Data Mining  
Solutions    
 
Back to HomePage
   Introduction       >
   Methodology
   Applications       >
   Models and Algorithms
 
Your position£ºHome Page>Data Mining¡ª¡ªPredictive Modeling Data Mining
Predictive Modeling
 
A hierarchy of choices
 

The goal of data mining is to produce new knowledge that the user can act upon. It does this by building a model of the real world based on data collected from a variety of sources which may include corporate transactions, customer histories and demographic information, process control data, and relevant external databases such as credit bureau information or weather data. The result of the model building is a Descriptive of patterns and relationships in the data that can be confidently used for prediction.

To avoid confusing the different aspects of data mining, it helps to envision a hierarchy of the choices and decisions you need to make before you start:
     Business goal
     Type of prediction
     Model type
     Algorithm
     Product

At the highest level is the business goal: what is the ultimate purpose of mining this data? For example, seeking patterns in your data to help you retain good customers, you might build one model to predict customer profitability and a second model to identify customers likely to leave (attrition). Your knowledge of your organization's needs and objectives will guide you in formulating the goal of your models.

The next step is deciding on the type of prediction that's most appropriate: (1) classification: predicting into what category or class a case falls, or (2) regression: predicting what number value a variable will have (if it's a variable that varies with time, it's called time series prediction). In the example above, you might use regression to forecast the amount of profitability, and classification to predict which customers might leave. These are discussed in more detail below.

Now you can choose the model type: a neural net to perform the regression, perhaps, and a decision tree for the classification. There are also traditional statistical models to choose from such as logistic regression, discriminant analysis, or general linear models. The most important model types for data mining are described in the next section, on DATA MINING Models And Algorithms.

Many Algorithms are available to build your models. You might build the neural net using
backpropagation or radial basis functions. For the decision tree, you might choose among CART, C5.0, Quest, or CHAID. Some of these Algorithms are also discussed in DATA MINING Models And Algorithms, below.

When selecting a data mining product, be aware that they generally have different implementations of a particular algorithm even when they identify it with the same name. These implementation differences can affect operational characteristics such as memory usage and data storage, as well as performance characteristics such as speed and accuracy. Other key considerations to keep in mind are covered later in the section on SELECTING DATA MINING PRODUCTS.

Many business goals are best met by building multiple model types using a variety of Algorithms. You may not be able to determine which model type is best until you've tried several approaches.

Some terminology
 

In Predictive Modeling models, the values or classes we are predicting are called the response, dependent or target variables. The values used to make the prediction are called the predictor or independent variables.

Predictive Modeling models are built, or trained, using data for which the value of the response variable is already known. This kind of training is sometimes referred to as supervised learning, because calculated or estimated values are compared with the known results. (By contrast, descriptive techniques such as clustering, described in the previous section, are sometimes referred to as unsupervised learning because there is no already-known result to guide the Algorithms.)

   
Classification
 

Classification problems aim to identify the characteristics that indicate the group to which each case belongs. This pattern can be used both to understand the existing data and to predict how new instances will behave. For example, you may want to predict whether individuals can be classified as likely to respond to a direct mail solicitation, vulnerable to switching over to a competing longdistance phone service, or a good candidate for a surgical procedure.

Data mining creates classification models by examining already classified data (cases) and
inductively finding a Predictive Modeling pattern. These existing cases may come from an historical database, such as people who have already undergone a particular medical treatment or moved to a new longdistance service. They may come from an experiment in which a sample of the entire database is tested in the real world and the results used to create a classifier. For example, a sample of a mailing list would be sent an offer, and the results of the mailing used to develop a classification model to be applied to the entire database. Sometimes an expert classifies a sample of the database, and this classification is then used to create the model which will be applied to the entire database.

   
Regression
 

Regression uses existing values to forecast what other values will be. In the simplest case, regression uses standard statistical techniques such as linear regression. Unfortunately, many real-world problems are not simply linear projections of previous values. For instance, sales volumes, negotiable securities prices, and product failure rates are all very difficult to predict because they may depend on complex interactions of multiple predictor variables. Therefore, more complex techniques (e.g., logistic regression, decision trees, or neural nets) may be necessary to forecast future values.

The same model types can often be used for both regression and classification. For example, the CART (Classification And Regression Trees) decision tree algorithm can be used to build both classification trees (to classify categorical response variables) and regression trees (to forecast continuous response variables). Neural nets too can create both classification and regression models.

   
Time series
 

Time series forecasting predicts unknownfuture values based on a time-varying series of predictors. Like regression, it uses known results to guide its predictions. Models must take into account the distinctive properties of time, especially the hierarchy of periods (including such varied definitions as the five- or seven-day work week, the thirteen-¡°month¡± year, etc.), seasonality, calendar effects such as holidays, date arithmetic, and special considerations such as how much of the past is relevant.

   
 
Reference£º"Introduction to Data Mining and Knowledge Discovery" by Two Crows Corporation
  Copyright © 2003 - Hua Analytical Technology Co.,Ltd All rights reserved.  »¦ICP±¸09008869ºÅ