Data Mining  
Solutions    
 
Back to HomePage
   Introduction       >
   Methodology
   Applications       >
   Models and Algorithms
 
Your position£ºHome Page>Methodology
Methodology
 
Process Models
 

Recognizing that a systematic approach is essential to successful data mining, many vendor and consulting organizations have specified a process model designed to guide the user (especially someone new to building Predictive Modeling models) through a sequence of steps that will lead to good results. SPSS uses the 5A's ¡ª Assess, Access, Analyze, Act and Automate ¡ª and SAS uses SEMMA ¡ª Sample, Explore, Modify, Model, Assess.

Recently, a consortium of vendors and users consisting of NCR Systems Engineering Copenhagen (Denmark), Daimler-Benz AG (Germany), SPSS/Integral Solutions Ltd. (England) and OHRA Verzekeringen en Banking Groep B.V (The Netherlands) has been developing a specification called CRISP-DM ¡ª Cross-Industry Standard Process for Data Mining. CRISP-DM is similar to process models from other companies including the one from Two Crows Corporation. As of September 1999, CRISP-DM is a work in progress. It is a good start in helping people to understand the necessary steps in successful data mining.

The Two Crows Process Model
 

The Two Crows data mining process model described below is derived from the Two Crows process model discussed in the previous edition of this document, and also takes advantage of some insights from CRISP-DM.

Keep in mind that while the steps appear in a list, the data mining process is not linear ¡ª you will inevitably need to loop back to previous steps. For example, what you learn in the ¡°explore data¡± step may require you to add new data to the data mining database. The initial models you build may provide insights that lead you to create new variables.

The basic steps of data mining for knowledge discovery are:
     1. Define business problem
     2. Build data mining database
     3. Explore data
     4. Prepare data for modeling
     5. Build model
     6. Evaluate model
     7. Deploy model and results

Let's go through these steps to better understand the knowledge discovery process.

1. Define the business problem. First and foremost, the prerequisite to knowledge discovery is understanding your data and your business. Without this understanding, no algorithm, regardless of sophistication, is going to provide you with a result in which you should have confidence. Without this background you will not be able to identify the problems you're trying to solve, prepare the data for mining, or correctly interpret the results. To make the best use of data mining you must make a clear statement of your objectives. It may be that you wish to increase the response to a direct mail campaign. Depending on your specific goal, such as ¡°increasing the response rate¡± or ¡°increasing the value of a response,¡± you will build a very different model. An effective statement of the problem will include a way of measuring the results of your knowledge discovery project. It may also include a cost justification.

2. Build a data mining database. This step along with the next two constitute the core of the data preparation. Together, they take more time and effort than all the other steps combined. There may be repeated iterations of the data preparation and model building steps as you learn something from the model that suggests you modify the data. These data preparation steps may take anywhere from 50% to 90% of the time and effort of the entire knowledge discovery process!
The data to be mined should be collected in a database. Note that this does not necessarily imply a database management system must be used. Depending on the amount of the data, the complexity of the data, and the uses to which it is to be put, a flat file or even a spreadsheet may be adequate.
In general, it's not a good idea to use your corporate data warehouse for this. You will be better off creating a separate data mart. Mining the data will make you a very active user of the data warehouse, possibly causing resource allocation problems. You will often be joining many tables together and accessing substantial portions of the warehouse. A single trial model may require many passes through much of the warehouse.
Almost certainly you will be modifying the data from the data warehouse. In addition you may
want to bring in data from outside your company to overlay on the data warehouse data or you
may want to add new fields computed from existing fields. You may need to gather additional
data through surveys. Other people building different models from the data warehouse (some of whom will use the same data as you) may want to make similar alterations to the warehouse.However, data warehouse administrators do not look kindly on having data changed in what is unquestionably a corporate resource.
One more reason for a separate database is that the structure of the corporate data warehouse may not easily support the kinds of exploration you need to do to understand this data. This includes queries summarizing the data, multi-dimensional reports (sometimes called pivot tables), and many different kinds of graphs or visualizations.
Lastly, you may want to store this data in a different DBMS with a different physical design than the one you use for your corporate data warehouse. Increasingly, people are selecting specialpurpose DBMSs which support these data mining requirements quite well. If, however, your corporate data warehouse allows you to create logical data marts and if it can handle the resource demands of data mining, then it may also serve as a good data mining database.
The tasks in building a data mining database are:
     a. Data collection
     b. Descriptive Modeling
     c. Selection
     d. Data quality assessment and data cleansing
     e. Consolidation and integration
     f. Metadata construction
     g. Load the data mining database
     h. Maintain the data mining database
You must remember that these tasks are not performed in strict sequence, but as the need arises.For example, you will start constructing the metadata infrastructure as you collect the data,and modify it continuously. What you learn in consolidation or data quality assessment may change your initial selection decision.

a. Data collection. Identify the sources of the data you will be mining. A data-gathering phase may be necessary because some of the data you need may never have been collected. You
may need to acquire external data from public databases (such as census or weather data) or
proprietary databases (such as credit bureau data).
A Data Collection Report lists the properties of the different source data sets. Some of the
elements in this report should include:

  • Source of data (internal Applications or outside vendor)
  • Owner
  • Person/organization responsible for maintaining data
  • DBA
  • Cost (if purchased)
  • Storage organization (e.g., Oracle database,VSAM file, etc.)
  • Size in tables, rows, records, etc.
  • Size in bytes
  • Physical storage (CD-ROM, tape, server, etc.)
  • Security requirements
  • Restrictions on use
  • Privacy requirements

Be sure to make note of special security and privacy issues that your data mining database
will inherit from the source data. For example, many European data sets are constrained in
their use by privacy regulations that are far stricter than those in the United States.

b. Descriptive Modeling Describe the contents of each file or database table. Some of the properties
documented in a Descriptive Modeling Report are:

  • Number of fields/columns
  • Number/percentage of records with missing values
  • Field names
    For each field:
  • Data type
  • Definition
  • Descriptive
  • Source of field
  • Unit of measure
  • Number of unique values
  • List of values
  • Range of values
  • Number/percentage of missing values
  • Collection information (e.g., how, where,conditions)
  • Timeframe (e.g., daily, weekly, monthly)
  • Specific time data (e.g., every Monday or every Tuesday)
  • Primary key/foreign key relationships
  • 1999 Two Crows Corporation 25

c. Selection. The next step in preparing the data mining database is to select the subset of data to mine. This is not the same as sampling the database or choosing predictor variables.
Rather, it is a gross elimination of irrelevant or unneeded data. Other criteria for excluding data may include resource constraints, cost, restrictions on data use, or quality problems.

d. Data quality assessment and data cleansing. GIGO (Garbage In, Garbage Out) is quite
applicable to data mining, so if you want good models you need to have good data. A data
quality assessment identifies characteristics of the data that will affect the model quality.Essentially, you are trying to ensure not only the correctness and consistency of values but also that all the data you have is measuring the same thing in the same way.

There are a number of types of data quality problems. Single fields may have an incorrect
value. For example, recently a man's nine-digit Social Security identification number was
accidentally entered as income when the government computed his taxes! Even when individual fields have what appear to be correct values, there may be incorrect combinations,such as pregnant males. Sometimes the value for a field is missing. Inconsistencies must be
identified and removed when consolidating data from multiple sources.

Missing data can be a particularly pernicious problem. If you have to throw out every record
with a field missing, you may wind up with a very small database or an inaccurate picture of
the whole database. The fact that a value is missing may be significant in itself. Perhaps only wealthy customers regularly leave the ¡°income¡± field blank, for instance. It can be
worthwhile to create a new variable to identify missing values, build a model using it, and
compare the results with those achieved by substituting for the missing value to see which
leads to better predictions.

Another approach is to calculate a substitute value. Some common strategies for calculating
missing values include using the modal value (for nominal variables), the median (for ordinal variables), or the mean (for continuous variables). A less common strategy is to assign a missing value based on the distribution of values for that variable. For example, if a database consisted of 40% females and 60% males, then you might assign a missing gender entry the value of ¡°female¡± 40% of the time and ¡°male¡± 60% of the time. Sometimes people build Predictive Modeling models using data mining techniques to predict missing values. This usually gives a better result than a simple calculation, but is much more time-consuming.

Recognize that you will not be able to fix all the problems, so you will need to work around
them as best as possible. It is far preferable and more cost-effective to put in place procedures and checks to avoid the data quality problems ¡ª ¡°an ounce of prevention.¡± Usually, however, you must build the models you need with the data you now have, and avoidance is something you'll work toward for the future.

e. Integration and consolidation. The data you need may reside in a single database or in
multiple databases. The source databases may be transaction databases used by the
operational systems of your company. Other data may be in data warehouses or data marts
built for specific purposes. Still other data may reside in a proprietary database belonging to another company such as a credit bureau.

Data integration and consolidation combines data from different sources into a single mining
database and requires reconciling differences in data values from the various sources.
Improperly reconciled data is a major source of quality problems. There are often large
differences in the way data are defined and used in different databases. Some nconsistencies
may be easy to uncover, such as different addresses for the same customer. Making it more
difficult to resolve these problems is that they are often subtle. For example, the same
customer may have different names or ¡ª worse ¡ª multiple customer identification numbers.
The same name may be used for different entities (homonyms), or different names may be used for the same entity (synonyms). There are often unit incompatibilities, especially when data sources are consolidated from different countries; for example, U.S. dollars and Canadian dollars cannot be added without conversion.

f. Metadata construction. The information in the Dataset Descriptive and Descriptive Modeling
reports is the basis for the metadata infrastructure. In essence this is a database about the database itself. It provides information that will be used in the creation of the physical database as well as information that will be used by analysts in understanding the data and building the models.

g. Load the data mining database. In most cases the data should be stored in its own database.For large amounts or complex data, this will usually be a DBMS as opposed to a flat file.Having collected, integrated and cleaned the data, it is now necessary to actually load the database itself. Depending on the DBMS and hardware being used, the amount of data, and
the complexity of the database design, this may turn out to be a serious undertaking that
requires the expertise of information systems professionals.

h. Maintain the data mining database. Once created, a database needs to be cared for. It needs to be backed up periodically; its performance should be monitored; and it may need
occasional reorganization to reclaim disk storage or to improve performance. For a large,
complex database stored in a DBMS, the maintenance may also require the services of
information systems professionals.

3. Explore the data. See the Descriptive Modeling FOR DATA MINING section above for a detailed
discussion of visualization, link analysis, and other means of exploring the data. The goal is to identify the most important fields in predicting an outcome, and determine which derived values may be useful.

In a data set with hundreds or even thousands of columns, exploring the data can be as timeconsuming and labor-intensive as it is illuminating. A good interface and fast computer response are very important in this phase because the very nature of your exploration is changed when you have to wait even 20 minutes for some graphs, let alone a day.

4. Prepare data for modeling. This is the final data preparation step before building models. There are four main parts to this step:
     a. Select variables
     b. Select rows
     c. Construct new variables
     d. Transform variables
a. Select variables. Ideally, you would take all the variables you have, feed them to the data mining tool and let it find those which are the best predictors. In practice, this doesn't work very well. One reason is that the time it takes to build a model increases with the number of variables. Another reason is that blindly including extraneous columns can lead to incorrect models. A very common error, for example, is to use as a predictor variable data that can only be known if you know the value of the response variable. People have actually used date of birth to ¡°predict¡± age without realizing it.

While in principle some data mining Algorithms will automatically ignore irrelevant variables and properly account for related (covariant) columns, in practice it is wise to avoid depending solely on the tool. Often your knowledge of the problem domain can let you make many of these selections correctly. For example, including ID number or Social Security number as predictor variables will at best have no benefit and at worst may reduce the weight of other important variables.

b. Select rows. As in the case of selecting variables, you would like to use all the rows you have to build models. If you have a lot of data, however, this may take too long or require buying a bigger computer than you would like.

Consequently it is often a good idea to sample the data when the database is large. This yields no loss of information for most business problems, although sample selection must be done carefully to ensure the sample is truly random. Given a choice of either investigating a few models built on all the data or investigating more models built on a sample, the latter approach will usually help you develop a more accurate and robust model.

You may also want to throw out data that are clearly outliers. While in some cases outliers
may contain information important to your model building, often they can be ignored based
on your understanding of the problem. For example, they may be the result of incorrectly
entered data, or of a one-time occurrence such as a labor strike.

Sometimes you may need to add new records (e.g., for customers who made no purchases).

c. Construct new variables. It is often necessary to construct new predictors derived from the raw data. For example, forecasting credit risk using a debt-to-income ratio rather than just debt and income as predictor variables may yield more accurate results that are also easier to understand. Certain variables that have little effect alone may need to be combined with others, using various arithmetic or algebraic operations (e.g., addition, ratios). Some variables that extend over a wide range may be modified to construct a better predictor, such as using the log of income instead of income.

d. Transform variables. The tool you choose may dictate how you represent your data, for
instance, the categorical explosion required by neural nets. Variables may also be scaled to
fall within a limited range, such as 0 to 1. Many decision trees used for classification require continuous data such as income to be grouped in ranges (bins) such as High, Medium, and Low. The encoding you select can influence the result of your model. For example, the cutoff points for the bins may change the outcome of a model.

5. Data mining model building. The most important thing to remember about model building is
that it is an iterative process. You will need to explore alternative models to find the one that is most useful in solving your business problem. What you learn in searching for a good model may lead you to go back and make some changes to the data you are using or even modify your problem statement.

Once you have decided on the type of prediction you want to make (e.g., classification or
regression), you must choose a model type for making the prediction. This could be a decision tree, a neural net, a proprietary method, or that old standby, logistic regression. Your choice of model type will influence what data preparation you must do and how you go about it. For example, a neural net tool may require you to explode your categorical variables. Or the tool may require that the data be in a particular file format, thus requiring you to extract the data into that format. Once the data is ready, you can proceed with training your model.

The process of building Predictive Modeling models requires a well-defined training and validation protocol in order to insure the most accurate and robust predictions. This kind of protocol is sometimes called supervised learning. The essence of supervised learning is to train (estimate) your model on a portion of the data, then test and validate it on the remainder of the data. A model is built when the cycle of training and testing is completed. Sometimes a third data set, called the validation data set, is needed because the test data may be influencing features of the model, and the validation set acts as an independent measure of the model's accuracy.

Training and testing the data mining model requires the data to be split into at least two groups:one for model training (i.e., estimation of the model parameters) and one for model testing. If you don't use different training and test data, the accuracy of the model will be overestimated. After the model is generated using the training database, it is used to predict the test database, and the resulting accuracy rate is a good estimate of how the model will perform on future databases that are similar to the training and test databases. It does not guarantee that the model is correct. It simply says that if the same technique were used on a succession of databases with similar data to the training and test data, the average accuracy would be close to the one obtained this way.

Simple validation. The most basic testing method is called simple validation. To carry this out,you set aside a percentage of the database as a test database, and do not use it in any way in the model building and estimation. This percentage is typically between 5% and 33%. For all the future calculations to be correct, the division of the data into two groups must be random, so that the training and test data sets both reflect the data being modeled.

After building the model on the main body of the data, the model is used to predict the classes or values of the test database. Dividing the number of incorrect classifications by the total number of instances gives an error rate. Dividing the number of correct classifications by the total number of instances gives an accuracy rate (i.e., accuracy = 1 ¨C error). For a regression model, the goodness of fit or ¡°r-squared¡± is usually used as an estimate of the accuracy.

In building a single model, even this simple validation may need to be performed dozens of times.For example, when using a neural net, sometimes each training pass through the net is tested against a test database. Training then stops when the accuracy rates on the test database no longer improve with additional iterations.

Cross validation. If you have only a modest amount of data (a few thousand rows) for building the model, you can't afford to set aside a percentage of it for simple validation. Cross validation is a method that lets you use all your data. The data is randomly divided into two equal sets in order to estimate the Predictive Modeling accuracy of the model. First, a model is built on the first set and used to predict the outcomes in the second set and calculate an error rate. Then a model is built on the second set and used to predict the outcomes in the first set and again calculate an error rate.Finally, a model is built using all the data. There are now two independent error estimates which can be averaged to give a better estimate of the true accuracy of the model built on all the data.Typically, the more general n-fold cross validation is used. In this method, the data is randomly divided into n disjoint groups. For example, suppose the data is divided into ten groups. The first group is set aside for testing and the other nine are lumped together for model building. The model built on the 90% group is then used to predict the group that was set aside. This process is repeated a total of 10 times as each group in turn is set aside, the model is built on the remaining 90% of the data, and then that model is used to predict the set-aside group. Finally, a model is built using all the data. The mean of the 10 independent error rate predictions is used as the error rate for this last model.

Bootstrapping is another technique for estimating the error of a model; it is primarily used with very small data sets. As in cross validation, the model is built on the entire dataset. Then numerous data sets called bootstrap samples are created by sampling from the original data set.After each case is sampled, it is replaced and a case is selected again until the entire bootstrap sample is created. Note that records may occur more than once in the data sets thus created. A model is built on this data set, and its error rate is calculated. This is called the resubstitution error. Many bootstrap samples (sometimes over 1,000) are created. The final error estimate for the model built on the whole data set is calculated by taking the average of the estimates from each of the bootstrap samples.

Based upon the results of your model building, you may want to build another model using the
same technique but different parameters, or perhaps try other Algorithms or tools. For example,another approach may increase your accuracy. No tool or technique is perfect for all data, and it is difficult if not impossible to be sure before you start which technique will work the best. It is quite common to build numerous models before finding a satisfactory one.

6. Evaluation and interpretation.

a. Model Validation. After building a model, you must evaluate its results and interpret their significance. Remember that the accuracy rate found during testing applies only to the data on which the model was built. In practice, the accuracy may vary if the data to which the model is applied differs in important and unknowable ways from the original data. More
importantly, accuracy by itself is not necessarily the right metric for selecting the best model. You need to know more about the type of errors and the costs associated with them.

Confusion matrices. For classification problems, a confusion matrix is a very useful tool for understanding results. A confusion matrix (Figure 9) shows the counts of the actual versus predicted class values. It shows not only how well the model predicts, but also presents the details needed to see exactly where things may have gone wrong. The following table is a sample confusion matrix. The columns show the actual classes, and the rows show the predicted classes. Therefore the diagonal shows all the correct predictions. In the confusion matrix, you can see that our model predicted 38 of the 46 Class B's correctly, but misclassified 8 of them: two as Class A and six as Class C. This is much more informative than simply telling us an overall accuracy rate of 82% (123 correct classifications out of 150 cases).

In particular, if there are different costs associated with different errors, a model with a lower overall accuracy may be preferable to one with higher accuracy but a greater cost to the organization due to the types of errors it makes. For example, suppose in the above confusion matrix each correct answer had a value of $10 and each incorrect answer for class A had a cost of $5, for class B a cost of $10, and for class C a cost of $20. Then the net value of the matrix would be :

But consider the following confusion matrix (Figure 10). The accuracy has dropped to 79%
(118/150). However when we apply the costs from above to this confusion matrix the net
value is:

Thus, if you wanted to maximize the value of the model, you would be better off choosing the less accurate model that has a higher net value.

The lift (gain) chart (Figure 11) is also a big help in evaluating the usefulness of a model. It shows how responses (e.g., to a direct mail solicitation or a surgical treatment) are changed by applying the model. This change ratio is called the lift. For example, instead of a 10% response rate when a random 10% of the population is treated, the response rate of a scored 10% of the population is over 30%. The lift is 3 in this case.

Another important component of interpretation is to assess the value of the model. Again, a
pattern may be interesting, but acting on it may cost more than the revenue or savings it generates. The ROI (Return on Investment) chart in Figure 12 is a good example of how attaching values to a response and costs to a program can provide additional guidance to decision making. (Here, ROI is defined as ratio of profit to cost.) Note that beyond the 8th
decile (80%), the ROI of the scored model becomes negative. It is at a maximum at the 2nd decile (20%).

Alternatively you may want to look at the profitability of a model (profit = revenue minus
cost), as shown in the following chart (Figure 13).

Note that in the example we've been using, the maximum lift (for the 10 deciles) was achieved at the 1st decile (10%), the maximum ROI at the 2nd decile (20%), and the maximum profit at the 3rd and 4th deciles.

Ideally, you can act on the results of a model in a profitable way. But remember, there may be no practical means to take advantage of the knowledge gained.

b. External validation. As pointed out above, no matter how good the accuracy of a model is estimated to be, there is no guarantee that it reflects the real world. A valid model is not necessarily a correct model. One of the main reasons for this problem is that there are always assumptions implicit in the model. For example, the inflation rate may not have been included as a variable in a model that predicts the propensity of an individual to buy, but a jump in inflation from 3% to 17% will certainly affect people's behavior. Also, the data used to build the model may fail to match the real world in some unknown way, leading to an incorrect model.

Therefore it is important to test a model in the real world. If a model is used to select a subset of a mailing list, do a test mailing to verify the model. If a model is used to predict credit risk,try the model on a small set of applicants before full deployment. The higher the risk associated with an incorrect model, the more important it is to construct an experiment to
check the model results.

7. Deploy the model and results. Once a data mining model is built and validated, it can be used in one of two main ways. The first way is for an analyst to recommend actions based on simply
viewing the model and its results. For example, the analyst may look at the clusters the model has identified, the rules that define the model, or the lift and ROI charts that depict the effect of the model.

The second way is to apply the model to different data sets. The model could be used to flag
records based on their classification, or assign a score such as the probability of an action (e.g.,responding to a direct mail solicitation). Or the model can select some records from the database and subject these to further analyses with an OLAP tool.

Often the models are part of a business process such as risk analysis, credit authorization or fraud detection. In these cases the model is incorporated into an Applications. For instance, a Predictive Modeling model may be integrated into a mortgage loan Applications to aid a loan officer in evaluating the applicant. Or a model might be embedded in an Applications such as an inventory ordering system that automatically generates an order when the forecast inventory levels drop below a threshold.

The data mining model is often applied to one event or transaction at a time, such as scoring a loan Applications for risk. The amount of time to process each new transaction, and the rate at which new transactions arrive, will determine whether a parallelized algorithm is needed. Thus, while loan Applicationss can easily be evaluated on modest-sized computers, monitoring credit card transactions or cellular telephone calls for fraud would require a parallel system to deal with the high transaction rate.

When delivering a complex Applications, data mining is often only a small, albeit critical, part of the final product. For example, knowledge discovered through data mining may be combined with the knowledge of domain experts and applied to data in the database and incoming transactions. In a fraud detection system, known patterns of fraud may be combined with discovered patterns. When suspected cases of fraud are passed on to fraud investigators for evaluation, the investigators may need to access database records about other claims filed by the claimant as well as other claims in which the same doctors and lawyers were involved.

Model monitoring. You must, of course, measure how well your model has worked after you use it. However, even when you think you're finished because your model works well, you must continually monitor the performance of the model. Over time, all systems evolve. Salespeople know that purchasing patterns change over time. External variables such as inflation rate may change enough to alter the way people behave. Thus, from time to time the model will have to be retested, retrained and possibly completely rebuilt. Charts of the residual differences between forecasted and observed values are an excellent way to monitor model results. Such charts are easy to use and understand, not computationally intensive, and could be built into the software that implements the model. Thus, the system could monitor itself.

Reference£º"Introduction to Data Mining and Knowledge Discovery" by Two Crows Corporation
  Copyright © 2003 - Hua Analytical Technology Co.,Ltd All rights reserved.  »¦ICP±¸09008869ºÅ