The Two Crows
data mining process model described below is derived
from the Two Crows process model discussed in
the previous edition of this document, and also
takes advantage of some insights from CRISP-DM.
Keep in mind that while the
steps appear in a list, the data mining process
is not linear ¡ª you will inevitably need to loop
back to previous steps. For example, what you
learn in the ¡°explore data¡± step may require you
to add new data to the data mining database. The
initial models you build may provide insights
that lead you to create new variables.
The basic steps of data mining
for knowledge discovery are:
1. Define business
problem
2. Build data mining
database
3. Explore data
4. Prepare data
for modeling
5. Build model
6. Evaluate model
7. Deploy model
and results
Let's go through these steps
to better understand the knowledge discovery process.
1. Define the business problem.
First and foremost, the prerequisite to knowledge
discovery is understanding your data and your
business. Without this understanding, no algorithm,
regardless of sophistication, is going to provide
you with a result in which you should have confidence.
Without this background you will not be able to
identify the problems you're trying to solve,
prepare the data for mining, or correctly interpret
the results. To make the best use of data mining
you must make a clear statement of your objectives.
It may be that you wish to increase the response
to a direct mail campaign. Depending on your specific
goal, such as ¡°increasing the response rate¡± or
¡°increasing the value of a response,¡± you will
build a very different model. An effective statement
of the problem will include a way of measuring
the results of your knowledge discovery project.
It may also include a cost justification.
2. Build a data mining database.
This step along with the next two constitute the
core of the data preparation. Together, they take
more time and effort than all the other steps
combined. There may be repeated iterations of
the data preparation and model building steps
as you learn something from the model that suggests
you modify the data. These data preparation steps
may take anywhere from 50% to 90% of the time
and effort of the entire knowledge discovery process!
The data to be mined should be collected in a
database. Note that this does not necessarily
imply a database management system must be used.
Depending on the amount of the data, the complexity
of the data, and the uses to which it is to be
put, a flat file or even a spreadsheet may be
adequate.
In general, it's not a good idea to use your corporate
data warehouse for this. You will be better off
creating a separate data mart. Mining the data
will make you a very active user of the data warehouse,
possibly causing resource allocation problems.
You will often be joining many tables together
and accessing substantial portions of the warehouse.
A single trial model may require many passes through
much of the warehouse.
Almost certainly you will be modifying the data
from the data warehouse. In addition you may
want to bring in data from outside your company
to overlay on the data warehouse data or you
may want to add new fields computed from existing
fields. You may need to gather additional
data through surveys. Other people building different
models from the data warehouse (some of whom will
use the same data as you) may want to make similar
alterations to the warehouse.However, data warehouse
administrators do not look kindly on having data
changed in what is unquestionably a corporate
resource.
One more reason for a separate database is that
the structure of the corporate data warehouse
may not easily support the kinds of exploration
you need to do to understand this data. This includes
queries summarizing the data, multi-dimensional
reports (sometimes called pivot tables), and many
different kinds of graphs or visualizations.
Lastly, you may want to store this data in a different
DBMS with a different physical design than the
one you use for your corporate data warehouse.
Increasingly, people are selecting specialpurpose
DBMSs which support these data mining requirements
quite well. If, however, your corporate data warehouse
allows you to create logical data marts and if
it can handle the resource demands of data mining,
then it may also serve as a good data mining database.
The tasks in building a data mining database are:
a. Data collection
b. Descriptive Modeling
c. Selection
d. Data quality
assessment and data cleansing
e. Consolidation
and integration
f. Metadata construction
g. Load the data
mining database
h. Maintain the
data mining database
You must remember that these tasks are not performed
in strict sequence, but as the need arises.For
example, you will start constructing the metadata
infrastructure as you collect the data,and modify
it continuously. What you learn in consolidation
or data quality assessment may change your initial
selection decision.
a. Data collection. Identify
the sources of the data you will be mining. A
data-gathering phase may be necessary because
some of the data you need may never have been
collected. You
may need to acquire external data from public
databases (such as census or weather data) or
proprietary databases (such as credit bureau data).
A Data Collection Report lists the properties
of the different source data sets. Some of the
elements in this report should include:
- Source of data (internal Applications or
outside vendor)
- Owner
- Person/organization responsible for maintaining
data
- DBA
- Cost (if purchased)
- Storage organization (e.g., Oracle database,VSAM
file, etc.)
- Size in tables, rows, records, etc.
- Size in bytes
- Physical storage (CD-ROM, tape, server,
etc.)
- Security requirements
- Restrictions on use
- Privacy requirements
Be sure to make note of special
security and privacy issues that your data mining
database
will inherit from the source data. For example,
many European data sets are constrained in
their use by privacy regulations that are far
stricter than those in the United States.
b. Descriptive Modeling Describe
the contents of each file or database table. Some
of the properties
documented in a Descriptive Modeling Report are:
- Number of fields/columns
- Number/percentage of records with missing
values
- Field names
For each field:
- Data type
- Definition
- Descriptive
- Source of field
- Unit of measure
- Number of unique values
- List of values
- Range of values
- Number/percentage of missing values
- Collection information (e.g., how, where,conditions)
- Timeframe (e.g., daily, weekly, monthly)
- Specific time data (e.g., every Monday or
every Tuesday)
- Primary key/foreign key relationships
- 1999 Two Crows Corporation 25
c. Selection. The next step
in preparing the data mining database is to select
the subset of data to mine. This is not the same
as sampling the database or choosing predictor
variables.
Rather, it is a gross elimination of irrelevant
or unneeded data. Other criteria for excluding
data may include resource constraints, cost, restrictions
on data use, or quality problems.
d. Data quality assessment and
data cleansing. GIGO (Garbage In, Garbage Out)
is quite
applicable to data mining, so if you want good
models you need to have good data. A data
quality assessment identifies characteristics
of the data that will affect the model quality.Essentially,
you are trying to ensure not only the correctness
and consistency of values but also that all the
data you have is measuring the same thing in the
same way.
There are a number of types
of data quality problems. Single fields may have
an incorrect
value. For example, recently a man's nine-digit
Social Security identification number was
accidentally entered as income when the government
computed his taxes! Even when individual fields
have what appear to be correct values, there may
be incorrect combinations,such as pregnant males.
Sometimes the value for a field is missing. Inconsistencies
must be
identified and removed when consolidating data
from multiple sources.
Missing data can be a particularly
pernicious problem. If you have to throw out every
record
with a field missing, you may wind up with a very
small database or an inaccurate picture of
the whole database. The fact that a value is missing
may be significant in itself. Perhaps only wealthy
customers regularly leave the ¡°income¡± field blank,
for instance. It can be
worthwhile to create a new variable to identify
missing values, build a model using it, and
compare the results with those achieved by substituting
for the missing value to see which
leads to better predictions.
Another approach is to calculate
a substitute value. Some common strategies for
calculating
missing values include using the modal value (for
nominal variables), the median (for ordinal variables),
or the mean (for continuous variables). A less
common strategy is to assign a missing value based
on the distribution of values for that variable.
For example, if a database consisted of 40% females
and 60% males, then you might assign a missing
gender entry the value of ¡°female¡± 40% of the
time and ¡°male¡± 60% of the time. Sometimes people
build Predictive Modeling models using data mining techniques
to predict missing values. This usually gives
a better result than a simple calculation, but
is much more time-consuming.
Recognize that you will not
be able to fix all the problems, so you will need
to work around
them as best as possible. It is far preferable
and more cost-effective to put in place procedures
and checks to avoid the data quality problems
¡ª ¡°an ounce of prevention.¡± Usually, however,
you must build the models you need with the data
you now have, and avoidance is something you'll
work toward for the future.
e. Integration and consolidation.
The data you need may reside in a single database
or in
multiple databases. The source databases may be
transaction databases used by the
operational systems of your company. Other data
may be in data warehouses or data marts
built for specific purposes. Still other data
may reside in a proprietary database belonging
to another company such as a credit bureau.
Data integration and consolidation
combines data from different sources into a single
mining
database and requires reconciling differences
in data values from the various sources.
Improperly reconciled data is a major source of
quality problems. There are often large
differences in the way data are defined and used
in different databases. Some nconsistencies
may be easy to uncover, such as different addresses
for the same customer. Making it more
difficult to resolve these problems is that they
are often subtle. For example, the same
customer may have different names or ¡ª worse ¡ª
multiple customer identification numbers.
The same name may be used for different entities
(homonyms), or different names may be used for
the same entity (synonyms). There are often unit
incompatibilities, especially when data sources
are consolidated from different countries; for
example, U.S. dollars and Canadian dollars cannot
be added without conversion.
f. Metadata construction. The
information in the Dataset Descriptive and Descriptive Modeling
reports is the basis for the metadata infrastructure.
In essence this is a database about the database
itself. It provides information that will be used
in the creation of the physical database as well
as information that will be used by analysts in
understanding the data and building the models.
g. Load the data mining database.
In most cases the data should be stored in its
own database.For large amounts or complex data,
this will usually be a DBMS as opposed to a flat
file.Having collected, integrated and cleaned
the data, it is now necessary to actually load
the database itself. Depending on the DBMS and
hardware being used, the amount of data, and
the complexity of the database design, this may
turn out to be a serious undertaking that
requires the expertise of information systems
professionals.
h. Maintain the data mining
database. Once created, a database needs to be
cared for. It needs to be backed up periodically;
its performance should be monitored; and it may
need
occasional reorganization to reclaim disk storage
or to improve performance. For a large,
complex database stored in a DBMS, the maintenance
may also require the services of
information systems professionals.
3. Explore the data. See the
Descriptive Modeling FOR DATA MINING section above
for a detailed
discussion of visualization, link analysis, and
other means of exploring the data. The goal is
to identify the most important fields in predicting
an outcome, and determine which derived values
may be useful.
In a data set with hundreds
or even thousands of columns, exploring the data
can be as timeconsuming and labor-intensive as
it is illuminating. A good interface and fast
computer response are very important in this phase
because the very nature of your exploration is
changed when you have to wait even 20 minutes
for some graphs, let alone a day.
4. Prepare data for modeling.
This is the final data preparation step before
building models. There are four main parts to
this step:
a. Select variables
b. Select rows
c. Construct new
variables
d. Transform variables
a. Select variables. Ideally, you would take all
the variables you have, feed them to the data
mining tool and let it find those which are the
best predictors. In practice, this doesn't work
very well. One reason is that the time it takes
to build a model increases with the number of
variables. Another reason is that blindly including
extraneous columns can lead to incorrect models.
A very common error, for example, is to use as
a predictor variable data that can only be known
if you know the value of the response variable.
People have actually used date of birth to ¡°predict¡±
age without realizing it.
While in principle some data
mining Algorithms will automatically ignore irrelevant
variables and properly account for related (covariant)
columns, in practice it is wise to avoid depending
solely on the tool. Often your knowledge of the
problem domain can let you make many of these
selections correctly. For example, including ID
number or Social Security number as predictor
variables will at best have no benefit and at
worst may reduce the weight of other important
variables.
b. Select rows. As in the case
of selecting variables, you would like to use
all the rows you have to build models. If you
have a lot of data, however, this may take too
long or require buying a bigger computer than
you would like.
Consequently it is often a good
idea to sample the data when the database is large.
This yields no loss of information for most business
problems, although sample selection must be done
carefully to ensure the sample is truly random.
Given a choice of either investigating a few models
built on all the data or investigating more models
built on a sample, the latter approach will usually
help you develop a more accurate and robust model.
You may also want to throw out
data that are clearly outliers. While in some
cases outliers
may contain information important to your model
building, often they can be ignored based
on your understanding of the problem. For example,
they may be the result of incorrectly
entered data, or of a one-time occurrence such
as a labor strike.
Sometimes you may need to add
new records (e.g., for customers who made no purchases).
c. Construct new variables.
It is often necessary to construct new predictors
derived from the raw data. For example, forecasting
credit risk using a debt-to-income ratio rather
than just debt and income as predictor variables
may yield more accurate results that are also
easier to understand. Certain variables that have
little effect alone may need to be combined with
others, using various arithmetic or algebraic
operations (e.g., addition, ratios). Some variables
that extend over a wide range may be modified
to construct a better predictor, such as using
the log of income instead of income.
d. Transform variables. The
tool you choose may dictate how you represent
your data, for
instance, the categorical explosion required by
neural nets. Variables may also be scaled to
fall within a limited range, such as 0 to 1. Many
decision trees used for classification require
continuous data such as income to be grouped in
ranges (bins) such as High, Medium, and Low. The
encoding you select can influence the result of
your model. For example, the cutoff points for
the bins may change the outcome of a model.
5. Data mining model building.
The most important thing to remember about model
building is
that it is an iterative process. You will need
to explore alternative models to find the one
that is most useful in solving your business problem.
What you learn in searching for a good model may
lead you to go back and make some changes to the
data you are using or even modify your problem
statement.
Once you have decided on the
type of prediction you want to make (e.g., classification
or
regression), you must choose a model type for
making the prediction. This could be a decision
tree, a neural net, a proprietary method, or that
old standby, logistic regression. Your choice
of model type will influence what data preparation
you must do and how you go about it. For example,
a neural net tool may require you to explode your
categorical variables. Or the tool may require
that the data be in a particular file format,
thus requiring you to extract the data into that
format. Once the data is ready, you can proceed
with training your model.
The process of building Predictive Modeling
models requires a well-defined training and validation
protocol in order to insure the most accurate
and robust predictions. This kind of protocol
is sometimes called supervised learning. The essence
of supervised learning is to train (estimate)
your model on a portion of the data, then test
and validate it on the remainder of the data.
A model is built when the cycle of training and
testing is completed. Sometimes a third data set,
called the validation data set, is needed because
the test data may be influencing features of the
model, and the validation set acts as an independent
measure of the model's accuracy.
Training and testing the data
mining model requires the data to be split into
at least two groups:one for model training (i.e.,
estimation of the model parameters) and one for
model testing. If you don't use different training
and test data, the accuracy of the model will
be overestimated. After the model is generated
using the training database, it is used to predict
the test database, and the resulting accuracy
rate is a good estimate of how the model will
perform on future databases that are similar to
the training and test databases. It does not guarantee
that the model is correct. It simply says that
if the same technique were used on a succession
of databases with similar data to the training
and test data, the average accuracy would be close
to the one obtained this way.
Simple validation. The most
basic testing method is called simple validation.
To carry this out,you set aside a percentage of
the database as a test database, and do not use
it in any way in the model building and estimation.
This percentage is typically between 5% and 33%.
For all the future calculations to be correct,
the division of the data into two groups must
be random, so that the training and test data
sets both reflect the data being modeled.
After building the model on
the main body of the data, the model is used to
predict the classes or values of the test database.
Dividing the number of incorrect classifications
by the total number of instances gives an error
rate. Dividing the number of correct classifications
by the total number of instances gives an accuracy
rate (i.e., accuracy = 1 ¨C error). For a regression
model, the goodness of fit or ¡°r-squared¡± is usually
used as an estimate of the accuracy.
In building a single model,
even this simple validation may need to be performed
dozens of times.For example, when using a neural
net, sometimes each training pass through the
net is tested against a test database. Training
then stops when the accuracy rates on the test
database no longer improve with additional iterations.
Cross validation. If you have
only a modest amount of data (a few thousand rows)
for building the model, you can't afford to set
aside a percentage of it for simple validation.
Cross validation is a method that lets you use
all your data. The data is randomly divided into
two equal sets in order to estimate the Predictive Modeling
accuracy of the model. First, a model is built
on the first set and used to predict the outcomes
in the second set and calculate an error rate.
Then a model is built on the second set and used
to predict the outcomes in the first set and again
calculate an error rate.Finally, a model is built
using all the data. There are now two independent
error estimates which can be averaged to give
a better estimate of the true accuracy of the
model built on all the data.Typically, the more
general n-fold cross validation is used. In this
method, the data is randomly divided into n disjoint
groups. For example, suppose the data is divided
into ten groups. The first group is set aside
for testing and the other nine are lumped together
for model building. The model built on the 90%
group is then used to predict the group that was
set aside. This process is repeated a total of
10 times as each group in turn is set aside, the
model is built on the remaining 90% of the data,
and then that model is used to predict the set-aside
group. Finally, a model is built using all the
data. The mean of the 10 independent error rate
predictions is used as the error rate for this
last model.
Bootstrapping is another technique
for estimating the error of a model; it is primarily
used with very small data sets. As in cross validation,
the model is built on the entire dataset. Then
numerous data sets called bootstrap samples are
created by sampling from the original data set.After
each case is sampled, it is replaced and a case
is selected again until the entire bootstrap sample
is created. Note that records may occur more than
once in the data sets thus created. A model is
built on this data set, and its error rate is
calculated. This is called the resubstitution
error. Many bootstrap samples (sometimes over
1,000) are created. The final error estimate for
the model built on the whole data set is calculated
by taking the average of the estimates from each
of the bootstrap samples.
Based upon the results of your
model building, you may want to build another
model using the
same technique but different parameters, or perhaps
try other Algorithms or tools. For example,another
approach may increase your accuracy. No tool or
technique is perfect for all data, and it is difficult
if not impossible to be sure before you start
which technique will work the best. It is quite
common to build numerous models before finding
a satisfactory one.
6. Evaluation and interpretation.
a. Model Validation. After building
a model, you must evaluate its results and interpret
their significance. Remember that the accuracy
rate found during testing applies only to the
data on which the model was built. In practice,
the accuracy may vary if the data to which the
model is applied differs in important and unknowable
ways from the original data. More
importantly, accuracy by itself is not necessarily
the right metric for selecting the best model.
You need to know more about the type of errors
and the costs associated with them.
Confusion matrices. For classification
problems, a confusion matrix is a very useful
tool for understanding results. A confusion matrix
(Figure 9) shows the counts of the actual versus
predicted class values. It shows not only how
well the model predicts, but also presents the
details needed to see exactly where things may
have gone wrong. The following table is a sample
confusion matrix. The columns show the actual
classes, and the rows show the predicted classes.
Therefore the diagonal shows all the correct predictions.
In the confusion matrix, you can see that our
model predicted 38 of the 46 Class B's correctly,
but misclassified 8 of them: two as Class A and
six as Class C. This is much more informative
than simply telling us an overall accuracy rate
of 82% (123 correct classifications out of 150
cases).

In particular, if there are different costs associated
with different errors, a model with a lower overall
accuracy may be preferable to one with higher
accuracy but a greater cost to the organization
due to the types of errors it makes. For example,
suppose in the above confusion matrix each correct
answer had a value of $10 and each incorrect answer
for class A had a cost of $5, for class B a cost
of $10, and for class C a cost of $20. Then the
net value of the matrix would be :

But consider the following confusion matrix (Figure
10). The accuracy has dropped to 79%
(118/150). However when we apply the costs from
above to this confusion matrix the net
value is:

Thus, if you wanted to maximize
the value of the model, you would be better off
choosing the less accurate model that has a higher
net value.
The lift (gain) chart (Figure
11) is also a big help in evaluating the usefulness
of a model. It shows how responses (e.g., to a
direct mail solicitation or a surgical treatment)
are changed by applying the model. This change
ratio is called the lift. For example, instead
of a 10% response rate when a random 10% of the
population is treated, the response rate of a
scored 10% of the population is over 30%. The
lift is 3 in this case.

Another important component of
interpretation is to assess the value of the model.
Again, a
pattern may be interesting, but acting on it may
cost more than the revenue or savings it generates.
The ROI (Return on Investment) chart in Figure
12 is a good example of how attaching values to
a response and costs to a program can provide
additional guidance to decision making. (Here,
ROI is defined as ratio of profit to cost.) Note
that beyond the 8th
decile (80%), the ROI of the scored model becomes
negative. It is at a maximum at the 2nd decile
(20%).

Alternatively you may want to
look at the profitability of a model (profit =
revenue minus
cost), as shown in the following chart (Figure
13).
Note that in the example we've
been using, the maximum lift (for the 10 deciles)
was achieved at the 1st decile (10%), the maximum
ROI at the 2nd decile (20%), and the maximum profit
at the 3rd and 4th deciles.
Ideally, you can act on the results
of a model in a profitable way. But remember,
there may be no practical means to take advantage
of the knowledge gained.
b. External validation. As pointed
out above, no matter how good the accuracy of
a model is estimated to be, there is no guarantee
that it reflects the real world. A valid model
is not necessarily a correct model. One of the
main reasons for this problem is that there are
always assumptions implicit in the model. For
example, the inflation rate may not have been
included as a variable in a model that predicts
the propensity of an individual to buy, but a
jump in inflation from 3% to 17% will certainly
affect people's behavior. Also, the data used
to build the model may fail to match the real
world in some unknown way, leading to an incorrect
model.
Therefore it is important to
test a model in the real world. If a model is
used to select a subset of a mailing list, do
a test mailing to verify the model. If a model
is used to predict credit risk,try the model on
a small set of applicants before full deployment.
The higher the risk associated with an incorrect
model, the more important it is to construct an
experiment to
check the model results.
7. Deploy the model and results.
Once a data mining model is built and validated,
it can be used in one of two main ways. The first
way is for an analyst to recommend actions based
on simply
viewing the model and its results. For example,
the analyst may look at the clusters the model
has identified, the rules that define the model,
or the lift and ROI charts that depict the effect
of the model.
The second way is to apply the
model to different data sets. The model could
be used to flag
records based on their classification, or assign
a score such as the probability of an action (e.g.,responding
to a direct mail solicitation). Or the model can
select some records from the database and subject
these to further analyses with an OLAP tool.
Often the models are part of
a business process such as risk analysis, credit
authorization or fraud detection. In these cases
the model is incorporated into an Applications.
For instance, a Predictive Modeling model may be integrated
into a mortgage loan Applications to aid a loan
officer in evaluating the applicant. Or a model
might be embedded in an Applications such as an
inventory ordering system that automatically generates
an order when the forecast inventory levels drop
below a threshold.
The data mining model is often
applied to one event or transaction at a time,
such as scoring a loan Applications for risk. The
amount of time to process each new transaction,
and the rate at which new transactions arrive,
will determine whether a parallelized algorithm
is needed. Thus, while loan Applicationss can easily
be evaluated on modest-sized computers, monitoring
credit card transactions or cellular telephone
calls for fraud would require a parallel system
to deal with the high transaction rate.
When delivering a complex Applications,
data mining is often only a small, albeit critical,
part of the final product. For example, knowledge
discovered through data mining may be combined
with the knowledge of domain experts and applied
to data in the database and incoming transactions.
In a fraud detection system, known patterns of
fraud may be combined with discovered patterns.
When suspected cases of fraud are passed on to
fraud investigators for evaluation, the investigators
may need to access database records about other
claims filed by the claimant as well as other
claims in which the same doctors and lawyers were
involved.
Model monitoring. You must,
of course, measure how well your model has worked
after you use it. However, even when you think
you're finished because your model works well,
you must continually monitor the performance of
the model. Over time, all systems evolve. Salespeople
know that purchasing patterns change over time.
External variables such as inflation rate may
change enough to alter the way people behave.
Thus, from time to time the model will have to
be retested, retrained and possibly completely
rebuilt. Charts of the residual differences between
forecasted and observed values are an excellent
way to monitor model results. Such charts are
easy to use and understand, not computationally
intensive, and could be built into the software
that implements the model. Thus, the system could
monitor itself. |