The goal of
data mining is to produce new knowledge that the
user can act upon. It does this by building a
model of the real world based on data collected
from a variety of sources which may include corporate
transactions, customer histories and demographic
information, process control data, and relevant
external databases such as credit bureau information
or weather data. The result of the model building
is a Descriptive of patterns and relationships
in the data that can be confidently used for prediction.
To avoid confusing the different
aspects of data mining, it helps to envision a
hierarchy of the choices and decisions you need
to make before you start:
Business goal
Type of prediction
Model type
Algorithm
Product
At the highest level is the
business goal: what is the ultimate purpose of
mining this data? For example, seeking patterns
in your data to help you retain good customers,
you might build one model to predict customer
profitability and a second model to identify customers
likely to leave (attrition). Your knowledge of
your organization's needs and objectives will
guide you in formulating the goal of your models.
The next step is deciding on
the type of prediction that's most appropriate:
(1) classification: predicting into what category
or class a case falls, or (2) regression: predicting
what number value a variable will have (if it's
a variable that varies with time, it's called
time series prediction). In the example above,
you might use regression to forecast the amount
of profitability, and classification to predict
which customers might leave. These are discussed
in more detail below.
Now you can choose the model
type: a neural net to perform the regression,
perhaps, and a decision tree for the classification.
There are also traditional statistical models
to choose from such as logistic regression, discriminant
analysis, or general linear models. The most important
model types for data mining are described in the
next section, on DATA MINING Models And Algorithms.
Many Algorithms are available
to build your models. You might build the neural
net using
backpropagation or radial basis functions. For
the decision tree, you might choose among CART,
C5.0, Quest, or CHAID. Some of these Algorithms
are also discussed in DATA MINING Models And Algorithms,
below.
When selecting a data mining
product, be aware that they generally have different
implementations of a particular algorithm even
when they identify it with the same name. These
implementation differences can affect operational
characteristics such as memory usage and data
storage, as well as performance characteristics
such as speed and accuracy. Other key considerations
to keep in mind are covered later in the section
on SELECTING DATA MINING PRODUCTS.
Many business goals are best
met by building multiple model types using a variety
of Algorithms. You may not be able to determine
which model type is best until you've tried several
approaches.
|