|
Your position£º Home Page> Data
Mining¡ª¡ªAlgorithms |
 |
Data
Mining Models and Algorithms |
 |
|
|
| |
| |
Now let's examine
some of the types of Models And Algorithms used
to mine data. Most products use variations of
Algorithms that have been published in computer
science or statistics journals, with their specific
implementations customized to meet the individual
vendor's goal. For example, many vendors sell
versions of the CART or CHAID decision trees with
enhancements to work on parallel computers. Some
vendors have proprietary Algorithms which, while
not extensions or enhancements of any published
approach, may work quite well.
Most of the Models And Algorithms
discussed in this section can be thought of as
generalizations of the standard workhorse of modeling,
the linear regression model. Much effort has been
expended in the statistics, computer science,
artificial intelligence and engineering communities
to overcome the limitations of this basic model.
The common characteristic of many of the newer
technologies we will consider is that the pattern-finding
mechanism is data-driven rather than user-driven.
That is, the relationships are found inductively
by the software itself based on the existing data
rather than requiring the modeler to specify the
functional form and interactions.
Perhaps the most important thing
to remember is that no one model or algorithm
can or should be used exclusively. For any given
problem, the nature of the data itself will affect
the choice of Models And Algorithms you choose.
There is no ¡°best¡± model or algorithm. Consequently,
you will need a variety of tools and technologies
in order to find the best possible model. |
| |
|
 |
Neural Networks |
| |
Neural networks
are of particular interest because they offer
a means of efficiently modeling large and complex
problems in which there may be hundreds of predictor
variables that have many interactions.(Actual
biological neural networks are incomparably more
complex.) Neural nets may be used in classification
problems (where the output is a categorical variable)
or for regressions (where the output variable
is continuous).
A neural network (Figure 4)
starts with an input layer, where each node corresponds
to a predictor variable. These input nodes are
connected to a number of nodes in a hidden layer.
Each input node is connected to every node in
the hidden layer. The nodes in the hidden layer
may be connected to nodes in another hidden layer,
or to an output layer. The output layer consists
of one or more response variables.
After the input layer, each node
takes in a set of inputs, multiplies them by a
connection weight Wxy (e.g., the weight from node
1 to 3 is W13 ¡ª see Figure 5), adds them together,
applies a function (called the activation or squashing
function) to them, and passes the output to the
node(s) in the next layer. For example, the value
passed from node 4 to node 6 is: Activation function
applied to ([W14 * value of node 1] + [W24 * value
of node 2])
Each node may be viewed as a
predictor variable (nodes 1 and 2 in this example)
or as a combination of predictor variables (nodes
3 through 6). Node 6 is a non-linear combination
of the values of nodes 1 and 2, because of the
activation function on the summed values at the
hidden nodes. In fact, if there is a linear activation
function but no hidden layer, neural nets are
equivalent to a linear regression; and with certain
non-linear activation functions, neural nets are
equivalent to logistic regression.
The connection weights (W's)
are the unknown parameters which are estimated
by a training method. Originally, the most common
training method was backpropagation; newer methods
include conjugate gradient, quasi-Newton, Levenberg-Marquardt,
and genetic Algorithms. Each training method has
a set of parameters that control various aspects
of training such as avoiding local optima or adjusting
the speed of conversion.
The architecture (or topology)
of a neural network is the number of nodes and
hidden layers, and how they are connected. In
designing a neural network, either the user or
the software must choose the number of hidden
nodes and hidden layers, the activation function,
and limits on the weights. While there are some
general guidelines, you may have to experiment
with these parameters.
One of the most common types
of neural network is the feed-forward backpropagation
network. For simplicity of discussion, we will
assume a single hidden layer.
Backpropagation training is
simply a version of gradient descent, a type of
algorithm that tries to
reduce a target value (error, in the case of neural
nets) at each step. The algorithm proceeds as
follows.
Feed forward: The value of the
output node is calculated based on the input node
values and
a set of initial weights. The values from the
input nodes are combined in the hidden layers,
and the values of those nodes are combined to
calculate the output value.
Backpropagation: The error in
the output is computed by finding the difference
between the
calculated output and the desired output (i.e.,
the actual values found in the training set).
Next, the error from the output is assigned to
the hidden layer nodes proportionally to their
weights. This permits an error to be computed
for every output node and hidden node in the
network. Finally, the error at each of the hidden
and output nodes is used by the algorithm to adjust
the weight coming into that node to reduce the
error.
This process is repeated for
each row in the training set. Each pass through
all rows in the training set is called an epoch.
The training set will be used repeatedly, until
the error no longer decreases. At that point the
neural net is considered to be trained to find
the pattern in the test set.
Because so many parameters may
exist in the hidden layers, a neural net with
enough hidden nodes will always eventually fit
the training set if left to run long enough. But
how well it will do on other data? To avoid an
overfitted neural network which will only work
well on the training data, you must know when
to stop training. Some implementations will evaluate
the neural net against the test data periodically
during training. As long as the error rate on
the test set is decreasing, training will continue.
If the error rate on the test data goes up, even
though the error rate on the training data is
still decreasing, then the neural net may be overfitting
the data.
The graph in Figure 6 illustrates
how the test data set helps us avoid overfitting.
You can see how the error rate decreases with
each pass the neural net makes through the data
(small circle markers), but the error rate for
the test data (triangle markers) bottoms out and
starts increasing. Since the goal of data mining
is to make predictions on data other than the
training set, you are clearly better off using
a neural net that minimizes the error on the test
data, not the training data.
Neural networks differ in philosophy
from many statistical methods in several ways.
First, a neural network usually has more parameters
than does a typical statistical model. For example,
there are thirteen parameters (nine weights and
four bias or constant terms) in the neural network
shown in Figure 4. Because they are so numerous,
and because so many combinations of parameters
result in similar predictions, the parameters
become uninterpretable and the network serves
as a ¡°black box¡± predictor. In fact, a given result
can be associated with several different sets
of weights. Consequently, the network weights
in general do not aid in understanding the underlying
process generating the
prediction. However, this is acceptable in many
Applicationss. A Banking may want to utomatically
recognize handwritten Applicationss, but does not
care about the form of the functional relationship
between the pixels and the characters they represent.
Some of the many Applicationss where hundreds of
variables may be input into models with thousands
of parameters (node weights) include modeling
of chemical plants, robots and financial markets,
and pattern recognition problems such as speech,
vision and handwritten character ecognition.
One advantage of neural network
models is that they can easily be implemented
to run on massively parallel computers with each
node simultaneously doing its own calculations.
Users must be conscious of several
facts about neural networks: First, neural networks
are not easily interpreted. There is no explicit
rationale given for the decisions or predictions
a neural network makes.
Second, they tend to overfit
the training data unless very stringent measures,
such as weight decay and/or cross validation,
are used judiciously. This is due to the very
large number of parameters of the neural network
which, if allowed to be of sufficient size, will
fit any data set arbitrarily well when allowed
to train to convergence.
Third, neural networks require
an extensive amount of training time unless the
problem is very small.Once trained, however, they
can provide predictions very quickly.
Fourth,they require no less
data preparation than any other method, which
is to say they require a lot of data preparation.
One myth of neural networks is that data of any
quality can be used to provide reasonable predictions.
The most successful implementations of neural
networks (or decision trees, or logistic regression,
or any other method) involve very careful data
cleansing, selection, preparation and pre-processing.
For instance, neural nets require that all variables
be numeric. Therefore categorical data such as
¡°state¡± is usually broken up into multiple dichotomous
variables (e.g.,¡°California,¡± ¡°New York¡±) , each
with a ¡°1¡± (yes) or ¡°0¡± (no) value. The resulting
increase in variables is called the categorical
explosion.
Finally, neural networks tend
to work best when the data set is sufficiently
large and the signal-tonoise ratio is reasonably
high. Because they are so flexible, they will
find many false patterns in a low signal-to-noise
ratio situation. |
| |
|
 |
Decision
Trees |
| |
Decision trees
are a way of representing a series of rules that
lead to a class or value. For example,you may
wish to classify loan applicants as good or bad
credit risks. Figure 7 shows a simple decision
tree that solves this problem while illustrating
all the basic components of a decision tree: the
decision node, branches and leaves.
The first component is the top
decision node, or root node, which specifies a
test to be carried out. The root node in this
example is ¡°Income > $40,000.¡± The results
of this test cause the tree to split into branches,
each representing one of the possible answers.
In this case, the test ¡°Income > $40,000¡± can
be answered either ¡°yes¡± or ¡°no,¡± and so we get
two branches.
Depending on the algorithm,
each node may have two or more branches. For example,
CART
generates trees with only two branches at each
node. Such a tree is called a binary tree. When
more than two branches are allowed it is called
a multiway tree.
Each branch will lead either
to another decision node or to the bottom of the
tree, called a leaf node.By navigating the decision
tree you can assign a value or class to a case
by deciding which branch to take, starting at
the root node and moving to each subsequent node
until a leaf node is reached. Each node uses the
data from the case to choose the appropriate branch.
Armed with this sample tree
and a loan Applications, a loan officer could determine
whether the applicant was a good or bad credit
risk. An individual with ¡°Income > $40,000¡±
and ¡°High Debt¡±would be classified a ¡°Bad Risk,¡±
whereas an individual with ¡°Income < $40,000¡±
and ¡°Job > 5 Years¡± would be classified a ¡°Good
Risk.¡±
Decision trees models are commonly
used in data mining to examine the data and induce
the tree and its rules that will be used to make
predictions. A number of different Algorithms
may be used for building decision trees including
CHAID (Chi-squared Automatic Interaction Detection),
CART (Classification And Regression Trees), Quest,
and C5.0.
Decision trees are grown through
an iterative splitting of data into discrete groups,
where the goal is to maximize the ¡°distance¡± between
groups at each split.
One of the distinctions between
decision tree methods is how they measure this
distance. While the details of such measurement
are beyond the scope of this Introduction, you
can think of each split as separating the data
into new groups which are as different from each
other as possible. This is also sometimes called
making the groups purer. Using our simple example
where the data had two possible output classes
¡ª Good Risk and Bad Risk ¡ª it would be preferable
if each data split found a criterion resulting
in ¡°pure¡± groups with instances of only one class
instead of both classes.
Decision trees which are used
to predict categorical variables are called classification
trees because they place instances in categories
or classes. Decision trees used to predict continuous
variables are called regression trees.
The example we've been using
up until now has been very simple. The tree is
easy to understand and interpret. However, trees
can become very complicated. Imagine the complexity
of a decision tree derived from a database of
hundreds of attributes and a response variable
with a dozen output classes. Such a tree would
be extremely difficult to understand, although
each path to a leaf is usually understandable.
In that sense a decision tree can explain its
predictions, which is an important advantage.
However, this clarity can be
somewhat misleading. For example, the hard splits
of decision trees imply a precision that is rarely
reflected in reality. (Why would someone whose
salary was $40,001 be a good credit risk whereas
someone whose salary was $40,000 not be?) Furthermore,
since several trees can often represent the same
data with equal accuracy, what interpretation
should be placed on the rules?
Decision trees make few passes
through the data (no more than one pass for each
level of the tree) and they work well with many
predictor variables. As a consequence, models
can be built very quickly, making them suitable
for large data sets.
Trees left to grow without bound
take longer to build and become unintelligible,
but more importantly they overfit the data. Tree
size can be controlled via stopping rules that
limit growth. One common stopping rule is simply
to limit the maximum depth to which a tree may
grow. Another stopping rule is to establish a
lower limit on the number of records in a node
and not do splits below this limit.
An alternative to stopping rules
is to prune the tree. The tree is allowed to grow
to its full size and then, using either built-in
heuristics or user intervention, the tree is pruned
back to the smallest size that does not compromise
accuracy. For example, a branch or subtree that
the user feels is inconsequential because it has
very few cases might be removed. CART prunes trees
by cross validating them to see if the improvement
in accuracy justifies the extra nodes.
A common criticism of decision
trees is that they choose a split using a ¡°greedy¡±
algorithm in which the decision on which variable
to split doesn't take into account any effect
the split might have on future splits. In other
words, the split decision is made at the node
¡°in the moment¡± and it is never revisited. In
addition, all splits are made sequentially, so
each split is dependent on its predecessor. Thus
all future splits are dependent on the first split,
which means the final solution could be very different
if a different first split is made. The benefit
of looking ahead to make the best splits based
on two or more levels at one time is unclear.
Such attempts to look ahead are in the research
stage, but are very computationally intensive
and presently unavailable in commercial implementations.
Furthermore, Algorithms used
for splitting are generally univariate; that is,
they consider only one predictor variable at a
time. And while this approach is one of the reasons
the model builds quickly ¡ª it limits the number
of possible splitting rules to test ¡ª it also
makes relationships between predictor variables
harder to detect.
Decision trees that are not
limited to univariate splits could use multiple
predictor variables in a single splitting rule.
Such a decision tree could allow linear combinations
of variables, also known as oblique trees. A criterion
for a split might be ¡°SALARY < (0.35 * MORTGAGE),¡±
for instance. Splitting on logical combinations
of variables (such as ¡°SALARY > 35,000 OR MORTGAGE
< 150,000¡±) is another kind of multivariate
split.
Decision trees handle non-numeric
data very well. This ability to accept categorical
data minimizes the amount of data transformations
and the explosion of predictor variables inherent
in neural nets.Some classification trees were
designed for and therefore work best when the
predictor variables are also categorical. Continuous
predictors can frequently be used even in these
cases by converting the continuous variable to
a set of ranges (binning). Some decision trees
do not support continuous response variables (i.e.,
will not build regression trees), in which case
the response variables in the training set must
also be binned to output classes. |
| |
|
 |
Multivariate
Adaptive Regression Splines (MARS) |
| |
In the mid-1980s
one of the inventors of CART, Jerome H. Friedman,
developed a method designed to address its shortcomings.
The main disadvantages he wanted
to eliminate were:
Discontinuous predictions
(hard splits).
Dependence of all
splits on previous ones.
Reduced interpretability
due to interactions, especially high-order interactions.
To this end he developed the
MARS algorithm. The basic idea of MARS is quite
simple, while the algorithm itself is rather involved.
Very briefly, the CART disadvantages are taken
care of by:
Replacing the discontinuous
branching at a node with a continuous transition
modeled by a
pair of straight lines. At the end of the model-building
process, the straight lines at each
node are replaced with a very smooth function
called a spline.
Not requiring that
new splits be dependent on previous splits.
Unfortunately, this means MARS
loses the tree structure of CART and cannot produce
rules. On the other hand, MARS automatically finds
and lists the most important predictor variables
as well as the interactions among predictor variables.
MARS also plots the dependence of the response
on each predictor. The result is an automatic
non-linear step-wise regression tool.
MARS, like most neural net and
decision tree Algorithms, has a tendency to overfit
the training data. This can be addressed in two
ways. First, manual cross validation can be performed
and the algorithm tuned to provide good prediction
on the test set. Second, there are various tuning
parameters in the algorithm itself that can guide
internal cross validation. |
 |
Rule Induction
|
| |
Rule induction
is a method for deriving a set of rules to classify
cases. Although decision trees can produce a set
of rules, rule induction methods generate a set
of independent rules which do not necessarily
(and are unlikely to) form a tree. Because the
rule inducer is not forcing splits at each level,
and can look ahead, it may be able to find different
and sometimes better patterns for classification.
Unlike trees, the rules generated may not cover
all possible situations. Also unlike trees, rules
may sometimes conflict in their predictions, in
which case it is necessary to choose which rule
to follow. One common method to resolve conflicts
is to assign a confidence to rules and use the
one in which you are most confident. Alternatively,
if more than two rules conflict, you may let them
vote, perhaps weighting their votes by the confidence
you have in each rule. |
 |
K-nearest
Neighbor and Memory-Based Reasoning (MBR)
|
| |
When trying
to solve new problems, people often look at Solutions
to similar problems that they have previously
solved. K-nearest neighbor (k-NN) is a classification
technique that uses a version of this same method.
It decides in which class to place a new case
by examining some number ¡ª the ¡°k¡± in k-nearest
neighbor ¡ª of the most similar cases or neighbors
(Figure 8). It counts the number of cases for
each class, and assigns the new case to the same
class to which most of its neighbors belong.
The first thing you must do to
apply k-NN is to find a measure of the distance
between attributes in the data and then calculate
it. While this is easy for numeric data, categorical
variables need special handling. For example,
what is the distance between blue and green? You
must then have a way of summing the distance measures
for the attributes. Once you can calculate the
distance between cases, you then select the set
of already classified cases to use as the basis
for classifying new cases, decide how large a
neighborhood in which to do the comparisons, and
also decide how to count the neighbors themselves
(e.g., you might give more weight to nearer neighbors
than farther neighbors).
K-NN puts a large computational
load on the computer because the calculation time
increases as the factorial of the total number
of points. While it's a rapid process to apply
a decision tree or neural net to a new case, k-NN
requires that a new calculation be made for each
new case. To speed up k-NN,frequently all the
data is kept in memory. Memory-based reasoning
usually refers to a k-NN classifier kept in memory.
K-NN models are very easy to
understand when there are few predictor variables.
They are also
useful for building models that involve non-standard
data types, such as text. The only requirement
for being able to include a data type is the existence
of an appropriate metric. |
 |
Logistic
Regression |
| |
Logistic regression
is a generalization of linear regression. It is
used primarily for predicting binary variables
(with values such as yes/no or 0/1) and occasionally
multi-class variables. Because the response variable
is discrete, it cannot be modeled directly by
linear regression. Therefore, rather than predict
whether the event itself (the response variable)
will occur, we build the model to predict the
logarithm of the odds of its occurrence. This
logarithm is called the log odds or the logit
transformation.
has the same interpretation
as in the more casual use of odds in games of
chance or sporting events.When we say that the
odds are 3 to 1 that a particular team will win
a soccer game, we mean that the probability of
their winning is three times as great as the probability
of their losing. So we believe they have a 75%
chance of winning and a 25% chance of losing.
Similar terminology can be applied to the chances
of a particular type of customer (e.g., a customer
with a given gender, income, marital status, etc.)
replying to a mailing. If we say the odds are
3 to 1 that the customer will respond, we mean
that the probability of that type of customer
responding is three times as great as the probability
of him or her not responding.
Having predicted the log odds,
you then take the anti-log of this number to find
the odds. Odds of 62% would mean that the case
is assigned to the class designated ¡°1¡± or ¡°yes,¡±
for example.
While logistic regression is
a very powerful modeling tool, it assumes that
the response variable (the log odds, not the event
itself) is linear in the coefficients of the predictor
variables. Furthermore, the modeler, based on
his or her experience with the data and data analysis,
must choose the right inputs and specify their
functional relationship to the response variable.
So, for example, the modeler must choose among
income or (income)2 or log (income) as a predictor
variable. Additionally the modeler must explicitly
add terms for any interactions. It is up to the
model builder to search for the right variables,
find their correct expression, and account for
their possible interactions. Doing this effectively
requires a great deal of skill and experience
on the part of the analyst.
Neural nets, on the other hand,
use their hidden layers to estimate the forms
of the non-linear terms and interaction in a semi-automated
way. Users need a different set of analytic skills
to apply neural nets successfully. For example,
the choice of an activation function will affect
the speed with which a neural net trains. |
 |
Discriminant
Analysis |
| |
Discriminant
analysis is the oldest mathematical classification
technique, having been first published by R. A.
Fisher in 1936 to classify the famous Iris botanical
data into three species. It finds hyperplanes
(e.g., lines in two dimensions, planes in three,
etc.) that separate the classes. The resultant
model is very easy to interpret because all the
user has to do is determine on which side of the
line (or hyper-plane) a point falls. Training
is simple and scalable. The technique is very
sensitive to patterns in the data. It is used
very often in certain disciplines such as medicine,
the social sciences, and field
biology.
Discriminant analysis is not
popular in data mining, however, for three main
reasons. First, it assumes that all of the predictor
variables are normally distributed (i.e., their
histograms look like bell-shaped curves), which
may not be the case. Second, unordered categorical
predictor variables (e.g., red/blue/ green) cannot
be used at all. Third, the boundaries that separate
the classes are all linear forms (such as lines
or planes), but sometimes the data just can't
be separated that way.
Recent versions of discriminant
analysis address some of these problems by allowing
the boundaries to be quadratic as well as linear,
which significantly increases the sensitivity
in certain cases. There are also techniques that
allow the normality assumption to be replaced
with an estimate of the real distribution (i.e.,
replace the theoretical bell-shaped curve with
the histograms of the predictor variables). Ordered
categorical data can be modeled by forming the
histogram from the bins defined by the categorical
variables. |
 |
Generalized
Additive Models (GAM) |
| |
There is
a class of models extending both linear and logistic
regression, known as generalized additive models
or GAM. They are called additive because we assume
that the model can be written as the sum of possibly
non-linear functions, one for each predictor.
GAM can be used either for regression or for classification
of a binary response. The response variable can
be virtually any function of the predictors as
long as there are not discontinuous steps. For
example, suppose that payment delinquency is a
rather complicated function of income where the
probability of delinquency initially declines
as income increases. It then turns around and
starts to increase again for moderate income,
finally peaking before coming down again for higher
income card-holders. In such a case, a linear
model may fail to see any relationship between
income and delinquency due to the non-linear
behavior. GAM, using computer power in place of
theory or knowledge of the functional form, will
produce a smooth curve, summarizing the relationship
as described above. The most common
estimation procedure is backfitting. Instead of
estimating large numbers of parameters as neural
nets do, GAM goes one step further and estimates
a value of the output for each value of the input
¡ª one point, one estimate. As with the neural
net, GAM generates a curve automatically, choosing
the amount of complexity based on the data. |
 |
Boosting
|
| |
If you were
to build a model using one sample of data, and
then build a new model using the same algorithm
but on a different sample, you might get a different
result. After validating the two models, you could
choose the one that best met your objectives.
Even better results might be achieved if you built
several models and let them vote, making a prediction
based on what the majority recommended. Of course,
any interpretability of the prediction would be
lost, but the improved results might be worth
it.
This is exactly the approach
taken by boosting, a technique first published
by Freund and Schapire in 1996. Basically, boosting
takes multiple random samples from the data and
builds a classification model for each. The training
set is changed based on the result of the previous
models. The final classification is the class
assigned most often by the models. The exact Algorithms
for boosting have evolved from the original, but
the underlying idea is the same.
Boosting has become a very popular
addition to data mining packages. |
 |
Genetic Algorithms
|
| |
Genetic Algorithms
are not used to find patterns per se, but rather
to guide the learning process of data mining Algorithms
such as neural nets. Essentially, genetic Algorithms
act as a method for performing a guided search
for good models in the solution space.
They are called genetic Algorithms
because they loosely follow the pattern of biological
evolution in which the members of one generation
(of models) compete to pass on their characteristics
to the next generation (of models), until the
best (model) is found. The information to be passed
on is contained in ¡°chromosomes,¡± which contain
the parameters for building the model.
For example, in building a neural
net, genetic Algorithms can replace backpropagation
as a way to adjust the weights. The chromosome
in this case would contain the weights. Alternatively,
genetic Algorithms might be used to find the best
architecture, and the chromosomes would contain
the number of hidden layers and the number of
nodes in each layer.
While genetic Algorithms are
an interesting approach to optimizing models,
they add a lot of
computational overhead. |
| |
Reference£º"Introduction
to Data Mining and Knowledge Discovery"
by Two Crows Corporation
|
|
| |
|