|
Your position£º Home Page> Data Mining¡ª¡ªIntroduction |
 |
Introduction |
 |
|
|
| |
 |
Data mining: In brief |
| |
Databases
today can range in size into the terabytes ¡ª more
than 1,000,000,000,000 bytes of data Within these
masses of data lies hidden information of strategic
importance. But when there are so many trees,
how do you draw meaningful conclusions about the
forest?
The newest answer is data mining,
which is being used both to increase revenues
and to reduce costs. The potential returns are
enormous. Innovative organizations worldwide are
already using data mining to locate and appeal
to higher-value customers, to reconfigure their
product offerings to increase sales, and to minimize
losses due to error or fraud.
Data mining is a process that
uses a variety of data analysis tools to discover
patterns and
relationships in data that may be used to make
valid predictions.
The first and simplest analytical
step in data mining is to describe the data ¡ª
summarize its statistical attributes (such as
means and standard deviations), visually review
it using charts and graphs, and look for potentially
meaningful links among variables (such as values
that often occur together). As emphasized in the
section on THE DATA MINING PROCESS, collecting,
exploring and selecting the right data are critically
important.
But Descriptive Modeling alone cannot
provide an action plan. You must build a Predictive Modeling
model based on patterns determined from known
results, then test that model on results outside
the original sample. A good model should never
be confused with reality (you know a road map
isn't a perfect representation of the actual road),
but it can be a useful guide to understanding
your business.
The final step is to empirically
verify the model. For example, from a database
of customers who have already responded to a particular
offer, you've built a model predicting which prospects
are likeliest to respond to the same offer. Can
you rely on this prediction? Send a mailing to
a portion of the new list and see what results
you get. |
| |
|
 |
Data mining: What it can't do |
| |
Data mining
is a tool, not a magic wand. It won't sit in your
database watching what happens and send you e-mail
to get your attention when it sees an interesting
pattern. It doesn't eliminate the need to know
your business, to understand your data, or to
understand analytical methods. Data mining assists
business analysts with finding patterns and relationships
in the data ¡ª it does not tell you the value of
the patterns to the organization. Furthermore,
the patterns uncovered by data mining must be
verified in the real world.
Remember that the Predictive Modeling
relationships found via data mining are not necessarily
causes of an action or behavior. For example,
data mining might determine that males with incomes
between $50,000 and $65,000 who subscribe to certain
magazines are likely purchasers of a product you
want to sell. While you can take advantage of
this pattern, say by aiming your marketing at
people who fit the pattern, you should not assume
that any of these factors cause them to buy your
product.
To ensure meaningful results,
it's vital that you understand your data. The
quality of your output will often be sensitive
to outliers (data values that are very different
from the typical values in your database), irrelevant
columns or columns that vary together (such as
age and date of birth), the way you encode your
data, and the data you leave in and the data you
exclude. Algorithms vary in their sensitivity
to such data issues, but it is unwise to depend
on a data mining product to make all the right
decisions on its own.
Data mining will not automatically
discover Solutions without guidance. Rather than
setting the vague goal, ¡°Help improve the response
to my direct mail solicitation,¡± you might use
data mining to find the characteristics of people
who (1) respond to your solicitation, or (2) respond
AND make a large purchase. The patterns data mining
finds for those two goals may be very different.
Although a good data mining tool
shelters you from the intricacies of statistical
techniques, it requires you to understand the
workings of the tools you choose and the Algorithms
on which they are based. The choices you make
in setting up your data mining tool and the optimizations
you choose will affect the accuracy and speed
of your models.
Data mining does not replace
skilled business analysts or managers, but rather
gives them a powerful new tool to improve the
job they are doing. Any company that knows its
business and its customers is already aware of
many important, high-payoff patterns that its
employees have observed over the years. What data
mining can do is confirm such empirical observations
and find new, subtle patterns that yield steady
incremental improvement (plus the occasional breakthrough
insight). |
| |
|
 |
Data mining and data warehousing |
| |
Frequently,
the data to be mined is first extracted from an
enterprise data warehouse into a data mining database
or data mart (Figure 1). There is some real benefit
if your data is already part of a data warehouse.
As we shall see later on, the problems of cleansing
data for a data warehouse and for data mining
are very similar. If the data has already been
cleansed for a data warehouse, then it most likely
will not need further cleaning in order to be
mined. Furthermore, you will have already addressed
many of the problems of data consolidation and
put in place maintenance procedures..
The data mining database may
be a logical rather than a physical subset of
your data warehouse, provided that the data warehouse
DBMS can support the additional resource demands
of data mining. If it cannot, then you will be
better off with a separate data mining database.
A data warehouse is not a requirement
for data mining. Setting up a large data warehouse
that consolidates data from multiple sources,
resolves data integrity problems, and loads the
data into a query database can be an enormous
task, sometimes taking years and costing millions
of dollars. You could, however, mine data from
one or more operational or transactional databases
by simply extracting it into a read-only database
(Figure 2). This new database functions as a type
of data mart. |
 |
Data mining and OLAP |
| |
One of the
most common questions from data processing professionals
is about the difference
between data mining and OLAP (On-Line Analytical
Processing). As we shall see, they are very
different tools that can complement each other.
OLAP is part of the spectrum
of decision support tools. Traditional query and
report tools describe what is in a database. OLAP
goes further; it's used to answer why certain
things are true. The user forms a hypothesis about
a relationship and verifies it with a series of
queries against the data. For example, an analyst
might want to determine the factors that lead
to loan defaults. He or she might initially hypothesize
that people with low incomes are bad credit risks
and analyze the database with OLAP to verify (or
disprove) this assumption. If that hypothesis
were not borne out by the data, the analyst might
then look at high debt as the determinant of risk.
If the data did not support this guess either,
he or she might then try debt and income together
as the best predictor of bad credit risks.
In other words, the OLAP analyst
generates a series of hypothetical patterns and
relationships and uses queries against the database
to verify them or disprove them. OLAP analysis
is essentially a deductive process. But what happens
when the number of variables being analyzed is
in the dozens or even hundreds? It becomes much
more difficult and time-consuming to find a good
hypothesis (let alone be confident that there
is not a better explanation than the one found),
and analyze the database with OLAP to verify or
disprove it.
Data mining is different from
OLAP because rather than verify hypothetical patterns,
it uses the data itself to uncover such patterns.
It is essentially an inductive process. For example,
suppose the analyst who wanted to identify the
risk factors for loan default were to use a data
mining tool. The data mining tool might discover
that people with high debt and low incomes were
bad credit risks (as above), but it might go further
and also discover a pattern the analyst did not
think to try, such as that age is also a determinant
of risk.
Here is where data mining and
OLAP can complement each other. Before acting
on the pattern, the analyst needs to know what
the financial implications would be of using the
discovered pattern to govern who gets credit.
The OLAP tool can allow the analyst to answer
those kinds of questions.
Furthermore, OLAP is also complementary
in the early stages of the knowledge discovery
process because it can help you explore your data,
for instance by focusing attention on important
variables,identifying exceptions, or finding interactions.
This is important because the better you understand
your data, the more effective the knowledge discovery
process will be. |
 |
Data mining, machine learning and statistics |
| |
Data mining
takes advantage of advances in the fields of artificial
intelligence (AI) and statistics.Both disciplines
have been working on problems of pattern recognition
and classification. Both communities have made
great contributions to the understanding and Applications
of neural nets and decision trees.
Data mining does not replace
traditional statistical techniques. Rather, it
is an extension of statistical methods that is
in part the result of a major change in the statistics
community. The development of most statistical
techniques was, until recently, based on elegant
theory and analytical methods that worked quite
well on the modest amounts of data being analyzed.
The increased power of computers and their lower
cost, coupled with the need to analyze enormous
data sets with millions of rows, have allowed
the development of new techniques based on a brute-force
exploration of possible Solutions.
New techniques include relatively
recent Algorithms like neural nets and decision
trees, and new approaches to older Algorithms
such as discriminant analysis. By virtue of bringing
to bear the increased computer power on the huge
volumes of available data, these techniques can
approximate almost any functional form or interaction
on their own. Traditional statistical techniques
rely on the modeler to specify the functional
form and interactions.
The key point is that data mining
is the Applications of these and other AI and statistical
techniques to common business problems in a fashion
that makes these techniques available to the skilled
knowledge worker as well as the trained statistics
professional. Data mining is a tool for increasing
the productivity of people trying to build Predictive Modeling
models. |
 |
Data mining and hardware/software trends |
| |
A key enabler
of data mining is the major progress in hardware
price and performance. The dramatic 99% drop in
the price of computer disk storage in just the
last few years has radically changed the economics
of collecting and storing massive amounts of data.
At $10/megabyte, one terabyte of data costs $10,000,000
to store. At 10¡é/megabyte, one terabyte of data
costs only $100,000 to store! This doesn't even
include the savings in real estate from greater
storage capacities.
The drop in the cost of computer
processing has been equally dramatic. Each generation
of chips greatly increases the power of the CPU,
while allowing further drops on the cost curve.
This is also reflected in the price of RAM (random
access memory), where the cost of a megabyte has
dropped from hundreds of dollars to around a dollar
in just a few years. PCs routinely have 64 megabytes
or more of RAM, and workstations may have 256
megabytes or more, while servers with gigabytes
of main memory are not a rarity.
While the power of the individual
CPU has greatly increased, the real advances in
scalability stem from parallel computer architectures.
Virtually all servers today support multiple CPUs
using symmetric multi-processing, and clusters
of these SMP servers can be created that allow
hundreds of CPUs to work on finding patterns in
the data.
Advances in database management
systems to take advantage of this hardware parallelism
also
benefit data mining. If you have a large or complex
data mining problem requiring a great deal of
access to an existing database, native DBMS access
provides the best possible performance.
The result of these trends is
that many of the performance barriers to finding
patterns in large amounts of data are being eliminated. |
 |
Data mining Applicationss |
| |
Data mining
is increasingly popular because of the substantial
contribution it can make. It can be used to control
costs as well as contribute to revenue increases.
Many organizations are using
data mining to help manage all phases of the customer
life cycle,including acquiring new customers,
increasing revenue from existing customers, and
retaining good customers. By determining characteristics
of good customers (profiling), a company can target
prospects with similar characteristics. By profiling
customers who have bought a particular product
it can focus attention on similar customers who
have not bought that product (cross-selling).
By profiling customers who have left, a company
can act to retain customers who are at risk for
leaving (reducing churn or attrition), because
it is usually far less expensive to retain a customer
than acquire a new one.
Data mining offers value across
a broad spectrum of industries. Telecommunications
and credit card companies are two of the leaders
in applying data mining to detect fraudulent use
of their services.Insurance companies and negotiable securities
exchanges are also interested in applying this
technology to reduce fraud. Medical Applicationss
are another fruitful area: data mining can be
used to predict the effectiveness of surgical
procedures, medical tests or medications. Companies
active in the financial markets use data mining
to determine market and industry characteristics
as well as to predict individual company and negotiable securities
performance. Retailers are making more use of
data mining to decide which products to negotiable securities
in particular stores (and even how to place them
within a store), as well as to assess the effectiveness
of promotions and coupons. Pharmaceutical firms
are mining large databases of chemical compounds
and of genetic material to discover substances
that might be candidates for development as agents
for the treatments of disease. |
 |
Successful data mining |
| |
There are
two keys to success in data mining. First is coming
up with a precise formulation of the problem you
are trying to solve. A focused statement usually
results in the best payoff. The second key is
using the right data. After choosing from the
data available to you, or perhaps buying external
data, you may need to transform and combine it
in significant ways.
The more the model builder can
¡°play¡± with the data, build models, evaluate results,
and work with the data some more (in a given unit
of time), the better the resulting model will
be. Consequently, the degree to which a data mining
tool supports this interactive data exploration
is more important than the Algorithms it uses.
Ideally, the data exploration
tools (graphics/visualization, query/OLAP) are
well-integrated with the analytics or Algorithms
that build the models. |
| |
Reference£º"Introduction
to Data Mining and Knowledge Discovery"
by Two Crows Corporation
|
|
| |
|