Data Mining  
Solutions    
 
Back to HomePage
   Introduction       >
   Methodology
   Applications       >
   Models and Algorithms
 
Your position£ºHome Page>Data Mining¡ª¡ªDescriptive
Descriptive Modeling
 
Summaries and visualization
 

Before you can build good Predictive Modeling models, you must understand your data. Start by gathering a variety of numerical summaries (including descriptive statistics such as averages, standard deviations and so forth) and looking at the distribution of the data. You may want to produce cross tabulations (pivot tables) for multi-dimensional data.

Data can be continuous, having any numerical value (e.g., quantity sold) or categorical, fitting into discrete classes (e.g., red, blue, green). Categorical data can be further defined as either ordinal, having a meaningful order (e.g., high/medium/low), or nominal, that is unordered (e.g., postal codes).

Graphing and visualization tools are a vital aid in data preparation and their importance to effective data analysis cannot be overemphasized. Data visualization most often provides the Aha! leading to new insights and success. Some of the common and very useful graphical displays of data are histograms or box plots that display distributions of values. You may also want to look at scatter plots in two or three dimensions of different pairs of variables. The ability to add a third, overlay variable greatly increases the usefulness of some types of graphs.

Visualization works because it exploits the broader information bandwidth of graphics as opposed to text or numbers. It allows people to see the forest and zoom in on the trees. Patterns, relationships, exceptional values and missing values are often easier to perceive when shown graphically, rather than as lists of numbers and text.

The problem in using visualization stems from the fact that models have many dimensions or
variables, but we are restricted to showing these dimensions on a two-dimensional computer screen or paper. For example, we may wish to view the relationship between credit risk and age, sex, marital status, own-or-rent, years in job, etc. Consequently, visualization tools must use clever representations to collapse n dimensions into two. Increasingly powerful and sophisticated data visualization tools are being developed, but they often require people to train their eyes through practice in order to understand the information being conveyed. Users who are color-blind or who are not spatially oriented may also have problems with visualization tools.

   
Clustering
 

Clustering divides a database into different groups. The goal of clustering is to find groups that are very different from each other, and whose members are very similar to each other. Unlike classification (see Predictive Modeling Data Mining, below), you don't know what the clusters will be when you start, or by which attributes the data will be clustered. Consequently, someone who is knowledgeable in the business must interpret the clusters. Often it is necessary to modify the clustering by excluding variables that have been employed to group instances, because upon examination the user identifies them as irrelevant or not meaningful. After you have found clusters that reasonably segment your database, these clusters may then be used to classify new data. Some of the common Algorithms used to perform clustering include Kohonen feature maps and K-means.

Don't confuse clustering with segmentation. Segmentation refers to the general problem of
identifying groups that have common characteristics. Clustering is a way to segment data into groups that are not previously defined, whereas classification is a way to segment data by assigning it to groups that are already defined.

Link analysis
 

Link analysis is a descriptive approach to exploring data that can help identify relationships among values in a database. The two most common approaches to link analysis are association discovery and sequence discovery. Association discovery finds rules about items that appear together in an event such as a purchase transaction. Market-basket analysis is a well-known example of association discovery. Sequence discovery is very similar, in that a sequence is an association related over time.

Associations are written as A T B, where A is called the antecedent or left-hand side (LHS), and B is called the consequent or right-hand side (RHS). For example, in the association rule ¡°If people buy a hammer then they buy nails,¡± the antecedent is ¡°buy a hammer¡± and the consequent is ¡°buy nails.¡±

It's easy to determine the proportion of transactions that contain a particular item or item set: simply count them. The frequency with which a particular association (e.g., the item set ¡°hammers and nails¡±) appears in the database is called its support or prevalence. If, say, 15 transactions out of 1,000 consist of ¡°hammer and nails,¡± the support for this association would be 1.5%. A low level of support (say, one transaction out of a million) may indicate that the particular association isn't very important ¡ª or it may indicated the presence of bad data (e.g., ¡°male and pregnant¡±).

To discover meaningful rules, however, we must also look at the relative frequency of occurrence of the items and their combinations. Given the occurrence of item A (the antecedent), how often does item B (the consequent) occur? That is, what is the conditional predictability of B, given A?Using the above example, this would mean asking ¡°When people buy a hammer, how often do they also buy nails?¡± Another term for this conditional predictability is confidence. Confidence is calculated as a ratio: (frequency of A and B)/(frequency of A).

Let's specify our hypothetical database in more detail to illustrate these concepts:
     Total hardware-store transactions: 1,000
     Number which include ¡°hammer¡±: 50
     Number which include ¡°nails¡±: 80
     Number which include ¡°lumber¡±: 20
     Number which include ¡°hammer¡± and ¡°nails¡±: 15
     Number which include ¡°nails¡± and ¡°lumber¡±: 10
     Number which include ¡°hammer¡± and ¡°lumber¡±: 10
     Number which include ¡°hammer,¡± ¡°nails¡± and ¡°lumber¡±:5

We can now calculate:
     Support for ¡°hammer and nails¡± = 1.5% (15/1,000)
     Support for ¡°hammer, nails and lumber¡± = 0.5%(5/1,000)
     Confidence of ¡°hammer T nails¡± = 30% (15/50)
     Confidence of ¡°nails T hammer¡± = 19% (15/80)
     Confidence of ¡°hammer and nails T lumber¡± = 33%(5/15)
     Confidence of ¡°lumber T hammer and nails¡± = 25%(5/20)

Thus we can see that the likelihood that a hammer buyer will also purchase nails (30%) is greater than the likelihood that someone buying nails will also purchase a hammer (19%). The prevalence of this hammer-and-nails association (i.e., the support is 1.5%) is high enough to suggest a meaningful rule.Lift is another measure of the power of an association. The greater the lift, the greater the influence that the occurrence of A has on the likelihood that B will occur. Lift is calculated as the ratio (confidence of A T B)/ (frequency of B). From our example:
     Lift of ¡°hammer T nails¡±: 3.75 (30%/8%)
     Lift of ¡°hammer and nails T lumber¡±: 16.5 (33%/2%)

Association Algorithms find these rules by doing the equivalent of sorting the data while counting occurrences so that they can calculate confidence and support. The efficiency with which they can do this is one of the differentiators among Algorithms. This is especially important because of the combinatorial explosion that results in enormous numbers of rules, even for market baskets in the express lane. Some Algorithms will create a database of rules, confidence factors, and support that can be queried (for example, ¡°Show me all associations in which ice cream is the consequent, that have a confidence factor of over 80% and a support of 2% or more¡±).

Another common attribute of association rule generators is the ability to specify an item hierarchy. In our example we have looked at all nails and hammers, not individual types. It is important to choose a proper level of aggregation or you'll be unlikely to find associations of interest. An item hierarchy allows you to control the level of aggregation and experiment with different levels.

Remember that association or sequence rules are not really rules, but rather Descriptives of
relationships in a particular database. There is no formal testing of models on other data to increase the Predictive Modeling power of these rules. Rather there is an implicit assumption that the past behavior will continue in the future.

It is often difficult to decide what to do with association rules you've discovered. In store planning,for example, putting associated items physically close together may reduce the total value of market baskets ¡ª customers may buy less overall because they no longer pick up unplanned items while walking through the store in search of the desired items. Insight, analysis and experimentation are usually required to achieve any benefit from association rules.

Graphical methods may also be very useful in seeing the structure of links. In Figure 3 each of the circles represents a value or an event. The lines connecting them show a link. The thicker lines represent stronger or more frequent linkages, thus emphasizing potentially more important relationships such as associations. For instance, looking at an Insurance database to detect potential fraud might reveal that a particular doctor and lawyer work together on an unusually large number of cases.

 

Reference£º"Introduction to Data Mining and Knowledge Discovery" by Two Crows Corporation

 
  Copyright © 2003 - Hua Analytical Technology Co.,Ltd All rights reserved.