Link analysis
is a descriptive approach to exploring data that
can help identify relationships among values in
a database. The two most common approaches to
link analysis are association discovery and sequence
discovery. Association discovery finds rules about
items that appear together in an event such as
a purchase transaction. Market-basket analysis
is a well-known example of association discovery.
Sequence discovery is very similar, in that a
sequence is an association related over time.
Associations are written as
A T B, where A is called the antecedent or left-hand
side (LHS), and B is called the consequent or
right-hand side (RHS). For example, in the association
rule ¡°If people buy a hammer then they buy nails,¡±
the antecedent is ¡°buy a hammer¡± and the consequent
is ¡°buy nails.¡±
It's easy to determine the proportion
of transactions that contain a particular item
or item set: simply count them. The frequency
with which a particular association (e.g., the
item set ¡°hammers and nails¡±) appears in the database
is called its support or prevalence. If, say,
15 transactions out of 1,000 consist of ¡°hammer
and nails,¡± the support for this association would
be 1.5%. A low level of support (say, one transaction
out of a million) may indicate that the particular
association isn't very important ¡ª or it may indicated
the presence of bad data (e.g., ¡°male and pregnant¡±).
To discover meaningful rules,
however, we must also look at the relative frequency
of occurrence of the items and their combinations.
Given the occurrence of item A (the antecedent),
how often does item B (the consequent) occur?
That is, what is the conditional predictability
of B, given A?Using the above example, this would
mean asking ¡°When people buy a hammer, how often
do they also buy nails?¡± Another term for this
conditional predictability is confidence. Confidence
is calculated as a ratio: (frequency of A and
B)/(frequency of A).
Let's specify our hypothetical
database in more detail to illustrate these concepts:
Total hardware-store
transactions: 1,000
Number which include
¡°hammer¡±: 50
Number which include
¡°nails¡±: 80
Number which include
¡°lumber¡±: 20
Number which include
¡°hammer¡± and ¡°nails¡±: 15
Number which include
¡°nails¡± and ¡°lumber¡±: 10
Number which include
¡°hammer¡± and ¡°lumber¡±: 10
Number which include
¡°hammer,¡± ¡°nails¡± and ¡°lumber¡±:5
We can now calculate:
Support for ¡°hammer
and nails¡± = 1.5% (15/1,000)
Support for ¡°hammer,
nails and lumber¡± = 0.5%(5/1,000)
Confidence of ¡°hammer
T nails¡± = 30% (15/50)
Confidence of ¡°nails
T hammer¡± = 19% (15/80)
Confidence of ¡°hammer
and nails T lumber¡± = 33%(5/15)
Confidence of ¡°lumber
T hammer and nails¡± = 25%(5/20)
Thus we can see that the likelihood
that a hammer buyer will also purchase nails (30%)
is greater than the likelihood that someone buying
nails will also purchase a hammer (19%). The prevalence
of this hammer-and-nails association (i.e., the
support is 1.5%) is high enough to suggest a meaningful
rule.Lift is another measure of the power of an
association. The greater the lift, the greater
the influence that the occurrence of A has on
the likelihood that B will occur. Lift is calculated
as the ratio (confidence of A T B)/ (frequency
of B). From our example:
Lift of ¡°hammer
T nails¡±: 3.75 (30%/8%)
Lift of ¡°hammer
and nails T lumber¡±: 16.5 (33%/2%)
Association Algorithms find
these rules by doing the equivalent of sorting
the data while counting occurrences so that they
can calculate confidence and support. The efficiency
with which they can do this is one of the differentiators
among Algorithms. This is especially important
because of the combinatorial explosion that results
in enormous numbers of rules, even for market
baskets in the express lane. Some Algorithms will
create a database of rules, confidence factors,
and support that can be queried (for example,
¡°Show me all associations in which ice cream is
the consequent, that have a confidence factor
of over 80% and a support of 2% or more¡±).
Another common attribute of
association rule generators is the ability to
specify an item hierarchy. In our example we have
looked at all nails and hammers, not individual
types. It is important to choose a proper level
of aggregation or you'll be unlikely to find associations
of interest. An item hierarchy allows you to control
the level of aggregation and experiment with different
levels.
Remember that association or
sequence rules are not really rules, but rather
Descriptives of
relationships in a particular database. There
is no formal testing of models on other data to
increase the Predictive Modeling power of these
rules. Rather there is an implicit assumption
that the past behavior will continue in the future.
It is often difficult to decide
what to do with association rules you've discovered.
In store planning,for example, putting associated
items physically close together may reduce the
total value of market baskets ¡ª customers may
buy less overall because they no longer pick up
unplanned items while walking through the store
in search of the desired items. Insight, analysis
and experimentation are usually required to achieve
any benefit from association rules.
Graphical methods may also be
very useful in seeing the structure of links.
In Figure 3 each of the circles represents a value
or an event. The lines connecting them show a
link. The thicker lines represent stronger or
more frequent linkages, thus emphasizing potentially
more important relationships such as associations.
For instance, looking at an Insurance database
to detect potential fraud might reveal that a
particular doctor and lawyer work together on
an unusually large number of cases.
|