SOCR ≫ DSPA ≫ Topics ≫

1 Motivation

HTTP cookies are used to track web-surfing and Internet traffic. We often notice that promotions (ads) on websites tend to match our needs, reveal our prior browsing history, or may reflect our interests. That is not an accident. Nowadays, recommendation systems are highly based on machine learning methods that can learn the behavior, e.g., purchasing patterns, of individual consumers. In this chapter, we will uncover some of the mystery behind recommendation systems like market basket analysis. Specifically, we will (1) discuss association rules and their support and confidence, (2) present the Apriori algorithm for association rule learning, and (3) cover step-by-step a set of case-studies, including a toy example, Head and Neck Cancer Medications, Grocery purchases, and survival of Titanic passengers.

2 Association Rules

Association rules are the result of process analytics (e.g., market basket analysis) that specify patterns of relationships among items. One specific example would be: \[\{charcoal, \, lighter, \, chicken\, wings\}\rightarrow\{barbecue\, sauce\}\] In words, charcoal, lighter and chicken wings imply barbecue sauce. The curly brackets indicate that we have a set of items and the arrow suggest a direction of the association. Items in a set are called elements. When an item-set like \(\{charcoal, \, lighter, \, chicken\, wings, \, barbecue\, sauce\}\) appears in our dataset with some regularity, we can mine and discover pattern of association with other item-sets.

Association rules are commonly used for unsupervised discovery of knowledge rather than prediction of pre-specified outcomes. In biomedical research, association rules are widely used to:

  • searching for interesting or frequently occurring patterns of DNA,
  • searching for protein sequences in an analysis of cancer data,
  • finding patterns of medical claims that occur in combination with credit card or insurance fraud.

3 The Apriori algorithm for association rule learning

Association rules are mostly applied to transactional data, which is usually records of trade, exchange, or arrangement, e.g., medical records. These datasets are typically very large in number of transactions and features, e.g., electronic health record (EHR). In such data archives, the number and complexity of transactions is high, which complicates efforts to extract patterns, conduct market analysis, or predict basket purchases.

Apriori association rules help untangle such difficult modeling problems. If we have a simple prior belief about the properties of frequent elements, we may be able to efficiently reduce the number of features or combinations that we need to look at.

The Apriori algorithm is based on a simple apriori belief that all subsets of a frequent item-set must also be frequent. This is known as the Apriori property. In the last example, the full set \(\{charcoal, \, lighter, \, chicken\, wings, \, barbecue\, sauce\}\) is frequent if and only if itself and all of its subsets, including single elements, pairs, and triples, occur frequently. Naturally, the apriori rule is designed for finding patterns in large datasets where patterns that appear frequently are considered “interesting”, “valuable”, or “important”.

4 Rule support and confidence

We can measure rule’s importance by computing its support and confidence metrics. The support and confidence represent two criteria useful in deciding whether a pattern is “valuable”. By setting thresholds for these two criteria, we can easily limit the number of interesting rules or item-sets reported.

For item-sets \(X\) and \(Y\), the support of an item-set measures how (relatively) frequently it appears in the data: \[support(X)=\frac{count(X)}{N},\] where N is the total number of transactions in the database and count(X) is the number of observations (transactions) containing the item-set X.

In a set-theoretic sense, the union of item-sets is an item-set itself. In other words, if \(Z=\{X,Y\}=\{X\cup Y\}\), then \[support(Z)=support(X,Y).\]

For a given rule \(X \rightarrow Y\), the rule's confidence measures the relative accuracy of the rule: \[confidence(X \rightarrow Y)=\frac{support(X, Y)}{support(X)}.\]

The confidence measures the joint occurrence of X and Y over the X domain. If whenever X appears Y tends to also be present, then we will have a high \(confidence(X\rightarrow Y)\). This set-theoretic formulation of the confidence can also be expressed as a conditional probability, since the numerator is effectively the probability of the intersection of the events \(E_X\) and \(E_Y\) corresponding to observing the transactions \(X\) and \(Y\), respectively. That is, the support corresponds to the conditional probability \(support(X,Y)=Prob(E_X\cap E_Y)\) and \[confidence(X \rightarrow Y)=\frac{Prob(E_X\cap E_Y)}{Prob(E_X)}\equiv Prob(E_Y | E_X).\]

Note that the ranges of the support and the confidence are \(0 \leq support,\ confidence \leq 1\).

One intuitive example of a strong association rule is \(\{peanut\, butter\}\rightarrow\{bread\}\), because it has high support as well as high confidence in grocery store transactions. Shoppers tend to purchase bread when they get peanut butter. These items tend to appear in the same baskets, which yields high confidence for the rule \(\{peanut\, butter\}\rightarrow\{bread\}\). Of course, there may be cases where \(\{peanut\, butter\}\rightarrow\{celery\}\). These recommendation systems may not be perfect and can fail in unexpected ways, see the “How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did?” article.

5 Building a set of rules with the Apriori principle

Remember that the number of arrangements of \(n\) elements taken \(k\) at a time, i.e., the number of combinations, increases exponentially with the size of the item inventory (\(n\)), see this SOCR activity. This is precisely why if a restaurant only uses 10 ingredients, there are \({10\choose 5}=253\) possible menu items for customers to order. clearly the complexity of the number of “baskets” rapidly increases with the inventory of the available items or ingredients. For instance,

  • \(n=100\) (ingredients) and \(k=50\) (menus of 50 ingredients), yields \({100\choose 50}=100891344545564193334812497256\gt 10^{29}\) possible orders, and
  • \(n=100\) (ingredients) and \(k=5\) (menus of only 5 ingredients), yields \({100\choose 5}=75287520\gt 7M\) possible orders.

To avoid this complexity, we will introduce a two-step process of building few, simple, and informative sets of rules:

  • Step 1: Filter all item-sets with a minimum support threshold. This is accomplished iteratively by increasing the size of the item-sets. In the first iteration, we compute the support of singletons, 1-item-sets. Next iteration, we compute the support of pairs of items, then, triples of items, etc. Item-sets passing iteration i could be considered as candidates for the next iteration, i+1. If {A}, {B}, {C} are all frequent singletons, but D is not frequent in the first singleton-selection round, then in the second iteration we only consider the support of these pairs {A, B}, {A,C}, {B,C}, ignoring all pairs including D. This substantially reduces the cardinality of the potential item-sets and ensures the feasibility of the algorithm. At the third iteration, if {A,C}, and {B,C} are frequently occurring, but {A, B} is not, then the algorithm may terminate, as the support of {A,B,C} is trivial (does not pass the support threshold), given that {A, B} was not frequent enough.

  • Step 2: Using the item-sets selected in step 1, generate new rules with confidence larger than a predefined minimum confidence threshold. The candidate item-sets that passed step 1 would include all frequent item-sets. For the highly-supported item-set {A, C}, we would compute the confidence measures for \(\{A\}\rightarrow\{C\}\) as well as \(\{C\}\rightarrow\{A\}\) and compare these against the minimum confidence threshold. The surviving rules are the ones with confidence levels exceeding that minimum threshold.

6 A toy example

Assume that a large supermarket tracks sales data by stock-keeping unit (SKU) for each item, i.e., each item, such as “butter” or “bread”, is identified by an SKU number. The supermarket has a database of transactions where each transaction is a set of SKUs that were bought together.

Suppose the database of transactions consist of following item-sets, each representing a purchasing order:

require(knitr)
item_table = as.data.frame(t(c("{1,2,3,4}","{1,2,4}","{1,2}","{2,3,4}","{2,3}","{3,4}","{2,4}")))
colnames(item_table) <- c("choice1","choice2","choice3","choice4","choice5","choice6","choice7")
kable(item_table, caption = "Item table")
Item table
choice1 choice2 choice3 choice4 choice5 choice6 choice7
{1,2,3,4} {1,2,4} {1,2} {2,3,4} {2,3} {3,4} {2,4}

We will use Apriori to determine the frequent item-sets of this database. To do so, we will say that an item-set is frequent if it appears in at least \(3\) transactions of the database, i.e., the value \(3\) is the support threshold.

The first step of Apriori is to count up the number of occurrences, i.e., the support, of each member item separately. By scanning the database for the first time, we obtain get:

item_table = as.data.frame(t(c(3,6,4,5)))
colnames(item_table) <- c("items: {1}","{2}","{3}","{4}")
rownames(item_table) <- "(N=7)*support"
kable(item_table,caption = "Size 1 Support")
Size 1 Support
items: {1} {2} {3} {4}
(N=7)*support 3 6 4 5

All the singletons, item-sets of size 1, have a support of at least 3, so they are all frequent. The next step is to generate a list of all pairs of frequent items.

For example, regarding the pair \(\{1,2\}\): the first table of Example 2 shows items 1 and 2 appearing together in three of the item-sets; therefore, we say that the support of the item \(\{1,2\}\) is \(3\).

item_table = as.data.frame(t(c(3,1,2,3,4,3)))
colnames(item_table) <- c("{1,2}","{1,3}","{1,4}","{2,3}","{2,4}","{3,4}")
rownames(item_table) <- "N*support"
kable(item_table,caption = "Size 2 Support")
Size 2 Support
{1,2} {1,3} {1,4} {2,3} {2,4} {3,4}
N*support 3 1 2 3 4 3

The pairs \(\{1,2\}\), \(\{2,3\}\), \(\{2,4\}\), and \(\{3,4\}\) all meet or exceed the minimum support of \(3\), so they are frequent. The pairs \(\{1,3\}\) and \(\{1,4\}\) are not and any larger set which contains \(\{1,3\}\) or \(\{1,4\}\) cannot be frequent. In this way, we can prune sets: we will now look for frequent triples in the database, but we can already exclude all the triples that contain one of these two pairs:

item_table = as.data.frame(t(c(2)))
colnames(item_table) <- c("{2,3,4}")
rownames(item_table) <- "N*support"
kable(item_table,caption = "Size 3 Support")
Size 3 Support
{2,3,4}
N*support 2

In the example, there are no frequent triplets – the support of the item-set \(\{2,3,4\}\) is below the minimal threshold, and the other triplets were excluded because they were super sets of pairs that were already below the threshold. We have thus determined the frequent sets of items in the database, and illustrated how some items were not counted because some of their subsets were already known to be below the threshold.

7 Case Study 1: Head and Neck Cancer Medications

7.1 Step 1 - collecting data

To demonstrate the Apriori algorithm in a real biomedical case-study, we will use a transactional healthcare data representing a subset of the Head and Neck Cancer Medication data, which it is available in our case-studies collection as 10_medication_descriptions.csv. It consists of inpatient medications for head and neck cancer patients.

The data is a wide format, see Chapter 1, where each row represents a patient. During the study period, each patient had records for a maximum of 5 encounters. NA represents no medication administration records in this specific time point for the specific patient. This dataset contains a total of 528 patients.

7.2 Step 2 - exploring and preparing the data

Different from our data imports in the previous chapters, transactional data need to be ingested in R using the read.transactions() function. This function will store data as a matrix with each row representing a basket (or transaction example) and each column representing items in the transaction.

Let’s load the dataset and delete the irrelevant index column. Using the write.csv(R data, "path") function, we can output our R data file into a local CSV file. To avoid generating another index column in the output CSV file, we can use the row.names=F option.

med<-read.csv("https://umich.instructure.com/files/1678540/download?download_frd=1", stringsAsFactors = FALSE)
med<-med[, -1]
write.csv(med, "medication.csv", row.names=F)
library(knitr)
kable(med[1:5, ])
MEDICATION_DESC.1 MEDICATION_DESC.2 MEDICATION_DESC.3 MEDICATION_DESC.4 MEDICATION_DESC.5
acetaminophen uh cefazolin ivpb uh NA NA NA
docusate fioricet heparin injection ondansetron injection uh simvastatin
hydrocodone acetaminophen 5mg 325mg NA NA NA NA
fentanyl injection uh NA NA NA NA
cefazolin ivpb uh hydrocodone acetaminophen 5mg 325mg NA NA NA

Now we can use read.transactions() in the arules package to read the CSV file we just outputted.

# install.packages("arules")
library(arules)
med<-read.transactions("medication.csv", sep = ",", skip = 1, rm.duplicates=TRUE)
## distribution of transactions with duplicates:
## items
##   1   2   3 
##  79 166 248
summary(med)
## transactions as itemMatrix in sparse format with
##  528 rows (elements/itemsets/transactions) and
##  88 columns (items) and a density of 0.02085486 
## 
## most frequent items:
##                fentanyl injection uh  hydrocodone acetaminophen 5mg 325mg 
##                                  211                                  165 
##                    cefazolin ivpb uh                    heparin injection 
##                                  108                                  105 
## hydrocodone acetamin 75mg 500mg 15ml                              (Other) 
##                                   60                                  320 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   5 
## 248 166  79  23  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.835   2.000   5.000 
## 
## includes extended item information - examples:
##                            labels
## 1                         09 nacl
## 2                   09 nacl bolus
## 3 acetaminophen  multiroute    uh

Here we use the option rm.duplicates=T because we may have similar medication administration records for two different patients. The option skip=1 means we skip the heading line in the CSV file. Now we get a transactional data with unique rows.

The summary of a transactional data contains rich information. The first block of information tells us that we have 528 rows and 88 different medicines in this matrix. Using the density number we can calculate how many non NA medication records are in the data. In total, we have \(528\times 88=46,464\) positions in the matrix. Thus, there are \(46,464\times 0.0209=971\) medicines prescribed during the study period.

The second block lists the most frequent medicines and their frequencies in the matrix. For example fentanyl injection uh appeared 211 times, which represents a proportion of \(211/528=0.4\) of the (treatment) transactions. Since fentanyl is frequently used to help prevent pain after surgery or other medical treatments, we can see that many of these patients may have undergone some significant medical procedures that require post-operative pain management.

The last block shows statistics about the size of the transaction. 248 patients had only one medicine in the study period, while 12 of them had 5 medication records one for each time point. On average, the patients are having 1.8 different medicines.

7.2.1 Visualizing item support - item frequency plots

The summary might may still be fairly abstract, so we can visualize the information.

inspect(med[1:5,])
##     items                                
## [1] {acetaminophen uh,                   
##      cefazolin ivpb uh}                  
## [2] {docusate,                           
##      fioricet,                           
##      heparin injection,                  
##      ondansetron injection uh,           
##      simvastatin}                        
## [3] {hydrocodone acetaminophen 5mg 325mg}
## [4] {fentanyl injection uh}              
## [5] {cefazolin ivpb uh,                  
##      hydrocodone acetaminophen 5mg 325mg}

The inspect() call shows the transactional dataset. We can see that the medication records of each patient are nicely formatted as item-sets.

We can further analyze the frequent terms using itemFrequency(). This will show all item frequencies alphabetically ordered from the first five outputs.

library(plotly)

itemFrequency(med[, 1:5])
##                                 09 nacl                           09 nacl bolus 
##                             0.013257576                             0.003787879 
##         acetaminophen  multiroute    uh acetaminophen codeine 120 mg 12 mg 5 ml 
##                             0.001893939                             0.001893939 
##       acetaminophen codeine 300mg 30 mg 
##                             0.020833333
# itemFrequencyPlot(med, topN=20)

# convert the data to a matrix
mat <- as.matrix(med@data)
# View matrix as image
# image(mat)

# capture the row/column names
rowNames <- med@itemInfo$labels
colNames <- paste0("S", c(1:dim(mat)[2]))
rownames(mat) <- rowNames
colnames(mat) <- colNames

# convert matrix to DF for processing, order rows based on their average (across subjects/cases/columns), back to matrix for display
df <- as.data.frame(1*mat)
df$avg <- rowMeans(df)
dfOrdered <- df[order(df$avg, decreasing = T), ]
matOrdered <- as.matrix(dfOrdered)

# track the ordered row names
rowNames <- rownames(dfOrdered)
colNames <- colnames(dfOrdered)

# 2D top 20 terms bar plot
# To order the meds based on "avg", instead of alphabetically (mind the "-" sign to order Large to small!)
plot_ly(x = reorder(rowNames[c(1:20)], -dfOrdered[1:20, "avg"]), y=dfOrdered[1:20, "avg"], name="Top 20 Meds", type="bar")  %>%
  layout(title='Frequency of Medications (Top 20) based on averaging across Cases',
           xaxis = list(title="Term"),
           yaxis = list(title="Relative Frequency"))
# 3D surface plot
plot_ly(x = colNames, y = rowNames, z = 2*matOrdered, type = "surface") %>%
  layout(title='Term (X) by Sample (Y) Frequency (Z) Plot',
           xaxis = list(title="Term"),
           yaxis = list(title="Sample ID")) %>% hide_colorbar()

We can only display the top 20 medicines that are most frequently present in this dataset. Consistent with the prior summary() output, fentanyl is still the most frequent item. You can also try to plot the items with a threshold for support. Instead of topN=20, just use the option support=0.1, which will give you all the items have a support greater or equal to \(0.1\).

Let’s generalize this process and define a new function, itemFrequencyPlotly(transactionObject, numTopItemps = 10), that can generate frequency plots for any transaction object.

# define a generic plot_ly ItemFrequency plotting function
itemFrequencyPlotly <- function(transactionObject, numTopItemps = 10, name="") {
  name <- ifelse(name=="", 
                 paste0('Frequency of Items (Top ', numTopItemps, 
                     ' ) based on averaging across Cases'),
                 paste0('Frequency of Items (Top ', numTopItemps, 
                     ' ) based on averaging across Cases (Data=', name, ')'))
  
  mat <- as.matrix(transactionObject@data)
  # View matrix as image
  # image(mat)
  
  # capture the row/column names
  rowNames <- transactionObject@itemInfo$labels
  colNames <- paste0("S", c(1:dim(mat)[2]))
  rownames(mat) <- rowNames
  colnames(mat) <- colNames
  
  # convert matrix to DF for processing, order rows based on their average (across subjects/cases/columns), back to matrix for display
  df <- as.data.frame(1*mat)
  df$avg <- rowMeans(df)
  dfOrdered <- df[order(df$avg, decreasing = T), ]
  matOrdered <- as.matrix(dfOrdered)
  
  # track the ordered row names
  rowNames <- rownames(dfOrdered)
  colNames <- colnames(dfOrdered)
  
  # 2D top "20"numTopItemps" terms bar plot
  # To order the meds based on "avg", instead of alphabetically (mind the "-" sign to order Large to small!)
  plot_ly(x = reorder(rowNames[c(1:numTopItemps)], -dfOrdered[1:numTopItemps, "avg"]), 
          y=dfOrdered[1:numTopItemps, "avg"], name=paste0("Top ", numTopItemps, " Meds"), type="bar")  %>%
    layout(title=name,
             xaxis = list(title="Terms"),
             yaxis = list(title="Relative Frequency"))
}

7.2.2 Visualizing transaction data - plotting the sparse matrix

The sparse matrix will show what medications were prescribed for each patient. Below we only show the top-20 medications for the first 15 cases.

# image(med[1:5, ])

plot_ly(x=reorder(rowNames[c(1:20)], -dfOrdered[1:20, "avg"]), y=colNames[1:15], 
        z=2*matOrdered[1:15, 1:20], type="heatmap") %>% 
  layout(title='Heatmap - Top-20 Medications for the first 15 Cases') %>% hide_colorbar()

This images has 15 rows (translations), as we only specified the first 15 patients) and 20 columns (top-20 meds). Although the picture may be a little hard to interpret, it gives a sense of what kind of medicine is prescribed for each patient in the study.

Let’s see an expanded graph including the top 30 medications for a random roster of 50 patients.

subset_int <- sample(ncol(matOrdered), 50, replace = F)  
# image(med[subset_int, ])

plot_ly(x=rowNames[1:30], y=colNames[subset_int], 
        z=2*t(matOrdered[1:30, subset_int]), type="heatmap") %>% 
  layout(title='Heatmap - Bottom-30 Medications for a Random set of 50 Patients') %>% hide_colorbar()

It shows us clearly that some medications are more popular than others. Now, let’s fit the Apriori model.

7.3 Step 3 - training a model on the data

With the data in place, we can build the association rules using the arules::apriori() function.

myrules <- apriori(data=mydata, parameter=list(support=0.1, confidence=0.8, minlen=1))

  • data: a sparse matrix created by read.transacations().
  • support: minimum threshold for support.
  • confidence: minimum threshold for confidence.
  • minlen: minimum required rule items (in our case, medications).

Setting up the threshold could be hard. You don’t want it to be too high so that you get no rules or rules that everyone knows. You don’t want to set it too low either, to avoid too many rules present. Let’s see what we get under the default setting support=0.1, confidence=0.8:

apriori(med)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 52 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[88 item(s), 528 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## set of 0 rules

Not surprisingly, we have 0 rules. The default setting is too high. In practice, we might need some time to fine-tune these thresholds, which may require certain familiarity with the underlying process or clinical phenomenon.

In this case study, we set support=0.01 and confidence=0.25. This requires rules that have appeared in at least 10% of the head and neck cancer patients in the study. Also, the rules have to have at least 25% accuracy. Moreover, minlen=2 would be a very helpful option because it removes all rules that have fewer than two items.

med_rule <- apriori(med, parameter=list(support=0.01, confidence=0.25, minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 5 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[88 item(s), 528 transaction(s)] done [0.00s].
## sorting and recoding items ... [16 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [29 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
med_rule
## set of 29 rules

The result suggest we have a new med_rule object consisting of 29 rules.

7.4 Step 4 - evaluating model performance

First, we can obtain the overall summary of this set of rules.

summary(med_rule)
## set of 29 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4 
## 13 12  4 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.00    3.00    2.69    3.00    4.00 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01136   Min.   :0.2500   Min.   :0.01894   Min.   :0.7583  
##  1st Qu.:0.01705   1st Qu.:0.3390   1st Qu.:0.03788   1st Qu.:1.3333  
##  Median :0.01894   Median :0.4444   Median :0.06250   Median :1.7481  
##  Mean   :0.03448   Mean   :0.4491   Mean   :0.08392   Mean   :1.8636  
##  3rd Qu.:0.03788   3rd Qu.:0.5000   3rd Qu.:0.08902   3rd Qu.:2.2564  
##  Max.   :0.11174   Max.   :0.8000   Max.   :0.31250   Max.   :3.9111  
##      count      
##  Min.   : 6.00  
##  1st Qu.: 9.00  
##  Median :10.00  
##  Mean   :18.21  
##  3rd Qu.:20.00  
##  Max.   :59.00  
## 
## mining info:
##  data ntransactions support confidence
##   med           528    0.01       0.25

We have 13 rules that contain two items; 12 rules containing 3 items, and the remaining 4 rules contain 4 items.

The lift column shows how much more likely one medicine is to be prescribed to a patient given another medicine is prescribed. It is obtained by the following formula: \[lift(X\rightarrow Y)=\frac{confidence(X\rightarrow Y)}{support(Y)}\] Note that \(lift(X\rightarrow Y)\) is the same as \(lift(Y\rightarrow X)\). The range of \(lift\) is \([0,\infty)\) and higher \(lift\) is better. We don’t need to worry about the support, since we already set a threshold that the support must exceed.

Using the arulesViz package we can visualize the confidence and support scatter plots for all the rules.

# install.packages("arulesViz")
library(arulesViz)
# plot(sort(med_rule))

sortedRule <- sort(med_rule)

x1   <- sortedRule@quality$support
y1   <- sortedRule@quality$confidence
z1   <- sortedRule@quality$lift
col1 <- sortedRule@quality$count
ruleNames <- paste0("Rule", c(1:length(sortedRule@quality$support)))

plot_ly(x = ~x1, y = ~y1, z = ~z1, color = ~z1, name=ruleNames) %>% 
  add_markers() %>% 
  layout(title=paste0("Arule Support-Confidence-Lift Plot (for all ", length(sortedRule@quality$support), " rules)"),
         scene = list(xaxis = list(title = 'Support'),
                     yaxis = list(title = 'Confidence'),
                     zaxis = list(title = 'Lift'))) %>% 
  hide_colorbar()

Again, we can utilize the inspect() function to see exactly what are these rules.

inspect(med_rule[1:3])
##     lhs                               rhs                 support    confidence
## [1] {acetaminophen uh}             => {cefazolin ivpb uh} 0.01136364 0.4615385 
## [2] {ampicillin sulbactam ivpb uh} => {heparin injection} 0.01893939 0.3448276 
## [3] {ondansetron injection uh}     => {heparin injection} 0.01704545 0.2727273 
##     coverage   lift     count
## [1] 0.02462121 2.256410  6   
## [2] 0.05492424 1.733990 10   
## [3] 0.06250000 1.371429  9

Here, lhs and rhs refer to “left hand side” and “right hand side” of the rule, respectively. lhs is the given condition and rhs is the predicted result. Using the first row as an example: If a head-and-neck patient has been prescribed acetaminophen (pain reliever and fever reducer), it is likely that the patient is also prescribed cefazolin (antibiotic prescribed for treatment of resistant bacterial infections); bacterial infections are associated with fevers and some cancers.

7.5 Step 5 - sorting the set of association rules

Sorting the resulting association rules corresponding to high lift values will help us identify the most useful rules.

inspect(sort(med_rule, by="lift")[1:3])
##     lhs                                      rhs                    support confidence   coverage     lift count
## [1] {fentanyl injection uh,                                                                                     
##      heparin injection,                                                                                         
##      hydrocodone acetaminophen 5mg 325mg} => {cefazolin ivpb uh} 0.01515152  0.8000000 0.01893939 3.911111     8
## [2] {cefazolin ivpb uh,                                                                                         
##      fentanyl injection uh,                                                                                     
##      hydrocodone acetaminophen 5mg 325mg} => {heparin injection} 0.01515152  0.6153846 0.02462121 3.094505     8
## [3] {heparin injection,                                                                                         
##      hydrocodone acetaminophen 5mg 325mg} => {cefazolin ivpb uh} 0.03787879  0.6250000 0.06060606 3.055556    20

These rules may need to be interpreted by clinicians and experts in the specific context of the study. For instance, the first row, {fentanyl, heparin, hydrocodone acetaminophen} implies {cefazolin}. Fentanyl and hydrocodone acetaminophen are both pain relievers that may be prescribed after surgery to relieve moderate to severe pain based on an narcotic opioid pain reliever (hydrocodone) and a non-opioid pain reliever (acetaminophen). Heparin is usually used before surgery to reduce the risk of blood clots. This rule may suggest patients that have undergone surgical treatments and are likely that they will need cefazolin to prevent post-surgical bacterial infection. Cefazolin is an antibiotic used for the treatment of bacterial infections and also to prevent group B streptococcal disease around the time of delivery and before general surgery.

7.6 Step 6 - taking subsets of association rules

If we are more interested in investigating associations that are linked to a specific medicine, we can narrow the rules down by making subsets. Let us try investigating rules related to fentanyl, since it appears to be the most frequently prescribed medicine. Fentanyl is used in the management of chronic cancer pain.

fi_rules<-subset(med_rule, items %in% "fentanyl injection uh")
inspect(fi_rules)
##      lhs                                      rhs                                      support confidence   coverage      lift count
## [1]  {ondansetron injection uh}            => {fentanyl injection uh}               0.01893939  0.3030303 0.06250000 0.7582938    10
## [2]  {fentanyl injection uh,                                                                                                        
##       ondansetron injection uh}            => {hydrocodone acetaminophen 5mg 325mg} 0.01136364  0.6000000 0.01893939 1.9200000     6
## [3]  {hydrocodone acetaminophen 5mg 325mg,                                                                                          
##       ondansetron injection uh}            => {fentanyl injection uh}               0.01136364  0.3750000 0.03030303 0.9383886     6
## [4]  {cefazolin ivpb uh,                                                                                                            
##       fentanyl injection uh}               => {heparin injection}                   0.01893939  0.5000000 0.03787879 2.5142857    10
## [5]  {fentanyl injection uh,                                                                                                        
##       heparin injection}                   => {cefazolin ivpb uh}                   0.01893939  0.4761905 0.03977273 2.3280423    10
## [6]  {cefazolin ivpb uh,                                                                                                            
##       fentanyl injection uh}               => {hydrocodone acetaminophen 5mg 325mg} 0.02462121  0.6500000 0.03787879 2.0800000    13
## [7]  {fentanyl injection uh,                                                                                                        
##       hydrocodone acetaminophen 5mg 325mg} => {cefazolin ivpb uh}                   0.02462121  0.3250000 0.07575758 1.5888889    13
## [8]  {fentanyl injection uh,                                                                                                        
##       heparin injection}                   => {hydrocodone acetaminophen 5mg 325mg} 0.01893939  0.4761905 0.03977273 1.5238095    10
## [9]  {heparin injection,                                                                                                            
##       hydrocodone acetaminophen 5mg 325mg} => {fentanyl injection uh}               0.01893939  0.3125000 0.06060606 0.7819905    10
## [10] {fentanyl injection uh,                                                                                                        
##       hydrocodone acetaminophen 5mg 325mg} => {heparin injection}                   0.01893939  0.2500000 0.07575758 1.2571429    10
## [11] {cefazolin ivpb uh,                                                                                                            
##       fentanyl injection uh,                                                                                                        
##       heparin injection}                   => {hydrocodone acetaminophen 5mg 325mg} 0.01515152  0.8000000 0.01893939 2.5600000     8
## [12] {cefazolin ivpb uh,                                                                                                            
##       heparin injection,                                                                                                            
##       hydrocodone acetaminophen 5mg 325mg} => {fentanyl injection uh}               0.01515152  0.4000000 0.03787879 1.0009479     8
## [13] {cefazolin ivpb uh,                                                                                                            
##       fentanyl injection uh,                                                                                                        
##       hydrocodone acetaminophen 5mg 325mg} => {heparin injection}                   0.01515152  0.6153846 0.02462121 3.0945055     8
## [14] {fentanyl injection uh,                                                                                                        
##       heparin injection,                                                                                                            
##       hydrocodone acetaminophen 5mg 325mg} => {cefazolin ivpb uh}                   0.01515152  0.8000000 0.01893939 3.9111111     8

Earlier, we saw that \(\%in\%\) is a simple intuitive interface to match() and is used as a binary operator returning a logical vector indicating if there is a match (T) or not (F) for its left operand within the right table object. In total, there are 14 rules related to this item. Let’s plot them.

7.7 Graphical depiction of association rules

# ?arulesViz::plot()
# plot(sort(fi_rules, by="lift"), method="grouped", control=list(type="items"), 
#      main = "Grouped Matrix for the 14 Fentanyl-associated Rules")

# plot(sort(fi_rules, by="lift"), method = "graph", engine = "plotly")
# plot(sort(fi_rules, by="lift"), method = "graph", engine = "igraph")
# plot(sort(fi_rules, by="lift"), method = "graph", engine = "paracoord")

plot(sort(fi_rules, by="lift"), method = "graph", engine = "htmlwidget",
     control=list(main = list(title="Grouped Matrix for the 14 Fentanyl-associated Rules")))
## Available control parameters (with default values):
## itemCol   =  #CBD2FC
## nodeCol   =  c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B",  "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0",  "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision     =  3
## igraphLayout  =  layout_nicely
## interactive   =  TRUE
## engine    =  visNetwork
## max   =  100
## selection_menu    =  TRUE
## degree_highlight  =  1
## verbose   =  FALSE
subrules2 <- sample(subset(fi_rules, lift > 2), 5)
plot(sort(subrules2, by="lift"), method="grouped", control=list(type="items"), engine = "htmlwidget")
## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## verbose   =  FALSE
plot(sort(fi_rules, by="lift"), method="grouped",  k = 7, control=list(type="items"), engine = "htmlwidget")
## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## verbose   =  FALSE
m <- rules2matrix(sort(fi_rules, by="lift")[1:10], measure = "lift")
#  m
plot(fi_rules[1:10], method = "matrix", engine = "htmlwidget")
## Grouped matrix
# create a matrix with LHSs grouped in k = 10 groups
m <- rules2groupedMatrix(sort(fi_rules, by="lift")[1:10], k = 10)
# m$m
# number of rules per group
# table(m$clustering_rules)
# get rules for group 1
inspect(fi_rules[m$clustering_rules == 1])
##     lhs                           rhs                                      support confidence   coverage      lift count
## [1] {ondansetron injection uh} => {fentanyl injection uh}               0.01893939  0.3030303 0.06250000 0.7582938    10
## [2] {cefazolin ivpb uh,                                                                                                 
##      fentanyl injection uh,                                                                                             
##      heparin injection}        => {hydrocodone acetaminophen 5mg 325mg} 0.01515152  0.8000000 0.01893939 2.5600000     8
# the corresponding plot
plot(fi_rules, method = "grouped matrix", k = 7, engine = "htmlwidget")
# plot(fi_rules, method = "grouped matrix", k = 7)

#### For interactive runs try the RShinyApp:
# ruleExplorer(fi_rules)
plot(fi_rules, method="graph", measure = "support", engine="htmlwidget", # nodeCol=rainbow(14),
     shading = "lift", control = list(verbose = TRUE))
## Used control parameters:
## itemCol   =  #CBD2FC
## nodeCol   =  c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B",  "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0",  "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision     =  3
## igraphLayout  =  layout_nicely
## interactive   =  TRUE
## engine    =  htmlwidget
## max   =  100
## selection_menu    =  TRUE
## degree_highlight  =  1
## verbose   =  TRUE

7.8 Saving association rules to a file or data frame

We can save these rules into a CSV file using write(). It is similar with the function write.csv() that we have mentioned in the beginning of this case study.

write(med_rule, file = "medrule.csv", sep=",", row.names=F)

Sometimes it is more convenient to convert the rules into a data frame.

med_df <- as(med_rule, "data.frame")
str(med_df)
## 'data.frame':    29 obs. of  6 variables:
##  $ rules     : chr  "{acetaminophen uh} => {cefazolin ivpb uh}" "{ampicillin sulbactam ivpb uh} => {heparin injection}" "{ondansetron injection uh} => {heparin injection}" "{ondansetron injection uh} => {fentanyl injection uh}" ...
##  $ support   : num  0.0114 0.0189 0.017 0.0189 0.0303 ...
##  $ confidence: num  0.462 0.345 0.273 0.303 0.485 ...
##  $ coverage  : num  0.0246 0.0549 0.0625 0.0625 0.0625 ...
##  $ lift      : num  2.256 1.734 1.371 0.758 1.552 ...
##  $ count     : int  6 10 9 10 16 13 21 17 48 48 ...

As we can see, the rules are converted into a factor vector. Finally, remember that matrices and data-frames can be converted to arules transactions format using arulesObject <- as(input.df, "transactions").

8 Practice Problems

8.1 Groceries

In this practice problem, we will investigate the associations of frequently purchased groceries using the grocery dataset in the R base. Firstly, let’s load the data.

data("Groceries")
summary(Groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

We will try to find out the top 5 frequent grocery items and plot them.

# itemFrequencyPlot(Groceries, topN=5)
itemFrequencyPlotly(Groceries, 10, "grocieries")

Then, try to use support = 0.006, confidence = 0.25, minlen = 2 to set up the grocery association rules. Sort the top 3 rules with highest lift.

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
## set of 463 rules
##     lhs                   rhs                      support confidence   coverage     lift count
## [1] {herbs}            => {root vegetables}    0.007015760  0.4312500 0.01626843 3.956477    69
## [2] {berries}          => {whipped/sour cream} 0.009049314  0.2721713 0.03324860 3.796886    89
## [3] {tropical fruit,                                                                           
##      other vegetables,                                                                         
##      whole milk}       => {root vegetables}    0.007015760  0.4107143 0.01708185 3.768074    69

The number of rules (\(463\)) appears excessive. We can try stringer parameters. In practice, it’s more possible to observe underlying rules if you set a higher confidence. Here we set the \(confidence=0.6\).

groceryrules <- apriori(Groceries, parameter = list(support = 0.006, confidence = 0.6, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
groceryrules
## set of 8 rules
inspect(sort(groceryrules, by = "lift")[1:3])
##     lhs                            rhs          support     confidence
## [1] {butter,whipped/sour cream} => {whole milk} 0.006710727 0.6600000 
## [2] {butter,yogurt}             => {whole milk} 0.009354347 0.6388889 
## [3] {root vegetables,butter}    => {whole milk} 0.008235892 0.6377953 
##     coverage   lift     count
## [1] 0.01016777 2.583008 66   
## [2] 0.01464159 2.500387 92   
## [3] 0.01291307 2.496107 81

We observe mainly rules between dairy products. It makes sense that customers pick up milk when they walk down the dairy products isle. Experiment further with various parameter settings and try to interpret the results in the context of this grocery case-study.

8.2 Titanic Passengers

Next we’ll use the Titanic Passengers Dataset. Let’s start by loading the data first.

dat <- read.csv("https://umich.instructure.com/files/9372716/download?download_frd=1")

# Choose only the key data features
dat <- dat[ , c("pclass", "survived", "sex", "age", "fare", "cabin")]

# Factorize categorical features
dat$pclass <- as.factor(dat$pclass)
dat$survived <- factor(dat$survived, levels = c("0", "1"))
dat$sex <- as.factor(dat$sex)

# Convert the Cabin number to a character A-F, Z=missing cabin ID
dat$cabin <- substring(dat$cabin,1,1)
for (i in 1:length(dat$cabin))
     if ((dat$cabin[i]=="")) dat$cabin[i] <- "Z"
dat$cabin <- as.factor(dat$cabin)

# Convert the Ticket Fair from numeric to categorical label
f <- as.character(dat$fare)
for (i in 1:length(dat$fare)) {
  if (is.na(dat$fare[i])) f[i] <- as.character("low")
  else if (dat$fare[i]<50) f[i] <- as.character("low") 
  else if (50<=dat$fare[i] && dat$fare[i]<100) f[i] <- as.character("medium") 
  else if (100<=dat$fare[i] && dat$fare[i]<200)  f[i] <- as.character("high")
  else f[i] <- as.character("extreme")   #  if (200<=dat$fare[i]) 
}
dat$fare <- as.factor(f)
table(as.factor(dat$fare))
## 
## extreme    high     low  medium 
##      38      46    1067     158
# Convert Age from numeric to categorical (Decade-of-life) label
f <- as.character(dat$age)
for (i in 1:length(dat$age)) {
  if (is.na(dat$age[i])) f[i] <- as.character("1")
  else {
    a = 1 + dat$age[i] %/% 10  # integer division by 10 (per decade of life)
    f[i] <- as.character(a)
  } 
}
dat$age <- as.factor(f)
table(as.factor(dat$age))
## 
##   1   2   3   4   5   6   7   8   9 
## 345 143 344 232 135  70  32   7   1
str(dat)
## 'data.frame':    1309 obs. of  6 variables:
##  $ pclass  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ survived: Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 1 2 1 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age     : Factor w/ 9 levels "1","2","3","4",..: 3 1 1 4 3 5 7 4 6 8 ...
##  $ fare    : Factor w/ 4 levels "extreme","high",..: 1 2 2 2 2 3 4 3 4 3 ...
##  $ cabin   : Factor w/ 9 levels "A","B","C","D",..: 2 3 3 3 3 5 4 1 3 9 ...

Next, we can mine the association rules.

library(arules)

rules <- apriori(dat, parameter = list(minlen=5, supp=0.02, conf=0.8))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.02      5
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 26 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[29 item(s), 1309 transaction(s)] done [0.00s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [186 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(rules[1:20])
##      lhs                                       rhs          support   
## [1]  {pclass=3,survived=0,sex=male,age=5}   => {cabin=Z}    0.02062643
## [2]  {pclass=3,sex=male,age=5,cabin=Z}      => {survived=0} 0.02062643
## [3]  {pclass=3,survived=0,sex=male,age=5}   => {fare=low}   0.02139037
## [4]  {pclass=3,sex=male,age=5,fare=low}     => {survived=0} 0.02139037
## [5]  {pclass=3,survived=0,age=5,cabin=Z}    => {fare=low}   0.02750191
## [6]  {pclass=3,survived=0,age=5,fare=low}   => {cabin=Z}    0.02750191
## [7]  {pclass=3,age=5,fare=low,cabin=Z}      => {survived=0} 0.02750191
## [8]  {pclass=3,sex=male,age=5,cabin=Z}      => {fare=low}   0.02215432
## [9]  {pclass=3,sex=male,age=5,fare=low}     => {cabin=Z}    0.02215432
## [10] {survived=0,sex=male,age=5,cabin=Z}    => {fare=low}   0.03743316
## [11] {survived=0,sex=male,age=5,fare=low}   => {cabin=Z}    0.03743316
## [12] {survived=0,age=5,fare=low,cabin=Z}    => {sex=male}   0.03743316
## [13] {sex=male,age=5,fare=low,cabin=Z}      => {survived=0} 0.03743316
## [14] {survived=1,sex=female,age=2,cabin=Z}  => {fare=low}   0.02368220
## [15] {survived=1,sex=female,age=2,fare=low} => {cabin=Z}    0.02368220
## [16] {survived=1,age=2,fare=low,cabin=Z}    => {sex=female} 0.02368220
## [17] {pclass=3,sex=female,age=2,cabin=Z}    => {fare=low}   0.02750191
## [18] {pclass=3,sex=female,age=2,fare=low}   => {cabin=Z}    0.02750191
## [19] {pclass=3,survived=0,sex=male,age=2}   => {cabin=Z}    0.03819710
## [20] {pclass=3,sex=male,age=2,cabin=Z}      => {survived=0} 0.03819710
##      confidence coverage   lift     count
## [1]  0.9642857  0.02139037 1.244822 27   
## [2]  0.9310345  0.02215432 1.506458 27   
## [3]  1.0000000  0.02139037 1.226804 28   
## [4]  0.9333333  0.02291826 1.510177 28   
## [5]  1.0000000  0.02750191 1.226804 36   
## [6]  0.9729730  0.02826585 1.256037 36   
## [7]  0.8780488  0.03132162 1.420724 36   
## [8]  1.0000000  0.02215432 1.226804 29   
## [9]  0.9666667  0.02291826 1.247896 29   
## [10] 0.9423077  0.03972498 1.156027 49   
## [11] 0.8166667  0.04583652 1.054257 49   
## [12] 0.8305085  0.04507257 1.289603 49   
## [13] 0.9245283  0.04048892 1.495930 49   
## [14] 1.0000000  0.02368220 1.226804 31   
## [15] 0.8857143  0.02673797 1.143393 31   
## [16] 0.8378378  0.02826585 2.353497 31   
## [17] 1.0000000  0.02750191 1.226804 36   
## [18] 1.0000000  0.02750191 1.290927 36   
## [19] 0.9803922  0.03896104 1.265615 50   
## [20] 0.9090909  0.04201681 1.470952 50

We can focus on the binary survival outcome (Survived=1/0).

# examine the rules with containing "Survived=1" in the RHS
rules <- apriori(dat,
  parameter = list(minlen=3, supp=0.02, conf=0.7),
  appearance = list(rhs="survived=1", default="lhs"), control = list(verbose=F))
rules.sorted <- sort(rules, by="lift")
inspect(head(rules.sorted, 30))
##      lhs                                       rhs          support   
## [1]  {sex=female,cabin=B}                   => {survived=1} 0.02750191
## [2]  {pclass=1,sex=female,cabin=B}          => {survived=1} 0.02750191
## [3]  {pclass=1,sex=female,fare=medium}      => {survived=1} 0.05271199
## [4]  {pclass=1,sex=female,age=4}            => {survived=1} 0.02826585
## [5]  {pclass=1,sex=female}                  => {survived=1} 0.10618793
## [6]  {sex=female,fare=medium}               => {survived=1} 0.05500382
## [7]  {sex=female,fare=high}                 => {survived=1} 0.02062643
## [8]  {pclass=1,sex=female,fare=high}        => {survived=1} 0.02062643
## [9]  {sex=female,cabin=C}                   => {survived=1} 0.03208556
## [10] {pclass=1,sex=female,cabin=C}          => {survived=1} 0.03208556
## [11] {pclass=2,sex=female}                  => {survived=1} 0.07181054
## [12] {pclass=2,sex=female,fare=low}         => {survived=1} 0.06951872
## [13] {pclass=2,sex=female,cabin=Z}          => {survived=1} 0.06264324
## [14] {pclass=2,sex=female,fare=low,cabin=Z} => {survived=1} 0.06035141
## [15] {pclass=2,sex=female,age=3}            => {survived=1} 0.02521008
## [16] {pclass=2,sex=female,age=3,fare=low}   => {survived=1} 0.02368220
## [17] {pclass=2,sex=female,age=3,cabin=Z}    => {survived=1} 0.02139037
## [18] {sex=female,age=4}                     => {survived=1} 0.05194805
## [19] {sex=female,age=5}                     => {survived=1} 0.02750191
## [20] {pclass=1,fare=high}                   => {survived=1} 0.02597403
## [21] {sex=female,age=2}                     => {survived=1} 0.03590527
## [22] {pclass=1,cabin=B}                     => {survived=1} 0.03590527
## [23] {sex=female,age=3}                     => {survived=1} 0.06264324
## [24] {pclass=1,age=3}                       => {survived=1} 0.02826585
## [25] {pclass=1,age=4}                       => {survived=1} 0.03896104
## [26] {pclass=1,fare=medium}                 => {survived=1} 0.06799083
## [27] {pclass=1,cabin=D}                     => {survived=1} 0.02139037
##      confidence coverage   lift     count
## [1]  1.0000000  0.02750191 2.618000  36  
## [2]  1.0000000  0.02750191 2.618000  36  
## [3]  1.0000000  0.05271199 2.618000  69  
## [4]  0.9736842  0.02902979 2.549105  37  
## [5]  0.9652778  0.11000764 2.527097 139  
## [6]  0.9350649  0.05882353 2.448000  72  
## [7]  0.9310345  0.02215432 2.437448  27  
## [8]  0.9310345  0.02215432 2.437448  27  
## [9]  0.9130435  0.03514133 2.390348  42  
## [10] 0.9130435  0.03514133 2.390348  42  
## [11] 0.8867925  0.08097785 2.321623  94  
## [12] 0.8834951  0.07868602 2.312990  91  
## [13] 0.8817204  0.07104660 2.308344  82  
## [14] 0.8777778  0.06875477 2.298022  79  
## [15] 0.8684211  0.02902979 2.273526  33  
## [16] 0.8611111  0.02750191 2.254389  31  
## [17] 0.8484848  0.02521008 2.221333  28  
## [18] 0.7906977  0.06569901 2.070047  68  
## [19] 0.7826087  0.03514133 2.048870  36  
## [20] 0.7391304  0.03514133 1.935043  34  
## [21] 0.7343750  0.04889228 1.922594  47  
## [22] 0.7230769  0.04965623 1.893015  47  
## [23] 0.7130435  0.08785332 1.866748  82  
## [24] 0.7115385  0.03972498 1.862808  37  
## [25] 0.7083333  0.05500382 1.854417  51  
## [26] 0.7007874  0.09702063 1.834661  89  
## [27] 0.7000000  0.03055768 1.832600  28

Prune any redundant association rules. For instance, some rules may provide no extra knowledge. Rules that are highly related with prior rules may be redundant. Pruning reduces the number of rules from 27 to 18.

# search for redundant rules
rules_lift <- sort(rules, by = 'lift')
rules_pruned <- rules_lift[!is.redundant(rules_lift, measure="lift")]
inspect(head(rules_pruned, 30))
##      lhs                                  rhs          support    confidence
## [1]  {sex=female,cabin=B}              => {survived=1} 0.02750191 1.0000000 
## [2]  {pclass=1,sex=female,fare=medium} => {survived=1} 0.05271199 1.0000000 
## [3]  {pclass=1,sex=female,age=4}       => {survived=1} 0.02826585 0.9736842 
## [4]  {pclass=1,sex=female}             => {survived=1} 0.10618793 0.9652778 
## [5]  {sex=female,fare=medium}          => {survived=1} 0.05500382 0.9350649 
## [6]  {sex=female,fare=high}            => {survived=1} 0.02062643 0.9310345 
## [7]  {sex=female,cabin=C}              => {survived=1} 0.03208556 0.9130435 
## [8]  {pclass=2,sex=female}             => {survived=1} 0.07181054 0.8867925 
## [9]  {sex=female,age=4}                => {survived=1} 0.05194805 0.7906977 
## [10] {sex=female,age=5}                => {survived=1} 0.02750191 0.7826087 
## [11] {pclass=1,fare=high}              => {survived=1} 0.02597403 0.7391304 
## [12] {sex=female,age=2}                => {survived=1} 0.03590527 0.7343750 
## [13] {pclass=1,cabin=B}                => {survived=1} 0.03590527 0.7230769 
## [14] {sex=female,age=3}                => {survived=1} 0.06264324 0.7130435 
## [15] {pclass=1,age=3}                  => {survived=1} 0.02826585 0.7115385 
## [16] {pclass=1,age=4}                  => {survived=1} 0.03896104 0.7083333 
## [17] {pclass=1,fare=medium}            => {survived=1} 0.06799083 0.7007874 
## [18] {pclass=1,cabin=D}                => {survived=1} 0.02139037 0.7000000 
##      coverage   lift     count
## [1]  0.02750191 2.618000  36  
## [2]  0.05271199 2.618000  69  
## [3]  0.02902979 2.549105  37  
## [4]  0.11000764 2.527097 139  
## [5]  0.05882353 2.448000  72  
## [6]  0.02215432 2.437448  27  
## [7]  0.03514133 2.390348  42  
## [8]  0.08097785 2.321623  94  
## [9]  0.06569901 2.070047  68  
## [10] 0.03514133 2.048870  36  
## [11] 0.03514133 1.935043  34  
## [12] 0.04889228 1.922594  47  
## [13] 0.04965623 1.893015  47  
## [14] 0.08785332 1.866748  82  
## [15] 0.03972498 1.862808  37  
## [16] 0.05500382 1.854417  51  
## [17] 0.09702063 1.834661  89  
## [18] 0.03055768 1.832600  28

The package arulesViz supplies rule-visualization routines using scatter, bubble, and parallel coordinates plots. The visualization of the frequent itemsets using parallel coordinates allows all items to be placed on the vertical axis with their position determined by their group and by the frequency of the item in descending order, vertical ranking based the support of the 1-itemset containing the item, see this paper. To render the maximal frequent itemsets, all subsets are implicitly drawn as sub-segments of one polyline. All itemsets are visualized as a parallel coordinate polylines. The support can be mapped to line color or width. The visualization effectively . Itemsets that share common parts may have overlapping polylines.

library(arulesViz)
plot(rules)

# scatter plot of association rules
plot(rules_pruned, method="graph", control=list(verbose = FALSE), engine="htmlwidget")
# display association rules
# plot(rules_pruned, method="paracoord", control=list(reorder=TRUE, verbose = FALSE), engine="htmlwidget")

# interactive association network graph
plot(rules_pruned, method="graph", measure = "support", engine="htmlwidget", nodeCol=rainbow(14),
     shading = "lift", control = list(verbose = FALSE))
#  <- rules2matrix(sort(rules_pruned, by="lift")[1:10], measure = "lift")
#  m
# plot(rules_pruned, method = "matrix", engine = "htmlwidget")
## Grouped matrix
# create a matrix with LHSs grouped in k = 10 groups
# m <- rules2groupedMatrix(sort(rules_pruned, by="lift")[1:10], k = 10)
# m$m
# number of rules per group
# table(m$clustering_rules)
# get rules for group 1
inspect(rules_pruned[m$clustering_rules == 1])
##     lhs                     rhs          support    confidence coverage  
## [1] {sex=female,cabin=B} => {survived=1} 0.02750191 1.0000000  0.02750191
## [2] {pclass=1,fare=high} => {survived=1} 0.02597403 0.7391304  0.03514133
##     lift     count
## [1] 2.618000 36   
## [2] 1.935043 34
# the corresponding plot
plot(rules_pruned, method = "grouped matrix", k = 7, engine = "htmlwidget")

Consider plotting all rules (not only the pruned list of association rules) and examining rules predicting passenger death (use "survived=0").

9 Summary

  • The Apriori algorithm for association rule learning is only suitable for large transactional data. For some small datasets, it might not be very helpful.
  • It is useful for discovering associations, mostly in early phases of an exploratory study.
  • Some rules can be built due to chance and may need further verification.
  • See also Chapter 19 (Text Mining and NLP).

Try to replicate these results with other data from the list of our Case-Studies.

SOCR Resource Visitor number Web Analytics SOCR Email