#' --- #' title: "DSPA2: Data Science and Predictive Analytics (UMich HS650)" #' subtitle: "

Text Mining, Natural Language Processing, Apriori Association Rules Learning

" #' author: "

SOCR/MIDAS (Ivo Dinov)

" #' date: "`r format(Sys.time(), '%B %Y')`" #' tags: [DSPA, SOCR, MIDAS, Big Data, Predictive Analytics] #' output: #' html_document: #' theme: spacelab #' highlight: tango #' includes: #' before_body: SOCR_header.html #' toc: true #' number_sections: true #' toc_depth: 4 #' toc_float: #' collapsed: false #' smooth_scroll: true #' code_folding: show #' self_contained: yes #' --- #' #' As we have seen in the previous chapters, traditional statistical analyses and classical data modeling are applied to *relational data* where the observed information is represented by tables, vectors, arrays, tensors or data-frames containing binary, categorical, original, or numerical values. Such representations provide incredible advantages (e.g., quick reference and dereference of elements, search, discovery and navigation), but also limit the scope of applications. Relational data objects are quite effective for managing information that is based only on existing attributes. However, when data science inference needs to utilize attributes that are not included in the relational model, alternative non-relational representations are necessary. For instance, imagine that our data object includes qualitative data or free text (e.g., physician/nurse clinical notes, biospecimen samples) containing information about medical condition, treatment or outcome. It may be difficult, or sometimes even impossible, to include raw text in fully automated data analytic protocols solely using classical procedures and statistical models available for relational datasets. In this chapter, we will present ML/AI methods for qualitative and unstructured data. #' #' #' # Natural Language Processing (NLP) and Text Mining (TM) #' #' Natural Language Processing (NLP) and Text Mining (TM) refer to automated machine-driven algorithms for semantic mapping, information extraction, and understanding of (natural) human language. Sometimes, this involves deriving salient information from large amounts of unstructured text. To do so, we need to build semantic and syntactic mapping algorithms for effective processing of heavy text. Recall the related NLP/TM work we did in [Chapter 5](https://socr.umich.edu/DSPA2/DSPA2_notes/05_SupervisedClassification.html) demonstrating an interesting text classifier using the naïve Bayes algorithm. #' #' In this section, we will present details about various text processing strategies in `R`. Specifically, we will present simulated and real examples of text processing and computing document term frequency (TF), inverse document frequency (IDF), and cosine similarity transformation. #' #' ## A simple NLP/TM example #' #' Text mining or text analytics (TM/TA) examines large volumes of unstructured text (corpus) aiming to obtain interpretable information, discover context, identify linguistic motifs, or transform the text and derive quantitative data that can be further analyzed. Natural language processing (NLP) is one example of a TM analytical technique. Whereas TM's goal is to discover relevant contextual information, which may be unknown, hidden, or obfuscated, NLP is focused on linguistic analysis that trains a machine to interpret voluminous textual content. To decipher the semantics and ambiguities in human-interpretable language, NLP employs automatic summarization, tagging, disambiguation, extraction of entities and relations, pattern recognition and frequency analyses. The [United States International Trade Commission reports an International Data Corporation (IDC) Estimate](https://www.usitc.gov/publications/332/executive_briefings/ebot_data_centers_around_the_world.pdf) of the expected total amount of information by 2025 - a staggering 175 Zettabytes ($1ZB=10^{21}=2^{70}$ bytes). The amount of data we obtain and record [doubles every 12-14 months (Kryder's law)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4795481/). A small fraction of this massive information ($<0.0001\%$ or $<1PB=10^{15}$ bytes) represents newly written or transcribed text, including code. However, it is impossible (cf. efficiency, time, resources) for humans to read, synthesize, curate, interpret, and react to all this information without direct assistance of TM/NLP. The information content in text could be substantially higher than that of other information media. Remember that "a picture may be worth a thousand words", yet, "a word may also be worth a thousand pictures". As an example, the simple sentence *"The second edition of the data science and predictive analytics textbook includes 14 Chapters"* takes 93 bytes to store the text as a character array. However, a color image showing this sentence as a scale vector graphic or printed text could be much larger, e.g., 10 megabytes (MB), and a high definition video of a speaker reading the same sentence could easily surpass 50MB. Text mining and natural language processing may be used to automatically analyze and interpret written, coded or transcribed content to assess news, moods, emotions, and biosocial trends related to specific topics. #' #' In general, text analysis protocols involve: #' #' - Construction of a document-term matrix (DTM) from the input documents, vectorizing the text, e.g., creating a map of single words or `n-grams` into a vector space. In other words, we generate a *vectorizer function mapping terms to indices*. #' - Apply a model-based statistical analysis or a model-free machine learning technique for prediction, clustering, classification, similarity search, network/sentiment analysis, or forecasting using the DTM. This step also includes tuning and internally validating the performance of the method. #' - Apply and evaluate the technique to new data. #' #' *Define and load the unstructured-text documents* #' #' Let's create some documents we can use to demonstrate the use of the `tm` package to do text mining. The 5 documents below represent portions of the syllabi of 5 recent courses taught by [Ivo Dinov](https://www.umich.edu/~dinov): #' #' - [HS650: Data Science and Predictive Analytics (DSPA)](https://www.socr.umich.edu/people/dinov/2022/Fall/DSPA_HS650/) #' - [Bioinformatics 501 (Mathematical Foundations for Bioinformatics)](https://umich.instructure.com/courses/537305) #' - [HS 853: Scientific Methods for Health Sciences: Special Topics](https://www.socr.umich.edu/people/dinov/courses/HS853.html) #' - [HS851: Scientific Methods for Health Sciences: Applied Inference](https://www.socr.umich.edu/people/dinov/2014/Fall/HS851), and #' - [HS550: Scientific Methods for Health Sciences: Fundamentals](https://www.socr.umich.edu/people/dinov/courses/HS550.html). #' #' We import the syllabi into several separate segments represented as `documents`. We'll embed the course syllabi directly into `R`, however, using the `rvest::read_html()` method also facilitates loading in the five course syllabi directly from the corresponding course websites. #' #' doc1 <- "HS650: The Data Science and Predictive Analytics (DSPA) course (offered as a massive open online course, MOOC, as well as a traditional University of Michigan class) aims to build computational abilities, inferential thinking, and practical skills for tackling core data scientific challenges. It explores foundational concepts in data management, processing, statistical computing, and dynamic visualization using modern programming tools and agile web-services. Concepts, ideas, and protocols are illustrated through examples of real observational, simulated and research-derived datasets. Some prior quantitative experience in programming, calculus, statistics, mathematical models, or linear algebra will be necessary. This open graduate course will provide a general overview of the principles, concepts, techniques, tools and services for managing, harmonizing, aggregating, preprocessing, modeling, analyzing and interpreting large, multi-source, incomplete, incongruent, and heterogeneous data (Big Data). The focus will be to expose students to common challenges related to handling Big Data and present the enormous opportunities and power associated with our ability to interrogate such complex datasets, extract useful information, derive knowledge, and provide actionable forecasting. Biomedical, healthcare, and social datasets will provide context for addressing specific driving challenges. Students will learn about modern data analytic techniques and develop skills for importing and exporting, cleaning and fusing, modeling and visualizing, analyzing and synthesizing complex datasets. The collaborative design, implementation, sharing and community validation of high-throughput analytic workflows will be emphasized throughout the course." #' #' #' doc2 <- "Bioinformatics 501: The Mathematical Foundations for Bioinformatics course covers some of the fundamental mathematical techniques commonly used in bioinformatics and biomedical research. These include: 1) principles of multi-variable calculus, and complex numbers/functions, 2) foundations of linear algebra, such as linear spaces, eigen-values and vectors, singular value decomposition, spectral graph theory and Markov chains, 3) differential equations and their usage in biomedical system, which includes topic such as existence and uniqueness of solutions, two dimensional linear systems, bifurcations in one and two dimensional systems and cellular dynamics, and 4) optimization methods, such as free and constrained optimization, Lagrange multipliers, data denoising using optimization and heuristic methods. Demonstrations using MATLAB, R, and Python are included throughout the course." #' #' #' doc3 <- "HS 853: This course covers a number of modern analytical methods for advanced healthcare research. Specific focus will be on reviewing and using innovative modeling, computational, analytic and visualization techniques to address concrete driving biomedical and healthcare applications. The course will cover the 5 dimensions of Big-Data (volume, complexity, multiple scales, multiple sources, and incompleteness). HS853 is a 4 credit hour course (3 lectures + 1 lab/discussion). Students will learn how to conduct research, employ and report on recent advanced health sciences analytical methods; read, comprehend and present recent reports of innovative scientific methods; apply a broad range of health problems; and experiment with real Big-Data. Topics Covered include: Foundations of R, Scientific Visualization, Review of Multivariate and Mixed Linear Models, Causality/Causal Inference and Structural Equation Models, Generalized Estimating Equations, PCOR/CER methods Heterogeneity of Treatment Effects, Big-Data, Big-Science, Internal statistical cross-validation, Missing data, Genotype-Environment-Phenotype, associations, Variable selection (regularized regression and controlled/knockoff filtering), medical imaging, Databases/registries, Meta-analyses, classification methods, Longitudinal data and time-series analysis, Geographic Information Systems (GIS), Psychometrics and Rasch measurement model analysis, MCMC sampling for Bayesian inference, and Network Analysis" #' #' #' doc4 <- "HS 851: This course introduces students to applied inference methods in studies involving multiple variables. Specific methods that will be discussed include linear regression, analysis of variance, and different regression models. This course will emphasize the scientific formulation, analytical modeling, computational tools and applied statistical inference in diverse health-sciences problems. Data interrogation, modeling approaches, rigorous interpretation and inference will be emphasized throughout. HS851 is a 4 credit hour course (3 lectures + 1 lab/discussion). Students will learn how to: Understand the commonly used statistical methods of published scientific papers , Conduct statistical calculations/analyses on available data , Use software tools to analyze specific case-studies data , Communicate advanced statistical concepts/techniques , Determine, explain and interpret assumptions and limitations. Topics Covered include Epidemiology , Correlation/SLR , and slope inference, 1-2 samples , ROC Curve , ANOVA , Non-parametric inference , Cronbach's $\alpha$, Measurement Reliability/Validity , Survival Analysis , Decision theory , CLT/LLNs - limiting results and misconceptions , Association Tests , Bayesian Inference , PCA/ICA/Factor Analysis , Point/Interval Estimation (CI) - MoM, MLE , Instrument performance Evaluation , Study/Research Critiques , Common mistakes and misconceptions in using probability and statistics, identifying potential assumption violations, and avoiding them." #' #' #' doc5 <- "HS550: This course provides students with an introduction to probability reasoning and statistical inference. Students will learn theoretical concepts and apply analytic skills for collecting, managing, modeling, processing, interpreting and visualizing (mostly univariate) data. Students will learn the basic probability modeling and statistical analysis methods and acquire knowledge to read recently published health research publications. HS550 is a 4 credit hour course (3 lectures + 1 lab/discussion). Students will learn how to: Apply data management strategies to sample data files , Carry out statistical tests to answer common healthcare research questions using appropriate methods and software tools , Understand the core analytical data modeling techniques and their appropriate use Examples of Topics Covered , EDA/Charts , Ubiquitous variation , Parametric inference , Probability Theory , Odds Ratio/Relative Risk , Distributions , Exploratory data analysis , Resampling/Simulation , Design of Experiments , Intro to Epidemiology , Estimation , Hypothesis testing , Experiments vs. Observational studies , Data management (tables, streams, cloud, warehouses, DBs, arrays, binary, ASCII, handling, mechanics) , Power, sample-size, effect-size, sensitivity, specificity , Bias/Precision , Association vs. Causality , Rate-of-change , Clinical vs. Stat significance , Statistical Independence Bayesian Rule." #' #' #' #' *Create a new VCorpus object* #' #' The `VCorpus` object includes all the text and some meta-data (e.g., indexing) about the text. #' #' docs <- c(doc1, doc2, doc3, doc4, doc5) class(docs) #' #' #' Then let's make a `VCorpus` object using the `tm` package. To complete this task, we need to know the source type. Here `docs` has a vector with "character" class so we should use `VectorSource()`. If it is a dataframe, we should use `DataframeSource()` instead. `VCorpus()` creates a *volatile corpus*, which is the data type used by the `tm` package for text mining. #' #' library(tm) doc_corpus <- VCorpus(VectorSource(docs)) doc_corpus doc_corpus[[1]]$content #' #' #' This is a list that contains the information for the 5 documents we have created. Now we can apply the `tm_map()` function on this object to edit the text. Similarly to human semantic language understanding, the goal here is to algorithmically process the text and output (structured) quantitative information as a signature tensor representing the original (unstructured) text. #' #' *Transform text to lowercase* #' #' The text itself contains upper case letters as well as lower case letters. The first thing to do is to convert everything to lowercase. #' #' doc_corpus <- tm_map(doc_corpus, tolower) doc_corpus[[1]] #' #' #' *Text pre-processing* #' #' Removing Stopwords: These documents contain a lot of "stopwords" or common words that have important semantic meaning but low analytic value. We can remove these by the following command. #' #' stopwords("english") doc_corpus <- tm_map(doc_corpus, removeWords, stopwords("english")) doc_corpus[[1]] #' #' #' We removed all the stopwards included in the `stopwords("english")` list. You can always make your own stop-word list and just use `doc_corpus<-tm_map(doc_corpus, removeWords, your_own_words_list)` to apply this list. #' #' From the output of `doc1`, we notice that the removal of stopwords may create extra blank spaces. Thus, the next step would be to remove them. #' #' doc_corpus <- tm_map(doc_corpus, stripWhitespace) doc_corpus[[1]] #' #' #' Remove punctuation: Now we notice the irrelevant punctuation in the text, which can be removed by using a combination of `tm_map()` and `removePunctuation()` functions. #' #' doc_corpus <- tm_map(doc_corpus, removePunctuation) doc_corpus[[2]] #' #' #' The above `tm_map` commands changed the structure of our `doc_corpus` object. We can apply the `PlainTextDocument` function to convert it back to the original format. #' #' doc_corpus <- tm_map(doc_corpus, PlainTextDocument) #' #' #' Stemming: Removal of plurals and action suffixes. Let's inspect the first three documents. We notice that there are some words ending with "ing", "es", "s". #' #' doc_corpus[[1]]$content doc_corpus[[2]]$content doc_corpus[[3]]$content #' #' #' If we have multiple terms that only differ in their endings (e.g., past, present, present-perfect-continuous tense), the algorithm will treat them differently because it does not understand language semantics the way a human would. To make things easier for the computer, we can delete these endings by "stemming" documents. Remember to load the package `SnowballC` before using the function `stemDocument()`. The earliest stemmer was written by Julie Beth Lovins in 1968, which had great influence on all subsequent work. Currently, one of the most popular stemming approaches was proposed by Martin Porter and is used in `stemDocument()`, [more on Porter's algorithm](https://tartarus.org/martin/PorterStemmer/). #' #' # install.packages("SnowballC") library(SnowballC) doc_corpus <- tm_map(doc_corpus, stemDocument) doc_corpus[[1]]$content #' #' #' This stemming process has to be done after the `PlainTextDocument` function because `stemDocument` only can be applied to plain text. #' #' Bags of words: It's very useful to be able to tokenize text documents into `n-grams`, sequences of words, e.g., a `2-gram` represents two-word phrases that appear together in order. This allows us to form bags of words and extract information about word ordering. The *bag of words model* is a common way to represent documents in matrix form based on their term frequencies (TFs). We can construct an $n\times t$ document-term matrix (DTM), where $n$ is the number of documents, and $t$ is the number of unique terms. Each column in the DTM represents a unique term, the $(i,j)^{th}$ cell represents how many of term $j$ are present in document $i$. #' #' The basic bag of words model is invariant to ordering of the words within a document. Once we compute the DTM, we can use machine learning techniques to interpret the derived signature information contained in the resulting matrices. #' #' *Document-term matrix* #' #' Now the `doc_corpus` object is quite clean. Next, we can make a document-term matrix to explore all the terms in 5 documents. The document-term matrix is a bunch of dummy variables that tell us if a given term appears in a specific document. #' #' doc_dtm <- TermDocumentMatrix(doc_corpus) doc_dtm #' #' #' The summary of the document-term matrix is informative. We have 329 different terms in the 5 documents. There are 540 non-zero and 1105 sparse entries. Thus, the sparsity is $\frac{1105}{(540+1105)}\approx 67\%$, which measures the term sparsity across documents. A high sparsity means terms are not repeated often among different documents. #' #' Recall that we applied the `PlainTextDocument` function to your `doc_corpus` object. This removes all document metadata. To relabel the documents in the document-term matrix we can use the following commands: #' #' doc_dtm$dimnames$Docs <- as.character(1:5) inspect(doc_dtm) #' #' #' We might want to find and report the frequent terms using this document-term matrix. #' #' findFreqTerms(doc_dtm, lowfreq = 2) #' #' #' This gives us the terms that appear in at least 2 documents. High-frequency terms like `comput`, `statist`, `model`, `healthcar`, and `learn` make perfect sense to be included in this shortlist, as these courses cover modeling, statistical and computational methods with applications to health sciences. #' #' The `tm` package provides the functionality to compute the correlations between terms. Here is a mechanism to determine the words that are highly correlated with `statist`, ($\rho(statist, ?)\ge 0.8$). #' #' findAssocs(doc_dtm, "statist", corlimit = 0.8) #' #' #' ## Case-Study: Job ranking #' #' Next, we will mine an interesting dataset containing both quantitative and qualitative (free text) elements. The [2011 USA Jobs Ranking Dataset](https://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRanking) from [SOCR data archive](https://wiki.socr.umich.edu/index.php/SOCR_Data). More recent jobs description data are provided by the [US Bureau of Labor Statistics (BLS)](https://www.bls.gov/oes/current/oes_nat.htm) and can be used to practice similar *mixed data* modeling approaches. #' #' library(rvest) wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRanking") html_nodes(wiki_url, "#content") job <- html_table(html_nodes(wiki_url, "table")[[1]]) head(job) #' #' #' Note that low or high job indices represent more or less desirable jobs, respectively. Thus, in 2011, the most desirable job among top 200 common jobs would be Software Engineer. The aim of our study now is to explore the difference between the *top 30* desirable jobs and the *bottom 100* jobs based on their textural job descriptions (JDs). #' #' We will go through the same procedure as we did earlier in the course syllabi example. The documents we are using represent the *Description* column (a text vector of JDs) in the dataset. #' #' *Step 1: make a VCorpus object* #' #' jobs <- as.list(job$Description) jobCorpus <- VCorpus(VectorSource(jobs)) #' #' #' *Step 2: clean the VCorpus object* #' #' jobCorpus <- tm_map(jobCorpus, tolower) for(j in seq(jobCorpus)){ jobCorpus[[j]] <- gsub("_", " ", jobCorpus[[j]]) } #' #' #' Here we used a loop to substitute "_" (underscore) with blank space. This is necessary as the underscore character connecting words will cause problems with using `removePunctuation` to separate terms. In this situation, global pattern matching, `gsub`, will find and replace the underscores with spaces. #' #' jobCorpus <- tm_map(jobCorpus, removeWords, stopwords("english")) jobCorpus <- tm_map(jobCorpus, removePunctuation) jobCorpus <- tm_map(jobCorpus, stripWhitespace) jobCorpus <- tm_map(jobCorpus, PlainTextDocument) jobCorpus <- tm_map(jobCorpus, stemDocument) #' #' #' *Step 3: build document-term matrix* #' #' Term Document Matrix (TDM) objects (`tm::DocumentTermMatrix`) contain a sparse term-document matrix or document-term matrix and attribute weights of the matrix. First make sure that we got a clean VCorpus object. #' #' jobCorpus[[1]]$content #' #' #' Then we can start to build the DTM and reassign labels to the `Docs`. #' #' dtm <- DocumentTermMatrix(jobCorpus) dtm dtm$dimnames$Docs<-as.character(1:200) inspect(dtm[1:10, 1:10]) # tdm <- TermDocumentMatrix(jobCorpus, control = list(removePunctuation = TRUE, stopwords = TRUE)) # dtm <- DocumentTermMatrix(jobCorpus, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE)) #' #' #' Let's subset the `dtm` into top 30 jobs and bottom 100 jobs. #' #' dtm_top30 <- dtm[1:30, ] dtm_bot100 <- dtm[101:200, ] dtm_top30 dtm_bot100 #' #' #' In this case, since the sparsity is very high, we can try to remove some of the words that rarely appear in the job descriptions. #' #' dtms_top30 <- removeSparseTerms(dtm_top30, 0.90) dtms_top30 dtms_bot100 <- removeSparseTerms(dtm_bot100, 0.94) dtms_bot100 #' #' #' On the top, instead of the initial 846 terms, we only have 19 terms appearing in at least 10% of the JD's. #' #' Similarly, in the bottom, instead of the initial 846 terms, we only have 14 terms appearing in at least 6% of the bottom 100 jobs. #' #' Similar to what we did in [Chapter 5](https://socr.umich.edu/DSPA2/DSPA2_notes/05_SupervisedClassification.html), visualization of the terms-world clouds can be accomplished in `R` by combining functionalities from the `tm` and `wordcloud` packages. First, we can count the term frequencies in two document-term matrices. #' #' library(plotly) # Let's calculate the cumulative frequencies of words across documents and sort: freq1<-sort(colSums(as.matrix(dtms_top30)), decreasing=T) freq1 freq2<-sort(colSums(as.matrix(dtms_bot100)), decreasing=T) freq2 # Plot frequent words (for bottom 100 jobs) wf2=data.frame(term=names(freq2), occurrences=freq2) # library(ggplot2) # p <- ggplot(subset(wf2, freq2>2), aes(term, occurrences)) # p <- p + geom_bar(stat="identity") # p <- p + theme(axis.text.x=element_text(angle=45, hjust=1)) # p df.freq2 <- subset(wf2, freq2>2) plot_ly(data=df.freq2, x=~term, y=~occurrences, type="bar") %>% layout(title="Bottom 100 Job Descriptions (Frequent Terms)", xaxis = list(categoryorder = "total descending")) # Plot frequent words (for top 30 jobs) wf1=data.frame(term=names(freq1), occurrences=freq1) # p2 <- ggplot(subset(wf1, freq1>2), aes(term, occurrences, fill = freq1)) # p2 <- p2 + geom_bar(stat="identity") # p2 <- p2 + theme(axis.text.x=element_text(angle=45, hjust=1)) # p2 df.freq1 <- subset(wf1, freq1>2) plot_ly(data=df.freq1, x=~term, y=~occurrences, type="bar") %>% layout(title="Top 30 Job Descriptions (Frequent Terms)", xaxis = list(categoryorder = "total descending")) # what is common (frequently occurring words) # in the description of the top 30 and the bottom 100 jobs? intersect(df.freq1$term, df.freq2$term) #' #' #' Then we apply the `wordcloud` function to the `freq` dataset. #' #' library(wordcloud) set.seed(123) # wordcloud(names(freq1), freq1) wordcloud(names(freq1), freq1, min.freq=2, colors=brewer.pal(6,"Spectral")) # Color code the frequencies using an appropriate color map: # Sequential palettes names include: # Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGn YlGnBu YlOrBr YlOrRd # Diverging palettes include # BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn Spectral wordcloud(names(freq2), freq2, min.freq=5, colors=brewer.pal(6, "Spectral")) #wordcloud(names(freq2), freq2, min.freq=5, colors=brewer.pal(length(names(subset(wf2, freq2>5))), "Dark2")) #' #' #' It becomes apparent that top 30 jobs tend to focus more on research or discovery and include frequent keywords like “study”, “theory”, and “science”. The bottom 100 jobs are more focused on mechanistic operations of objects or equipment, with frequent keywords like “operation”, “repair”, and “perform”. #' #' ## Area Under ROC Curve #' #' In [Chapter 9](https://socr.umich.edu/DSPA2/DSPA2_notes/09_ModelEvalImprovement.html) we will discuss the *Receiver Operating Characteristic (ROC)* curve. We can use document-term matrices to build classifiers and use the *area under ROC curve* to evaluate those classifiers. Assume we want to predict whether a job ranks top 30 in the job list. #' #' The first task would be to create an indicator of high rank (job is in the top 30 list). We can use the `ifelse()` function that we are already familiar with. #' #' job$highrank <- ifelse(job$Index<30, 1, 0) #' #' #' Next we load the `glmnet` package to help us build the prediction model and draw graphs. #' #' # install.packages("glmnet") library(glmnet) #' #' #' The function we will be using is the `cv.glmnet`, where `cv` stands for cross-validation. Since the derived job ranking variable `highrank` is binary, we specify the option `family='binomial'`. Also, we want to use a 10-fold CV method for internal statistical (resampling-based) prediction validation. #' #' set.seed(25) fit <- cv.glmnet(x = as.matrix(dtm), y = job[['highrank']], family = 'binomial', # lasso penalty alpha = 1, # interested in the area under ROC curve type.measure = "auc", # 10-fold cross-validation nfolds = 10, # high value is less accurate, but has faster training thresh = 1e-3, # again lower number of iterations for faster training maxit = 1e3) # plot(fit) print(paste("max AUC =", round(max(fit$cvm), 4))) plotCV.glmnet <- function(cv.glmnet.object, name="") { df <- as.data.frame(cbind(x=log(cv.glmnet.object$lambda), y=cv.glmnet.object$cvm, errorBar=cv.glmnet.object$cvsd), nzero=cv.glmnet.object$nzero) featureNum <- cv.glmnet.object$nzero xFeature <- log(cv.glmnet.object$lambda) yFeature <- max(cv.glmnet.object$cvm)+max(cv.glmnet.object$cvsd) dataFeature <- data.frame(featureNum, xFeature, yFeature) plot_ly(data = df) %>% # add error bars for each CV-mean at log(lambda) add_trace(x = ~x, y = ~y, type = 'scatter', mode = 'markers', name = 'CV MSE', error_y = ~list(array = errorBar)) %>% # add the lambda-min and lambda 1SD vertical dash lines add_lines(data=df, x=c(log(cv.glmnet.object$lambda.min), log(cv.glmnet.object$lambda.min)), y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)), showlegend=F, line=list(dash="dash"), name="lambda.min", mode = 'lines+markers') %>% add_lines(data=df, x=c(log(cv.glmnet.object$lambda.1se), log(cv.glmnet.object$lambda.1se)), y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)), showlegend=F, line=list(dash="dash"), name="lambda.1se") %>% # Add Number of Features Annotations on Top add_trace(dataFeature, x = ~xFeature, y = ~yFeature, type = 'scatter', name="Number of Features", mode = 'text', text = ~featureNum, textposition = 'middle right', textfont = list(color = '#000000', size = 9)) %>% # Add top x-axis (non-zero features) # add_trace(data=df, x=~c(min(cv.glmnet.object$nzero),max(cv.glmnet.object$nzero)), # y=~c(max(y)+max(errorBar),max(y)+max(errorBar)), showlegend=F, # name = "Non-Zero Features", yaxis = "ax", mode = "lines+markers", type = "scatter") %>% layout(title = paste0("Cross-Validation MSE (", name, ")"), xaxis = list(title=TeX("\\log(\\lambda)"), side="bottom", showgrid=TRUE), # type="log" hovermode = "x unified", legend = list(orientation='h'), # xaxis2 = ax, yaxis = list(title = cv.glmnet.object$name, side="left", showgrid = TRUE)) %>% config(mathjax = 'cdn') } plotCV.glmnet(fit, "LASSO") #' #' #' Here $x$ is a matrix of covariates and $y$ is the response (outcome) variable. The graph is showing all the AUC we got from models we created. The last line of code helps us select the best AUC among all models. The resulting $AUC\sim 0.63$ represents a relatively good prediction model for this small sample size. #' #' ## TF-IDF #' #' To enhance the performance of DTM matrix, we introduce the **TF-IDF (term frequency - inverse document frequency)** concept. Unlike pure frequency, TF-IDF measures the relative importance of a term. If a term appears in almost every document, the term will be considered common with high information capacity. Alternatively, rare terms would be considered as less informational. #' #' ### Term Frequency (TF) #' #' **TF** is a ratio $\frac{\text{a term's occurrences in a document}}{\text{the number of occurrences of the most frequent word within the same document}}$. #' #' $$TF(t,d)=\frac{f_d(t)}{\max_{w\in d}{f_d(w)}}.$$ #' #' ### Inverse Document Frequency (IDF) #' #' The **TF** definition may allow high scores for irrelevant words that naturally show up often in a long text, even if these may have been triaged in a prior preprocessing step. The **IDF** attempts to rectify that. **IDF** represents the inverse of the share of the documents in which the regarded term can be found. The lower the number of documents containing the term, relative to the size of the corpus, the higher the term factor. #' #' IDF involves a logarithm function because otherwise the effective scoring penalty of showing up in two documents would be too extreme. Typically, the IDF for a term found in just one document is twice the IDF for another term found in two docs. The $\ln()$ function rectifies this bias of ranking in favor of rare terms, even if the TF-factor may be high. It is rather unlikely that a term's relevance is only high in one doc and not all others. #' #' $$IDF(t,D) = \ln\left ( \frac{|D|}{|\{ d\in D: t\in d\}|} \right ).$$ #' #' ### TF-IDF #' #' Both TF and IDF yield high scores for highly relevant terms. TF relies on local information (search over $d$), whereas IDF incorporates a more global perspective (search over $D$). The product $TF\times IDF$, gives the classical **TF-IDF** formula. However, alternative expressions may be formulated to get other univariate expressions using alternative weights for TF and IDF. #' #' $$TF\_IDF(t,d,D)= TF(t,d)\times IDF(t,D).$$ #' An example of an alternative TF-IDF metric can be defined by: #' $$ TF\_IDF'(t,d,D) =\frac{IDF(t,D)}{|D|} + TF\_IDF(t,d,D).$$ #' #' Let's make another DTM with TF-IDF weights and compare the differences between the *unweighted* and *weighted* DTM. #' #' dtm.tfidf <- DocumentTermMatrix(jobCorpus, control = list(weighting=weightTfIdf)) dtm.tfidf dtm.tfidf$dimnames$Docs <- as.character(1:200) inspect(dtm.tfidf[1:9, 1:10]) inspect(dtm[1:9, 1:10]) #' #' #' Inspecting the two different DTMs we can see that TF-IDF is not only counting the frequency but also assigning different weights to each term according to the importance of the term. Next, we are going to fit another model with this new DTM (`dtm.tfidf`). #' #' set.seed(1234) fit1 <- cv.glmnet(x = as.matrix(dtm.tfidf), y = job[['highrank']], family = 'binomial', # lasso penalty alpha = 1, # interested in the area under ROC curve type.measure = "auc", # 10-fold cross-validation nfolds = 10, # high value is less accurate, but has faster training thresh = 1e-3, # again lower number of iterations for faster training maxit = 1e3) # plot(fit1) print(paste("max AUC =", round(max(fit1$cvm), 4))) plotCV.glmnet(fit1, "LASSO") #' #' #' This performance is about the same as the previous jobs ranking prediction classifier (based on the unweighted DTM). Due to random sampling, each run of the protocols may generate a slightly different result. The idea behind using TF-IDF is that one would expect to get more unbiased estimates of word importance. If the document includes stopwords, like "the" or "one", the unweighted DTM may distort the results, but TF-IDF can help fix this problem. #' #' Next we can report a more intuitive representation of the job ranking prediction reflecting the agreement of the binary (top-30 or not) classification between the real labels and the predicted labels. Note that this validation uses the training data itself, however, later we will illustrate model performance assessment using independent testing data. #' #' # Binarize the LASSO probability prediction preffit1 <- predict(fit1, newx=as.matrix(dtm.tfidf), s="lambda.min", type = "class") binPredfit1 <- ifelse(preffit1<0.5, 0, 1) table(binPredfit1, job[['highrank']]) #' #' #' Let's try to predict the job ranking of a new (testing or validation) job description (JD). There are [many job descriptions provided online](https://www.bls.gov/ocs/ocsjobde.htm) that we can use for independent testing, by extracting the JD text and using the pre-trained model to predict the job ranking of the corresponding positions. Trying several alternative job categories, e.g., some high-tech or fin-tech and some manufacturing and construction jobs, may provide some intuition to the power of the jobs-classifier we built. Below, we will compare the BLS JDs for the positions of *accountant*, *attorney*, and *machinist.* #' #' # install.packages("text2vec"); install.packages("data.table") library(text2vec) library(data.table) # Choose the JD for a PUBLIC ACCOUNTANTS 1430 (https://www.bls.gov/ocs/ocsjobde.htm) xTestAccountant <- "Performs professional auditing work in a public accounting firm. Work requires at least a bachelor's degree in accounting. Participates in or conducts audits to ascertain the fairness of financial representations made by client companies. May also assist the client in improving accounting procedures and operations. Examines financial reports, accounting records, and related documents and practices of clients. Determines whether all important matters have been disclosed and whether procedures are consistent and conform to acceptable practices. Samples and tests transactions, internal controls, and other elements of the accounting system(s) as needed to render the accounting firm's final written opinion. As an entry level public accountant, serves as a junior member of an audit team. Receives classroom and on-the-job training to provide practical experience in applying the principles, theories, and concepts of accounting and auditing to specific situations. (Positions held by trainee public accountants with advanced degrees, such as MBA's are excluded at this level.) Complete instructions are furnished and work is reviewed to verify its accuracy, conformance with required procedures and instructions, and usefulness in facilitating the accountant's professional growth. Any technical problems not covered by instructions are brought to the attention of a superior. Carries out basic audit tests and procedures, such as: verifying reports against source accounts and records; reconciling bank and other accounts; and examining cash receipts and disbursements, payroll records, requisitions, receiving reports, and other accounting documents in detail to ascertain that transactions are properly supported and recorded. Prepares selected portions of audit working papers" xTestAttorney <- "Performs consultation, advisory and/or trail work and carries out the legal processes necessary to effect the rights, privileges, and obligations of the organization. The work performed requires completion of law school with an L.L.B. degree or J.D. degree and admission to the bar. Responsibilities or functions include one or more of the following or comparable duties: 1. Preparing and reviewing various legal instruments and documents, such as contracts, leases, licenses, purchases, sales, real estate, etc.; 2. Acting as agent of the organization in its transactions; 3. Examining material (e.g., advertisements, publications, etc.) for legal implications; advising officials of proposed legislation which might affect the organization; 4. Applying for patents, copyrights, or registration of the organization's products, processes, devices, and trademarks; advising whether to initiate or defend lawsuits; 5. Conducting pre trial preparations; defending the organization in lawsuits; 6. Prosecuting criminal cases for a local or state government or defending the general public (for example, public defenders and attorneys rendering legal services to students); or 7. Advising officials on tax matters, government regulations, and/or legal rights. Attorney jobs are matched at one of six levels according to two factors: 1. Difficulty level of legal work; and 2. Responsibility level of job. Attorney jobs which meet the above definitions are to be classified and coded in accordance with a chart available upon request. Legal questions are characterized by: facts that are well-established; clearly applicable legal precedents; and matters not of substantial importance to the organization. (Usually relatively limited sums of money, e.g., a few thousand dollars, are involved.) a. legal investigation, negotiation, and research preparatory to defending the organization in potential or actual lawsuits involving alleged negligence where the facts can be firmly established and there are precedent cases directly applicable to the situation; b. searching case reports, legal documents, periodicals, textbooks, and other legal references, and preparing draft opinions on employee compensation or benefit questions where there is a substantial amount of clearly applicable statutory, regulatory, and case material; c. drawing up contracts and other legal documents in connection with real property transactions requiring the development of detailed information but not involving serious questions regarding titles to property or other major factual or legal issues. d. preparing routine criminal cases for trial when the legal or factual issues are relatively straightforward and the impact of the case is limited; and e. advising public defendants in regard to routine criminal charges or complaints and representing such defendants in court when legal alternatives and facts are relatively clear and the impact of the outcome is limited primarily to the defendant. Legal work is regularly difficult by reason of one or more of the following: the absence of clear and directly applicable legal precedents; the different possible interpretations that can be placed on the facts, the laws, or the precedents involved; the substantial importance of the legal matters to the organization (e.g., sums as large as $100,000 are generally directly or indirectly involved); or the matter is being strongly pressed or contested in formal proceedings or in negotiations by the individuals, corporations, or government agencies involved. a. advising on the legal implications of advertising representations when the facts supporting the representations and the applicable precedent cases are subject to different interpretations; b. reviewing and advising on the implications of new or revised laws affecting the organization; c. presenting the organization's defense in court in a negligence lawsuit which is strongly pressed by counsel for an organized group; d. providing legal counsel on tax questions complicated by the absence of precedent decisions that are directly applicable to the organization's situation; e. preparing and prosecuting criminal cases when the facts of the cases are complex or difficult to determine or the outcome will have a significant impact within the jurisdiction; and f. advising and representing public defendants in all phases of criminal proceedings when the facts of the case are complex or difficult to determine, complex or unsettled legal issues are involved, or the prosecutorial jurisdiction devotes substantial resources to obtaining a conviction." xTestMachinist <- "Produces replacement parts and new parts in making repairs of metal parts of mechanical equipment. Work involves most of the following: interpreting written instructions and specifications; planning and laying out of work; using a variety of machinist's handtools and precision measuring instruments; setting up and operating standard machine tools; shaping of metal parts to close tolerances; making standard shop computations relating to dimensions of work, tooling, feeds, and speeds of machining; knowledge of the working properties of the common metals; selecting standard materials, parts, and equipment required for this work; and fitting and assembling parts into mechanical equipment. In general, the machinist's work normally requires a rounded training in machine-shop practice usually acquired through a formal apprenticeship or equivalent training and experience. Industrial machinery repairer. Repairs machinery or mechanical equipment. Work involves most of the following: examining machines and mechanical equipment to diagnose source of trouble; dismantling or partly dismantling machines and performing repairs that mainly involve the use of handtools in scraping and fitting parts; replacing broken or defective parts with items obtained from stock; ordering the production of a replacement part by a machine shop or sending the machine to a machine shop for major repairs; preparing written specifications for major repairs or for the production of parts ordered from machine shops; reassembling machines; and making all necessary adjustments for operation. In general, the work of a machinery maintenance mechanic requires rounded training and experience usually acquired through a formal apprenticeship or equivalent training and experience. Excluded from this classification are workers whose primary duties involve setting up or adjusting machines. Vehicle and mobile equipment mechanics and repairers. Repairs, rebuilds, or overhauls major assemblies of internal combustion automobiles, buses, trucks, or tractors. Work involves most of the following: Diagnosing the source of trouble and determining the extent of repairs required; replacing worn or broken parts such as piston rings, bearings, or other engine parts; grinding and adjusting valves; rebuilding carburetors; overhauling transmissions; and repairing fuel injection, lighting, and ignition systems. In general, the work of the motor vehicle mechanic requires rounded training and experience usually acquired through a formal apprenticeship or equivalent training and experience" # Define the testing cases (1) list, (2) Y-category (top-30 or not), and (3) ID names testJDs <- as.list(c(xTestAccountant, xTestAttorney, xTestMachinist)) testTop30 <- c(1,1,0) testNames <- c("xTestAccountant", "xTestAttorney", "xTestMachinist") ### 1. Training Phase Labeled data # loop to substitute "_" with blank space for(j in seq(jobs)){ jobs[[j]] <- gsub("_", " ", jobs[[j]]) } prep_fun = tolower tok_fun = word_tokenizer trainIterator = itoken(unlist(jobs), preprocessor = prep_fun, tokenizer = tok_fun, ids = rownames(jobs), progressbar = FALSE) vocab = create_vocabulary(trainIterator) vocab # `text2vec` package has alternative tokenizer functions (?tokenizers), we can use simple wrappers of the `base::gsub()` function or # write new tokenize. Let's construct a document-term matrix using the vocabulary of the trainIterator vectorizer = vocab_vectorizer(vocab) dtm_train = create_dtm(trainIterator, vectorizer) dim(dtm_train) ### 2. Fit LASSO model set.seed(1234) fit1 <- cv.glmnet(x = dtm_train, y = job[['highrank']], family = 'binomial', # lasso penalty alpha = 1, # interested in the area under ROC curve type.measure = "auc", # 10-fold cross-validation nfolds = 10, # high value is less accurate, but has faster training thresh = 1e-5, # again lower number of iterations for faster training maxit = 1e4) print(paste("max AUC =", round(max(fit1$cvm), 4))) #### 3. Testing Phase (new JDs) testIterator = tok_fun(prep_fun(unlist(testJDs))) # turn off progress bar because it won't look nice in rmd testIterator = itoken(testIterator, ids = testNames, progressbar = FALSE) dtm_test = create_dtm(testIterator, vectorizer) predictedJDs = predict(fit1, dtm_test, type = 'response')[,1] # Type can be: "link", "response", "coefficients", "class", "nonzero" predictedJDs # glmnet:::auc(testTop30, predictedJDs) #' #' #' Note that the results may change somewhat (e.g., AUC~0.62). Above we assessed the JD predictive (LASSO) model using the three out of bag job descriptions. Below is a plot of the model's MSE. #' #' # plot(fit1) # plot(fit1, xvar="lambda", label="TRUE") # mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5) plotCV.glmnet(fit1, "LASSO") #' #' #' The output of the predictions shows that: #' #' * On the *training data*, the predicted probabilities rapidly decrease with the indexing of the jobs, corresponding to the *overall job ranking* (highly ranked/desired jobs are listed on the top). #' * On the three *testing job description data* (accountant, attorney, and machinist), there is a clear ranking difference between the machinist and the other two professions. #' #' Also see the discussion in [Chapter 11](https://socr.umich.edu/DSPA2/DSPA2_notes/11_FeatureSelection.html) about the different *types of predictions* that can be generated as outputs of `cv.glmnet` regularized forecasting methods. #' #' ## Cosine similarity #' #' As we mentioned above, text data are often transformed, or weighted, by the Term Frequency-Inverse Document Frequency (TF-IDF). In many text-mining applications, this preprocessing step may often yield better results compared to the raw (unweighted) term-frequencies. Alternative transformations may be based on different distance measures, such as cosine distance. The cosine similarity and distance are defined by #' #' $$similarity = \cos(\theta) = \frac{A\cdot B}{||A||_2||B||_2},$$ #' $$Cosine\ Distance = 1-Similarity = 1 - \frac{A\cdot B}{||A||_2||B||_2},$$ #' where $\theta$ represents the angle between two vectors $A$ and $B$ in Euclidean space spanned by the DTM matrix. Note that the cosine *similarity* of two text documents (more specifically two DTM's with or without TF-IDF weights) will always be in the range $[0,1]$, since the term frequencies are always non-negative. In other words, the angle between two term frequency vectors cannot exceed $90^o$, and therefore, $0\leq Cosine\_Distance\leq1$, albeit, it's not a proper distance metric as it does not satisfy the triangle inequality, in general. Mind the dimensions of the corresponding matrices: $\dim(dtm)=200\times 846$ and $\dim(dist\_cos)=200\times 200$. #' #' cos_dist = function(mat){ numer = tcrossprod(mat) denom1 = sqrt(apply(mat, 1, crossprod)) denom2 = sqrt(apply(mat, 1, crossprod)) 1 - numer / outer(denom1,denom2) } # Recall # dtm <- DocumentTermMatrix(jobCorpus) # jobs <- as.list(job$Description) # trainIterator = itoken(unlist(jobs), ...) # dtm_train = create_dtm(trainIterator, vectorizer) # fit <- cv.glmnet(x = as.matrix(dtm), y = job[['highrank']], ...) dist_cos = cos_dist(as.matrix(dtm)) # also try with dtm_train set.seed(1234) fit_cos <- cv.glmnet(x = dist_cos, y = job[['highrank']], family = 'binomial', # lasso penalty alpha = 1, # interested in the area under ROC curve type.measure = "auc", # 10-fold cross-validation nfolds = 10, # high value is less accurate, but has faster training thresh = 1e-5, # again lower number of iterations for faster training maxit = 1e5) # plot(fit_cos) plotCV.glmnet(fit_cos, "Cosine-transformed LASSO") print(paste("max AUC =", round(max(fit_cos$cvm), 4))) #' #' #' The AUC now is significantly higher, $0.84$, which is a pretty good result, an improvement over the prior unweighted and TF-IDF-weighted DTMs. This suggests that cosine-transforming the data improves natural language processing results and leads to a more acceptable content classifier. #' #' #' ## Sentiment analysis #' #' Let's use the `text2vec::movie_review` dataset, which consists of 5,000 movie reviews dichotomized as `positive` or `negative`. In the subsequent predictive analytics, this *sentiment* will represent our output feature: #' $$Y= Sentiment=\left\{ #' \begin{array}{ll} #' 0, & \quad negative \\ #' 1, & \quad positive #' \end{array} #' \right. .$$ #' #' #' *Data Preprocessing* #' #' The `data.table` package will also be used for some data manipulation. Let's start with splitting the data into *training* and *testing* sets. #' #' # install.packages("text2vec"); install.packages("data.table") library(text2vec) library(data.table) # Load the movie reviews data data("movie_review") # coerce the movie reviews data to a data.table (DT) object setDT(movie_review) # create a key for the movie-reviews data table setkey(movie_review, id) # View the data # View(movie_review) head(movie_review); dim(movie_review); colnames(movie_review) # Generate 80-20% training-testing split of the reviews all_ids = movie_review$id set.seed(1234) train_ids = sample(all_ids, 5000*0.8) test_ids = setdiff(all_ids, train_ids) train = movie_review[train_ids, ] test = movie_review[test_ids, ] #' #' #' Next, we will vectorize the reviews by creating terms to *termID* mappings. Note that terms may include arbitrary *n-grams*, not just single words. The set of reviews will be represented as a sparse matrix, with rows and columns corresponding to reviews/reviewers and terms, respectively. This vectorization may be accomplished in several alternative ways, e.g., by using the corpus vocabulary, feature hashing, etc. #' #' The vocabulary-based DTM, created by the `create_vocabulary()` function, relies on all unique terms from all reviews, where each term has a unique ID. In this example, we will create the review vocabulary using an *iterator* construct abstracting the input details and enabling *in memory* processing of the (training) data by chunks. #' #' # define the test preprocessing # either a simple (tolower case) function preproc_fun = tolower # or a more elaborate "cleaning" function preproc_fun = function(x) # text data { require("tm") x = gsub("<.*?>", " ", x) # regex removing HTML tags x = iconv(x, "latin1", "ASCII", sub="") # remove non-ASCII characters x = gsub("[^[:alnum:]]", " ", x) # remove non-alpha-numeric values x = tolower(x) # convert to lowercase characters # x = removeNumbers(x) # removing numbers x = stripWhitespace(x) # removing white space x = gsub("^\\s+|\\s+$", "", x) # remove leading and trailing white space return(x) } # define the tokenization function token_fun = word_tokenizer # iterator for both training and testing sets iter_train = itoken(train$review, preprocessor = preproc_fun, tokenizer = token_fun, ids = train$id, progressbar = TRUE) iter_test = itoken(test$review, preprocessor = preproc_fun, tokenizer = token_fun, ids = test$id, progressbar = TRUE) reviewVocab = create_vocabulary(iter_train) # report the head and tail of the reviewVocab reviewVocab #' #' #' The next step computes the *document term matrix* (DTM). #' #' reviewVectorizer = vocab_vectorizer(reviewVocab) t0 = Sys.time() dtm_train = create_dtm(iter_train, reviewVectorizer) dtm_test = create_dtm(iter_test, reviewVectorizer) t1 = Sys.time() print(difftime(t1, t0, units = 'sec')) # check the DTM dimensions dim(dtm_train); dim(dtm_test) # confirm that the training data review DTM dimensions are consistent # with training review IDs, i.e., #rows = number of documents, and # #columns = number of unique terms (n-grams), dim(dtm_train)[[2]] identical(rownames(dtm_train), train$id) #' #' #' ## NLP/TM Analytics #' #' We can now fit statistical models or derive machine learning AI predictions. Let's start by using `glmnet()` to fit a *logit model* with LASSO ($L_1$) regularization and 10-fold cross-validation, see [Chapter 11](https://socr.umich.edu/DSPA2/DSPA2_notes/11_FeatureSelection.html). #' #' library(glmnet) nFolds = 10 t0 = Sys.time() glmnet_classifier = cv.glmnet(x = dtm_train, y = train[['sentiment']], family = "binomial", # LASSO L1 penalty alpha = 1, # interested in the area under ROC curve or MSE type.measure = "auc", # n-fold internal (training data) stats cross-validation nfolds = nFolds, # threshold: high value is less accurate / faster training thresh = 1e-2, # again lower number of iterations for faster training maxit = 1e3 ) lambda.best <- glmnet_classifier$lambda.min lambda.best # report execution time t1 = Sys.time() print(difftime(t1, t0, units = 'sec')) # some prediction plots # plot(glmnet_classifier) # # plot(glmnet_classifier, xvar="lambda", label="TRUE") # mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5) plotCV.glmnet(glmnet_classifier) #' #' #' Now let's look at external validation, i.e., testing the model on the independent 20% of the reviews we kept aside. The performance of the binary prediction (binary sentiment analysis of these movie reviews) on the test data is roughly the same as we had from the internal statistical 10-fold cross-validation. #' #' # library('glmnet') # report the mean internal cross-validated error print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4))) # report TESTING data prediction accuracy xTest = dtm_test yTest = test[['sentiment']] predLASSO <- predict(glmnet_classifier, s = glmnet_classifier$lambda.1se, newx = xTest) testMSE_LASSO <- mean((predLASSO - yTest)^2); testMSE_LASSO # Binarize the LASSO probability prediction binPredLASSO <- ifelse(predLASSO<0.5, 0, 1) table(binPredLASSO, yTest) # and testing data AUC glmnet:::auc(yTest, predLASSO) # report the top 20 negative and positive predictive terms: predict == predict.cv.glmnet summary(predLASSO) sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients"))[1:20] rev(sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients")))[1:20] #' #' #' The (external) prediction performance, measured by AUC, on the testing data is about the same as the internal 10-fold stats cross-validation we reported above. #' #' #' ### Prediction Optimization #' #' Earlier we saw that we can also prune the vocabulary and perhaps improve prediction performance, e.g., by removing non-salient terms like stopwords and by using *n-grams* instead of single words. #' #' reviewVocab = create_vocabulary(iter_train, stopwords=tm::stopwords("english"), ngram = c(1L, 2L)) prunedReviewVocab = prune_vocabulary(reviewVocab, term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001) prunedVectorizer = vocab_vectorizer(prunedReviewVocab) t0 = Sys.time() dtm_train = create_dtm(iter_train, prunedVectorizer) dtm_test = create_dtm(iter_test, prunedVectorizer) t1 = Sys.time() print(difftime(t1, t0, units = 'sec')) #' #' #' Next, refit the model and report the performance. Did this yield an improvement in the prediction accuracy? #' #' glmnet_prunedClassifier=cv.glmnet(x=dtm_train, y=train[['sentiment']], family = "binomial", # LASSO L1 penalty alpha = 1, # interested in the area under ROC curve or MSE type.measure = "auc", # n-fold internal (training data) stats cross-validation nfolds = nFolds, # threshold: high value is less accurate / faster training thresh = 1e-4, # again lower number of iterations for faster training maxit = 1e5 ) lambda.best <- glmnet_prunedClassifier$lambda.min lambda.best # report execution time t1 = Sys.time() print(difftime(t1, t0, units = 'sec')) # some prediction plots # plot(glmnet_prunedClassifier) # mtext("Pruned-Model CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5) plotCV.glmnet(glmnet_prunedClassifier) # report the mean internal cross-validated error print(paste("max AUC =", round(max(glmnet_prunedClassifier$cvm), 4))) # report TESTING data prediction accuracy xTest = dtm_test yTest = test[['sentiment']] predLASSO = predict(glmnet_prunedClassifier, dtm_test, type = 'response')[,1] testMSE_LASSO <- mean((predLASSO - yTest)^2); testMSE_LASSO # Binarize the LASSO probability prediction binPredLASSO <- ifelse(predLASSO<0.5, 0, 1) table(binPredLASSO, yTest) # and testing data AUC glmnet:::auc(yTest, predLASSO) # report the top 20 negative and positive predictive terms summary(predLASSO) sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients"))[1:20] rev(sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients")))[1:20] # Binarize the LASSO probability prediction # and construct an approximate confusion matrix binPredLASSO <- ifelse(predLASSO<0.5, 0, 1) table(binPredLASSO, yTest) #' #' #' Using `n-grams` somewhat improved the sentiment prediction model. #' #' #' # Apriori Association Rules Learning #' #' [HTTP cookies](https://en.wikipedia.org/wiki/HTTP_cookie) are used to track web-surfing and Internet traffic. We often notice that promotions (ads) on websites tend to match our needs, reveal our prior browsing history, or may reflect our interests. That is not an accident. Nowadays, [recommendation systems](https://en.wikipedia.org/wiki/Recommender_system) are highly based on machine learning methods that can learn the behavior, e.g., purchasing patterns, of individual consumers. In this section, we will uncover some of the mystery behind recommendation systems like market basket analysis. Specifically, we will (1) discuss association rules and their support and confidence, (2) present the *Apriori algorithm* for association rule learning, and (3) cover step-by-step a set of case-studies, including a toy example and studies of Head and Neck Cancer Medications, Grocery purchases, and survival of Titanic passengers. #' #' ## Association Rules #' #' Association rules are the result of process analytics (e.g., market basket analysis) that specify patterns of relationships among items. One specific example would be: #' $$\{charcoal, \, lighter, \, chicken\, wings\}\rightarrow\{barbecue\, sauce\}$$ #' In other words, charcoal, lighter and chicken wings imply barbecue sauce. The curly brackets indicate that we have a set of items and the arrow suggests a direction of the association. Items in a set are called *elements.* When an item-set like $\{charcoal, \, lighter, \, chicken\, wings, \, barbecue\, sauce\}$ appears in our dataset with some regularity, we can mine and discover patterns of association with other item-sets. #' #' Association rules are commonly used for unsupervised discovery of knowledge rather than prediction of pre-specified outcomes. In biomedical research, association rules are widely used to: #' #' - searching for interesting or frequently occurring patterns of DNA, #' - searching for protein sequences in an analysis of cancer data, #' - finding patterns of medical claims that occur in combination with credit card or insurance fraud. #' #' ## The Apriori algorithm for association rule learning #' #' Association rules are mostly applied to *transactional data*, which is usually records of trade, exchange, or arrangement, e.g., medical records. These datasets are typically very large in number of transactions and features, e.g., [electronic health record (EHR)](https://en.wikipedia.org/wiki/Electronic_health_record). In such data archives, the number and complexity of transactions is high, which complicates efforts to extract patterns, conduct market analysis, or predict basket purchases. #' #' **Apriori association rules** help untangle such difficult modeling problems. If we have a simple prior belief about the properties of frequent elements, we may be able to efficiently reduce the number of features or combinations that we need to look at. #' #' The Apriori algorithm is based on a simple `apriori` belief that *all subsets of a frequent item-set must also be frequent*. This is known as the **Apriori property**. In the last example, the full set $\{charcoal, \, lighter, \, chicken\, wings, \, barbecue\, sauce\}$ is *frequent* if and only if itself and all of its subsets, including single elements, pairs, and triples, occur frequently. Naturally, the apriori rule is designed for finding patterns in large datasets where patterns that appear frequently are considered "interesting", "valuable", or "important". #' #' ## Rule **support** and **confidence** #' #' We can measure rule's importance by computing its **support** and **confidence** metrics. The support and confidence represent two criteria useful in deciding whether a pattern is "valuable". By setting thresholds for these two criteria, we can easily limit the number of interesting rules or item-sets reported. #' #' For item-sets $X$ and $Y$, the `support` of an item-set measures how (relatively) frequently it appears in the data: #' $$support(X)=\frac{count(X)}{N},$$ #' where *N* is the total number of transactions in the database and *count(X)* is the number of observations (transactions) containing the item-set *X*. #' #' In a set-theoretic sense, the union of item-sets is an item-set itself. In other words, if $Z=\{X,Y\}=\{X\cup Y\}$, then $$support(Z)=support(X,Y).$$ #' #' For a given rule $X \rightarrow Y$, the `rule's confidence` measures the relative accuracy of the rule: #' $$confidence(X \rightarrow Y)=\frac{support(X, Y)}{support(X)}.$$ #' #' The `confidence` measures the joint occurrence of *X* and *Y* over the *X* domain. If whenever *X* appears *Y* tends to also be present, then we will have a high $confidence(X\rightarrow Y)$. This set-theoretic formulation of the confidence can also be expressed as a conditional probability, since the numerator is effectively the probability of the intersection of the events $E_X$ and $E_Y$ corresponding to observing the transactions $X$ and $Y$, respectively. That is, the support corresponds to the [conditional probability](https://wiki.socr.umich.edu/index.php/AP_Statistics_Curriculum_2007_Prob_Rules#Conditional_Probability) $support(X,Y)=Prob(E_X\cap E_Y)$ and #' $$confidence(X \rightarrow Y)=\frac{Prob(E_X\cap E_Y)}{Prob(E_X)}\equiv Prob(E_Y | E_X).$$ #' #' Note that the ranges of the support and the confidence are $0 \leq support,\ confidence \leq 1$. #' #' #' One intuitive example of a strong association rule is $\{peanut\, butter\}\rightarrow\{bread\}$, because it has high *support* as well as high *confidence* in grocery store transactions. Shoppers tend to purchase bread when they get peanut butter. These items tend to appear in the same baskets, which yields high confidence for the rule $\{peanut\, butter\}\rightarrow\{bread\}$. Of course, there may be cases where $\{peanut\, butter\}\rightarrow\{celery\}$. These recommendation systems may not be perfect and can fail in unexpected ways, see the ["How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did?" article](https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did). #' #' ## Building a set of rules with the Apriori principle #' #' Remember that the number of arrangements of $n$ elements taken $k$ at a time, i.e., the [number of combinations](https://en.wikipedia.org/wiki/Combination), increases exponentially with the size of the item inventory ($n$), see [this SOCR activity](http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_Prob_Count). This is precisely why if a restaurant only uses 10 ingredients, there are ${10\choose 5}=253$ possible menu items for customers to order. clearly the complexity of the number of "baskets" rapidly increases with the inventory of the available items or ingredients. For instance, #' #' - $n=100$ (ingredients) and $k=50$ (menus of 50 ingredients), yields ${100\choose 50}=100891344545564193334812497256\gt 10^{29}$ possible arrangements (combinations), and #' - $n=100$ (ingredients) and $k=5$ (menus of only 5 ingredients), yields ${100\choose 5}=75287520\gt 7M$ possible orders. #' #' To avoid this complexity, we will introduce a two-step process of building few, simple, and informative sets of rules: #' #' - **Step 1**: Filter all item-sets with a minimum *support* threshold. This is accomplished iteratively by increasing the size of the item-sets. In the first iteration, we compute the support of singletons, 1-item-sets. Next iteration, we compute the support of pairs of items, then, triples of items, etc. Item-sets passing iteration *i* could be considered as candidates for the next iteration, *i+1*. If *{A}*, *{B}*, *{C}* are all frequent singletons, but *D* is not frequent in the first singleton-selection round, then in the second iteration we only consider the support of these pairs *{A, B}*, *{A,C}*, *{B,C}*, ignoring all pairs including *D*. This substantially reduces the cardinality of the potential item-sets and ensures the feasibility of the algorithm. At the third iteration, if *{A,C}*, and *{B,C}* are frequently occurring, but *{A, B}* is not, then the algorithm may terminate, as the support of *{A,B,C}* is trivial (does not pass the support threshold), given that *{A, B}* was not frequent enough. #' #' - **Step 2**: Using the item-sets selected in step 1, generate new rules with *confidence* larger than a predefined minimum confidence threshold. The candidate item-sets that passed step 1 would include all frequent item-sets. For the highly-supported item-set *{A, C}*, we would compute the confidence measures for $\{A\}\rightarrow\{C\}$ as well as $\{C\}\rightarrow\{A\}$ and compare these against the minimum confidence threshold. The *surviving rules are the ones with confidence levels exceeding that minimum threshold*. #' #' ## A toy example #' #' Assume that a large supermarket tracks sales data by stock-keeping unit (SKU) for each item, i.e., each item, such as "butter" or "bread", is identified by an SKU number. The supermarket has a database of transactions where each transaction is a set of SKUs that were bought together. #' #' Suppose the database of transactions consist of following item-sets, each representing a purchasing order: #' #' require(knitr) item_table = as.data.frame(t(c("{1,2,3,4}","{1,2,4}","{1,2}","{2,3,4}","{2,3}","{3,4}","{2,4}"))) colnames(item_table) <- c("choice1","choice2","choice3","choice4","choice5","choice6","choice7") kable(item_table, caption = "Item table") #' #' #' We will use *Apriori* to determine the frequent item-sets of this database. To do so, we will say that an item-set is frequent if it appears in at least $3$ transactions of the database, i.e., the value $3$ is the support threshold. #' #' The first step of Apriori is to count up the number of occurrences, i.e., the support, of each member item separately. By scanning the database for the first time, we obtain the following size-1 support: #' #' item_table = as.data.frame(t(c(3,6,4,5))) colnames(item_table) <- c("items: {1}","{2}","{3}","{4}") rownames(item_table) <- "(N=7)*support" kable(item_table,caption = "Size 1 Support") #' #' #' All the singletons, item-sets of size 1, have a support of at least 3, so they are all frequent. The next step is to generate a list of all pairs of frequent items. #' #' For example, regarding the pair $\{1,2\}$: the first table of Example 2 shows items 1 and 2 appearing together in three of the item-sets; therefore, we say that the support of the item $\{1,2\}$ is $3$. #' #' item_table = as.data.frame(t(c(3,1,2,3,4,3))) colnames(item_table) <- c("{1,2}","{1,3}","{1,4}","{2,3}","{2,4}","{3,4}") rownames(item_table) <- "N*support" kable(item_table,caption = "Size 2 Support") #' #' #' The pairs $\{1,2\}$, $\{2,3\}$, $\{2,4\}$, and $\{3,4\}$ all meet or exceed the minimum support of $3$, so they are *frequent*. The pairs $\{1,3\}$ and #' $\{1,4\}$ are not and any larger set which contains $\{1,3\}$ or $\{1,4\}$ cannot be frequent. In this way, we can prune sets: we will now look for frequent triples in the database, but we can already exclude all the triples that contain one of these two pairs: #' #' item_table = as.data.frame(t(c(2))) colnames(item_table) <- c("{2,3,4}") rownames(item_table) <- "N*support" kable(item_table,caption = "Size 3 Support") #' #' #' In the example, there are no frequent triplets -- the support of the item-set $\{2,3,4\}$ is below the minimal threshold, and the other triplets were excluded because they were super sets of pairs that were already below the threshold. We have thus determined the frequent sets of items in the database, and illustrated how some items were not counted because some of their subsets were already known to be below the threshold. #' #' ## Case Study 1: Head and Neck Cancer Medications #' #' *Step 1 - data import* #' #' To demonstrate the *Apriori* algorithm in a real biomedical case-study, we will use a transactional healthcare data representing [a subset of the Head and Neck Cancer Medication data](https://umich.instructure.com/files/1678540/download?download_frd=1), which it is available in [our case-studies collection](https://umich.instructure.com/courses/38100/files/folder/data) as `10_medication_descriptions.csv`. It consists of inpatient medications of head and neck cancer patients. #' #' The data is in a wide format (see [Chapter 1](https://socr.umich.edu/DSPA2/DSPA2_notes/01_Introduction.html)) where each row represents a patient. During the study period, each patient had records for a maximum of 5 encounters. *NA* represents no medication administration records in this specific time point for the specific patient. This dataset contains a total of 528 patients. #' #' #subsetting patients with 2-5 encounters in_medication <- read.csv("https://umich.instructure.com/files/1678540/download?download_frd=1", header=T) library(plyr) a <- count(inmedwpid$PID) b <- a[a$freq %in% c(2:5), ] hn_med1 <- inmedwpid[inmedwpid$PID%in% b$x, ] table(hn_med1$PID) # To save the results locally # write.csv(hn_med1, "D:\\folder\\hn_med1.csv") #' #' #' #make a csv file that is read.transactions() friendly. # hn_med1<-read.csv("D:\\folder\\hn_med1.csv") hn_med1<-hn_med1[, -1] hn_med1<-hn_med1[, c(3, 9)] med.l<-count(hn_med1, c("PID", "MEDICATION_DESC")) m_count<-count(med.l$PID) enc<-c() for (i in 1:length(m_count$x)){ for (j in 1:m_count$freq[i]){ enc=c(enc, j) } } med.l<-cbind(med.l, enc) med.l$MEDICATION_DESC<-gsub("[[:punct:]]", " ", med.l$MEDICATION_DESC) med.l$MEDICATION_DESC<-casefold(med.l$MEDICATION_DESC, upper=F) med.w<-reshape(med.l, timevar="enc", idvar =c("PID", "freq"), direction="wide") med.w<-med.w[, -c(1, 2)] write.csv(med.w, " med.w.csv") #' #' #' *Step 2 - exploring and preparing the data* #' #' Different from our data imports in the previous chapters, transactional data need to be ingested in R using the `read.transactions()` function. This function will store data as a matrix with each row representing a basket (or transaction example) and each column representing items in the transaction. #' #' Let's load the dataset and delete the irrelevant *index* column. Using the `write.csv(R data, "path")` function, we can output our R data file into a local CSV file. To avoid generating another index column in the output CSV file, we can use the `row.names=F` option. #' #' med<-read.csv("https://umich.instructure.com/files/1678540/download?download_frd=1", stringsAsFactors = FALSE) med<-med[, -1] write.csv(med, "medication.csv", row.names=F) library(knitr) kable(med[1:5, ]) #' #' #' Now we can use `read.transactions()` in the `arules` package to read the CSV file we just outputted. #' #' # install.packages("arules") library(arules) med <- read.transactions("medication.csv", sep = ",", skip = 1, rm.duplicates=TRUE) summary(med) #' #' #' Here we use the option `rm.duplicates=T` because we may have similar medication administration records for two different patients. The option `skip=1` means we skip the heading line in the CSV file. Now we get transactional data with unique rows. #' #' The summary of a transactional data contains rich information. The first block of information tells us that we have 528 rows and 88 different medicines in this matrix. Using the density number we can calculate how many non *NA* medication records are in the data. In total, we have $528\times 88=46,464$ positions in the matrix. Thus, there are $46,464\times 0.0209=971$ medicines prescribed during the study period. #' #' The second block lists the most frequent medicines and their frequencies in the matrix. For example `fentanyl injection uh` appeared 211 times, which represents a proportion of $211/528=0.4$ of the (treatment) transactions. Since [fentanyl](https://en.wikipedia.org/wiki/Fentanyl) is frequently used to help prevent pain after surgery or other medical treatments, we can see that many of these patients may have undergone some significant medical procedures that require postoperative pain management. #' #' The last block shows statistics about the size of the transaction. 248 patients had only one medicine in the study period, while 12 of them had 5 medication records, one for each time point. On average, the patients are having 1.8 different medicines. #' #' ### Visualizing item support - item frequency plots #' #' The summary may still be fairly abstract, so we can visualize the information. #' #' inspect(med[1:5,]) #' #' #' The `inspect()` call shows the transactional dataset. We can see that the medication records of each patient are nicely formatted as item-sets. #' #' We can further analyze the frequent terms using `itemFrequency()`. This will show all item frequencies alphabetically ordered from the first five outputs. #' #' library(plotly) itemFrequency(med[, 1:5]) # itemFrequencyPlot(med, topN=20) # convert the data to a matrix mat <- as.matrix(med@data) # View matrix as image # image(mat) # capture the row/column names rowNames <- med@itemInfo$labels colNames <- paste0("S", c(1:dim(mat)[2])) rownames(mat) <- rowNames colnames(mat) <- colNames # convert matrix to DF for processing, order rows based on their average (across subjects/cases/columns), back to matrix for display df <- as.data.frame(1*mat) df$avg <- rowMeans(df) dfOrdered <- df[order(df$avg, decreasing = T), ] matOrdered <- as.matrix(dfOrdered) # track the ordered row names rowNames <- rownames(dfOrdered) colNames <- colnames(dfOrdered) # 2D top 20 terms bar plot # To order the meds based on "avg", instead of alphabetically (mind the "-" sign to order Large to small!) plot_ly(x = reorder(rowNames[c(1:20)], -dfOrdered[1:20, "avg"]), y=dfOrdered[1:20, "avg"], name="Top 20 Meds", type="bar") %>% layout(title='Frequency of Medications (Top 20) based on averaging across Cases', xaxis = list(title="Term"), yaxis = list(title="Relative Frequency")) # 3D surface plot plot_ly(x = colNames, y = rowNames, z = 2*matOrdered, type = "surface") %>% layout(title='Term (X) by Sample (Y) Frequency (Z) Plot', xaxis = list(title="Term"), yaxis = list(title="Sample ID")) %>% hide_colorbar() #' #' #' We can only display the top 20 medicines that are most frequently present in this dataset. Consistent with the prior `summary()` output, `fentanyl` is still the most frequent item. You can also try to plot the items with a threshold for support. Instead of `topN=20`, just use the option `support=0.1`, which will give you all the items that have a support greater or equal to $0.1$. #' #' Let's generalize this process and define a new function, #' `itemFrequencyPlotly(transactionObject, numTopItemps = 10)`, that can generate frequency plots for any transaction object. #' #' # define a generic plot_ly ItemFrequency plotting function itemFrequencyPlotly <- function(transactionObject, numTopItemps = 10, name="") { name <- ifelse(name=="", paste0('Frequency of Items (Top ', numTopItemps, ' ) based on averaging across Cases'), paste0('Frequency of Items (Top ', numTopItemps, ' ) based on averaging across Cases (Data=', name, ')')) mat <- as.matrix(transactionObject@data) # View matrix as image # image(mat) # capture the row/column names rowNames <- transactionObject@itemInfo$labels colNames <- paste0("S", c(1:dim(mat)[2])) rownames(mat) <- rowNames colnames(mat) <- colNames # convert matrix to DF for processing, order rows based on their average (across subjects/cases/columns), back to matrix for display df <- as.data.frame(1*mat) df$avg <- rowMeans(df) dfOrdered <- df[order(df$avg, decreasing = T), ] matOrdered <- as.matrix(dfOrdered) # track the ordered row names rowNames <- rownames(dfOrdered) colNames <- colnames(dfOrdered) # 2D top "20"numTopItemps" terms bar plot # To order the meds based on "avg", instead of alphabetically (mind the "-" sign to order Large to small!) plot_ly(x = reorder(rowNames[c(1:numTopItemps)], -dfOrdered[1:numTopItemps, "avg"]), y=dfOrdered[1:numTopItemps, "avg"], name=paste0("Top ", numTopItemps, " Meds"), type="bar") %>% layout(title=name, xaxis = list(title="Terms"), yaxis = list(title="Relative Frequency")) } #' #' #' ### Visualizing transaction data - plotting the sparse matrix #' #' The sparse matrix will show what medications were prescribed for each patient. Below we only show the top-20 medications for the first 15 cases. #' #' # image(med[1:5, ]) plot_ly(x=reorder(rowNames[c(1:20)], -dfOrdered[1:20, "avg"]), y=colNames[1:15], z=2*matOrdered[1:15, 1:20], type="heatmap") %>% layout(title='Heatmap - Top-20 Medications for the first 15 Cases') %>% hide_colorbar() #' #' #' This image has 15 rows (translations), as we only specified the first 15 patients, and 20 columns (top-20 meds). Although the picture may be a little hard to interpret, it gives a sense of what kind of medicine is prescribed for each patient in the study. #' #' Let's see an expanded graph including the top 30 medications for a random roster of 50 patients. #' #' subset_int <- sample(ncol(matOrdered), 50, replace = F) # image(med[subset_int, ]) plot_ly(x=rowNames[1:30], y=colNames[subset_int], z=2*t(matOrdered[1:30, subset_int]), type="heatmap") %>% layout(title='Heatmap - Bottom-30 Medications for a Random set of 50 Patients') %>% hide_colorbar() #' #' #' It shows us clearly that some medications are more popular than others. Now, let's fit the *Apriori* model. #' #' *Step 3 - training a model on the data* #' #' With the data in place, we can build the *association rules* using the `arules::apriori()` function. #' #' `myrules <- apriori(data=mydata, parameter=list(support=0.1, confidence=0.8, minlen=1))` #' #' - *data*: a sparse matrix created by `read.transacations()`. #' - *support*: minimum threshold for support. #' - *confidence*: minimum threshold for confidence. #' - *minlen*: minimum required rule items (in our case, medications). #' #' Setting up the threshold could be challenging; too high threshold may generate no rules or common-sense and non-informative rules, whereas too low values may overlook too many relevant rules present in the transactional data. We’ll start with using the default setting `support=0.1, confidence=0.8`. #' #' apriori(med) #' #' #' Not surprisingly, we obtained 0 rules, as the default setting is too high. In practice, we might need some time to fine-tune these thresholds, which may require certain familiarity with the underlying process or embedding human clinical knowledge into the AI decision support. #' #' In this case study, we set `support=0.01` and `confidence=0.25`. This requires rules that have appeared in at least 1% of the head and neck cancer patients in the study. Also, the rules have to have at least 25% accuracy. Moreover, `minlen=2` would be a very helpful option because it removes all rules that have fewer than two items. #' #' med_rule <- apriori(med, parameter=list(support=0.01, confidence=0.25, minlen=2)) med_rule #' #' #' The result suggests we have a new `med_rule` object consisting of 29 rules. #' #' *Step 4 - evaluating model performance* #' #' First, we can obtain the overall summary of this set of rules. #' #' summary(med_rule) #' #' #' We have 13 rules that contain two items; 12 rules containing 3 items, and the remaining 4 rules contain 4 items. #' #' The `lift` column shows how much more likely one medicine is to be prescribed to a patient given another medicine is prescribed. It is obtained by the following formula: #' $$lift(X\rightarrow Y)=\frac{confidence(X\rightarrow Y)}{support(Y)}.$$ #' Note that $lift(X\rightarrow Y)$ is the same as $lift(Y\rightarrow X)$. The range of $lift$ is $[0,\infty)$ and higher $lift$ is better. We don't need to worry about the support, since we already set a threshold that the support must exceed. #' #' Using the `arulesViz` package we can visualize the confidence and support scatter plots for all the rules. #' #' # install.packages("arulesViz") library(arulesViz) # plot(sort(med_rule)) sortedRule <- sort(med_rule) x1 <- sortedRule@quality$support y1 <- sortedRule@quality$confidence z1 <- sortedRule@quality$lift col1 <- sortedRule@quality$count ruleNames <- paste0("Rule", c(1:length(sortedRule@quality$support))) plot_ly(x = ~x1, y = ~y1, z = ~z1, color = ~z1, name=ruleNames) %>% add_markers() %>% layout(title=paste0("Arule Support-Confidence-Lift Plot (for all ", length(sortedRule@quality$support), " rules)"), scene = list(xaxis = list(title = 'Support'), yaxis = list(title = 'Confidence'), zaxis = list(title = 'Lift'))) %>% hide_colorbar() #' #' #' Again, we can utilize the `inspect()` function to see exactly what these rules are. #' #' inspect(med_rule[1:3]) #' #' #' Here, `lhs` and `rhs` refer to the "left hand side" and "right hand side" of the rule, respectively. `lhs` is the given condition and `rhs` is the predicted result. Using the first row as an example: If a head-and-neck patient has been prescribed acetaminophen (pain reliever and fever reducer), it is likely that the patient is also prescribed *cefazolin* (antibiotic prescribed for treatment of resistant bacterial infections); bacterial infections are associated with fevers and some cancers. #' #' *Step 5 - sorting the set of association rules* #' #' Sorting the resulting association rules corresponding to high **lift** values will help us identify the most useful rules. #' #' inspect(sort(med_rule, by="lift")[1:3]) #' #' #' These rules may need to be interpreted by clinicians and experts in the specific context of the study. For instance, the first row, *{fentanyl, heparin, hydrocodone acetaminophen}* implies *{cefazolin}*. *Fentanyl* and hydrocodone acetaminophen are both pain relievers that may be prescribed after surgery to relieve moderate to severe pain based on a narcotic opioid pain reliever (*hydrocodone*) and a non-opioid pain reliever (acetaminophen). *Heparin* is usually used before surgery to reduce the risk of blood clots. The third rule suggests that surgery patients that are treated with *heparin* may also need *cefazolin* to prevent post-surgical bacterial infection *Cefazolin* is an antibiotic used for the treatment of bacterial infections and also to prevent group *B streptococcal disease* around the time of delivery and before general surgery. #' #' *Step 6 - taking subsets of association rules* #' #' If we are more interested in investigating associations that are linked to a specific medicine, we can narrow the rules down by making subsets. Let us try investigating rules related to [fentanyl](https://en.wikipedia.org/wiki/Fentanyl), since it appears to be the most frequently prescribed medicine. [Fentanyl](https://en.wikipedia.org/wiki/Fentanyl) is used in the management of chronic cancer pain. #' #' fi_rules<-subset(med_rule, items %in% "fentanyl injection uh") inspect(fi_rules) #' #' #' Earlier, we saw that $\%in\%$ is a simple intuitive interface to `match()` and is used as a binary operator returning a logical vector indicating if there is a match (T) or not (F) for its left operand within the right table object. In total, there are 14 rules related to this item. Let's plot them. #' #' ## Graphical depiction of association rules #' #' # ?arulesViz::plot() # plot(sort(fi_rules, by="lift"), method="grouped", control=list(type="items"), # main = "Grouped Matrix for the 14 Fentanyl-associated Rules") # plot(sort(fi_rules, by="lift"), method = "graph", engine = "plotly") # plot(sort(fi_rules, by="lift"), method = "graph", engine = "igraph") # plot(sort(fi_rules, by="lift"), method = "graph", engine = "paracoord") plot(sort(fi_rules, by="lift"), method = "graph", engine = "htmlwidget", control=list(main = list(title="Grouped Matrix for the 14 Fentanyl-associated Rules"))) subrules2 <- sample(subset(fi_rules, lift > 2), 5) plot(sort(subrules2, by="lift"), method="grouped", control=list(type="items"), engine = "htmlwidget") plot(sort(fi_rules, by="lift"), method="grouped", k = 7, control=list(type="items"), engine = "htmlwidget") m <- rules2matrix(sort(fi_rules, by="lift")[1:10], measure = "lift") # m plot(fi_rules[1:10], method = "matrix", engine = "htmlwidget") ## Grouped matrix # create a matrix with LHSs grouped in k = 10 groups m <- rules2groupedMatrix(sort(fi_rules, by="lift")[1:10], k = 10) # m$m # number of rules per group # table(m$clustering_rules) # get rules for group 1 inspect(fi_rules[m$clustering_rules == 1]) # the corresponding plot plot(fi_rules, method = "grouped matrix", k = 7, engine = "htmlwidget") # plot(fi_rules, method = "grouped matrix", k = 7) #### For interactive runs try the RShinyApp: # ruleExplorer(fi_rules) #' #' #' plot(fi_rules, method="graph", measure = "support", engine="htmlwidget", # nodeCol=rainbow(14), shading = "lift", control = list(verbose = TRUE)) #' #' #' ## Saving association rules to a file or a data frame #' #' We can save these rules into a CSV file using `write()`. It is similar to the function `write.csv()` that we have mentioned in the beginning of this case study. #' #' write(med_rule, file = "medrule.csv", sep=",", row.names=F) #' #' #' Sometimes it is more convenient to convert the rules into a data frame. #' #' med_df <- as(med_rule, "data.frame") str(med_df) #' #' #' As we can see, the rules are converted into a factor vector. Finally, remember that matrices and data-frames can be converted to `arules` *transactions* format using `arulesObject <- as(input.df, "transactions")`. #' #' # Summary #' #' - The Apriori algorithm for association rule learning is only suitable for large transactional data. For some small datasets, it might not be very helpful. #' - It is useful for discovering associations, mostly in early phases of an exploratory study. #' - Some rules can be built due to chance and may need further verification. #' #' # Practice Problems #' #' ## Groceries #' #' In this practice problem, we will investigate the associations of frequently purchased groceries using the *grocery* dataset in the R base. Firstly, let's load the data. #' #' data("Groceries") summary(Groceries) #' #' #' We will try to find out the top 5 frequent grocery items and plot them. #' #' # itemFrequencyPlot(Groceries, topN=5) itemFrequencyPlotly(Groceries, 10, "groceries") #' #' #' Then, try to use `support = 0.006, confidence = 0.25, minlen = 2` to set up the grocery association rules. Sort the top 3 rules with highest lift. #' #' groceryrules <- apriori(Groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2)) groceryrules inspect(sort(groceryrules, by = "lift")[1:3]) #' #' #' The number of rules ($463$) appears excessive. We can try stringer parameters. In practice, it's more possible to observe underlying rules if you set a higher confidence. Here we set the $confidence=0.6$. #' #' groceryrules <- apriori(Groceries, parameter = list(support = 0.006, confidence = 0.6, minlen = 2)) groceryrules inspect(sort(groceryrules, by = "lift")[1:3]) #' #' #' We observe mainly rules between dairy products. It makes sense that customers pick up milk when they walk down the dairy products aisle. Experiment further with various parameter settings and try to interpret the results in the context of this grocery case-study. #' #' ## Titanic Passengers #' #' Next we'll use the [Titanic Passengers Dataset](https://umich.instructure.com/courses/38100/files/folder/data/16_TitanicPassengerSurvivalDataset). Let's start by loading the data first. #' #' dat <- read.csv("https://umich.instructure.com/files/9372716/download?download_frd=1") # Choose only the key data features dat <- dat[ , c("pclass", "survived", "sex", "age", "fare", "cabin")] # Factorize categorical features dat$pclass <- as.factor(dat$pclass) dat$survived <- factor(dat$survived, levels = c("0", "1")) dat$sex <- as.factor(dat$sex) # Convert the Cabin number to a character A-F, Z=missing cabin ID dat$cabin <- substring(dat$cabin,1,1) for (i in 1:length(dat$cabin)) if ((dat$cabin[i]=="")) dat$cabin[i] <- "Z" dat$cabin <- as.factor(dat$cabin) # Convert the Ticket Fair from numeric to categorical label f <- as.character(dat$fare) for (i in 1:length(dat$fare)) { if (is.na(dat$fare[i])) f[i] <- as.character("low") else if (dat$fare[i]<50) f[i] <- as.character("low") else if (50<=dat$fare[i] && dat$fare[i]<100) f[i] <- as.character("medium") else if (100<=dat$fare[i] && dat$fare[i]<200) f[i] <- as.character("high") else f[i] <- as.character("extreme") # if (200<=dat$fare[i]) } dat$fare <- as.factor(f) table(as.factor(dat$fare)) # Convert Age from numeric to categorical (Decade-of-life) label f <- as.character(dat$age) for (i in 1:length(dat$age)) { if (is.na(dat$age[i])) f[i] <- as.character("1") else { a = 1 + dat$age[i] %/% 10 # integer division by 10 (per decade of life) f[i] <- as.character(a) } } dat$age <- as.factor(f) table(as.factor(dat$age)) str(dat) #' #' #' Next, we can mine the association rules. #' #' library(arules) rules <- apriori(dat, parameter = list(minlen=5, supp=0.02, conf=0.8)) inspect(rules[1:20]) #' #' #' We can focus on the binary *survival* outcome (Survived=1/0). #' #' # examine the rules with containing "Survived=1" in the RHS rules <- apriori(dat, parameter = list(minlen=3, supp=0.02, conf=0.7), appearance = list(rhs="survived=1", default="lhs"), control = list(verbose=F)) rules.sorted <- sort(rules, by="lift") inspect(head(rules.sorted, 30)) #' #' #' Prune any redundant association rules. For instance, some rules may provide no extra knowledge. Rules that are highly related with prior rules may be redundant. Pruning reduces the number of rules from *27 to 18*. #' #' # search for redundant rules rules_lift <- sort(rules, by = 'lift') rules_pruned <- rules_lift[!is.redundant(rules_lift, measure="lift")] inspect(head(rules_pruned, 30)) #' #' #' The package `arulesViz` supplies rule-visualization routines using scatter, bubble, and parallel coordinates plots. The visualization of the frequent itemsets using parallel coordinates allows all items to be placed on the vertical axis with their position determined by their group and by the frequency of the item in descending order, vertical ranking based the support of the 1-itemset containing the item, [see this paper](https://link.springer.com/chapter/10.1007/978-3-030-04921-8_12). To render the maximal frequent itemsets, all subsets are implicitly drawn as sub-segments of one polyline. All itemsets are visualized as parallel coordinate polylines. The *support* can be mapped to line color or width. This shows effectively that itemsets that share common parts may have overlapping polylines. #' #' #' library(arulesViz) plot(rules) # scatter plot of association rules plot(rules_pruned, method="graph", control=list(verbose = FALSE), engine="htmlwidget") # display association rules # plot(rules_pruned, method="paracoord", control=list(reorder=TRUE, verbose = FALSE), engine="htmlwidget") # interactive association network graph plot(rules_pruned, method="graph", measure = "support", engine="htmlwidget", nodeCol=rainbow(14), shading = "lift", control = list(verbose = FALSE)) # <- rules2matrix(sort(rules_pruned, by="lift")[1:10], measure = "lift") # m # plot(rules_pruned, method = "matrix", engine = "htmlwidget") ## Grouped matrix # create a matrix with LHSs grouped in k = 10 groups # m <- rules2groupedMatrix(sort(rules_pruned, by="lift")[1:10], k = 10) # m$m # number of rules per group # table(m$clustering_rules) # get rules for group 1 inspect(rules_pruned[m$clustering_rules == 1]) # the corresponding plot plot(rules_pruned, method = "grouped matrix", k = 7, engine = "htmlwidget") #' #' #' Consider plotting *all rules* (not only the pruned list of association rules) and examine rules predicting passenger death (specify `"survived=0"`). #' #' # Try these TM/NLP techniques for other applications. #' #' - Apply TM/NLP to model the [US Bureau of Labor Statistics (BLS) jobs data](https://www.bls.gov/oes/current/oes_nat.htm). #' - MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). [DOI: 10.1038/sdata.2016.35](https://doi.org/10.1038/sdata.2016.35). #' - [Other data from the DSPA list of Case-Studies](https://umich.instructure.com/courses/38100/files/). #' - Other available free text or transactional data archive. #' #' #' #'

#' #' #' #' #' #' #' #' #' #' #' #' #'