As we have seen in the previous chapters, traditional statistical analyses and classical data modeling are applied to relational data where the observed information is represented by tables, vectors, arrays, tensors or data-frames containing binary, categorical, original, or numerical values. Such representations provide incredible advantages (e.g., quick reference and dereference of elements, search, discovery and navigation), but also limit the scope of applications. Relational data objects are quite effective for managing information that is based only on existing attributes. However, when data science inference needs to utilize attributes that are not included in the relational model, alternative non-relational representations are necessary. For instance, imagine that our data object includes qualitative data or free text (e.g., physician/nurse clinical notes, biospecimen samples) containing information about medical condition, treatment or outcome. It may be difficult, or sometimes even impossible, to include raw text in fully automated data analytic protocols solely using classical procedures and statistical models available for relational datasets. In this chapter, we will present ML/AI methods for qualitative and unstructured data.

1 Natural Language Processing (NLP) and Text Mining (TM)

Natural Language Processing (NLP) and Text Mining (TM) refer to automated machine-driven algorithms for semantic mapping, information extraction, and understanding of (natural) human language. Sometimes, this involves deriving salient information from large amounts of unstructured text. To do so, we need to build semantic and syntactic mapping algorithms for effective processing of heavy text. Recall the related NLP/TM work we did in Chapter 5 demonstrating an interesting text classifier using the naïve Bayes algorithm.

In this section, we will present details about various text processing strategies in R. Specifically, we will present simulated and real examples of text processing and computing document term frequency (TF), inverse document frequency (IDF), and cosine similarity transformation.

1.1 A simple NLP/TM example

Text mining or text analytics (TM/TA) examines large volumes of unstructured text (corpus) aiming to obtain interpretable information, discover context, identify linguistic motifs, or transform the text and derive quantitative data that can be further analyzed. Natural language processing (NLP) is one example of a TM analytical technique. Whereas TM’s goal is to discover relevant contextual information, which may be unknown, hidden, or obfuscated, NLP is focused on linguistic analysis that trains a machine to interpret voluminous textual content. To decipher the semantics and ambiguities in human-interpretable language, NLP employs automatic summarization, tagging, disambiguation, extraction of entities and relations, pattern recognition and frequency analyses. The United States International Trade Commission reports an International Data Corporation (IDC) Estimate of the expected total amount of information by 2025 - a staggering 175 Zettabytes (\(1ZB=10^{21}=2^{70}\) bytes). The amount of data we obtain and record doubles every 12-14 months (Kryder’s law). A small fraction of this massive information (\(<0.0001\%\) or \(<1PB=10^{15}\) bytes) represents newly written or transcribed text, including code. However, it is impossible (cf. efficiency, time, resources) for humans to read, synthesize, curate, interpret, and react to all this information without direct assistance of TM/NLP. The information content in text could be substantially higher than that of other information media. Remember that “a picture may be worth a thousand words”, yet, “a word may also be worth a thousand pictures”. As an example, the simple sentence “The second edition of the data science and predictive analytics textbook includes 14 Chapters” takes 93 bytes to store the text as a character array. However, a color image showing this sentence as a scale vector graphic or printed text could be much larger, e.g., 10 megabytes (MB), and a high definition video of a speaker reading the same sentence could easily surpass 50MB. Text mining and natural language processing may be used to automatically analyze and interpret written, coded or transcribed content to assess news, moods, emotions, and biosocial trends related to specific topics.

In general, text analysis protocols involve:

Construction of a document-term matrix (DTM) from the input documents, vectorizing the text, e.g., creating a map of single words or n-grams into a vector space. In other words, we generate a vectorizer function mapping terms to indices.
Apply a model-based statistical analysis or a model-free machine learning technique for prediction, clustering, classification, similarity search, network/sentiment analysis, or forecasting using the DTM. This step also includes tuning and internally validating the performance of the method.
Apply and evaluate the technique to new data.

Define and load the unstructured-text documents

Let’s create some documents we can use to demonstrate the use of the tm package to do text mining. The 5 documents below represent portions of the syllabi of 5 recent courses taught by Ivo Dinov:

We import the syllabi into several separate segments represented as documents. We’ll embed the course syllabi directly into R, however, using the rvest::read_html() method also facilitates loading in the five course syllabi directly from the corresponding course websites.

doc1 <- "HS650: The Data Science and Predictive Analytics (DSPA) course (offered as a massive open online course, MOOC, as well as a traditional University of Michigan class) aims to build computational abilities, inferential thinking, and practical skills for tackling core data scientific challenges. It explores foundational concepts in data management, processing, statistical computing, and dynamic visualization using modern programming tools and agile web-services. Concepts, ideas, and protocols are illustrated through examples of real observational, simulated and research-derived datasets. Some prior quantitative experience in programming, calculus, statistics, mathematical models, or linear algebra will be necessary. This open graduate course will provide a general overview of the principles, concepts, techniques, tools and services for managing, harmonizing, aggregating, preprocessing, modeling, analyzing and interpreting large, multi-source, incomplete, incongruent, and heterogeneous data (Big Data). The focus will be to expose students to common challenges related to handling Big Data and present the enormous opportunities and power associated with our ability to interrogate such complex datasets, extract useful information, derive knowledge, and provide actionable forecasting. Biomedical, healthcare, and social datasets will provide context for addressing specific driving challenges. Students will learn about modern data analytic techniques and develop skills for importing and exporting, cleaning and fusing, modeling and visualizing, analyzing and synthesizing complex datasets. The collaborative design, implementation, sharing and community validation of high-throughput analytic workflows will be emphasized throughout the course."

doc2 <- "Bioinformatics 501: The Mathematical Foundations for Bioinformatics course covers some of the fundamental mathematical techniques commonly used in bioinformatics and  biomedical research. These include: 1) principles of multi-variable calculus, and complex numbers/functions, 2) foundations of linear algebra, such as linear spaces, eigen-values and vectors, singular value decomposition, spectral graph theory and Markov chains, 3) differential equations and their usage in biomedical system, which includes topic such as existence and uniqueness of solutions, two dimensional linear systems, bifurcations in one and two dimensional systems and cellular dynamics, and 4) optimization methods, such as free and constrained optimization, Lagrange multipliers, data denoising using optimization and heuristic methods. Demonstrations using MATLAB, R, and Python are included throughout the course."

doc3 <- "HS 853: This course covers a number of modern analytical methods for advanced healthcare research. Specific focus will be on reviewing and using innovative modeling, computational, analytic and visualization techniques to address concrete driving biomedical and healthcare applications. The course will cover the 5 dimensions of Big-Data (volume, complexity, multiple scales, multiple sources, and incompleteness). HS853 is a 4 credit hour course (3 lectures + 1 lab/discussion). Students will learn how to conduct research, employ and report on recent advanced health sciences analytical methods; read, comprehend and present recent reports of innovative scientific methods; apply a broad range of health problems; and experiment with real Big-Data. Topics Covered include: Foundations of R, Scientific Visualization, Review of Multivariate and Mixed Linear Models, Causality/Causal Inference and Structural Equation Models, Generalized Estimating Equations, PCOR/CER methods Heterogeneity of Treatment Effects, Big-Data, Big-Science, Internal statistical cross-validation, Missing data, Genotype-Environment-Phenotype, associations, Variable selection (regularized regression and controlled/knockoff filtering), medical imaging, Databases/registries, Meta-analyses, classification methods, Longitudinal data and time-series analysis, Geographic Information Systems (GIS), Psychometrics and Rasch measurement model analysis, MCMC sampling for Bayesian inference, and Network Analysis"

doc4 <- "HS 851: This course introduces students to applied inference methods in studies involving multiple variables. Specific methods that will be discussed include linear regression, analysis of variance, and different regression models. This course will emphasize the scientific formulation, analytical modeling, computational tools and applied statistical inference in diverse health-sciences problems. Data interrogation, modeling approaches, rigorous interpretation and inference will be emphasized throughout. HS851 is a 4 credit hour course (3 lectures + 1 lab/discussion).  Students will learn how to: Understand the commonly used statistical methods of published scientific papers , Conduct statistical calculations/analyses on available data , Use software tools to analyze specific case-studies data , Communicate advanced statistical concepts/techniques , Determine, explain and interpret assumptions and limitations. Topics Covered  include   Epidemiology , Correlation/SLR , and slope inference, 1-2 samples , ROC Curve , ANOVA , Non-parametric inference , Cronbach's $\alpha$, Measurement Reliability/Validity , Survival Analysis , Decision theory , CLT/LLNs - limiting results and misconceptions , Association Tests , Bayesian Inference , PCA/ICA/Factor Analysis , Point/Interval Estimation (CI) - MoM, MLE , Instrument performance Evaluation , Study/Research Critiques , Common mistakes and misconceptions in using probability and statistics, identifying potential assumption violations, and avoiding them."

doc5 <- "HS550: This course provides students with an introduction to probability reasoning and statistical inference. Students will learn theoretical concepts and apply analytic skills for collecting, managing, modeling, processing, interpreting and visualizing (mostly univariate) data. Students will learn the basic probability modeling and statistical analysis methods and acquire knowledge to read recently published health research publications. HS550 is a 4 credit hour course (3 lectures + 1 lab/discussion).  Students will learn how to:  Apply data management strategies to sample data files , Carry out statistical tests to answer common healthcare research questions using appropriate methods and software tools , Understand the core analytical data modeling techniques and their appropriate use  Examples of Topics Covered ,  EDA/Charts , Ubiquitous variation , Parametric inference , Probability Theory , Odds Ratio/Relative Risk , Distributions , Exploratory data analysis , Resampling/Simulation , Design of Experiments , Intro to Epidemiology , Estimation , Hypothesis testing , Experiments vs. Observational studies , Data management (tables, streams, cloud, warehouses, DBs, arrays, binary, ASCII, handling, mechanics) , Power, sample-size, effect-size, sensitivity, specificity , Bias/Precision , Association vs. Causality , Rate-of-change , Clinical vs. Stat significance , Statistical Independence Bayesian Rule."

Create a new VCorpus object

The VCorpus object includes all the text and some meta-data (e.g., indexing) about the text.

docs <- c(doc1, doc2, doc3, doc4, doc5)
class(docs)

## [1] "character"

Then let’s make a VCorpus object using the tm package. To complete this task, we need to know the source type. Here docs has a vector with “character” class so we should use VectorSource(). If it is a dataframe, we should use DataframeSource() instead. VCorpus() creates a volatile corpus, which is the data type used by the tm package for text mining.

library(tm)

## Loading required package: NLP

doc_corpus <- VCorpus(VectorSource(docs))
doc_corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5

doc_corpus[[1]]$content

## [1] "HS650: The Data Science and Predictive Analytics (DSPA) course (offered as a massive open online course, MOOC, as well as a traditional University of Michigan class) aims to build computational abilities, inferential thinking, and practical skills for tackling core data scientific challenges. It explores foundational concepts in data management, processing, statistical computing, and dynamic visualization using modern programming tools and agile web-services. Concepts, ideas, and protocols are illustrated through examples of real observational, simulated and research-derived datasets. Some prior quantitative experience in programming, calculus, statistics, mathematical models, or linear algebra will be necessary. This open graduate course will provide a general overview of the principles, concepts, techniques, tools and services for managing, harmonizing, aggregating, preprocessing, modeling, analyzing and interpreting large, multi-source, incomplete, incongruent, and heterogeneous data (Big Data). The focus will be to expose students to common challenges related to handling Big Data and present the enormous opportunities and power associated with our ability to interrogate such complex datasets, extract useful information, derive knowledge, and provide actionable forecasting. Biomedical, healthcare, and social datasets will provide context for addressing specific driving challenges. Students will learn about modern data analytic techniques and develop skills for importing and exporting, cleaning and fusing, modeling and visualizing, analyzing and synthesizing complex datasets. The collaborative design, implementation, sharing and community validation of high-throughput analytic workflows will be emphasized throughout the course."

This is a list that contains the information for the 5 documents we have created. Now we can apply the tm_map() function on this object to edit the text. Similarly to human semantic language understanding, the goal here is to algorithmically process the text and output (structured) quantitative information as a signature tensor representing the original (unstructured) text.

Transform text to lowercase

The text itself contains upper case letters as well as lower case letters. The first thing to do is to convert everything to lowercase.

doc_corpus <- tm_map(doc_corpus, tolower)
doc_corpus[[1]]

## [1] "hs650: the data science and predictive analytics (dspa) course (offered as a massive open online course, mooc, as well as a traditional university of michigan class) aims to build computational abilities, inferential thinking, and practical skills for tackling core data scientific challenges. it explores foundational concepts in data management, processing, statistical computing, and dynamic visualization using modern programming tools and agile web-services. concepts, ideas, and protocols are illustrated through examples of real observational, simulated and research-derived datasets. some prior quantitative experience in programming, calculus, statistics, mathematical models, or linear algebra will be necessary. this open graduate course will provide a general overview of the principles, concepts, techniques, tools and services for managing, harmonizing, aggregating, preprocessing, modeling, analyzing and interpreting large, multi-source, incomplete, incongruent, and heterogeneous data (big data). the focus will be to expose students to common challenges related to handling big data and present the enormous opportunities and power associated with our ability to interrogate such complex datasets, extract useful information, derive knowledge, and provide actionable forecasting. biomedical, healthcare, and social datasets will provide context for addressing specific driving challenges. students will learn about modern data analytic techniques and develop skills for importing and exporting, cleaning and fusing, modeling and visualizing, analyzing and synthesizing complex datasets. the collaborative design, implementation, sharing and community validation of high-throughput analytic workflows will be emphasized throughout the course."

Text pre-processing

Removing Stopwords: These documents contain a lot of “stopwords” or common words that have important semantic meaning but low analytic value. We can remove these by the following command.

stopwords("english")

##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

doc_corpus <- tm_map(doc_corpus, removeWords, stopwords("english"))
doc_corpus[[1]]

## [1] "hs650:  data science  predictive analytics (dspa) course (offered   massive open online course, mooc,  well   traditional university  michigan class) aims  build computational abilities, inferential thinking,  practical skills  tackling core data scientific challenges.  explores foundational concepts  data management, processing, statistical computing,  dynamic visualization using modern programming tools  agile web-services. concepts, ideas,  protocols  illustrated  examples  real observational, simulated  research-derived datasets.  prior quantitative experience  programming, calculus, statistics, mathematical models,  linear algebra will  necessary.  open graduate course will provide  general overview   principles, concepts, techniques, tools  services  managing, harmonizing, aggregating, preprocessing, modeling, analyzing  interpreting large, multi-source, incomplete, incongruent,  heterogeneous data (big data).  focus will   expose students  common challenges related  handling big data  present  enormous opportunities  power associated   ability  interrogate  complex datasets, extract useful information, derive knowledge,  provide actionable forecasting. biomedical, healthcare,  social datasets will provide context  addressing specific driving challenges. students will learn  modern data analytic techniques  develop skills  importing  exporting, cleaning  fusing, modeling  visualizing, analyzing  synthesizing complex datasets.  collaborative design, implementation, sharing  community validation  high-throughput analytic workflows will  emphasized throughout  course."

We removed all the stopwards included in the stopwords("english") list. You can always make your own stop-word list and just use doc_corpus<-tm_map(doc_corpus, removeWords, your_own_words_list) to apply this list.

From the output of doc1, we notice that the removal of stopwords may create extra blank spaces. Thus, the next step would be to remove them.

doc_corpus <- tm_map(doc_corpus, stripWhitespace)
doc_corpus[[1]]

## [1] "hs650: data science predictive analytics (dspa) course (offered massive open online course, mooc, well traditional university michigan class) aims build computational abilities, inferential thinking, practical skills tackling core data scientific challenges. explores foundational concepts data management, processing, statistical computing, dynamic visualization using modern programming tools agile web-services. concepts, ideas, protocols illustrated examples real observational, simulated research-derived datasets. prior quantitative experience programming, calculus, statistics, mathematical models, linear algebra will necessary. open graduate course will provide general overview principles, concepts, techniques, tools services managing, harmonizing, aggregating, preprocessing, modeling, analyzing interpreting large, multi-source, incomplete, incongruent, heterogeneous data (big data). focus will expose students common challenges related handling big data present enormous opportunities power associated ability interrogate complex datasets, extract useful information, derive knowledge, provide actionable forecasting. biomedical, healthcare, social datasets will provide context addressing specific driving challenges. students will learn modern data analytic techniques develop skills importing exporting, cleaning fusing, modeling visualizing, analyzing synthesizing complex datasets. collaborative design, implementation, sharing community validation high-throughput analytic workflows will emphasized throughout course."

Remove punctuation: Now we notice the irrelevant punctuation in the text, which can be removed by using a combination of tm_map() and removePunctuation() functions.

doc_corpus <- tm_map(doc_corpus, removePunctuation)
doc_corpus[[2]]

## [1] "bioinformatics 501 mathematical foundations bioinformatics course covers fundamental mathematical techniques commonly used bioinformatics biomedical research include 1 principles multivariable calculus complex numbersfunctions 2 foundations linear algebra linear spaces eigenvalues vectors singular value decomposition spectral graph theory markov chains 3 differential equations usage biomedical system includes topic existence uniqueness solutions two dimensional linear systems bifurcations one two dimensional systems cellular dynamics 4 optimization methods free constrained optimization lagrange multipliers data denoising using optimization heuristic methods demonstrations using matlab r python included throughout course"

The above tm_map commands changed the structure of our doc_corpus object. We can apply the PlainTextDocument function to convert it back to the original format.

doc_corpus <- tm_map(doc_corpus, PlainTextDocument)

Stemming: Removal of plurals and action suffixes. Let’s inspect the first three documents. We notice that there are some words ending with “ing”, “es”, “s”.

doc_corpus[[1]]$content

## [1] "hs650 data science predictive analytics dspa course offered massive open online course mooc well traditional university michigan class aims build computational abilities inferential thinking practical skills tackling core data scientific challenges explores foundational concepts data management processing statistical computing dynamic visualization using modern programming tools agile webservices concepts ideas protocols illustrated examples real observational simulated researchderived datasets prior quantitative experience programming calculus statistics mathematical models linear algebra will necessary open graduate course will provide general overview principles concepts techniques tools services managing harmonizing aggregating preprocessing modeling analyzing interpreting large multisource incomplete incongruent heterogeneous data big data focus will expose students common challenges related handling big data present enormous opportunities power associated ability interrogate complex datasets extract useful information derive knowledge provide actionable forecasting biomedical healthcare social datasets will provide context addressing specific driving challenges students will learn modern data analytic techniques develop skills importing exporting cleaning fusing modeling visualizing analyzing synthesizing complex datasets collaborative design implementation sharing community validation highthroughput analytic workflows will emphasized throughout course"

doc_corpus[[2]]$content

## [1] "bioinformatics 501 mathematical foundations bioinformatics course covers fundamental mathematical techniques commonly used bioinformatics biomedical research include 1 principles multivariable calculus complex numbersfunctions 2 foundations linear algebra linear spaces eigenvalues vectors singular value decomposition spectral graph theory markov chains 3 differential equations usage biomedical system includes topic existence uniqueness solutions two dimensional linear systems bifurcations one two dimensional systems cellular dynamics 4 optimization methods free constrained optimization lagrange multipliers data denoising using optimization heuristic methods demonstrations using matlab r python included throughout course"

doc_corpus[[3]]$content

## [1] "hs 853 course covers number modern analytical methods advanced healthcare research specific focus will reviewing using innovative modeling computational analytic visualization techniques address concrete driving biomedical healthcare applications course will cover 5 dimensions bigdata volume complexity multiple scales multiple sources incompleteness hs853 4 credit hour course 3 lectures  1 labdiscussion students will learn conduct research employ report recent advanced health sciences analytical methods read comprehend present recent reports innovative scientific methods apply broad range health problems experiment real bigdata topics covered include foundations r scientific visualization review multivariate mixed linear models causalitycausal inference structural equation models generalized estimating equations pcorcer methods heterogeneity treatment effects bigdata bigscience internal statistical crossvalidation missing data genotypeenvironmentphenotype associations variable selection regularized regression controlledknockoff filtering medical imaging databasesregistries metaanalyses classification methods longitudinal data timeseries analysis geographic information systems gis psychometrics rasch measurement model analysis mcmc sampling bayesian inference network analysis"

If we have multiple terms that only differ in their endings (e.g., past, present, present-perfect-continuous tense), the algorithm will treat them differently because it does not understand language semantics the way a human would. To make things easier for the computer, we can delete these endings by “stemming” documents. Remember to load the package SnowballC before using the function stemDocument(). The earliest stemmer was written by Julie Beth Lovins in 1968, which had great influence on all subsequent work. Currently, one of the most popular stemming approaches was proposed by Martin Porter and is used in stemDocument(), more on Porter’s algorithm.

# install.packages("SnowballC")
library(SnowballC)
doc_corpus <- tm_map(doc_corpus, stemDocument)
doc_corpus[[1]]$content

## [1] "hs650 data scienc predict analyt dspa cours offer massiv open onlin cours mooc well tradit univers michigan class aim build comput abil inferenti think practic skill tackl core data scientif challeng explor foundat concept data manag process statist comput dynam visual use modern program tool agil webservic concept idea protocol illustr exampl real observ simul researchderiv dataset prior quantit experi program calculus statist mathemat model linear algebra will necessari open graduat cours will provid general overview principl concept techniqu tool servic manag harmon aggreg preprocess model analyz interpret larg multisourc incomplet incongru heterogen data big data focus will expos student common challeng relat handl big data present enorm opportun power associ abil interrog complex dataset extract use inform deriv knowledg provid action forecast biomed healthcar social dataset will provid context address specif drive challeng student will learn modern data analyt techniqu develop skill import export clean fuse model visual analyz synthes complex dataset collabor design implement share communiti valid highthroughput analyt workflow will emphas throughout cours"

This stemming process has to be done after the PlainTextDocument function because stemDocument only can be applied to plain text.

Bags of words: It’s very useful to be able to tokenize text documents into n-grams, sequences of words, e.g., a 2-gram represents two-word phrases that appear together in order. This allows us to form bags of words and extract information about word ordering. The bag of words model is a common way to represent documents in matrix form based on their term frequencies (TFs). We can construct an \(n\times t\) document-term matrix (DTM), where \(n\) is the number of documents, and \(t\) is the number of unique terms. Each column in the DTM represents a unique term, the \((i,j)^{th}\) cell represents how many of term \(j\) are present in document \(i\).

The basic bag of words model is invariant to ordering of the words within a document. Once we compute the DTM, we can use machine learning techniques to interpret the derived signature information contained in the resulting matrices.

Document-term matrix

Now the doc_corpus object is quite clean. Next, we can make a document-term matrix to explore all the terms in 5 documents. The document-term matrix is a bunch of dummy variables that tell us if a given term appears in a specific document.

doc_dtm <- TermDocumentMatrix(doc_corpus)
doc_dtm

## <<TermDocumentMatrix (terms: 336, documents: 5)>>
## Non-/sparse entries: 489/1191
## Sparsity           : 71%
## Maximal term length: 27
## Weighting          : term frequency (tf)

The summary of the document-term matrix is informative. We have 329 different terms in the 5 documents. There are 540 non-zero and 1105 sparse entries. Thus, the sparsity is \(\frac{1105}{(540+1105)}\approx 67\%\), which measures the term sparsity across documents. A high sparsity means terms are not repeated often among different documents.

Recall that we applied the PlainTextDocument function to your doc_corpus object. This removes all document metadata. To relabel the documents in the document-term matrix we can use the following commands:

doc_dtm$dimnames$Docs <- as.character(1:5)
inspect(doc_dtm)

## <<TermDocumentMatrix (terms: 336, documents: 5)>>
## Non-/sparse entries: 489/1191
## Sparsity           : 71%
## Maximal term length: 27
## Weighting          : term frequency (tf)
## Sample             :
##          Docs
## Terms     1 2 3 4 5
##   analyt  3 0 3 1 2
##   cours   4 2 3 3 2
##   data    7 1 2 3 6
##   infer   0 0 2 6 2
##   method  0 2 5 3 2
##   model   3 0 4 3 3
##   statist 2 0 1 5 4
##   student 2 0 1 2 4
##   use     2 3 1 3 2
##   will    6 0 3 4 3

We might want to find and report the frequent terms using this document-term matrix.

findFreqTerms(doc_dtm, lowfreq = 2)

##   [1] "abil"        "address"     "advanc"      "algebra"     "analysi"    
##   [6] "analyt"      "analyz"      "appli"       "appropri"    "associ"     
##  [11] "assumpt"     "bayesian"    "big"         "bigdata"     "bioinformat"
##  [16] "biomed"      "calculus"    "challeng"    "common"      "complex"    
##  [21] "comput"      "concept"     "conduct"     "core"        "cours"      
##  [26] "cover"       "credit"      "data"        "dataset"     "design"     
##  [31] "dimension"   "drive"       "dynam"       "emphas"      "epidemiolog"
##  [36] "equat"       "estim"       "exampl"      "experi"      "focus"      
##  [41] "foundat"     "general"     "handl"       "health"      "healthcar"  
##  [46] "heterogen"   "hour"        "hs550"       "includ"      "incomplet"  
##  [51] "infer"       "inform"      "innov"       "interpret"   "interrog"   
##  [56] "knowledg"    "labdiscuss"  "learn"       "lectur"      "limit"      
##  [61] "linear"      "manag"       "mathemat"    "measur"      "method"     
##  [66] "misconcept"  "model"       "modern"      "multipl"     "multivari"  
##  [71] "observ"      "open"        "optim"       "power"       "present"    
##  [76] "principl"    "probabl"     "problem"     "process"     "program"    
##  [81] "provid"      "publish"     "read"        "real"        "recent"     
##  [86] "regress"     "report"      "research"    "review"      "sampl"      
##  [91] "scienc"      "scientif"    "skill"       "softwar"     "specif"     
##  [96] "statist"     "student"     "studi"       "system"      "techniqu"   
## [101] "test"        "theori"      "throughout"  "tool"        "topic"      
## [106] "two"         "understand"  "use"         "variabl"     "visual"     
## [111] "will"

This gives us the terms that appear in at least 2 documents. High-frequency terms like comput, statist, model, healthcar, and learn make perfect sense to be included in this shortlist, as these courses cover modeling, statistical and computational methods with applications to health sciences.

The tm package provides the functionality to compute the correlations between terms. Here is a mechanism to determine the words that are highly correlated with statist, (\(\rho(statist, ?)\ge 0.8\)).

findAssocs(doc_dtm, "statist", corlimit = 0.8)

## $statist
## epidemiolog   interpret     publish     softwar       studi  understand 
##        0.92        0.92        0.92        0.92        0.92        0.92 
##      specif       appli 
##        0.85        0.84

1.2 Case-Study: Job ranking

Next, we will mine an interesting dataset containing both quantitative and qualitative (free text) elements. The 2011 USA Jobs Ranking Dataset from SOCR data archive. More recent jobs description data are provided by the US Bureau of Labor Statistics (BLS) and can be used to practice similar mixed data modeling approaches.

library(rvest)
wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRanking")
html_nodes(wiki_url, "#content")

## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body" role="main">\n\t\t\t<a id="top"></a>\n\ ...

job <- html_table(html_nodes(wiki_url, "table")[[1]])
head(job)

## # A tibble: 6 × 10
##   Index Job_Title           Overall_Score `Average_Income(USD)` Work_Environment
##   <int> <chr>                       <int>                 <int>            <dbl>
## 1     1 Software_Engineer              60                 87140            150  
## 2     2 Mathematician                  73                 94178             89.7
## 3     3 Actuary                       123                 87204            179. 
## 4     4 Statistician                  129                 73208             89.5
## 5     5 Computer_Systems_A…           147                 77153             90.8
## 6     6 Meteorologist                 175                 85210            180. 
## # ℹ 5 more variables: Stress_Level <dbl>, Stress_Category <int>,
## #   Physical_Demand <dbl>, Hiring_Potential <dbl>, Description <chr>

Note that low or high job indices represent more or less desirable jobs, respectively. Thus, in 2011, the most desirable job among top 200 common jobs would be Software Engineer. The aim of our study now is to explore the difference between the top 30 desirable jobs and the bottom 100 jobs based on their textural job descriptions (JDs).

We will go through the same procedure as we did earlier in the course syllabi example. The documents we are using represent the Description column (a text vector of JDs) in the dataset.

Step 1: make a VCorpus object

jobs <- as.list(job$Description)
jobCorpus <- VCorpus(VectorSource(jobs))

Step 2: clean the VCorpus object

jobCorpus <- tm_map(jobCorpus, tolower)
for(j in seq(jobCorpus)){
  jobCorpus[[j]] <- gsub("_", " ", jobCorpus[[j]])
}

Here we used a loop to substitute “_” (underscore) with blank space. This is necessary as the underscore character connecting words will cause problems with using removePunctuation to separate terms. In this situation, global pattern matching, gsub, will find and replace the underscores with spaces.

jobCorpus <- tm_map(jobCorpus, removeWords, stopwords("english"))
jobCorpus <- tm_map(jobCorpus, removePunctuation)
jobCorpus <- tm_map(jobCorpus, stripWhitespace)
jobCorpus <- tm_map(jobCorpus, PlainTextDocument)
jobCorpus <- tm_map(jobCorpus, stemDocument)

Step 3: build document-term matrix

Term Document Matrix (TDM) objects (tm::DocumentTermMatrix) contain a sparse term-document matrix or document-term matrix and attribute weights of the matrix. First make sure that we got a clean VCorpus object.

jobCorpus[[1]]$content

## [1] "research design develop maintain softwar system along hardwar develop medic scientif industri purpos"

Then we can start to build the DTM and reassign labels to the Docs.

dtm <- DocumentTermMatrix(jobCorpus)
dtm

## <<DocumentTermMatrix (documents: 200, terms: 846)>>
## Non-/sparse entries: 1818/167382
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency (tf)

dtm$dimnames$Docs<-as.character(1:200)
inspect(dtm[1:10, 1:10])

## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 2/98
## Sparsity           : 98%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs 16wheel abnorm access accid accord account accur achiev act activ
##   1        0      0      0     0      0       0     0      0   0     0
##   10       0      0      0     0      0       0     0      0   0     0
##   2        0      0      0     0      0       0     0      0   0     0
##   3        0      0      0     1      0       0     0      0   0     0
##   4        0      0      0     0      0       0     0      0   0     0
##   5        0      0      0     0      0       0     0      0   0     0
##   6        0      0      0     0      0       0     0      0   0     0
##   7        0      0      0     0      0       0     0      0   0     0
##   8        0      0      0     0      1       0     0      0   0     0
##   9        0      0      0     0      0       0     0      0   0     0

# tdm <- TermDocumentMatrix(jobCorpus, control = list(removePunctuation = TRUE, stopwords = TRUE))
# dtm <- DocumentTermMatrix(jobCorpus, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE))

Let’s subset the dtm into top 30 jobs and bottom 100 jobs.

dtm_top30  <- dtm[1:30, ]
dtm_bot100 <- dtm[101:200, ]
dtm_top30

## <<DocumentTermMatrix (documents: 30, terms: 846)>>
## Non-/sparse entries: 293/25087
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency (tf)

dtm_bot100

## <<DocumentTermMatrix (documents: 100, terms: 846)>>
## Non-/sparse entries: 870/83730
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency (tf)

In this case, since the sparsity is very high, we can try to remove some of the words that rarely appear in the job descriptions.

dtms_top30  <- removeSparseTerms(dtm_top30, 0.90)
dtms_top30

## <<DocumentTermMatrix (documents: 30, terms: 19)>>
## Non-/sparse entries: 70/500
## Sparsity           : 88%
## Maximal term length: 10
## Weighting          : term frequency (tf)

dtms_bot100 <- removeSparseTerms(dtm_bot100, 0.94)
dtms_bot100

## <<DocumentTermMatrix (documents: 100, terms: 14)>>
## Non-/sparse entries: 122/1278
## Sparsity           : 91%
## Maximal term length: 10
## Weighting          : term frequency (tf)

On the top, instead of the initial 846 terms, we only have 19 terms appearing in at least 10% of the JD’s.

Similarly, in the bottom, instead of the initial 846 terms, we only have 14 terms appearing in at least 6% of the bottom 100 jobs.

Similar to what we did in Chapter 5, visualization of the terms-world clouds can be accomplished in R by combining functionalities from the tm and wordcloud packages. First, we can count the term frequencies in two document-term matrices.

library(plotly)
# Let's calculate the cumulative frequencies of words across documents and sort:
freq1<-sort(colSums(as.matrix(dtms_top30)), decreasing=T)
freq1

##    develop     assist      natur      studi     analyz    concern   individu 
##          6          5          5          5          4          4          4 
##   industri     physic       plan       busi     inform   institut    problem 
##          4          4          4          3          3          3          3 
##   research   scientif     theori  treatment understand 
##          3          3          3          3          3

freq2<-sort(colSums(as.matrix(dtms_bot100)), decreasing=T)
freq2

##       oper     repair    perform     instal      build     prepar       busi 
##         17         15         11          9          8          8          7 
##   commerci  construct   industri     machin manufactur    product  transport 
##          7          7          7          7          7          7          7

# Plot frequent words (for bottom 100 jobs)
wf2=data.frame(term=names(freq2), occurrences=freq2)
# library(ggplot2)
# p <- ggplot(subset(wf2, freq2>2), aes(term, occurrences))
# p <- p + geom_bar(stat="identity")
# p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
# p

df.freq2 <- subset(wf2, freq2>2)
plot_ly(data=df.freq2, x=~term, y=~occurrences, type="bar") %>%
  layout(title="Bottom 100 Job Descriptions (Frequent Terms)",
         xaxis = list(categoryorder = "total descending"))

# Plot frequent words (for top 30 jobs)
wf1=data.frame(term=names(freq1), occurrences=freq1)
# p2 <- ggplot(subset(wf1, freq1>2), aes(term, occurrences, fill = freq1))
# p2 <- p2 + geom_bar(stat="identity")
# p2 <- p2 + theme(axis.text.x=element_text(angle=45, hjust=1))
# p2

df.freq1 <- subset(wf1, freq1>2)
plot_ly(data=df.freq1, x=~term, y=~occurrences, type="bar") %>%
  layout(title="Top 30 Job Descriptions (Frequent Terms)",
         xaxis = list(categoryorder = "total descending"))

# what is common (frequently occurring words)
# in the description of the top 30 and the bottom 100 jobs?
intersect(df.freq1$term, df.freq2$term)

## [1] "industri" "busi"

Then we apply the wordcloud function to the freq dataset.

library(wordcloud)
set.seed(123)
# wordcloud(names(freq1), freq1)
wordcloud(names(freq1), freq1, min.freq=2, colors=brewer.pal(6,"Spectral"))

# Color code the frequencies using an appropriate color map:

# Sequential palettes names include: 
# Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGn YlGnBu YlOrBr YlOrRd

# Diverging palettes include  
# BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn Spectral
wordcloud(names(freq2), freq2, min.freq=5, colors=brewer.pal(6, "Spectral"))

#wordcloud(names(freq2), freq2, min.freq=5, colors=brewer.pal(length(names(subset(wf2, freq2>5))), "Dark2"))

It becomes apparent that top 30 jobs tend to focus more on research or discovery and include frequent keywords like “study”, “theory”, and “science”. The bottom 100 jobs are more focused on mechanistic operations of objects or equipment, with frequent keywords like “operation”, “repair”, and “perform”.

1.3 Area Under ROC Curve

In Chapter 9 we will discuss the Receiver Operating Characteristic (ROC) curve. We can use document-term matrices to build classifiers and use the area under ROC curve to evaluate those classifiers. Assume we want to predict whether a job ranks top 30 in the job list.

The first task would be to create an indicator of high rank (job is in the top 30 list). We can use the ifelse() function that we are already familiar with.

job$highrank <- ifelse(job$Index<30, 1, 0)

Next we load the glmnet package to help us build the prediction model and draw graphs.

# install.packages("glmnet")
library(glmnet)

The function we will be using is the cv.glmnet, where cv stands for cross-validation. Since the derived job ranking variable highrank is binary, we specify the option family='binomial'. Also, we want to use a 10-fold CV method for internal statistical (resampling-based) prediction validation.

set.seed(25)
fit <- cv.glmnet(x = as.matrix(dtm), y = job[['highrank']], 
                 family = 'binomial', 
                 # lasso penalty
                 alpha = 1, 
                 # interested in the area under ROC curve
                 type.measure = "auc", 
                 # 10-fold cross-validation
                 nfolds = 10, 
                 # high value is less accurate, but has faster training
                 thresh = 1e-3, 
                 # again lower number of iterations for faster training
                 maxit = 1e3)
# plot(fit)
print(paste("max AUC =", round(max(fit$cvm), 4)))

## [1] "max AUC = 0.6348"

plotCV.glmnet <- function(cv.glmnet.object, name="") {
  df <- as.data.frame(cbind(x=log(cv.glmnet.object$lambda), y=cv.glmnet.object$cvm, 
                           errorBar=cv.glmnet.object$cvsd), nzero=cv.glmnet.object$nzero)

  featureNum <- cv.glmnet.object$nzero
  xFeature <- log(cv.glmnet.object$lambda)
  yFeature <- max(cv.glmnet.object$cvm)+max(cv.glmnet.object$cvsd)
  dataFeature <- data.frame(featureNum, xFeature, yFeature)

  plot_ly(data = df) %>%
    # add error bars for each CV-mean at log(lambda)
    add_trace(x = ~x, y = ~y, type = 'scatter', mode = 'markers',
        name = 'CV MSE', error_y = ~list(array = errorBar)) %>% 
    # add the lambda-min and lambda 1SD vertical dash lines
    add_lines(data=df, x=c(log(cv.glmnet.object$lambda.min), log(cv.glmnet.object$lambda.min)), 
              y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)),
              showlegend=F, line=list(dash="dash"), name="lambda.min", mode = 'lines+markers') %>%
    add_lines(data=df, x=c(log(cv.glmnet.object$lambda.1se), log(cv.glmnet.object$lambda.1se)), 
              y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)), 
              showlegend=F, line=list(dash="dash"), name="lambda.1se") %>%
    # Add Number of Features Annotations on Top
    add_trace(dataFeature, x = ~xFeature, y = ~yFeature, type = 'scatter', name="Number of Features",
        mode = 'text', text = ~featureNum, textposition = 'middle right',
        textfont = list(color = '#000000', size = 9)) %>%
    # Add top x-axis (non-zero features)
    # add_trace(data=df, x=~c(min(cv.glmnet.object$nzero),max(cv.glmnet.object$nzero)),
    #           y=~c(max(y)+max(errorBar),max(y)+max(errorBar)), showlegend=F, 
    #           name = "Non-Zero Features", yaxis = "ax", mode = "lines+markers", type = "scatter") %>%
    layout(title = paste0("Cross-Validation MSE (", name, ")"),
                            xaxis = list(title=TeX("\\log(\\lambda)"),  side="bottom", showgrid=TRUE), # type="log"
                            hovermode = "x unified", legend = list(orientation='h'),  # xaxis2 = ax,  
                            yaxis = list(title = cv.glmnet.object$name, side="left", showgrid = TRUE)) %>%
    config(mathjax = 'cdn')
}

plotCV.glmnet(fit, "LASSO")

Here \(x\) is a matrix of covariates and \(y\) is the response (outcome) variable. The graph is showing all the AUC we got from models we created. The last line of code helps us select the best AUC among all models. The resulting \(AUC\sim 0.63\) represents a relatively good prediction model for this small sample size.

1.4 TF-IDF

To enhance the performance of DTM matrix, we introduce the TF-IDF (term frequency - inverse document frequency) concept. Unlike pure frequency, TF-IDF measures the relative importance of a term. If a term appears in almost every document, the term will be considered common with high information capacity. Alternatively, rare terms would be considered as less informational.

1.4.1 Term Frequency (TF)

TF is a ratio \(\frac{\text{a term's occurrences in a document}}{\text{the number of occurrences of the most frequent word within the same document}}\).

\[TF(t,d)=\frac{f_d(t)}{\max_{w\in d}{f_d(w)}}.\]

1.4.2 Inverse Document Frequency (IDF)

The TF definition may allow high scores for irrelevant words that naturally show up often in a long text, even if these may have been triaged in a prior preprocessing step. The IDF attempts to rectify that. IDF represents the inverse of the share of the documents in which the regarded term can be found. The lower the number of documents containing the term, relative to the size of the corpus, the higher the term factor.

IDF involves a logarithm function because otherwise the effective scoring penalty of showing up in two documents would be too extreme. Typically, the IDF for a term found in just one document is twice the IDF for another term found in two docs. The \(\ln()\) function rectifies this bias of ranking in favor of rare terms, even if the TF-factor may be high. It is rather unlikely that a term’s relevance is only high in one doc and not all others.

\[IDF(t,D) = \ln\left ( \frac{|D|}{|\{ d\in D: t\in d\}|} \right ).\]

1.4.3 TF-IDF

Both TF and IDF yield high scores for highly relevant terms. TF relies on local information (search over \(d\)), whereas IDF incorporates a more global perspective (search over \(D\)). The product \(TF\times IDF\), gives the classical TF-IDF formula. However, alternative expressions may be formulated to get other univariate expressions using alternative weights for TF and IDF.

\[TF\_IDF(t,d,D)= TF(t,d)\times IDF(t,D).\] An example of an alternative TF-IDF metric can be defined by: \[ TF\_IDF'(t,d,D) =\frac{IDF(t,D)}{|D|} + TF\_IDF(t,d,D).\]

Let’s make another DTM with TF-IDF weights and compare the differences between the unweighted and weighted DTM.

dtm.tfidf <- DocumentTermMatrix(jobCorpus, control = list(weighting=weightTfIdf))
dtm.tfidf

## <<DocumentTermMatrix (documents: 200, terms: 846)>>
## Non-/sparse entries: 1818/167382
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

dtm.tfidf$dimnames$Docs <- as.character(1:200)
inspect(dtm.tfidf[1:9, 1:10])

## <<DocumentTermMatrix (documents: 9, terms: 10)>>
## Non-/sparse entries: 2/88
## Sparsity           : 98%
## Maximal term length: 7
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##     Terms
## Docs 16wheel abnorm access     accid    accord account accur achiev act activ
##    1       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    2       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    3       0      0      0 0.5536547 0.0000000       0     0      0   0     0
##    4       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    5       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    6       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    7       0      0      0 0.0000000 0.0000000       0     0      0   0     0
##    8       0      0      0 0.0000000 0.4321928       0     0      0   0     0
##    9       0      0      0 0.0000000 0.0000000       0     0      0   0     0

inspect(dtm[1:9, 1:10])

## <<DocumentTermMatrix (documents: 9, terms: 10)>>
## Non-/sparse entries: 2/88
## Sparsity           : 98%
## Maximal term length: 7
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs 16wheel abnorm access accid accord account accur achiev act activ
##    1       0      0      0     0      0       0     0      0   0     0
##    2       0      0      0     0      0       0     0      0   0     0
##    3       0      0      0     1      0       0     0      0   0     0
##    4       0      0      0     0      0       0     0      0   0     0
##    5       0      0      0     0      0       0     0      0   0     0
##    6       0      0      0     0      0       0     0      0   0     0
##    7       0      0      0     0      0       0     0      0   0     0
##    8       0      0      0     0      1       0     0      0   0     0
##    9       0      0      0     0      0       0     0      0   0     0

Inspecting the two different DTMs we can see that TF-IDF is not only counting the frequency but also assigning different weights to each term according to the importance of the term. Next, we are going to fit another model with this new DTM (dtm.tfidf).

set.seed(1234)
fit1 <- cv.glmnet(x = as.matrix(dtm.tfidf), y = job[['highrank']], 
                 family = 'binomial', 
                 # lasso penalty
                 alpha = 1, 
                 # interested in the area under ROC curve
                 type.measure = "auc", 
                 # 10-fold cross-validation
                 nfolds = 10, 
                 # high value is less accurate, but has faster training
                 thresh = 1e-3, 
                 # again lower number of iterations for faster training
                 maxit = 1e3)
# plot(fit1)
print(paste("max AUC =", round(max(fit1$cvm), 4)))

## [1] "max AUC = 0.6206"

plotCV.glmnet(fit1, "LASSO")

This performance is about the same as the previous jobs ranking prediction classifier (based on the unweighted DTM). Due to random sampling, each run of the protocols may generate a slightly different result. The idea behind using TF-IDF is that one would expect to get more unbiased estimates of word importance. If the document includes stopwords, like “the” or “one”, the unweighted DTM may distort the results, but TF-IDF can help fix this problem.

Next we can report a more intuitive representation of the job ranking prediction reflecting the agreement of the binary (top-30 or not) classification between the real labels and the predicted labels. Note that this validation uses the training data itself, however, later we will illustrate model performance assessment using independent testing data.

# Binarize the LASSO probability prediction 
preffit1 <- predict(fit1, newx=as.matrix(dtm.tfidf), s="lambda.min", type = "class")
binPredfit1 <- ifelse(preffit1<0.5, 0, 1)
table(binPredfit1, job[['highrank']])

##            
## binPredfit1   0   1
##           0 171   0
##           1   0  29

Let’s try to predict the job ranking of a new (testing or validation) job description (JD). There are many job descriptions provided online that we can use for independent testing, by extracting the JD text and using the pre-trained model to predict the job ranking of the corresponding positions. Trying several alternative job categories, e.g., some high-tech or fin-tech and some manufacturing and construction jobs, may provide some intuition to the power of the jobs-classifier we built. Below, we will compare the BLS JDs for the positions of accountant, attorney, and machinist.

# install.packages("text2vec"); install.packages("data.table")
library(text2vec)
library(data.table)

# Choose the JD for a PUBLIC ACCOUNTANTS 1430 (https://www.bls.gov/ocs/ocsjobde.htm)
xTestAccountant <- "Performs professional auditing work in a public accounting firm. Work requires at least a bachelor's degree in accounting. Participates in or conducts audits to ascertain the fairness of financial representations made by client companies. May also assist the client in improving accounting procedures and operations. Examines financial reports, accounting records, and related documents and practices of clients. Determines whether all important matters have been disclosed and whether procedures are consistent and conform to acceptable practices. Samples and tests transactions, internal controls, and other elements of the accounting system(s) as needed to render the accounting firm's final written opinion. As an entry level public accountant, serves as a junior member of an audit team. Receives classroom and on-the-job training to provide practical experience in applying the principles, theories, and concepts of accounting and auditing to specific situations. (Positions held by trainee public accountants with advanced degrees, such as MBA's are excluded at this level.) Complete instructions are furnished and work is reviewed to verify its accuracy, conformance with required procedures and instructions, and usefulness in facilitating the accountant's professional growth. Any technical problems not covered by instructions are brought to the attention of a superior. Carries out basic audit tests and procedures, such as: verifying reports against source accounts and records; reconciling bank and other accounts; and examining cash receipts and disbursements, payroll records, requisitions, receiving reports, and other accounting documents in detail to ascertain that transactions are properly supported and recorded. Prepares selected portions of audit working papers"

xTestAttorney <- "Performs consultation, advisory and/or trail work and carries out the legal processes necessary to effect the rights, privileges, and obligations of the organization. The work performed requires completion of law school with an L.L.B. degree or J.D. degree and admission to the bar. Responsibilities or functions include one or more of the following or comparable duties:
1. Preparing and reviewing various legal instruments and documents, such as contracts, leases, licenses, purchases, sales, real estate, etc.;
2. Acting as agent of the organization in its transactions;
3. Examining material (e.g., advertisements, publications, etc.) for legal implications; advising officials of proposed legislation which might affect the organization;
4. Applying for patents, copyrights, or registration of the organization's products, processes, devices, and trademarks; advising whether to initiate or defend lawsuits;
5. Conducting pre trial preparations; defending the organization in lawsuits;
6. Prosecuting criminal cases for a local or state government or defending the general public (for example, public defenders and attorneys rendering legal services to students); or
7. Advising officials on tax matters, government regulations, and/or legal rights.
Attorney jobs are matched at one of six levels according to two factors:
1. Difficulty level of legal work; and
2. Responsibility level of job.
Attorney jobs which meet the above definitions are to be classified and coded in accordance with a chart available upon request.
Legal questions are characterized by: facts that are well-established; clearly applicable legal precedents; and matters not of substantial importance to the organization. (Usually relatively limited sums of money, e.g., a few thousand dollars, are involved.)
a. legal investigation, negotiation, and research preparatory to defending the organization in potential or actual lawsuits involving alleged negligence where the facts can be firmly established and there are precedent cases directly applicable to the situation;
b. searching case reports, legal documents, periodicals, textbooks, and other legal references, and preparing draft opinions on employee compensation or benefit questions where there is a substantial amount of clearly applicable statutory, regulatory, and case material;
c. drawing up contracts and other legal documents in connection with real property transactions requiring the development of detailed information but not involving serious questions regarding titles to property or other major factual or legal issues.
d. preparing routine criminal cases for trial when the legal or factual issues are relatively straightforward and the impact of the case is limited; and
e. advising public defendants in regard to routine criminal charges or complaints and representing such defendants in court when legal alternatives and facts are relatively clear and the impact of the outcome is limited primarily to the defendant.

Legal work is regularly difficult by reason of one or more of the following: the absence of clear and directly applicable legal precedents; the different possible interpretations that can be placed on the facts, the laws, or the precedents involved; the substantial importance of the legal matters to the organization (e.g., sums as large as $100,000 are generally directly or indirectly involved); or the matter is being strongly pressed or contested in formal proceedings or in negotiations by the individuals, corporations, or government agencies involved.
a. advising on the legal implications of advertising representations when the facts supporting the representations and the applicable precedent cases are subject to different interpretations;
b. reviewing and advising on the implications of new or revised laws affecting the organization;
c. presenting the organization's defense in court in a negligence lawsuit which is strongly pressed by counsel for an organized group;
d. providing legal counsel on tax questions complicated by the absence of precedent decisions that are directly applicable to the organization's situation;
e. preparing and prosecuting criminal cases when the facts of the cases are complex or difficult to determine or the outcome will have a significant impact within the jurisdiction; and
f. advising and representing public defendants in all phases of criminal proceedings when the facts of the case are complex or difficult to determine, complex or unsettled legal issues are involved, or the prosecutorial jurisdiction devotes substantial resources to obtaining a conviction."

xTestMachinist <- "Produces replacement parts and new parts in making repairs of metal parts of mechanical equipment. Work involves most of the following: interpreting written instructions and specifications; planning and laying out of work; using a variety of machinist's handtools and precision measuring instruments; setting up and operating standard machine tools; shaping of metal parts to close tolerances; making standard shop computations relating to dimensions of work, tooling, feeds, and speeds of machining; knowledge of the working properties of the common metals; selecting standard materials, parts, and equipment required for this work; and fitting and assembling parts into mechanical equipment. In general, the machinist's work normally requires a rounded training in machine-shop practice usually acquired through a formal apprenticeship or equivalent training and experience. Industrial machinery repairer. Repairs machinery or mechanical equipment. Work involves most of the following: examining machines and mechanical equipment to diagnose source of trouble; dismantling or partly dismantling machines and performing repairs that mainly involve the use of handtools in scraping and fitting parts; replacing broken or defective parts with items obtained from stock; ordering the production of a replacement part by a machine shop or sending the machine to a machine shop for major repairs; preparing written specifications for major repairs or for the production of parts ordered from machine shops; reassembling machines; and making all necessary adjustments for operation. In general, the work of a machinery maintenance mechanic requires rounded training and experience usually acquired through a formal apprenticeship or equivalent training and experience. Excluded from this classification are workers whose primary duties involve setting up or adjusting machines. Vehicle and mobile equipment mechanics and repairers. Repairs, rebuilds, or overhauls major assemblies of internal combustion automobiles, buses, trucks, or tractors. Work involves most of the following: Diagnosing the source of trouble and determining the extent of repairs required; replacing worn or broken parts such as piston rings, bearings, or other engine parts; grinding and adjusting valves; rebuilding carburetors; overhauling transmissions; and repairing fuel injection, lighting, and ignition systems. In general, the work of the motor vehicle mechanic requires rounded training and experience usually acquired through a formal apprenticeship or equivalent training and experience"

# Define the testing cases (1) list, (2) Y-category (top-30 or not), and (3) ID names
testJDs <- as.list(c(xTestAccountant, xTestAttorney, xTestMachinist))
testTop30 <- c(1,1,0)
testNames <- c("xTestAccountant", "xTestAttorney", "xTestMachinist")
  
### 1. Training Phase Labeled data
# loop to substitute "_" with blank space
for(j in seq(jobs)){
  jobs[[j]] <- gsub("_", " ", jobs[[j]])
}

prep_fun = tolower
tok_fun = word_tokenizer

trainIterator = itoken(unlist(jobs), 
             preprocessor = prep_fun, 
             tokenizer = tok_fun, 
             ids = rownames(jobs), 
             progressbar = FALSE)
vocab = create_vocabulary(trainIterator)
vocab

## Number of docs: 200 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##                term term_count doc_count
##              <char>      <int>     <int>
##    1:            16          1         1
##    2:      abnormal          1         1
##    3: abnormalities          1         1
##    4:        access          1         1
##    5:   accordingly          1         1
##   ---                                   
## 1109:            in         51        45
## 1110:            to         81        69
## 1111:           the         82        70
## 1112:            of        118        98
## 1113:           and        329       184

# `text2vec` package has alternative tokenizer functions (?tokenizers), we can use simple wrappers of the `base::gsub()` function or
# write new tokenize. Let's construct a document-term matrix using the vocabulary of the trainIterator
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(trainIterator, vectorizer)
dim(dtm_train)

## [1]  200 1113

### 2. Fit LASSO model
set.seed(1234)
fit1 <- cv.glmnet(x = dtm_train, y = job[['highrank']], 
                   family = 'binomial', 
                 # lasso penalty
                 alpha = 1, 
                 # interested in the area under ROC curve
                 type.measure = "auc", 
                 # 10-fold cross-validation
                 nfolds = 10, 
                 # high value is less accurate, but has faster training
                 thresh = 1e-5, 
                 # again lower number of iterations for faster training
                 maxit = 1e4)
print(paste("max AUC =", round(max(fit1$cvm), 4)))

## [1] "max AUC = 0.6191"

#### 3. Testing Phase (new JDs)
testIterator = tok_fun(prep_fun(unlist(testJDs)))
# turn off progress bar because it won't look nice in rmd
testIterator = itoken(testIterator, ids = testNames, progressbar = FALSE)

dtm_test = create_dtm(testIterator, vectorizer)

predictedJDs = predict(fit1, dtm_test, type = 'response')[,1] # Type can be: "link", "response", "coefficients", "class", "nonzero"
predictedJDs

## xTestAccountant   xTestAttorney  xTestMachinist 
##       0.2312489       0.1504954       0.1355345

# glmnet:::auc(testTop30, predictedJDs)

Note that the results may change somewhat (e.g., AUC~0.62). Above we assessed the JD predictive (LASSO) model using the three out of bag job descriptions. Below is a plot of the model’s MSE.

# plot(fit1)
# plot(fit1, xvar="lambda", label="TRUE")
# mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5)

plotCV.glmnet(fit1, "LASSO")

The output of the predictions shows that:

On the training data, the predicted probabilities rapidly decrease with the indexing of the jobs, corresponding to the overall job ranking (highly ranked/desired jobs are listed on the top).
On the three testing job description data (accountant, attorney, and machinist), there is a clear ranking difference between the machinist and the other two professions.

Also see the discussion in Chapter 11 about the different types of predictions that can be generated as outputs of cv.glmnet regularized forecasting methods.

1.5 Cosine similarity

As we mentioned above, text data are often transformed, or weighted, by the Term Frequency-Inverse Document Frequency (TF-IDF). In many text-mining applications, this preprocessing step may often yield better results compared to the raw (unweighted) term-frequencies. Alternative transformations may be based on different distance measures, such as cosine distance. The cosine similarity and distance are defined by

\[similarity = \cos(\theta) = \frac{A\cdot B}{||A||_2||B||_2},\] \[Cosine\ Distance = 1-Similarity = 1 - \frac{A\cdot B}{||A||_2||B||_2},\] where \(\theta\) represents the angle between two vectors \(A\) and \(B\) in Euclidean space spanned by the DTM matrix. Note that the cosine similarity of two text documents (more specifically two DTM’s with or without TF-IDF weights) will always be in the range \([0,1]\), since the term frequencies are always non-negative. In other words, the angle between two term frequency vectors cannot exceed \(90^o\), and therefore, \(0\leq Cosine\_Distance\leq1\), albeit, it’s not a proper distance metric as it does not satisfy the triangle inequality, in general. Mind the dimensions of the corresponding matrices: \(\dim(dtm)=200\times 846\) and \(\dim(dist\_cos)=200\times 200\).

cos_dist = function(mat){
  numer = tcrossprod(mat)
  denom1 = sqrt(apply(mat, 1, crossprod))
  denom2 = sqrt(apply(mat, 1, crossprod))
  1 - numer / outer(denom1,denom2)
}

# Recall
# dtm  <- DocumentTermMatrix(jobCorpus)
# jobs <- as.list(job$Description)
# trainIterator = itoken(unlist(jobs), ...)
# dtm_train = create_dtm(trainIterator, vectorizer)
# fit <- cv.glmnet(x = as.matrix(dtm), y = job[['highrank']], ...)
dist_cos = cos_dist(as.matrix(dtm))   # also try with dtm_train
                 
set.seed(1234)
fit_cos <- cv.glmnet(x = dist_cos, y = job[['highrank']], 
                 family = 'binomial', 
                 # lasso penalty
                 alpha = 1, 
                 # interested in the area under ROC curve
                 type.measure = "auc", 
                 # 10-fold cross-validation
                 nfolds = 10, 
                 # high value is less accurate, but has faster training
                 thresh = 1e-5, 
                 # again lower number of iterations for faster training
                 maxit = 1e5)
# plot(fit_cos)
plotCV.glmnet(fit_cos, "Cosine-transformed LASSO")

print(paste("max AUC =", round(max(fit_cos$cvm), 4)))

## [1] "max AUC = 0.8377"

The AUC now is significantly higher, \(0.84\), which is a pretty good result, an improvement over the prior unweighted and TF-IDF-weighted DTMs. This suggests that cosine-transforming the data improves natural language processing results and leads to a more acceptable content classifier.

1.6 Sentiment analysis

Let’s use the text2vec::movie_review dataset, which consists of 5,000 movie reviews dichotomized as positive or negative. In the subsequent predictive analytics, this sentiment will represent our output feature: \[Y= Sentiment=\left\{ \begin{array}{ll} 0, & \quad negative \\ 1, & \quad positive \end{array} \right. .\]

Data Preprocessing

The data.table package will also be used for some data manipulation. Let’s start with splitting the data into training and testing sets.

# install.packages("text2vec"); install.packages("data.table")
library(text2vec)
library(data.table)

# Load the movie reviews data
data("movie_review")

# coerce the movie reviews data to a data.table (DT) object
setDT(movie_review)

# create a key for the movie-reviews data table
setkey(movie_review, id)

# View the data
# View(movie_review)
head(movie_review); dim(movie_review); colnames(movie_review)

## Key: <id>
##         id sentiment
##     <char>     <int>
## 1: 10000_8         1
## 2: 10001_4         0
## 3: 10004_3         0
## 4: 10004_8         1
## 5: 10006_4         0
## 6: 10008_7         1
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                review
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <char>
## 1: Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without the luxuries; if Bolt succeeds, he can do what he wants with a future project of making more buildings. The bet's on where Bolt is thrown on the street with a bracelet on his leg to monitor his every move where he can't step off the sidewalk. He's given the nickname Pepto by a vagrant after it's written on his forehead where Bolt meets other characters including a woman by the name of Molly (Lesley Ann Warren) an ex-dancer who got divorce before losing her home, and her pals Sailor (Howard Morris) and Fumes (Teddy Wilson) who are already used to the streets. They're survivors. Bolt isn't. He's not used to reaching mutual agreements like he once did when being rich where it's fight or flight, kill or be killed.<br /><br />While the love connection between Molly and Bolt wasn't necessary to plot, I found \\"Life Stinks\\" to be one of Mel Brooks' observant films where prior to being a comedy, it shows a tender side compared to his slapstick work such as Blazing Saddles, Young Frankenstein, or Spaceballs for the matter, to show what it's like having something valuable before losing it the next day or on the other hand making a stupid bet like all rich people do when they don't know what to do with their money. Maybe they should give it to the homeless instead of using it like Monopoly money.<br /><br />Or maybe this film will inspire you to help others.
## 2:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            This film lacked something I couldn't put my finger on at first: charisma on the part of the leading actress. This inevitably translated to lack of chemistry when she shared the screen with her leading man. Even the romantic scenes came across as being merely the actors at play. It could very well have been the director who miscalculated what he needed from the actors. I just don't know.<br /><br />But could it have been the screenplay? Just exactly who was the chef in love with? He seemed more enamored of his culinary skills and restaurant, and ultimately of himself and his youthful exploits, than of anybody or anything else. He never convinced me he was in love with the princess.<br /><br />I was disappointed in this movie. But, don't forget it was nominated for an Oscar, so judge for yourself.
## 3:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     \\"It appears that many critics find the idea of a Woody Allen drama unpalatable.\\" And for good reason: they are unbearably wooden and pretentious imitations of Bergman. And let's not kid ourselves: critics were mostly supportive of Allen's Bergman pretensions, Allen's whining accusations to the contrary notwithstanding. What I don't get is this: why was Allen generally applauded for his originality in imitating Bergman, but the contemporaneous Brian DePalma was excoriated for \\"ripping off\\" Hitchcock in his suspense/horror films? In Robin Wood's view, it's a strange form of cultural snobbery. I would have to agree with that.
## 4:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          This isn't the comedic Robin Williams, nor is it the quirky/insane Robin Williams of recent thriller fame. This is a hybrid of the classic drama without over-dramatization, mixed with Robin's new love of the thriller. But this isn't a thriller, per se. This is more a mystery/suspense vehicle through which Williams attempts to locate a sick boy and his keeper.<br /><br />Also starring Sandra Oh and Rory Culkin, this Suspense Drama plays pretty much like a news report, until William's character gets close to achieving his goal.<br /><br />I must say that I was highly entertained, though this movie fails to teach, guide, inspect, or amuse. It felt more like I was watching a guy (Williams), as he was actually performing the actions, from a third person perspective. In other words, it felt real, and I was able to subscribe to the premise of the story.<br /><br />All in all, it's worth a watch, though it's definitely not Friday/Saturday night fare.<br /><br />It rates a 7.7/10 from...<br /><br />the Fiend :.
## 5:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            I don't know who to blame, the timid writers or the clueless director. It seemed to be one of those movies where so much was paid to the stars (Angie, Charlie, Denise, Rosanna and Jon) that there wasn't enough left to really make a movie. This could have been very entertaining, but there was a veil of timidity, even cowardice, that hung over each scene. Since it got an R rating anyway why was the ubiquitous bubble bath scene shot with a 70-year-old woman and not Angie Harmon? Why does Sheen sleepwalk through potentially hot relationships WITH TWO OF THE MOST BEAUTIFUL AND SEXY ACTRESSES in the world? If they were only looking for laughs why not cast Whoopi Goldberg and Judy Tenuta instead? This was so predictable I was surprised to find that the director wasn't a five year old. What a waste, not just for the viewers but for the actors as well.
## 6:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       You know, Robin Williams, God bless him, is constantly shooting himself in the foot lately with all these dumb comedies he has done this decade (with perhaps the exception of \\"Death To Smoochy\\", which bombed when it came out but is now a cult classic). The dramas he has made lately have been fantastic, especially \\"Insomnia\\" and \\"One Hour Photo\\". \\"The Night Listener\\", despite mediocre reviews and a quick DVD release, is among his best work, period.<br /><br />This is a very chilling story, even though it doesn't include a serial killer or anyone that physically dangerous for that matter. The concept of the film is based on an actual case of fraud that still has yet to be officially confirmed. In high school, I read an autobiography by a child named Anthony Godby Johnson, who suffered horrific abuse and eventually contracted AIDS as a result. I was moved by the story until I read reports online that Johnson may not actually exist. When I saw this movie, the confused feelings that Robin Williams so brilliantly portrayed resurfaced in my mind.<br /><br />Toni Collette probably gives her best dramatic performance too as the ultimately sociopathic \\"caretaker\\". Her role was a far cry from those she had in movies like \\"Little Miss Sunshine\\". There were even times she looked into the camera where I thought she was staring right at me. It takes a good actress to play that sort of role, and it's this understated (yet well reviewed) role that makes Toni Collette probably one of the best actresses of this generation not to have even been nominated for an Academy Award (as of 2008). It's incredible that there is at least one woman in this world who is like this, and it's scary too.<br /><br />This is a good, dark film that I highly recommend. Be prepared to be unsettled, though, because this movie leaves you with a strange feeling at the end.

## [1] 5000    3

## [1] "id"        "sentiment" "review"

# Generate 80-20% training-testing split of the reviews
all_ids = movie_review$id
set.seed(1234)
train_ids = sample(all_ids, 5000*0.8)
test_ids = setdiff(all_ids, train_ids)
train = movie_review[train_ids, ]
test = movie_review[test_ids, ]

Next, we will vectorize the reviews by creating terms to termID mappings. Note that terms may include arbitrary n-grams, not just single words. The set of reviews will be represented as a sparse matrix, with rows and columns corresponding to reviews/reviewers and terms, respectively. This vectorization may be accomplished in several alternative ways, e.g., by using the corpus vocabulary, feature hashing, etc.

The vocabulary-based DTM, created by the create_vocabulary() function, relies on all unique terms from all reviews, where each term has a unique ID. In this example, we will create the review vocabulary using an iterator construct abstracting the input details and enabling in memory processing of the (training) data by chunks.

# define the test preprocessing 

# either a simple (tolower case) function
preproc_fun = tolower

# or a more elaborate "cleaning" function
preproc_fun = function(x)                    # text data
{ require("tm")
  x  =  gsub("<.*?>", " ", x)               # regex removing HTML tags
  x  =  iconv(x, "latin1", "ASCII", sub="") # remove non-ASCII characters
  x  =  gsub("[^[:alnum:]]", " ", x)        # remove non-alpha-numeric values
  x  =  tolower(x)                          # convert to lowercase characters
  # x  =  removeNumbers(x)                  # removing numbers
  x  =  stripWhitespace(x)                  # removing white space
  x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
  return(x)
}

# define the tokenization function
token_fun = word_tokenizer

# iterator for both training and testing sets
iter_train = itoken(train$review, 
             preprocessor = preproc_fun, 
             tokenizer = token_fun, 
             ids = train$id, 
             progressbar = TRUE)

iter_test = itoken(test$review, 
             preprocessor = preproc_fun, 
             tokenizer = token_fun, 
             ids = test$id, 
             progressbar = TRUE)
reviewVocab = create_vocabulary(iter_train)

# report the head and tail of the reviewVocab
reviewVocab

## Number of docs: 4000 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##          term term_count doc_count
##        <char>      <int>     <int>
##     1:  00015          1         1
##     2:     03          1         1
##     3:    041          1         1
##     4:     05          1         1
##     5:     09          1         1
##    ---                            
## 35560:     to      22103      3788
## 35561:     of      23585      3791
## 35562:      a      26468      3871
## 35563:    and      26832      3864
## 35564:    the      54245      3966

The next step computes the document term matrix (DTM).

reviewVectorizer = vocab_vectorizer(reviewVocab)
t0 = Sys.time()
dtm_train = create_dtm(iter_train, reviewVectorizer)
dtm_test = create_dtm(iter_test, reviewVectorizer)
t1 = Sys.time()
print(difftime(t1, t0, units = 'sec'))

## Time difference of 1.355024 secs

# check the DTM dimensions
dim(dtm_train); dim(dtm_test)

## [1]  4000 35564

## [1]  1000 35564

# confirm that the training data review DTM dimensions are consistent 
# with training review IDs, i.e., #rows = number of documents, and
# #columns = number of unique terms (n-grams), dim(dtm_train)[[2]]
identical(rownames(dtm_train), train$id)

## [1] TRUE

1.7 NLP/TM Analytics

We can now fit statistical models or derive machine learning AI predictions. Let’s start by using glmnet() to fit a logit model with LASSO (\(L_1\)) regularization and 10-fold cross-validation, see Chapter 11.

library(glmnet)
nFolds = 10
t0 = Sys.time()
glmnet_classifier = cv.glmnet(x = dtm_train, y = train[['sentiment']], 
        family = "binomial", 
        # LASSO L1 penalty
        alpha = 1,
        # interested in the area under ROC curve or MSE
        type.measure = "auc",
        # n-fold internal (training data) stats cross-validation
        nfolds = nFolds,
        # threshold: high value is less accurate / faster training
        thresh = 1e-2,
        # again lower number of iterations for faster training
        maxit = 1e3
      )

lambda.best <- glmnet_classifier$lambda.min
lambda.best

## [1] 0.009043106

# report execution time
t1 = Sys.time()
print(difftime(t1, t0, units = 'sec'))

## Time difference of 4.031966 secs

# some prediction plots
# plot(glmnet_classifier)
# # plot(glmnet_classifier, xvar="lambda", label="TRUE")
# mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5)
plotCV.glmnet(glmnet_classifier)

Now let’s look at external validation, i.e., testing the model on the independent 20% of the reviews we kept aside. The performance of the binary prediction (binary sentiment analysis of these movie reviews) on the test data is roughly the same as we had from the internal statistical 10-fold cross-validation.

# library('glmnet')
# report the mean internal cross-validated error
print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4)))

## [1] "max AUC = 0.9221"

# report TESTING data prediction accuracy
xTest = dtm_test
yTest = test[['sentiment']]
predLASSO <- predict(glmnet_classifier, 
              s = glmnet_classifier$lambda.1se, newx = xTest)
testMSE_LASSO <- mean((predLASSO - yTest)^2); testMSE_LASSO

## [1] 2.482202

# Binarize the LASSO probability prediction 
binPredLASSO <- ifelse(predLASSO<0.5, 0, 1)
table(binPredLASSO, yTest)

##             yTest
## binPredLASSO   0   1
##            0 449 181
##            1  38 332

# and testing data AUC
glmnet:::auc(yTest, predLASSO)

## [1] 0.906773

# report the top 20 negative and positive predictive terms: predict == predict.cv.glmnet
summary(predLASSO)

##        s1         
##  Min.   :-8.2990  
##  1st Qu.:-0.9875  
##  Median : 0.1347  
##  Mean   :-0.1050  
##  3rd Qu.: 0.9050  
##  Max.   : 5.8133

sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients"))[1:20]

##  [1] -5.3773471 -2.4413170 -2.0576560 -1.8147407 -1.7930992 -1.6398103
##  [7] -1.5799485 -1.4489793 -1.3448756 -1.2109861 -1.2032443 -1.1045763
## [13] -1.0823501 -1.0753680 -1.0609291 -1.0525903 -1.0398367 -1.0111934
## [19] -0.9996942 -0.9048566

rev(sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients")))[1:20]

##  [1] 2.6990847 2.0802809 1.4634998 1.0710930 1.0179945 0.9351993 0.8824431
##  [8] 0.8106505 0.7944252 0.7789477 0.7634964 0.7611448 0.7361530 0.7230270
## [15] 0.7225707 0.7190906 0.7075955 0.7073987 0.7038402 0.6936936

The (external) prediction performance, measured by AUC, on the testing data is about the same as the internal 10-fold stats cross-validation we reported above.

1.7.1 Prediction Optimization

Earlier we saw that we can also prune the vocabulary and perhaps improve prediction performance, e.g., by removing non-salient terms like stopwords and by using n-grams instead of single words.

reviewVocab = create_vocabulary(iter_train, stopwords=tm::stopwords("english"), ngram = c(1L, 2L))

prunedReviewVocab = prune_vocabulary(reviewVocab, 
                term_count_min = 10, 
                doc_proportion_max = 0.5,
                doc_proportion_min = 0.001)
prunedVectorizer = vocab_vectorizer(prunedReviewVocab)

t0 = Sys.time()
dtm_train = create_dtm(iter_train, prunedVectorizer)
dtm_test = create_dtm(iter_test, prunedVectorizer)
t1 = Sys.time()
print(difftime(t1, t0, units = 'sec'))

## Time difference of 1.263181 secs

Next, refit the model and report the performance. Did this yield an improvement in the prediction accuracy?

glmnet_prunedClassifier=cv.glmnet(x=dtm_train, 
        y=train[['sentiment']], 
        family = "binomial", 
        # LASSO L1 penalty
        alpha = 1,
        # interested in the area under ROC curve or MSE
        type.measure = "auc",
        # n-fold internal (training data) stats cross-validation
        nfolds = nFolds,
        # threshold: high value is less accurate / faster training
        thresh = 1e-4,
        # again lower number of iterations for faster training
        maxit = 1e5
      )

lambda.best <- glmnet_prunedClassifier$lambda.min
lambda.best

## [1] 0.007865232

# report execution time
t1 = Sys.time()
print(difftime(t1, t0, units = 'sec'))

## Time difference of 3.810955 secs

# some prediction plots
# plot(glmnet_prunedClassifier)
# mtext("Pruned-Model CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5)
plotCV.glmnet(glmnet_prunedClassifier)

# report the mean internal cross-validated error
print(paste("max AUC =", round(max(glmnet_prunedClassifier$cvm), 4)))

## [1] "max AUC = 0.9301"

# report TESTING data prediction accuracy
xTest = dtm_test
yTest = test[['sentiment']]
predLASSO = predict(glmnet_prunedClassifier,
          dtm_test, type = 'response')[,1]

testMSE_LASSO <- mean((predLASSO - yTest)^2); testMSE_LASSO

## [1] 0.1232584

# Binarize the LASSO probability prediction 
binPredLASSO <- ifelse(predLASSO<0.5, 0, 1)
table(binPredLASSO, yTest)

##             yTest
## binPredLASSO   0   1
##            0 397  80
##            1  90 433

# and testing data AUC
glmnet:::auc(yTest, predLASSO)

## [1] 0.9181407

# report the top 20 negative and positive predictive terms
summary(predLASSO)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000352 0.2256975 0.5240056 0.4943751 0.7449730 0.9985883

sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients"))[1:20]

##  [1] -5.996667 -2.917531 -2.570840 -2.129353 -2.011366 -1.924831 -1.905654
##  [8] -1.672907 -1.623986 -1.536025 -1.525004 -1.430932 -1.363108 -1.330195
## [15] -1.274768 -1.258665 -1.176855 -1.170610 -1.146671 -1.136090

rev(sort(predict(glmnet_classifier, s = lambda.best, type = "coefficients")))[1:20]

##  [1] 2.9487927 2.5906712 1.6395806 1.3466682 1.2665379 1.1657157 1.1609455
##  [8] 1.0732492 1.0469346 1.0196068 1.0174891 1.0051070 0.9562316 0.9241320
## [15] 0.9162853 0.8817782 0.8803928 0.8668196 0.8604375 0.8561196

# Binarize the LASSO probability prediction 
# and construct an approximate confusion matrix
binPredLASSO <- ifelse(predLASSO<0.5, 0, 1)
table(binPredLASSO, yTest)

##             yTest
## binPredLASSO   0   1
##            0 397  80
##            1  90 433

Using n-grams somewhat improved the sentiment prediction model.

2 Apriori Association Rules Learning

HTTP cookies are used to track web-surfing and Internet traffic. We often notice that promotions (ads) on websites tend to match our needs, reveal our prior browsing history, or may reflect our interests. That is not an accident. Nowadays, recommendation systems are highly based on machine learning methods that can learn the behavior, e.g., purchasing patterns, of individual consumers. In this section, we will uncover some of the mystery behind recommendation systems like market basket analysis. Specifically, we will (1) discuss association rules and their support and confidence, (2) present the Apriori algorithm for association rule learning, and (3) cover step-by-step a set of case-studies, including a toy example and studies of Head and Neck Cancer Medications, Grocery purchases, and survival of Titanic passengers.

2.1 Association Rules

Association rules are the result of process analytics (e.g., market basket analysis) that specify patterns of relationships among items. One specific example would be: \[\{charcoal, \, lighter, \, chicken\, wings\}\rightarrow\{barbecue\, sauce\}\] In other words, charcoal, lighter and chicken wings imply barbecue sauce. The curly brackets indicate that we have a set of items and the arrow suggests a direction of the association. Items in a set are called elements. When an item-set like \(\{charcoal, \, lighter, \, chicken\, wings, \, barbecue\, sauce\}\) appears in our dataset with some regularity, we can mine and discover patterns of association with other item-sets.

Association rules are commonly used for unsupervised discovery of knowledge rather than prediction of pre-specified outcomes. In biomedical research, association rules are widely used to:

searching for interesting or frequently occurring patterns of DNA,
searching for protein sequences in an analysis of cancer data,
finding patterns of medical claims that occur in combination with credit card or insurance fraud.

2.2 The Apriori algorithm for association rule learning

Association rules are mostly applied to transactional data, which is usually records of trade, exchange, or arrangement, e.g., medical records. These datasets are typically very large in number of transactions and features, e.g., electronic health record (EHR). In such data archives, the number and complexity of transactions is high, which complicates efforts to extract patterns, conduct market analysis, or predict basket purchases.

Apriori association rules help untangle such difficult modeling problems. If we have a simple prior belief about the properties of frequent elements, we may be able to efficiently reduce the number of features or combinations that we need to look at.

The Apriori algorithm is based on a simple apriori belief that all subsets of a frequent item-set must also be frequent. This is known as the Apriori property. In the last example, the full set \(\{charcoal, \, lighter, \, chicken\, wings, \, barbecue\, sauce\}\) is frequent if and only if itself and all of its subsets, including single elements, pairs, and triples, occur frequently. Naturally, the apriori rule is designed for finding patterns in large datasets where patterns that appear frequently are considered “interesting”, “valuable”, or “important”.

2.3 Rule support and confidence

We can measure rule’s importance by computing its support and confidence metrics. The support and confidence represent two criteria useful in deciding whether a pattern is “valuable”. By setting thresholds for these two criteria, we can easily limit the number of interesting rules or item-sets reported.

For item-sets \(X\) and \(Y\), the support of an item-set measures how (relatively) frequently it appears in the data: \[support(X)=\frac{count(X)}{N},\] where N is the total number of transactions in the database and count(X) is the number of observations (transactions) containing the item-set X.

In a set-theoretic sense, the union of item-sets is an item-set itself. In other words, if \(Z=\{X,Y\}=\{X\cup Y\}\), then \[support(Z)=support(X,Y).\]

For a given rule \(X \rightarrow Y\), the rule's confidence measures the relative accuracy of the rule: \[confidence(X \rightarrow Y)=\frac{support(X, Y)}{support(X)}.\]

The confidence measures the joint occurrence of X and Y over the X domain. If whenever X appears Y tends to also be present, then we will have a high \(confidence(X\rightarrow Y)\). This set-theoretic formulation of the confidence can also be expressed as a conditional probability, since the numerator is effectively the probability of the intersection of the events \(E_X\) and \(E_Y\) corresponding to observing the transactions \(X\) and \(Y\), respectively. That is, the support corresponds to the conditional probability \(support(X,Y)=Prob(E_X\cap E_Y)\) and \[confidence(X \rightarrow Y)=\frac{Prob(E_X\cap E_Y)}{Prob(E_X)}\equiv Prob(E_Y | E_X).\]

Note that the ranges of the support and the confidence are \(0 \leq support,\ confidence \leq 1\).

One intuitive example of a strong association rule is \(\{peanut\, butter\}\rightarrow\{bread\}\), because it has high support as well as high confidence in grocery store transactions. Shoppers tend to purchase bread when they get peanut butter. These items tend to appear in the same baskets, which yields high confidence for the rule \(\{peanut\, butter\}\rightarrow\{bread\}\). Of course, there may be cases where \(\{peanut\, butter\}\rightarrow\{celery\}\). These recommendation systems may not be perfect and can fail in unexpected ways, see the “How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did?” article.

2.4 Building a set of rules with the Apriori principle

Remember that the number of arrangements of \(n\) elements taken \(k\) at a time, i.e., the number of combinations, increases exponentially with the size of the item inventory (\(n\)), see this SOCR activity. This is precisely why if a restaurant only uses 10 ingredients, there are \({10\choose 5}=253\) possible menu items for customers to order. clearly the complexity of the number of “baskets” rapidly increases with the inventory of the available items or ingredients. For instance,

\(n=100\) (ingredients) and \(k=50\) (menus of 50 ingredients), yields \({100\choose 50}=100891344545564193334812497256\gt 10^{29}\) possible arrangements (combinations), and
\(n=100\) (ingredients) and \(k=5\) (menus of only 5 ingredients), yields \({100\choose 5}=75287520\gt 7M\) possible orders.

To avoid this complexity, we will introduce a two-step process of building few, simple, and informative sets of rules:

Step 1: Filter all item-sets with a minimum support threshold. This is accomplished iteratively by increasing the size of the item-sets. In the first iteration, we compute the support of singletons, 1-item-sets. Next iteration, we compute the support of pairs of items, then, triples of items, etc. Item-sets passing iteration i could be considered as candidates for the next iteration, i+1. If {A}, {B}, {C} are all frequent singletons, but D is not frequent in the first singleton-selection round, then in the second iteration we only consider the support of these pairs {A, B}, {A,C}, {B,C}, ignoring all pairs including D. This substantially reduces the cardinality of the potential item-sets and ensures the feasibility of the algorithm. At the third iteration, if {A,C}, and {B,C} are frequently occurring, but {A, B} is not, then the algorithm may terminate, as the support of {A,B,C} is trivial (does not pass the support threshold), given that {A, B} was not frequent enough.
Step 2: Using the item-sets selected in step 1, generate new rules with confidence larger than a predefined minimum confidence threshold. The candidate item-sets that passed step 1 would include all frequent item-sets. For the highly-supported item-set {A, C}, we would compute the confidence measures for \(\{A\}\rightarrow\{C\}\) as well as \(\{C\}\rightarrow\{A\}\) and compare these against the minimum confidence threshold. The surviving rules are the ones with confidence levels exceeding that minimum threshold.

2.5 A toy example

Assume that a large supermarket tracks sales data by stock-keeping unit (SKU) for each item, i.e., each item, such as “butter” or “bread”, is identified by an SKU number. The supermarket has a database of transactions where each transaction is a set of SKUs that were bought together.

Suppose the database of transactions consist of following item-sets, each representing a purchasing order:

require(knitr)
item_table = as.data.frame(t(c("{1,2,3,4}","{1,2,4}","{1,2}","{2,3,4}","{2,3}","{3,4}","{2,4}")))
colnames(item_table) <- c("choice1","choice2","choice3","choice4","choice5","choice6","choice7")
kable(item_table, caption = "Item table")

Item table
choice1	choice2	choice3	choice4	choice5	choice6	choice7
{1,2,3,4}	{1,2,4}	{1,2}	{2,3,4}	{2,3}	{3,4}	{2,4}

We will use Apriori to determine the frequent item-sets of this database. To do so, we will say that an item-set is frequent if it appears in at least \(3\) transactions of the database, i.e., the value \(3\) is the support threshold.

The first step of Apriori is to count up the number of occurrences, i.e., the support, of each member item separately. By scanning the database for the first time, we obtain the following size-1 support:

item_table = as.data.frame(t(c(3,6,4,5)))
colnames(item_table) <- c("items: {1}","{2}","{3}","{4}")
rownames(item_table) <- "(N=7)*support"
kable(item_table,caption = "Size 1 Support")

Size 1 Support
	items: {1}	{2}	{3}	{4}
(N=7)*support	3	6	4	5

All the singletons, item-sets of size 1, have a support of at least 3, so they are all frequent. The next step is to generate a list of all pairs of frequent items.

For example, regarding the pair \(\{1,2\}\): the first table of Example 2 shows items 1 and 2 appearing together in three of the item-sets; therefore, we say that the support of the item \(\{1,2\}\) is \(3\).

item_table = as.data.frame(t(c(3,1,2,3,4,3)))
colnames(item_table) <- c("{1,2}","{1,3}","{1,4}","{2,3}","{2,4}","{3,4}")
rownames(item_table) <- "N*support"
kable(item_table,caption = "Size 2 Support")

Size 2 Support
	{1,2}	{1,3}	{1,4}	{2,3}	{2,4}	{3,4}
N*support	3	1	2	3	4	3

The pairs \(\{1,2\}\), \(\{2,3\}\), \(\{2,4\}\), and \(\{3,4\}\) all meet or exceed the minimum support of \(3\), so they are frequent. The pairs \(\{1,3\}\) and \(\{1,4\}\) are not and any larger set which contains \(\{1,3\}\) or \(\{1,4\}\) cannot be frequent. In this way, we can prune sets: we will now look for frequent triples in the database, but we can already exclude all the triples that contain one of these two pairs:

item_table = as.data.frame(t(c(2)))
colnames(item_table) <- c("{2,3,4}")
rownames(item_table) <- "N*support"
kable(item_table,caption = "Size 3 Support")

Size 3 Support
	{2,3,4}
N*support	2

In the example, there are no frequent triplets – the support of the item-set \(\{2,3,4\}\) is below the minimal threshold, and the other triplets were excluded because they were super sets of pairs that were already below the threshold. We have thus determined the frequent sets of items in the database, and illustrated how some items were not counted because some of their subsets were already known to be below the threshold.

2.6 Case Study 1: Head and Neck Cancer Medications

Step 1 - data import

To demonstrate the Apriori algorithm in a real biomedical case-study, we will use a transactional healthcare data representing a subset of the Head and Neck Cancer Medication data, which it is available in our case-studies collection as 10_medication_descriptions.csv. It consists of inpatient medications of head and neck cancer patients.

The data is in a wide format (see Chapter 1) where each row represents a patient. During the study period, each patient had records for a maximum of 5 encounters. NA represents no medication administration records in this specific time point for the specific patient. This dataset contains a total of 528 patients.

Step 2 - exploring and preparing the data

Different from our data imports in the previous chapters, transactional data need to be ingested in R using the read.transactions() function. This function will store data as a matrix with each row representing a basket (or transaction example) and each column representing items in the transaction.

Let’s load the dataset and delete the irrelevant index column. Using the write.csv(R data, "path") function, we can output our R data file into a local CSV file. To avoid generating another index column in the output CSV file, we can use the row.names=F option.

med<-read.csv("https://umich.instructure.com/files/1678540/download?download_frd=1", stringsAsFactors = FALSE)
med<-med[, -1]
write.csv(med, "medication.csv", row.names=F)
library(knitr)
kable(med[1:5, ])

MEDICATION_DESC.1	MEDICATION_DESC.2	MEDICATION_DESC.3	MEDICATION_DESC.4	MEDICATION_DESC.5
acetaminophen uh	cefazolin ivpb uh	NA	NA	NA
docusate	fioricet	heparin injection	ondansetron injection uh	simvastatin
hydrocodone acetaminophen 5mg 325mg	NA	NA	NA	NA
fentanyl injection uh	NA	NA	NA	NA
cefazolin ivpb uh	hydrocodone acetaminophen 5mg 325mg	NA	NA	NA

Now we can use read.transactions() in the arules package to read the CSV file we just outputted.

# install.packages("arules")
library(arules)
med <- read.transactions("medication.csv", sep = ",", skip = 1, rm.duplicates=TRUE)

## distribution of transactions with duplicates:
## items
##   1   2   3 
##  79 166 248

summary(med)

## transactions as itemMatrix in sparse format with
##  528 rows (elements/itemsets/transactions) and
##  88 columns (items) and a density of 0.02085486 
## 
## most frequent items:
##                fentanyl injection uh  hydrocodone acetaminophen 5mg 325mg 
##                                  211                                  165 
##                    cefazolin ivpb uh                    heparin injection 
##                                  108                                  105 
## hydrocodone acetamin 75mg 500mg 15ml                              (Other) 
##                                   60                                  320 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   5 
## 248 166  79  23  12 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.835   2.000   5.000 
## 
## includes extended item information - examples:
##                            labels
## 1                         09 nacl
## 2                   09 nacl bolus
## 3 acetaminophen  multiroute    uh

Here we use the option rm.duplicates=T because we may have similar medication administration records for two different patients. The option skip=1 means we skip the heading line in the CSV file. Now we get transactional data with unique rows.

The summary of a transactional data contains rich information. The first block of information tells us that we have 528 rows and 88 different medicines in this matrix. Using the density number we can calculate how many non NA medication records are in the data. In total, we have \(528\times 88=46,464\) positions in the matrix. Thus, there are \(46,464\times 0.0209=971\) medicines prescribed during the study period.

The second block lists the most frequent medicines and their frequencies in the matrix. For example fentanyl injection uh appeared 211 times, which represents a proportion of \(211/528=0.4\) of the (treatment) transactions. Since fentanyl is frequently used to help prevent pain after surgery or other medical treatments, we can see that many of these patients may have undergone some significant medical procedures that require postoperative pain management.

The last block shows statistics about the size of the transaction. 248 patients had only one medicine in the study period, while 12 of them had 5 medication records, one for each time point. On average, the patients are having 1.8 different medicines.

2.6.1 Visualizing item support - item frequency plots

The summary may still be fairly abstract, so we can visualize the information.

inspect(med[1:5,])

##     items                                
## [1] {acetaminophen uh,                   
##      cefazolin ivpb uh}                  
## [2] {docusate,                           
##      fioricet,                           
##      heparin injection,                  
##      ondansetron injection uh,           
##      simvastatin}                        
## [3] {hydrocodone acetaminophen 5mg 325mg}
## [4] {fentanyl injection uh}              
## [5] {cefazolin ivpb uh,                  
##      hydrocodone acetaminophen 5mg 325mg}

The inspect() call shows the transactional dataset. We can see that the medication records of each patient are nicely formatted as item-sets.

We can further analyze the frequent terms using itemFrequency(). This will show all item frequencies alphabetically ordered from the first five outputs.

library(plotly)

itemFrequency(med[, 1:5])

##                                 09 nacl                           09 nacl bolus 
##                             0.013257576                             0.003787879 
##         acetaminophen  multiroute    uh acetaminophen codeine 120 mg 12 mg 5 ml 
##                             0.001893939                             0.001893939 
##       acetaminophen codeine 300mg 30 mg 
##                             0.020833333

# itemFrequencyPlot(med, topN=20)

# convert the data to a matrix
mat <- as.matrix(med@data)
# View matrix as image
# image(mat)

# capture the row/column names
rowNames <- med@itemInfo$labels
colNames <- paste0("S", c(1:dim(mat)[2]))
rownames(mat) <- rowNames
colnames(mat) <- colNames

# convert matrix to DF for processing, order rows based on their average (across subjects/cases/columns), back to matrix for display
df <- as.data.frame(1*mat)
df$avg <- rowMeans(df)
dfOrdered <- df[order(df$avg, decreasing = T), ]
matOrdered <- as.matrix(dfOrdered)

# track the ordered row names
rowNames <- rownames(dfOrdered)
colNames <- colnames(dfOrdered)

# 2D top 20 terms bar plot
# To order the meds based on "avg", instead of alphabetically (mind the "-" sign to order Large to small!)
plot_ly(x = reorder(rowNames[c(1:20)], -dfOrdered[1:20, "avg"]), y=dfOrdered[1:20, "avg"], name="Top 20 Meds", type="bar")  %>%
  layout(title='Frequency of Medications (Top 20) based on averaging across Cases',
           xaxis = list(title="Term"),
           yaxis = list(title="Relative Frequency"))

# 3D surface plot
plot_ly(x = colNames, y = rowNames, z = 2*matOrdered, type = "surface") %>%
  layout(title='Term (X) by Sample (Y) Frequency (Z) Plot',
           xaxis = list(title="Term"),
           yaxis = list(title="Sample ID")) %>% hide_colorbar()

We can only display the top 20 medicines that are most frequently present in this dataset. Consistent with the prior summary() output, fentanyl is still the most frequent item. You can also try to plot the items with a threshold for support. Instead of topN=20, just use the option support=0.1, which will give you all the items that have a support greater or equal to \(0.1\).

Let’s generalize this process and define a new function, itemFrequencyPlotly(transactionObject, numTopItemps = 10), that can generate frequency plots for any transaction object.

# define a generic plot_ly ItemFrequency plotting function
itemFrequencyPlotly <- function(transactionObject, numTopItemps = 10, name="") {
  name <- ifelse(name=="", 
                 paste0('Frequency of Items (Top ', numTopItemps, 
                     ' ) based on averaging across Cases'),
                 paste0('Frequency of Items (Top ', numTopItemps, 
                     ' ) based on averaging across Cases (Data=', name, ')'))
  
  mat <- as.matrix(transactionObject@data)
  # View matrix as image
  # image(mat)
  
  # capture the row/column names
  rowNames <- transactionObject@itemInfo$labels
  colNames <- paste0("S", c(1:dim(mat)[2]))
  rownames(mat) <- rowNames
  colnames(mat) <- colNames
  
  # convert matrix to DF for processing, order rows based on their average (across subjects/cases/columns), back to matrix for display
  df <- as.data.frame(1*mat)
  df$avg <- rowMeans(df)
  dfOrdered <- df[order(df$avg, decreasing = T), ]
  matOrdered <- as.matrix(dfOrdered)
  
  # track the ordered row names
  rowNames <- rownames(dfOrdered)
  colNames <- colnames(dfOrdered)
  
  # 2D top "20"numTopItemps" terms bar plot
  # To order the meds based on "avg", instead of alphabetically (mind the "-" sign to order Large to small!)
  plot_ly(x = reorder(rowNames[c(1:numTopItemps)], -dfOrdered[1:numTopItemps, "avg"]), 
          y=dfOrdered[1:numTopItemps, "avg"], name=paste0("Top ", numTopItemps, " Meds"), type="bar")  %>%
    layout(title=name,
             xaxis = list(title="Terms"),
             yaxis = list(title="Relative Frequency"))
}

2.6.2 Visualizing transaction data - plotting the sparse matrix

The sparse matrix will show what medications were prescribed for each patient. Below we only show the top-20 medications for the first 15 cases.

# image(med[1:5, ])

plot_ly(x=reorder(rowNames[c(1:20)], -dfOrdered[1:20, "avg"]), y=colNames[1:15], 
        z=2*matOrdered[1:15, 1:20], type="heatmap") %>% 
  layout(title='Heatmap - Top-20 Medications for the first 15 Cases') %>% hide_colorbar()

This image has 15 rows (translations), as we only specified the first 15 patients, and 20 columns (top-20 meds). Although the picture may be a little hard to interpret, it gives a sense of what kind of medicine is prescribed for each patient in the study.

Let’s see an expanded graph including the top 30 medications for a random roster of 50 patients.

subset_int <- sample(ncol(matOrdered), 50, replace = F)  
# image(med[subset_int, ])

plot_ly(x=rowNames[1:30], y=colNames[subset_int], 
        z=2*t(matOrdered[1:30, subset_int]), type="heatmap") %>% 
  layout(title='Heatmap - Bottom-30 Medications for a Random set of 50 Patients') %>% hide_colorbar()

It shows us clearly that some medications are more popular than others. Now, let’s fit the Apriori model.

Step 3 - training a model on the data

With the data in place, we can build the association rules using the arules::apriori() function.

myrules <- apriori(data=mydata, parameter=list(support=0.1, confidence=0.8, minlen=1))

data: a sparse matrix created by read.transacations().
support: minimum threshold for support.
confidence: minimum threshold for confidence.
minlen: minimum required rule items (in our case, medications).

Setting up the threshold could be challenging; too high threshold may generate no rules or common-sense and non-informative rules, whereas too low values may overlook too many relevant rules present in the transactional data. We’ll start with using the default setting support=0.1, confidence=0.8.

apriori(med)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 52 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[88 item(s), 528 transaction(s)] done [0.00s].
## sorting and recoding items ... [5 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

## set of 0 rules

Not surprisingly, we obtained 0 rules, as the default setting is too high. In practice, we might need some time to fine-tune these thresholds, which may require certain familiarity with the underlying process or embedding human clinical knowledge into the AI decision support.

In this case study, we set support=0.01 and confidence=0.25. This requires rules that have appeared in at least 1% of the head and neck cancer patients in the study. Also, the rules have to have at least 25% accuracy. Moreover, minlen=2 would be a very helpful option because it removes all rules that have fewer than two items.

med_rule <- apriori(med, parameter=list(support=0.01, confidence=0.25, minlen=2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 5 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[88 item(s), 528 transaction(s)] done [0.00s].
## sorting and recoding items ... [16 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [29 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

med_rule

## set of 29 rules

The result suggests we have a new med_rule object consisting of 29 rules.

Step 4 - evaluating model performance

First, we can obtain the overall summary of this set of rules.

summary(med_rule)

## set of 29 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4 
## 13 12  4 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.00    3.00    2.69    3.00    4.00 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01136   Min.   :0.2500   Min.   :0.01894   Min.   :0.7583  
##  1st Qu.:0.01705   1st Qu.:0.3390   1st Qu.:0.03788   1st Qu.:1.3333  
##  Median :0.01894   Median :0.4444   Median :0.06250   Median :1.7481  
##  Mean   :0.03448   Mean   :0.4491   Mean   :0.08392   Mean   :1.8636  
##  3rd Qu.:0.03788   3rd Qu.:0.5000   3rd Qu.:0.08902   3rd Qu.:2.2564  
##  Max.   :0.11174   Max.   :0.8000   Max.   :0.31250   Max.   :3.9111  
##      count      
##  Min.   : 6.00  
##  1st Qu.: 9.00  
##  Median :10.00  
##  Mean   :18.21  
##  3rd Qu.:20.00  
##  Max.   :59.00  
## 
## mining info:
##  data ntransactions support confidence
##   med           528    0.01       0.25
##                                                                                  call
##  apriori(data = med, parameter = list(support = 0.01, confidence = 0.25, minlen = 2))

We have 13 rules that contain two items; 12 rules containing 3 items, and the remaining 4 rules contain 4 items.

The lift column shows how much more likely one medicine is to be prescribed to a patient given another medicine is prescribed. It is obtained by the following formula: \[lift(X\rightarrow Y)=\frac{confidence(X\rightarrow Y)}{support(Y)}.\] Note that \(lift(X\rightarrow Y)\) is the same as \(lift(Y\rightarrow X)\). The range of \(lift\) is \([0,\infty)\) and higher \(lift\) is better. We don’t need to worry about the support, since we already set a threshold that the support must exceed.

Using the arulesViz package we can visualize the confidence and support scatter plots for all the rules.

# install.packages("arulesViz")
library(arulesViz)
# plot(sort(med_rule))

sortedRule <- sort(med_rule)

x1   <- sortedRule@quality$support
y1   <- sortedRule@quality$confidence
z1   <- sortedRule@quality$lift
col1 <- sortedRule@quality$count
ruleNames <- paste0("Rule", c(1:length(sortedRule@quality$support)))

plot_ly(x = ~x1, y = ~y1, z = ~z1, color = ~z1, name=ruleNames) %>% 
  add_markers() %>% 
  layout(title=paste0("Arule Support-Confidence-Lift Plot (for all ", length(sortedRule@quality$support), " rules)"),
         scene = list(xaxis = list(title = 'Support'),
                     yaxis = list(title = 'Confidence'),
                     zaxis = list(title = 'Lift'))) %>% 
  hide_colorbar()

Again, we can utilize the inspect() function to see exactly what these rules are.

inspect(med_rule[1:3])

##     lhs                               rhs                 support    confidence
## [1] {acetaminophen uh}             => {cefazolin ivpb uh} 0.01136364 0.4615385 
## [2] {ampicillin sulbactam ivpb uh} => {heparin injection} 0.01893939 0.3448276 
## [3] {ondansetron injection uh}     => {heparin injection} 0.01704545 0.2727273 
##     coverage   lift     count
## [1] 0.02462121 2.256410  6   
## [2] 0.05492424 1.733990 10   
## [3] 0.06250000 1.371429  9

Here, lhs and rhs refer to the “left hand side” and “right hand side” of the rule, respectively. lhs is the given condition and rhs is the predicted result. Using the first row as an example: If a head-and-neck patient has been prescribed acetaminophen (pain reliever and fever reducer), it is likely that the patient is also prescribed cefazolin (antibiotic prescribed for treatment of resistant bacterial infections); bacterial infections are associated with fevers and some cancers.

Step 5 - sorting the set of association rules

Sorting the resulting association rules corresponding to high lift values will help us identify the most useful rules.

inspect(sort(med_rule, by="lift")[1:3])

##     lhs                                      rhs                    support confidence   coverage     lift count
## [1] {fentanyl injection uh,                                                                                     
##      heparin injection,                                                                                         
##      hydrocodone acetaminophen 5mg 325mg} => {cefazolin ivpb uh} 0.01515152  0.8000000 0.01893939 3.911111     8
## [2] {cefazolin ivpb uh,                                                                                         
##      fentanyl injection uh,                                                                                     
##      hydrocodone acetaminophen 5mg 325mg} => {heparin injection} 0.01515152  0.6153846 0.02462121 3.094505     8
## [3] {heparin injection,                                                                                         
##      hydrocodone acetaminophen 5mg 325mg} => {cefazolin ivpb uh} 0.03787879  0.6250000 0.06060606 3.055556    20

These rules may need to be interpreted by clinicians and experts in the specific context of the study. For instance, the first row, {fentanyl, heparin, hydrocodone acetaminophen} implies {cefazolin}. Fentanyl and hydrocodone acetaminophen are both pain relievers that may be prescribed after surgery to relieve moderate to severe pain based on a narcotic opioid pain reliever (hydrocodone) and a non-opioid pain reliever (acetaminophen). Heparin is usually used before surgery to reduce the risk of blood clots. The third rule suggests that surgery patients that are treated with heparin may also need cefazolin to prevent post-surgical bacterial infection Cefazolin is an antibiotic used for the treatment of bacterial infections and also to prevent group B streptococcal disease around the time of delivery and before general surgery.

Step 6 - taking subsets of association rules

If we are more interested in investigating associations that are linked to a specific medicine, we can narrow the rules down by making subsets. Let us try investigating rules related to fentanyl, since it appears to be the most frequently prescribed medicine. Fentanyl is used in the management of chronic cancer pain.

fi_rules<-subset(med_rule, items %in% "fentanyl injection uh")
inspect(fi_rules)

##      lhs                                       rhs                                      support confidence   coverage      lift count
## [1]  {ondansetron injection uh}             => {fentanyl injection uh}               0.01893939  0.3030303 0.06250000 0.7582938    10
## [2]  {fentanyl injection uh,                                                                                                         
##       ondansetron injection uh}             => {hydrocodone acetaminophen 5mg 325mg} 0.01136364  0.6000000 0.01893939 1.9200000     6
## [3]  {hydrocodone acetaminophen 5mg 325mg,                                                                                           
##       ondansetron injection uh}             => {fentanyl injection uh}               0.01136364  0.3750000 0.03030303 0.9383886     6
## [4]  {cefazolin ivpb uh,                                                                                                             
##       fentanyl injection uh}                => {heparin injection}                   0.01893939  0.5000000 0.03787879 2.5142857    10
## [5]  {fentanyl injection uh,                                                                                                         
##       heparin injection}                    => {cefazolin ivpb uh}                   0.01893939  0.4761905 0.03977273 2.3280423    10
## [6]  {cefazolin ivpb uh,                                                                                                             
##       fentanyl injection uh}                => {hydrocodone acetaminophen 5mg 325mg} 0.02462121  0.6500000 0.03787879 2.0800000    13
## [7]  {fentanyl injection uh,                                                                                                         
##       hydrocodone acetaminophen 5mg 325mg}  => {cefazolin ivpb uh}                   0.02462121  0.3250000 0.07575758 1.5888889    13
## [8]  {fentanyl injection uh,                                                                                                         
##       heparin injection}                    => {hydrocodone acetaminophen 5mg 325mg} 0.01893939  0.4761905 0.03977273 1.5238095    10
## [9]  {heparin injection,                                                                                                             
##       hydrocodone acetaminophen 5mg 325mg}  => {fentanyl injection uh}               0.01893939  0.3125000 0.06060606 0.7819905    10
## [10] {fentanyl injection uh,                                                                                                         
##       hydrocodone acetaminophen 5mg 325mg}  => {heparin injection}                   0.01893939  0.2500000 0.07575758 1.2571429    10
## [11] {cefazolin ivpb uh,                                                                                                             
##       fentanyl injection uh,                                                                                                         
##       heparin injection}                    => {hydrocodone acetaminophen 5mg 325mg} 0.01515152  0.8000000 0.01893939 2.5600000     8
## [12] {cefazolin ivpb uh,                                                                                                             
##       heparin injection,                                                                                                             
##       hydrocodone acetaminophen 5mg 325mg}  => {fentanyl injection uh}               0.01515152  0.4000000 0.03787879 1.0009479     8
## [13] {cefazolin ivpb uh,                                                                                                             
##       fentanyl injection uh,                                                                                                         
##       hydrocodone acetaminophen 5mg 325mg}  => {heparin injection}                   0.01515152  0.6153846 0.02462121 3.0945055     8
## [14] {fentanyl injection uh,                                                                                                         
##       heparin injection,                                                                                                             
##       hydrocodone acetaminophen 5mg 325mg}  => {cefazolin ivpb uh}                   0.01515152  0.8000000 0.01893939 3.9111111     8

Earlier, we saw that \(\%in\%\) is a simple intuitive interface to match() and is used as a binary operator returning a logical vector indicating if there is a match (T) or not (F) for its left operand within the right table object. In total, there are 14 rules related to this item. Let’s plot them.

2.7 Graphical depiction of association rules

# ?arulesViz::plot()
# plot(sort(fi_rules, by="lift"), method="grouped", control=list(type="items"), 
#      main = "Grouped Matrix for the 14 Fentanyl-associated Rules")

# plot(sort(fi_rules, by="lift"), method = "graph", engine = "plotly")
# plot(sort(fi_rules, by="lift"), method = "graph", engine = "igraph")
# plot(sort(fi_rules, by="lift"), method = "graph", engine = "paracoord")

plot(sort(fi_rules, by="lift"), method = "graph", engine = "htmlwidget",
     control=list(main = list(title="Grouped Matrix for the 14 Fentanyl-associated Rules")))

## Available control parameters (with default values):
## itemCol   =  #CBD2FC
## nodeCol   =  c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B",  "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0",  "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision     =  3
## igraphLayout  =  layout_nicely
## interactive   =  TRUE
## engine    =  visNetwork
## max   =  100
## selection_menu    =  TRUE
## degree_highlight  =  1
## verbose   =  FALSE

subrules2 <- sample(subset(fi_rules, lift > 2), 5)
plot(sort(subrules2, by="lift"), method="grouped", control=list(type="items"), engine = "htmlwidget")

## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

plot(sort(fi_rules, by="lift"), method="grouped",  k = 7, control=list(type="items"), engine = "htmlwidget")

## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

m <- rules2matrix(sort(fi_rules, by="lift")[1:10], measure = "lift")
#  m
plot(fi_rules[1:10], method = "matrix", engine = "htmlwidget")

## Grouped matrix
# create a matrix with LHSs grouped in k = 10 groups
m <- rules2groupedMatrix(sort(fi_rules, by="lift")[1:10], k = 10)
# m$m
# number of rules per group
# table(m$clustering_rules)
# get rules for group 1
inspect(fi_rules[m$clustering_rules == 1])

##     lhs                           rhs                                      support confidence   coverage      lift count
## [1] {ondansetron injection uh} => {fentanyl injection uh}               0.01893939  0.3030303 0.06250000 0.7582938    10
## [2] {cefazolin ivpb uh,                                                                                                 
##      fentanyl injection uh,                                                                                             
##      heparin injection}        => {hydrocodone acetaminophen 5mg 325mg} 0.01515152  0.8000000 0.01893939 2.5600000     8

# the corresponding plot
plot(fi_rules, method = "grouped matrix", k = 7, engine = "htmlwidget")

# plot(fi_rules, method = "grouped matrix", k = 7)

#### For interactive runs try the RShinyApp:
# ruleExplorer(fi_rules)

plot(fi_rules, method="graph", measure = "support", engine="htmlwidget", # nodeCol=rainbow(14),
     shading = "lift", control = list(verbose = TRUE))

## Used control parameters:
## itemCol   =  #CBD2FC
## nodeCol   =  c("#EE0000", "#EE0303", "#EE0606", "#EE0909", "#EE0C0C", "#EE0F0F", "#EE1212", "#EE1515", "#EE1818", "#EE1B1B", "#EE1E1E", "#EE2222", "#EE2525", "#EE2828", "#EE2B2B", "#EE2E2E", "#EE3131", "#EE3434", "#EE3737", "#EE3A3A", "#EE3D3D", "#EE4040", "#EE4444", "#EE4747", "#EE4A4A", "#EE4D4D", "#EE5050", "#EE5353", "#EE5656", "#EE5959", "#EE5C5C", "#EE5F5F", "#EE6262", "#EE6666", "#EE6969", "#EE6C6C", "#EE6F6F", "#EE7272", "#EE7575", "#EE7878", "#EE7B7B", "#EE7E7E", "#EE8181", "#EE8484", "#EE8888", "#EE8B8B",  "#EE8E8E", "#EE9191", "#EE9494", "#EE9797", "#EE9999", "#EE9B9B", "#EE9D9D", "#EE9F9F", "#EEA0A0", "#EEA2A2", "#EEA4A4", "#EEA5A5", "#EEA7A7", "#EEA9A9", "#EEABAB", "#EEACAC", "#EEAEAE", "#EEB0B0", "#EEB1B1", "#EEB3B3", "#EEB5B5", "#EEB7B7", "#EEB8B8", "#EEBABA", "#EEBCBC", "#EEBDBD", "#EEBFBF", "#EEC1C1", "#EEC3C3", "#EEC4C4", "#EEC6C6", "#EEC8C8", "#EEC9C9", "#EECBCB", "#EECDCD", "#EECFCF", "#EED0D0", "#EED2D2", "#EED4D4", "#EED5D5", "#EED7D7", "#EED9D9", "#EEDBDB", "#EEDCDC", "#EEDEDE", "#EEE0E0",  "#EEE1E1", "#EEE3E3", "#EEE5E5", "#EEE7E7", "#EEE8E8", "#EEEAEA", "#EEECEC", "#EEEEEE")
## precision     =  3
## igraphLayout  =  layout_nicely
## interactive   =  TRUE
## engine    =  htmlwidget
## max   =  100
## selection_menu    =  TRUE
## degree_highlight  =  1
## verbose   =  TRUE

2.8 Saving association rules to a file or a data frame

We can save these rules into a CSV file using write(). It is similar to the function write.csv() that we have mentioned in the beginning of this case study.

write(med_rule, file = "medrule.csv", sep=",", row.names=F)

Sometimes it is more convenient to convert the rules into a data frame.

med_df <- as(med_rule, "data.frame")
str(med_df)

## 'data.frame':    29 obs. of  6 variables:
##  $ rules     : chr  "{acetaminophen uh} => {cefazolin ivpb uh}" "{ampicillin sulbactam ivpb uh} => {heparin injection}" "{ondansetron injection uh} => {heparin injection}" "{ondansetron injection uh} => {fentanyl injection uh}" ...
##  $ support   : num  0.0114 0.0189 0.017 0.0189 0.0303 ...
##  $ confidence: num  0.462 0.345 0.273 0.303 0.485 ...
##  $ coverage  : num  0.0246 0.0549 0.0625 0.0625 0.0625 ...
##  $ lift      : num  2.256 1.734 1.371 0.758 1.552 ...
##  $ count     : int  6 10 9 10 16 13 21 17 48 48 ...

As we can see, the rules are converted into a factor vector. Finally, remember that matrices and data-frames can be converted to arules transactions format using arulesObject <- as(input.df, "transactions").

3 Summary

The Apriori algorithm for association rule learning is only suitable for large transactional data. For some small datasets, it might not be very helpful.
It is useful for discovering associations, mostly in early phases of an exploratory study.
Some rules can be built due to chance and may need further verification.

4 Practice Problems

4.1 Groceries

In this practice problem, we will investigate the associations of frequently purchased groceries using the grocery dataset in the R base. Firstly, let’s load the data.

data("Groceries")
summary(Groceries)

## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

We will try to find out the top 5 frequent grocery items and plot them.

# itemFrequencyPlot(Groceries, topN=5)
itemFrequencyPlotly(Groceries, 10, "groceries")

Then, try to use support = 0.006, confidence = 0.25, minlen = 2 to set up the grocery association rules. Sort the top 3 rules with highest lift.

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [463 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

## set of 463 rules

##     lhs                    rhs                      support confidence   coverage     lift count
## [1] {herbs}             => {root vegetables}    0.007015760  0.4312500 0.01626843 3.956477    69
## [2] {berries}           => {whipped/sour cream} 0.009049314  0.2721713 0.03324860 3.796886    89
## [3] {tropical fruit,                                                                            
##      other vegetables,                                                                          
##      whole milk}        => {root vegetables}    0.007015760  0.4107143 0.01708185 3.768074    69

The number of rules (\(463\)) appears excessive. We can try stringer parameters. In practice, it’s more possible to observe underlying rules if you set a higher confidence. Here we set the \(confidence=0.6\).

groceryrules <- apriori(Groceries, parameter = list(support = 0.006, confidence = 0.6, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [109 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

groceryrules

## set of 8 rules

inspect(sort(groceryrules, by = "lift")[1:3])

##     lhs                             rhs          support     confidence
## [1] {butter, whipped/sour cream} => {whole milk} 0.006710727 0.6600000 
## [2] {butter, yogurt}             => {whole milk} 0.009354347 0.6388889 
## [3] {root vegetables, butter}    => {whole milk} 0.008235892 0.6377953 
##     coverage   lift     count
## [1] 0.01016777 2.583008 66   
## [2] 0.01464159 2.500387 92   
## [3] 0.01291307 2.496107 81

We observe mainly rules between dairy products. It makes sense that customers pick up milk when they walk down the dairy products aisle. Experiment further with various parameter settings and try to interpret the results in the context of this grocery case-study.

4.2 Titanic Passengers

Next we’ll use the Titanic Passengers Dataset. Let’s start by loading the data first.

dat <- read.csv("https://umich.instructure.com/files/9372716/download?download_frd=1")

# Choose only the key data features
dat <- dat[ , c("pclass", "survived", "sex", "age", "fare", "cabin")]

# Factorize categorical features
dat$pclass <- as.factor(dat$pclass)
dat$survived <- factor(dat$survived, levels = c("0", "1"))
dat$sex <- as.factor(dat$sex)

# Convert the Cabin number to a character A-F, Z=missing cabin ID
dat$cabin <- substring(dat$cabin,1,1)
for (i in 1:length(dat$cabin))
     if ((dat$cabin[i]=="")) dat$cabin[i] <- "Z"
dat$cabin <- as.factor(dat$cabin)

# Convert the Ticket Fair from numeric to categorical label
f <- as.character(dat$fare)
for (i in 1:length(dat$fare)) {
  if (is.na(dat$fare[i])) f[i] <- as.character("low")
  else if (dat$fare[i]<50) f[i] <- as.character("low") 
  else if (50<=dat$fare[i] && dat$fare[i]<100) f[i] <- as.character("medium") 
  else if (100<=dat$fare[i] && dat$fare[i]<200)  f[i] <- as.character("high")
  else f[i] <- as.character("extreme")   #  if (200<=dat$fare[i]) 
}
dat$fare <- as.factor(f)
table(as.factor(dat$fare))

## 
## extreme    high     low  medium 
##      38      46    1067     158

# Convert Age from numeric to categorical (Decade-of-life) label
f <- as.character(dat$age)
for (i in 1:length(dat$age)) {
  if (is.na(dat$age[i])) f[i] <- as.character("1")
  else {
    a = 1 + dat$age[i] %/% 10  # integer division by 10 (per decade of life)
    f[i] <- as.character(a)
  } 
}
dat$age <- as.factor(f)
table(as.factor(dat$age))

## 
##   1   2   3   4   5   6   7   8   9 
## 345 143 344 232 135  70  32   7   1

str(dat)

## 'data.frame':    1309 obs. of  6 variables:
##  $ pclass  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ survived: Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 1 2 1 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age     : Factor w/ 9 levels "1","2","3","4",..: 3 1 1 4 3 5 7 4 6 8 ...
##  $ fare    : Factor w/ 4 levels "extreme","high",..: 1 2 2 2 2 3 4 3 4 3 ...
##  $ cabin   : Factor w/ 9 levels "A","B","C","D",..: 2 3 3 3 3 5 4 1 3 9 ...

Next, we can mine the association rules.

library(arules)

rules <- apriori(dat, parameter = list(minlen=5, supp=0.02, conf=0.8))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.02      5
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 26 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[29 item(s), 1309 transaction(s)] done [0.00s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [186 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(rules[1:20])

##      lhs                                          rhs          support   
## [1]  {pclass=3, survived=0, sex=male, age=5}   => {cabin=Z}    0.02062643
## [2]  {pclass=3, sex=male, age=5, cabin=Z}      => {survived=0} 0.02062643
## [3]  {pclass=3, survived=0, sex=male, age=5}   => {fare=low}   0.02139037
## [4]  {pclass=3, sex=male, age=5, fare=low}     => {survived=0} 0.02139037
## [5]  {pclass=3, survived=0, age=5, cabin=Z}    => {fare=low}   0.02750191
## [6]  {pclass=3, survived=0, age=5, fare=low}   => {cabin=Z}    0.02750191
## [7]  {pclass=3, age=5, fare=low, cabin=Z}      => {survived=0} 0.02750191
## [8]  {pclass=3, sex=male, age=5, cabin=Z}      => {fare=low}   0.02215432
## [9]  {pclass=3, sex=male, age=5, fare=low}     => {cabin=Z}    0.02215432
## [10] {survived=0, sex=male, age=5, cabin=Z}    => {fare=low}   0.03743316
## [11] {survived=0, sex=male, age=5, fare=low}   => {cabin=Z}    0.03743316
## [12] {survived=0, age=5, fare=low, cabin=Z}    => {sex=male}   0.03743316
## [13] {sex=male, age=5, fare=low, cabin=Z}      => {survived=0} 0.03743316
## [14] {survived=1, sex=female, age=2, cabin=Z}  => {fare=low}   0.02368220
## [15] {survived=1, sex=female, age=2, fare=low} => {cabin=Z}    0.02368220
## [16] {survived=1, age=2, fare=low, cabin=Z}    => {sex=female} 0.02368220
## [17] {pclass=3, sex=female, age=2, cabin=Z}    => {fare=low}   0.02750191
## [18] {pclass=3, sex=female, age=2, fare=low}   => {cabin=Z}    0.02750191
## [19] {pclass=3, survived=0, sex=male, age=2}   => {cabin=Z}    0.03819710
## [20] {pclass=3, sex=male, age=2, cabin=Z}      => {survived=0} 0.03819710
##      confidence coverage   lift     count
## [1]  0.9642857  0.02139037 1.244822 27   
## [2]  0.9310345  0.02215432 1.506458 27   
## [3]  1.0000000  0.02139037 1.226804 28   
## [4]  0.9333333  0.02291826 1.510177 28   
## [5]  1.0000000  0.02750191 1.226804 36   
## [6]  0.9729730  0.02826585 1.256037 36   
## [7]  0.8780488  0.03132162 1.420724 36   
## [8]  1.0000000  0.02215432 1.226804 29   
## [9]  0.9666667  0.02291826 1.247896 29   
## [10] 0.9423077  0.03972498 1.156027 49   
## [11] 0.8166667  0.04583652 1.054257 49   
## [12] 0.8305085  0.04507257 1.289603 49   
## [13] 0.9245283  0.04048892 1.495930 49   
## [14] 1.0000000  0.02368220 1.226804 31   
## [15] 0.8857143  0.02673797 1.143393 31   
## [16] 0.8378378  0.02826585 2.353497 31   
## [17] 1.0000000  0.02750191 1.226804 36   
## [18] 1.0000000  0.02750191 1.290927 36   
## [19] 0.9803922  0.03896104 1.265615 50   
## [20] 0.9090909  0.04201681 1.470952 50

We can focus on the binary survival outcome (Survived=1/0).

# examine the rules with containing "Survived=1" in the RHS
rules <- apriori(dat,
  parameter = list(minlen=3, supp=0.02, conf=0.7),
  appearance = list(rhs="survived=1", default="lhs"), control = list(verbose=F))
rules.sorted <- sort(rules, by="lift")
inspect(head(rules.sorted, 30))

##      lhs              rhs             support confidence   coverage     lift count
## [1]  {sex=female,                                                                 
##       cabin=B}     => {survived=1} 0.02750191  1.0000000 0.02750191 2.618000    36
## [2]  {pclass=1,                                                                   
##       sex=female,                                                                 
##       cabin=B}     => {survived=1} 0.02750191  1.0000000 0.02750191 2.618000    36
## [3]  {pclass=1,                                                                   
##       sex=female,                                                                 
##       fare=medium} => {survived=1} 0.05271199  1.0000000 0.05271199 2.618000    69
## [4]  {pclass=1,                                                                   
##       sex=female,                                                                 
##       age=4}       => {survived=1} 0.02826585  0.9736842 0.02902979 2.549105    37
## [5]  {pclass=1,                                                                   
##       sex=female}  => {survived=1} 0.10618793  0.9652778 0.11000764 2.527097   139
## [6]  {sex=female,                                                                 
##       fare=medium} => {survived=1} 0.05500382  0.9350649 0.05882353 2.448000    72
## [7]  {sex=female,                                                                 
##       fare=high}   => {survived=1} 0.02062643  0.9310345 0.02215432 2.437448    27
## [8]  {pclass=1,                                                                   
##       sex=female,                                                                 
##       fare=high}   => {survived=1} 0.02062643  0.9310345 0.02215432 2.437448    27
## [9]  {sex=female,                                                                 
##       cabin=C}     => {survived=1} 0.03208556  0.9130435 0.03514133 2.390348    42
## [10] {pclass=1,                                                                   
##       sex=female,                                                                 
##       cabin=C}     => {survived=1} 0.03208556  0.9130435 0.03514133 2.390348    42
## [11] {pclass=2,                                                                   
##       sex=female}  => {survived=1} 0.07181054  0.8867925 0.08097785 2.321623    94
## [12] {pclass=2,                                                                   
##       sex=female,                                                                 
##       fare=low}    => {survived=1} 0.06951872  0.8834951 0.07868602 2.312990    91
## [13] {pclass=2,                                                                   
##       sex=female,                                                                 
##       cabin=Z}     => {survived=1} 0.06264324  0.8817204 0.07104660 2.308344    82
## [14] {pclass=2,                                                                   
##       sex=female,                                                                 
##       fare=low,                                                                   
##       cabin=Z}     => {survived=1} 0.06035141  0.8777778 0.06875477 2.298022    79
## [15] {pclass=2,                                                                   
##       sex=female,                                                                 
##       age=3}       => {survived=1} 0.02521008  0.8684211 0.02902979 2.273526    33
## [16] {pclass=2,                                                                   
##       sex=female,                                                                 
##       age=3,                                                                      
##       fare=low}    => {survived=1} 0.02368220  0.8611111 0.02750191 2.254389    31
## [17] {pclass=2,                                                                   
##       sex=female,                                                                 
##       age=3,                                                                      
##       cabin=Z}     => {survived=1} 0.02139037  0.8484848 0.02521008 2.221333    28
## [18] {sex=female,                                                                 
##       age=4}       => {survived=1} 0.05194805  0.7906977 0.06569901 2.070047    68
## [19] {sex=female,                                                                 
##       age=5}       => {survived=1} 0.02750191  0.7826087 0.03514133 2.048870    36
## [20] {pclass=1,                                                                   
##       fare=high}   => {survived=1} 0.02597403  0.7391304 0.03514133 1.935043    34
## [21] {sex=female,                                                                 
##       age=2}       => {survived=1} 0.03590527  0.7343750 0.04889228 1.922594    47
## [22] {pclass=1,                                                                   
##       cabin=B}     => {survived=1} 0.03590527  0.7230769 0.04965623 1.893015    47
## [23] {sex=female,                                                                 
##       age=3}       => {survived=1} 0.06264324  0.7130435 0.08785332 1.866748    82
## [24] {pclass=1,                                                                   
##       age=3}       => {survived=1} 0.02826585  0.7115385 0.03972498 1.862808    37
## [25] {pclass=1,                                                                   
##       age=4}       => {survived=1} 0.03896104  0.7083333 0.05500382 1.854417    51
## [26] {pclass=1,                                                                   
##       fare=medium} => {survived=1} 0.06799083  0.7007874 0.09702063 1.834661    89
## [27] {pclass=1,                                                                   
##       cabin=D}     => {survived=1} 0.02139037  0.7000000 0.03055768 1.832600    28

Prune any redundant association rules. For instance, some rules may provide no extra knowledge. Rules that are highly related with prior rules may be redundant. Pruning reduces the number of rules from 27 to 18.

# search for redundant rules
rules_lift <- sort(rules, by = 'lift')
rules_pruned <- rules_lift[!is.redundant(rules_lift, measure="lift")]
inspect(head(rules_pruned, 30))

##      lhs                                    rhs          support    confidence
## [1]  {sex=female, cabin=B}               => {survived=1} 0.02750191 1.0000000 
## [2]  {pclass=1, sex=female, fare=medium} => {survived=1} 0.05271199 1.0000000 
## [3]  {pclass=1, sex=female, age=4}       => {survived=1} 0.02826585 0.9736842 
## [4]  {pclass=1, sex=female}              => {survived=1} 0.10618793 0.9652778 
## [5]  {sex=female, fare=medium}           => {survived=1} 0.05500382 0.9350649 
## [6]  {sex=female, fare=high}             => {survived=1} 0.02062643 0.9310345 
## [7]  {sex=female, cabin=C}               => {survived=1} 0.03208556 0.9130435 
## [8]  {pclass=2, sex=female}              => {survived=1} 0.07181054 0.8867925 
## [9]  {sex=female, age=4}                 => {survived=1} 0.05194805 0.7906977 
## [10] {sex=female, age=5}                 => {survived=1} 0.02750191 0.7826087 
## [11] {pclass=1, fare=high}               => {survived=1} 0.02597403 0.7391304 
## [12] {sex=female, age=2}                 => {survived=1} 0.03590527 0.7343750 
## [13] {pclass=1, cabin=B}                 => {survived=1} 0.03590527 0.7230769 
## [14] {sex=female, age=3}                 => {survived=1} 0.06264324 0.7130435 
## [15] {pclass=1, age=3}                   => {survived=1} 0.02826585 0.7115385 
## [16] {pclass=1, age=4}                   => {survived=1} 0.03896104 0.7083333 
## [17] {pclass=1, fare=medium}             => {survived=1} 0.06799083 0.7007874 
## [18] {pclass=1, cabin=D}                 => {survived=1} 0.02139037 0.7000000 
##      coverage   lift     count
## [1]  0.02750191 2.618000  36  
## [2]  0.05271199 2.618000  69  
## [3]  0.02902979 2.549105  37  
## [4]  0.11000764 2.527097 139  
## [5]  0.05882353 2.448000  72  
## [6]  0.02215432 2.437448  27  
## [7]  0.03514133 2.390348  42  
## [8]  0.08097785 2.321623  94  
## [9]  0.06569901 2.070047  68  
## [10] 0.03514133 2.048870  36  
## [11] 0.03514133 1.935043  34  
## [12] 0.04889228 1.922594  47  
## [13] 0.04965623 1.893015  47  
## [14] 0.08785332 1.866748  82  
## [15] 0.03972498 1.862808  37  
## [16] 0.05500382 1.854417  51  
## [17] 0.09702063 1.834661  89  
## [18] 0.03055768 1.832600  28

The package arulesViz supplies rule-visualization routines using scatter, bubble, and parallel coordinates plots. The visualization of the frequent itemsets using parallel coordinates allows all items to be placed on the vertical axis with their position determined by their group and by the frequency of the item in descending order, vertical ranking based the support of the 1-itemset containing the item, see this paper. To render the maximal frequent itemsets, all subsets are implicitly drawn as sub-segments of one polyline. All itemsets are visualized as parallel coordinate polylines. The support can be mapped to line color or width. This shows effectively that itemsets that share common parts may have overlapping polylines.

library(arulesViz)
plot(rules)

# scatter plot of association rules
plot(rules_pruned, method="graph", control=list(verbose = FALSE), engine="htmlwidget")

# display association rules
# plot(rules_pruned, method="paracoord", control=list(reorder=TRUE, verbose = FALSE), engine="htmlwidget")

# interactive association network graph
plot(rules_pruned, method="graph", measure = "support", engine="htmlwidget", nodeCol=rainbow(14),
     shading = "lift", control = list(verbose = FALSE))

#  <- rules2matrix(sort(rules_pruned, by="lift")[1:10], measure = "lift")
#  m
# plot(rules_pruned, method = "matrix", engine = "htmlwidget")
## Grouped matrix
# create a matrix with LHSs grouped in k = 10 groups
# m <- rules2groupedMatrix(sort(rules_pruned, by="lift")[1:10], k = 10)
# m$m
# number of rules per group
# table(m$clustering_rules)
# get rules for group 1
inspect(rules_pruned[m$clustering_rules == 1])

##     lhs                      rhs          support    confidence coverage  
## [1] {sex=female, cabin=B} => {survived=1} 0.02750191 1.0000000  0.02750191
## [2] {pclass=1, fare=high} => {survived=1} 0.02597403 0.7391304  0.03514133
##     lift     count
## [1] 2.618000 36   
## [2] 1.935043 34

# the corresponding plot
plot(rules_pruned, method = "grouped matrix", k = 7, engine = "htmlwidget")

Consider plotting all rules (not only the pruned list of association rules) and examine rules predicting passenger death (specify "survived=0").

5 Try these TM/NLP techniques for other applications.

Apply TM/NLP to model the US Bureau of Labor Statistics (BLS) jobs data.
MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35.
Other data from the DSPA list of Case-Studies.
Other available free text or transactional data archive.

6 Transformer NN Models for NLP

Transformers are specific attention-based architectures used in design of deep neural networks. Trasnformer-based DNN learn more quickly copmpared to recurrent neural networks, e.g., RNN and long short-term memory (LSTM). As of 2023, transformers are dominant for human language modeling including DataSifting and statistical obfuscation of medical text, Bidirectional Encoder Representations from Transformers (BIRT), and training large language generative-AI models (GAIMs), such as generative pre-trained transformers (GPTs), as the ones used in the SOCR AI Bot.

…

## https://blogs.rstudio.com/ai/posts/2022-09-29-r-text/
## https://www.r-text.org/articles/huggingface_in_r_extended_installation_guide.html 

################# See Python Version/installation instrucitons #################
# https://rstudio.github.io/reticulate/articles/versions.html
#
# conda_list()
# use_python("C:\\Users\\dinov\\Anaconda3\\envs\\r-reticulate/python.exe")
# use_python("C:\\Users\\dinov\\Anaconda3/python.exe")
################################################################################

# install.packages("text")  # devtools::install_github("oscarkjell/text")
library(text)
library(reticulate)

# Check Conda environment (textrpp_condaenv)
# conda_list()

# May need to install the python 'absl-py' library
# py_install("absl-py", "google", "six", "pandas")
# conda_install(packages="google-api-python-client")
# conda_install(packages="protobuf")
# conda_install(packages="google-api-python-client", python_version = "C:\\Users\\dinov\\Anaconda3\\envs\\textrpp_condaenv/")
# Install text required python packages in a conda environment (with defaults).
# textrpp_install()

# conda_create("r-reticulate")

# Initialize the installed conda environment.
# (Optional) save_profile = TRUE saves the settings for next time

# In Anaconda shell do:
# pip install --target=C:\\Users\\dinov\\Anaconda3\\envs\\textrpp_condaenv/ google-api-python-client
# pip install --target=C:\\Users\\dinov\\Anaconda3\\envs\\textrpp_condaenv/ protobuf
# pip install --target=C:\\Users\\dinov\\Anaconda3\\envs\\textrpp_condaenv grpcio-status==1.33.2 protobuf==3.19.6

text::textrpp_install_virtualenv(
  rpp_version=c("torch==1.7.1", "transformers==4.12.5", "numpy", "nltk", "protobuf"),
  python_path="C:/Users/IvoD/Anaconda3/envs/textrpp_condaenv/python.exe",
  envname = "textrpp_virtualenv")

textrpp_initialize(refresh_settings=T)

# Initialize the installed conda environment.
# save_profile = TRUE saves the settings so that you don't have to run textrpp_initialize() after restarting R. 
textrpp_initialize(save_profile = TRUE)

# Example text
texts <- c("I feel great!")

# Defaults
embeddings <- textEmbed(texts)
embeddings

library(reticulate)
# use_python("C:/Users/IvoD/Anaconda3/python.exe")
# reticulate::repl_python()

# See: https://tensorflow.rstudio.com/install/
#      May need to install PyTorch and TensorFlow
# install.packages("remotes")
# remotes::install_github("rstudio/tensorflow")
# library(tensorflow)

# following huggingface protocol
# https://huggingface.co/docs/transformers/tasks/question_answering

# Use the Anaconda shell to install the datasets python package
# %> pip install datasets
# %> pip install ipywidgets

from datasets import load_dataset

# 1. Try a smaller subset of the SQuAD dataset from the Datasets library
squad = load_dataset("squad", split="train[:10000]")

# 2. Split the dataset into a train and test sets using train_test_split
squad = squad.train_test_split(test_size=0.2)

# 3. Check some examples
squad["train"][0]  # the result below will change at execution time
# {'id': '56cc62406d243a140015ef7a', 'title': 'IPod', 'context': 'The name iPod was proposed by Vinnie Chieco, a freelance copywriter, who (with others) was called by Apple to figure out how to introduce the new player to the public. After Chieco saw a prototype, he thought of the movie 2001: A Space Odyssey and the phrase "Open the pod bay door, Hal!", which refers to the white EVA Pods of the Discovery One spaceship. Chieco saw an analogy to the relationship between the spaceship and the smaller independent pods in the relationship between a personal computer and the music player. Apple researched the trademark and found that it was already in use. Joseph N. Grasso of New Jersey had originally listed an "iPod" trademark with the U.S. Patent and Trademark Office (USPTO) in July 2000 for Internet kiosks. The first iPod kiosks had been demonstrated to the public in New Jersey in March 1998, and commercial use began in January 2000, but had apparently been discontinued by 2001. The trademark was registered by the USPTO in November 2003, and Grasso assigned it to Apple Computer, Inc. in 2005.', 'question': 'What film inspired the name of the iPod?', 'answers': {'text': ['2001: A Space Odyssey'], 'answer_start': [222]}}

# 4. Preprocessing - load a DistilBERT tokenizer to process the question and context fields
#  Preprocessing steps particular to question answering tasks include:
#    4.1. Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the context by setting truncation="only_second".
#   4.2.  Map the start and end positions of the answer to the original context by setting return_offset_mapping=True.
#   4.3 Find the start and end tokens of the answer and use the sequence_ids method to find which part of the offset corresponds to the question and which corresponds to the context.

# Using hte Anaconda shell, install the transformers Python package
# %> pip install transformers

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# 5. Create a function to truncate and map the start and end tokens of the answer in the context of the prompt
def preprocess_function(examples):
  questions = [q.strip() for q in examples["question"]]
  inputs = tokenizer(questions,
    examples["context"],
    max_length=384,
    truncation="only_second",
    return_offsets_mapping=True,
    padding="max_length",
  )

  offset_mapping = inputs.pop("offset_mapping")
  answers = examples["answers"]
  start_positions = []
  end_positions = []
  for i, offset in enumerate(offset_mapping):
    answer = answers[i]
    start_char = answer["answer_start"][0]
    end_char = answer["answer_start"][0] + len(answer["text"][0])
    sequence_ids = inputs.sequence_ids(i)
    # Find the start and end of the context
    idx = 0
    while sequence_ids[idx] != 1:
      idx += 1
      context_start = idx
    while sequence_ids[idx] == 1:
      idx += 1
      context_end = idx - 1
    # If the answer is not fully inside the context, label it (0, 0)
    if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
      start_positions.append(0)
      end_positions.append(0)
    else:
    # Otherwise it's the start and end token positions
      idx = context_start
    while idx <= context_end and offset[idx][0] <= start_char:
      idx += 1
      start_positions.append(idx - 1)
      idx = context_end
    while idx >= context_start and offset[idx][1] >= end_char:
      idx -= 1
      end_positions.append(idx + 1)
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Text Mining, Natural Language Processing, Apriori Association Rules Learning

SOCR/MIDAS (Ivo Dinov)

February 2025