1 Mining Twitter Data

Use these R Data Mining Twitter data or the UM Twitter Decahose data to apply NLP/TM methods and investigate the Twitter corpus.

Construct a VCorpus object
Clean the VCorpus object
Build document-term matrix (DTM)
Compute the TF-IDF(term frequency - inverse document frequency
Use the DTM to construct a word cloud.

2 Mining Cancer Clinical Notes

Use Head and Neck Cancer Medication Data to to apply NLP/TM methods and investigate the information content. In Chapter 5, we already saw some preliminary TM analysis. Now we need to go further.

Use MEDICATION_SUMMARY to construct a VCorpus object
Clean the VCorpus object
Build a document term matrix (DTM)
Add a column to indicate early and later cancer stage according to seer_stage, refer to Chapter 5
Use the DTM to construct a word cloud for early stage, later stage and the entire dataset
Interpret the word clouds
Compute the TF-IDF (Term Frequency - Inverse Document Frequency)
Apply LASSO on the unweighted and weighted DTM respectively and evaluate the results according to AUC
Try the cosine similarity transformation, apply LASSO, and compare the results
Use other measures such as “class” for cv.glmnet()
Does it appear that these classifiers may provide an automated machine interpretation of unstructured free text?

3 Use the SOCR Jobs Data to practice Apriori Association Rule learning

Load the Jobs Data
Use this guide to load HTML data data
Focus on the Description feature. Replace all underscore characters “_” with spaces
Save the data using write.csv() and then use the read.transactions() in arules package to read the CSV data file. Visualize the item support using item frequency plots
Generate the sparse terms matrix for each job category. What terms appear as more popular?
Fit a model: myrules<-apriori(data=jobs,parameter=list(support=0.1,confidence=0.8,minlen=1)). Try out several rule thresholds trading off gain and accuracy
Evaluate model performance with lift
Try to improve the model performance
Sort the set of association rules
Investigate associations that may be linked to specific job-description terms.

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Assignment 7: Text Mining, Natural Language Processing, Apriori Association Rules Learning

SOCR/MIDAS (Ivo Dinov)

March 2022

1 Mining Twitter Data

2 Mining Cancer Clinical Notes

3 Use the SOCR Jobs Data to practice Apriori Association Rule learning