SOCR ≫ DSPA ≫ DSPA2 Topics ≫

1 Mining Twitter Data

Use these R Data Mining Twitter data or the UM Twitter Decahose data to apply NLP/TM methods and investigate the Twitter corpus.

  • Construct a VCorpus object
  • Clean the VCorpus object
  • Build document-term matrix (DTM)
  • Compute the TF-IDF(term frequency - inverse document frequency
  • Use the DTM to construct a word cloud.

2 Mining Cancer Clinical Notes

Use Head and Neck Cancer Medication Data to to apply NLP/TM methods and investigate the information content. In Chapter 5, we already saw some preliminary TM analysis. Now we need to go further.

  • Use MEDICATION_SUMMARY to construct a VCorpus object
  • Clean the VCorpus object
  • Build a document term matrix (DTM)
  • Add a column to indicate early and later cancer stage according to seer_stage, refer to Chapter 5
  • Use the DTM to construct a word cloud for early stage, later stage and the entire dataset
  • Interpret the word clouds
  • Compute the TF-IDF (Term Frequency - Inverse Document Frequency)
  • Apply LASSO on the unweighted and weighted DTM respectively and evaluate the results according to AUC
  • Try the cosine similarity transformation, apply LASSO, and compare the results
  • Use other measures such as “class” for cv.glmnet()
  • Does it appear that these classifiers may provide an automated machine interpretation of unstructured free text?

3 Use the SOCR Jobs Data to practice Apriori Association Rule learning

  • Load the Jobs Data
  • Use this guide to load HTML data data
  • Focus on the Description feature. Replace all underscore characters “_” with spaces
  • Save the data using write.csv() and then use the read.transactions() in arules package to read the CSV data file. Visualize the item support using item frequency plots
  • Generate the sparse terms matrix for each job category. What terms appear as more popular?
  • Fit a model: myrules<-apriori(data=jobs,parameter=list(support=0.1,confidence=0.8,minlen=1)). Try out several rule thresholds trading off gain and accuracy
  • Evaluate model performance with lift
  • Try to improve the model performance
  • Sort the set of association rules
  • Investigate associations that may be linked to specific job-description terms.
SOCR Resource Visitor number Web Analytics SOCR Email