Mining Cancer Clinical Notes
Use Head and Neck Cancer Medication Data to to apply NLP/TM methods and investigate the information content. In Chapter 5, we already saw some preliminary TM analysis. Now we need to go further.
- Use 
MEDICATION_SUMMARY to construct a VCorpus object 
- Clean the VCorpus object
 
- Build a document term matrix (DTM)
 
- Add a column to indicate early and later cancer stage according to 
seer_stage, refer to Chapter 5 
- Use the DTM to construct a word cloud for early stage, later stage and the entire dataset
 
- Interpret the word clouds
 
- Compute the TF-IDF (Term Frequency - Inverse Document Frequency)
 
- Apply LASSO on the unweighted and weighted DTM respectively and evaluate the results according to AUC
 
- Try the cosine similarity transformation, apply LASSO, and compare the results
 
- Use other measures such as “class” for 
cv.glmnet() 
- Does it appear that these classifiers may provide an automated machine interpretation of unstructured free text?
 
 
 Use the SOCR Jobs Data to practice Apriori Association Rule learning
- Load the Jobs Data
 
- Use this guide to load HTML data data
 
- Focus on the Description feature. Replace all underscore characters “_” with spaces
 
- Save the data using 
write.csv() and then use the read.transactions() in arules package to read the CSV data file. Visualize the item support using item frequency plots 
- Generate the sparse terms matrix for each job category. What terms appear as more popular?
 
- Fit a model: 
myrules<-apriori(data=jobs,parameter=list(support=0.1,confidence=0.8,minlen=1)). Try out several rule thresholds trading off gain and accuracy 
- Evaluate model performance with 
lift 
- Try to improve the model performance
 
- Sort the set of association rules
 
- Investigate associations that may be linked to specific job-description terms.