Processing text data for analysis
Load the SOCR 2011 US Job Satisfaction data. The last column (Description
) contains free text about each job. Notice that white spaces are replaced by underscores, __
. Mine this text field and suggest some meta-data analytics.
- Convert the textual meta-data into a corpus object.
- Triage some of the irrelevant punctuation and other symbols in the corpus document, change all text to lower case, etc.
- Tokenize the job descriptions into words and report the sparsity of word matrix.
- Examine the distributions of
Stress_Level
and Overall_Score
.
- Label the
Stress_Level
and Overall_Score
into two categories, by introducing some proper cutoffs.
- Generate a word cloud to visualize the job description text.
- Visualize the differences between low and high
Stress_Level
, as well as low score (high rank jobs) and high scores (jobs of poor ranking).
- Ignore low frequency words and report the sparsity of your categorical data matrix before and after deleting the low frequency words.
- Transform the word count features into binary data.
- Apply the Naive Bayes classifier using a binaryzed version of the job overall ranking score as outcome. What do you find?
- Apply LDA, and compare with Naive Bayes with respect to the classification error, specificity and sensitivity.