Processing text data for analysis
Load the SOCR 2011 US Job Satisfaction data. The last column (Description
) contains free text describing each job type. Notice that spaces are replaced by underscores, __
. To mine the text field and suggest some meta-data analytics, construct an R protocol for:
- Convert the textual meta-data into a corpus object.
- Triage some of the irrelevant punctuation and other symbols in the corpus document, change all text to lower case, etc.
- Tokenize the job descriptions into words. Examine the distributions of
Stress_Category
and Hiring_Potential
.
- Classify the job Stress into two categories.
- Generate a word cloud to visualize the job description text.
- Graphically visualize the difference between low and high Stress_Category graph.
- Transform the word count features into categorical data
- Ignore those low frequency words and report the sparsity of your categorical data matrix with or without delete those low frequency words. Note that the
sparsity
of a matrix is the fraction: \(Sparsity(A) =\frac{\text{number of zero-valued elements}}{\text{total number of matrix elements (} m\times n\text{)}}\).
- Apply the Naive Bayes classifier to original matrix and lower dimension matrix, what do you observe?
- Apply and compare LDA and Naive Bayes classifiers with respect to the error, specificity and sensitivity.