SOCR ≫ DSPA ≫ Topics ≫

1 Explain these two concepts

  • Bayes Theorem
  • Laplace Estimation

2 Processing text data for analysis

Load the SOCR 2011 US Job Satisfaction data. The last column (Description) contains free text about each job. Notice that white spaces are replaced by underscores, __. Mine this text field and suggest some meta-data analytics.

  • Convert the textual meta-data into a corpus object.
  • Triage some of the irrelevant punctuation and other symbols in the corpus document, change all text to lower case, etc.
  • Tokenize the job descriptions into words and report the sparsity of word matrix.
  • Examine the distributions of Stress_Level and Overall_Score.
  • Label the Stress_Level and Overall_Score into two categories, by introducing some proper cutoffs.
  • Generate a word cloud to visualize the job description text.
  • Visualize the differences between low and high Stress_Level, as well as low score (high rank jobs) and high scores (jobs of poor ranking).
  • Ignore low frequency words and report the sparsity of your categorical data matrix before and after deleting the low frequency words.
  • Transform the word count features into binary data.
  • Apply the Naive Bayes classifier using a binaryzed version of the job overall ranking score as outcome. What do you find?
  • Apply LDA, and compare with Naive Bayes with respect to the classification error, specificity and sensitivity.

SOCR Resource Visitor number Dinov Email