SOCR ≫ DSPA ≫ DSPA2 Topics ≫

1 kNN

1.1 Traumatic Brain injury

Use the kNN algorithm to provide a classification of the data in the TBI case study, (CaseStudy11_TBI). Determine an appropriate k, train and evaluate the performance of the classification model on the data. Report some model quality statistics for a couple of different values of k and use these to rank-order (and perhaps plot the classification results of) the models.

1.2 Parkinson’s Disease

Use the 05_PPMI_top_UPDRS_Integrated_LongFormat1 data to practice kNN classification.

1.3 High Dimensional Space KNN Classification

  • Preprocess the data: delete the index and ID columns; convert the response variable ResearchGroup to binary 0-1 factor; detect NA (missing) values (impute if necessary)
  • Summarize the dataset: use str, summary, cor, ggpairs
  • Scale/Normalize the data: As appropriate, scale to 0 to 1; transform \(log(x+1)\); discretize (0 or 1)
  • Partition data into training and testing sets: use set.seed and random sample, train:test = 2:1
  • Select the optimal \(k\) for each of the scaled data: Plot a error graph for \(k\), including three lines: training_error, cross-validation_error and testing_error, respectively
  • What is the impact of \(k\)?: Formulate a hypothesis about the relation between \(k\) and the error rates. You can try to use knn.tunning to verify the results (Hint: select the same folds, all you may obtain a result slightly different)
  • Interpret the results: Hint: Considering the number of dimensions of the data, how many points are necessary to obtain the same density result for 100 dimensional space compared to a 1 dimensional space?
  • Report the error rates for both the training and the testing data. What do you find?

1.4 Lower Dimension Space kNN Classification

Try the above protocol again but select only columns 1 to 5 as predictors (after deleting the index and ID columns). Now, what about the \(k\) you select and the error rates for each kind of scaled data (original data, normalized data)? Comment on any interesting observations.

2 Naive Bayes

2.1 Explain these two concepts

  • Bayes Theorem
  • Laplace Estimation

2.2 Processing text data for analysis

Load the SOCR 2011 US Job Satisfaction data. The last column (Description) contains free text describing each job type. Notice that spaces are replaced by underscores, __. To mine the text field and suggest some meta-data analytics, construct an R protocol for:

  • Convert the textual meta-data into a corpus object.
  • Triage some of the irrelevant punctuation and other symbols in the corpus document, change all text to lowercase, etc.
  • Tokenize the job descriptions into words. Examine the distributions of Stress_Category and Hiring_Potential.
  • Classify the job Stress into two categories.
  • Generate a word cloud to visualize the job description text.
  • Graphically visualize the difference between low and high Stress_Category.
  • Transform the word count features into categorical data
  • Ignore those low frequency words and report the sparsity of your categorical data matrix with or without delete those low frequency words. Note that the sparsity of a matrix is the fraction: \(Sparsity(A) =\frac{\text{number of zero-valued elements}}{\text{total number of matrix elements (} m\times n\text{)}}\).
  • Apply the Naive Bayes classifier to the original matrix and lower dimension matrix, what do you observe?
  • Apply and compare LDA and Naive Bayes classifiers with respect to the error, specificity and sensitivity.

3 Decision Tree Classification

3.1 Explain the following concepts

  • Information Gain Measure
  • Impurity
  • Entropy
  • Gini

3.2 Decision Tree Partitioning

Use the SOCR Neonatal Pain data to build and display a decision tree recursively partitioning the data using the provided features and attributes to split the data into clusters.

  • Create two classes using variable Cluster
  • Create random training and test datasets
  • Train a decision tree model on the data, use C5.0 and rpart, separately
  • Evaluate the model performance and compare the C5.0 and rpart results
  • Tune the parameter for rpart and evaluate again
  • Make predictions on testing data and assess the prediction accuracy - report the confusion matrix
  • Comment on the classification performance
  • Try to apply Random Forest classification and report the variables importance plot, predictions on testing data, and assess the prediction accuracy.
SOCR Resource Visitor number Web Analytics SOCR Email