SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |

Use the kNN algorithm to provide a classification of the data in the TBI case study, (CaseStudy11_TBI). Determine an appropriate *k*, train and evaluate the performance of the classification model on the data. Report some model quality statistics for a couple of different values of *k* and use these to rank-order (and perhaps plot the classification results of) the models.

Use the 05_PPMI_top_UPDRS_Integrated_LongFormat1 data to practice kNN classification.

**Preprocess the data**: delete the`index`

and`ID`

columns; convert the response variable`ResearchGroup`

to binary 0-1 factor; detect`NA`

(missing) values (impute if necessary)**Summarize the dataset**: use`str`

,`summary`

,`cor`

,`ggpairs`

**Scale/Normalize the data**: As appropriate, scale to 0 to 1; transform \(log(x+1)\); discretize (0 or 1)**Partition data into training and testing sets**: use`set.seed`

and random`sample`

, train:test = 2:1**Select the optimal \(k\) for each of the scaled data**: Plot a error graph for \(k\), including three lines: training_error, cross-validation_error and testing_error, respectively**What is the impact of \(k\)?**: Formulate a hypothesis about the relation between \(k\) and the error rates. You can try to use`knn.tunning`

to verify the results (*Hint*: select the same folds, all you may obtain a result slightly different)**Interpret the results**: Hint: Considering the number of dimensions of the data, how many points are necessary to obtain the same density result for 100 dimensional space compared to a 1 dimensional space?**Report the error rates**for both the training and the testing data. What do you find?

Try the above protocol again but select only columns 1 to 5 as predictors (after deleting the index and ID columns). Now, what about the \(k\) you select and the error rates for each kind of scaled data (original data, normalized data)? Comment on any interesting observations.

- Bayes Theorem
- Laplace Estimation

Load the SOCR 2011 US Job Satisfaction data. The last column (`Description`

) contains free text describing each job type. Notice that spaces are replaced by underscores, `__`

. To mine the text field and suggest some meta-data analytics, construct an R protocol for:

- Convert the textual meta-data into a corpus object.
- Triage some of the irrelevant punctuation and other symbols in the corpus document, change all text to lowercase, etc.
- Tokenize the job descriptions into words. Examine the distributions of
`Stress_Category`

and`Hiring_Potential`

. - Classify the job Stress into two categories.
- Generate a word cloud to visualize the job description text.
- Graphically visualize the difference between low and high Stress_Category.
- Transform the word count features into categorical data
- Ignore those low frequency words and report the sparsity of your categorical data matrix with or without delete those low frequency words. Note that the
`sparsity`

of a matrix is the fraction: \(Sparsity(A) =\frac{\text{number of zero-valued elements}}{\text{total number of matrix elements (} m\times n\text{)}}\). - Apply the Naive Bayes classifier to the original matrix and lower dimension matrix, what do you observe?
- Apply and compare LDA and Naive Bayes classifiers with respect to the error, specificity and sensitivity.

- Information Gain Measure
- Impurity
- Entropy
- Gini

Use the SOCR Neonatal Pain data to build and display a decision tree recursively partitioning the data using the provided features and attributes to split the data into clusters.

- Create two classes using variable Cluster
- Create random training and test datasets
- Train a decision tree model on the data, use
`C5.0`

and`rpart`

, separately - Evaluate the model performance and compare the
`C5.0`

and`rpart`

results - Tune the parameter for
`rpart`

and evaluate again - Make predictions on testing data and assess the prediction accuracy - report the confusion matrix
- Comment on the classification performance
- Try to apply Random Forest classification and report the variables importance plot, predictions on testing data, and assess the prediction accuracy.