SOCR ≫ | DSPA ≫ | Topics ≫ |
Use the kNN algorithm to provide a classification of the data in the TBI case study, (CaseStudy11_TBI). Determine an appropriate k, train and evaluate the performance of the classification model on the data. Report some model quality statistics for a couple of different values of k and use these to rank-order (and perhaps plot the classification results of) the models.
Use 05_PPMI_top_UPDRS_Integrated_LongFormat1 data to practice KNN classification.
Preprocess the data: delete the index
and ID
columns; convert the response variable ResearchGroup
to binary 0-1 factor; detect NA
(missing) values (impute if necessary)
Summarize the dataset: use str
, summary
, cor
, ggpairs
Scale/Normalize the data: As appropriate, scale to 0 to 1; transform \(log(x+1)\); discretize (0 or 1)
Partition data into training and testing sets: use set.seed
and random sample
, train:test = 2:1
Select the optimal \(k\) for each of the scaled data: Plot a error graph for \(k\), including three lines: training_error, cross-validation_error and testing_error, respectively
What is the impact of \(k\)?: Formulate a hypothesis about the relation between \(k\) and the error rates. You can try to use knn.tunning
to verify the results (Hint: select the same folds, all you may obtain a result slightly different)
Interpret the results: Hint: Considering the number of dimension of the data, how many points are necessary to obtain the same density result for 100 dimensional space compared to a 1 dimensional space?
Report the error rates for both the training and the testing data. What do you find?
Try the above protocol again but select only columns 1 to 5 as predictors (after deleting the index and ID columns). Now, what about the \(k\) you select and the error rates for each kind of scaled data (original data, normalized data)? Comment on any interesting observations.