In the next several chapters we will concentrate of various progressively advanced machine learning, classification and clustering techniques. There are two categories of machine learning techniques - unsupervised and supervised (human-guided). In general, supervised classification methods aim to identify or predict predefined classes and label new objects as members of specific classes. Whereas, unsupervised clustering approaches attempt to group objects into subsets, without knowing a priori labels, and determine relationships between objects.

In the context of machine learning, classification is supervised learning and clustering is unsupervised learning.

Unsupervised classification refers to methods where the outcomes (groupings with common characteristics) are automatically derived based on intrinsic affinities and associations in the data without human indication of clustering. Unsupervised learning is purely based on input data ($X$) without corresponding output labels. The goal is to model the underlying structure, affinities, or distribution in the data in order to learn more about its intrinsic characteristics. It is called unsupervised learning because there are no a priori correct answers and there is no human guidance. Algorithms are left to their own devises to discover and present the interesting structure in the data. Clustering (discover the inherent groupings in the data) and association (discover association rules that describe the data) represent the core unsupervised learning problems. The k-means clustering and the Apriori association rule provide solutions to unsupervised learning problems.

Unsupervised Clustering Approaches
Bayesian		Hierarchical		Partitioning Based
Decision Based	Non-parametric	Divisive (Top-Down)	Agglomerative (Bottom-Up)	Spectral	K-Means / Centroid	Graph-Theoretic	Model Based
Bayesian Classifier has high computational requirements. As there are a priori given labels, some prior is needed/specified to train the classifier, which is in turn used to label new data. Then, the newly labeled samples are subsequently used to train a new (supervised) classifier, i.e., decision-directed unsupervised learning. If the initial classifier is not appropriate, the process may diverge	Chinese restaurant process (CRP), infinite Hidden Markov Model (HMM)	Principal Direction Divisive Partitioning (PDDP)	Start with each case in a separate cluster and repeatedly join the closest pairs into clusters, until a stopping criterion is matched: (1) there is only a single cluster for all cases, (2) predetermined number of clusters is reached, (3) the distance between the closest clusters exceeds a threshold	SpecC	kmeans, hkmeans	Growing Neural Gas	mclust

Supervised classification methods utilize user provided labels representative of specific classes associated with concrete observations, cases or units. These training classes/outcomes are used as references for the classification. Many problems can be addressed by decision-support systems utilizing combinations of supervised and unsupervised classification processes. Supervised learning involves input variables ($X$) and an outcome variable ($Y$) to learn mapping functions from the input to the output: $Y = f(X)$. The goal is to approximate the mapping function so that when it is applied to new (validation) data ($Z$) it (accurately) predicts the (expected) outcome variables ($Y$). It is called supervised learning because the learning process is supervised by initial training labels guiding and correcting the learning until the algorithm achieves an acceptable level of performance.

Regression (output variable is a real value) and classification (output variable is a category) problems represent the two types of supervised learning. Examples of supervised machine learning algorithms include Linear regression and Random forest, both provide solutions for regression-type problems, but Random forest also provides solutions to classification problems.

Just like categorization of exploratory data analytics, Chapter 3, is challenging, so is systematic codification of machine learning techniques. The table below attempts to provide a rough representation of common machine learning methods. However, it is not really intended to be a gold-standard protocol for choosing the best analytical method. Before you settle on a specific strategy for data analysis, you should always review the data characteristics in light of the assumptions of each technique and assess the potential to gain new knowledge or extract valid information from applying a specific technique.

Inference	Outcome	Supervised	Unsupervised
Classification & Prediction	Binary	Classification-Rules, OneR, kNN, NaiveBayes, Decision-Tree, C5.0, AdaBoost, XGBoost, LDA/QDA, Logit/Poisson, SVM	Apriori, Association-Rules, k-Means, NaiveBayes
Classification & Prediction	Categorical	Regression Modeling & Forecasting	Apriori, Association-Rules, k-Means, NaiveBayes
Regression Modeling	Real Quantitative	(MLR) Regression Modeling, LDA/QDA, SVM, Decision-Tree, NeuralNet	Regression Modeling Tree, Apriori/Association-Rules

Many of these will be discussed in later chapters. In this chapter, we will present step-by-step the k-nearest neighbor (kNN) algorithm. Specifically, we will demonstrate (1) data retrieval and normalization, (2) splitting the data into training and testing sets, (3) fitting models on the training data, (4) evaluating model performance on testing data, (5) improving model performance, and (6) determining optimal values of $k$.

In Chapter 13, we will present detailed strategies, and evaluation metrics, to assess the performance of all clustering and classification methods.

1 Motivation

Classification tasks could be very difficult when the features and target classes are numerous, complicated or extremely difficult to understand. In those scenarios, where the items of similar class type tend to be homogeneous, nearest neighbor classification may be appropriate because assigning unlabeled cases to their most similar labeled neighbors may be fairly easy to accomplish.

Such classification methods can help us understand the story behind the unlabeled data using known data and avoiding analyzing those complicated features and target classes. This is because these techniques have no prior distribution assumptions. However, this non-parametric approach makes the methods rely heavy on the training instances, which explains their lazy algorithms designation.

2 The kNN algorithm Overview

The KNN algorithm involves the following steps:

Create a training dataset that has classified examples labeled by nominal variables and different features in ordinal or numerical variables.
Create a testing dataset containing unlabeled examples with similar features as the training data.
Given a predetermined number $k$, match each test case with the $k$ closest training records that are “nearest” to the test case, according to a certain similarity or distance measure.
Assign a test case class label according to the majority vote of the $k$ nearest training cases.

Mathematically, for a given $k$, a specific similarity metric $d$, and a new testing case $x$, the kNN classifier performs two steps ($k$ is typically odd to avoid ties):

Runs through the whole training dataset ($y$) computing $d(x,y)$. Let $A$ represent the $k$ closest points to $x$ in the training data $y$.
Estimates the conditional probability for each class, which corresponds to the fraction of points in $A$ with that given class label. If $I(z)$ is an indicator function $I(z) = \begin{cases} 1 & z= true \\ 0 & otherwise \end{cases}$, then the testing data input $x$ gets assigned to the class with the largest probability, $P(y=j|X=x)$:

\[P(y=j|X=x) =\frac{1}{k} \sum_{i\in A}{I(y^{(i)}=j)}.\]

2.1 Distance Function and Dummy coding

How to measure the similarity between records? We can think of similarity measures as the distance metrics between the two records or cases. There are many distance functions to choose from. Traditionally, we use Euclidean distance as our similarity metric.

If we use a line to connect the two points representing the testing and the training records in $n$ dimensional space, the length of the line is the Euclidean distance. If $a, b$ both have $n$ features, the coordinates for them are $(a_1, a_2, ..., a_n)$ and $(b_1, b_2, ..., b_n)$. Our distance could be:

\[dist(a, b)=\sqrt{(a_1-b_1)^2+(a_2-b_2)^2+...+(a_n-b_n)^2}.\]

When we have nominal features, it requires a little trick to apply the Euclidean distance formula. We could create dummy variables as indicators of all the nominal feature levels. The dummy variable would equal to one when we have the feature and zero otherwise. We show two examples:

\[ Gender= \left\{ \begin{array}{ll} 0 & X=male \\ 1 & X=female \\ \end{array} \right. \] \[ Cold= \left\{ \begin{array}{ll} 0 & Temp \geq 37F \\ 1 & Temp < 37F \\ \end{array} \right. \]

This allows only binary expressions. If we have multiple nominal categories, just make each one as a dummy variable and apply the Euclidean distance.

2.2 Ways to Determine k

The parameter k could not be too large or too small. If our k is too large, the test record tends to be classified as the most popular class in the training records, rather than the most similar one. On the other hand, if the k is too small, outliers, noise, or mislabeled training data cases might lead to errors in predictions.

The common practice is to calculate the square root of the number of training cases and use that number as (initial) estimate of k.

A more robust way would be to choose several k values and select the one with the optimal (best) classifying performance.

2.3 Rescaling of the features

Different features might have different scales. For example, we can have a measure of pain scaling from 1 to 10 or 1 to 100. some similarity or distances measures assume the same measuring unit in all feature dimensions. This requires that the data may need to be transferred into the same scale. Re-scaling can make each feature contribute to the distance in a relatively equal manner, avoiding potential bias.

2.4 Rescaling Formulas

There are many alternative strategies to rescale the data.

2.4.1 min-max normalization

\[X_{new}=\frac{X-min(X)}{max(X)-min(X)}.\]

After re-scaling the data, $X_{new}$ would range from 0 to 1, representing the distance between each value and its minimum as a percentage. Larger values indicate further distance from the minimum, 100% means that the value is at the maximum.

2.4.2 z-score standardization

\[X_{new}=\frac{X-\mu}{\sigma}=\frac{X-Mean(X)}{SD(X)}\]

This is based on the properties of normal distribution that we have talked about in Chapter 2. After z-score standardization, the re-scaled feature will have unbounded range. This is different from the min-max normalization which always have a finite range from 0 to 1. Following z-score standardization, the transformed features are uniteless and may resemble standard normal distribution.

3 Case Study: Youth Development

3.1 Step 1: Collecting Data

The data we are using for this case study is the “Boys Town Study of Youth Development”, which is the second case study, CaseStudy02_Boystown_Data.csv.

Variables:

ID: Case subject identifier
Sex: dichotomous variable (1=male, 2=female)
GPA: Interval-level variable with range of 0-5 (0-“A” average, 1- “B” average, 2- “C” average, 3- “D” average, 4-“E”, 5-“F”")
Alcohol use: Interval level variable from 0-11 (drink everyday - never drinked)
Attitudes on drinking in the household: Alcatt- Interval level variable from 0-6 (totally approve - totally disapprove)
DadJob: 1-yes, dad has a job and 2- no
MomJob: 1-yes and 2-no
Parent closeness (example: In your opinion, does your mother make you feel close to her?)
- Dadclose: Interval level variable 0-7 (usually-never)
- Momclose: interval level variable 0-7 (usually-never)
Delinquency:
- larceny (how many times have you taken things >$50?): Interval level data 0-4 (never - many times),
- vandalism: Interval level data 0-7 (never - many times)

3.2 Step 2: Exploring and preparing the data

First, we need to load and do some data manipulation. We are using the Euclidean distance so dummy variable should be used. The following codes transferred sex, dadjob and momjob into dummy variables.

library(class)
library(gmodels)
boystown<-read.csv("https://umich.instructure.com/files/399119/download?download_frd=1", sep=" ")
boystown$sex<-boystown$sex-1
boystown$dadjob <- -1*(boystown$dadjob-2)
boystown$momjob <- -1*(boystown$momjob-2)
str(boystown)

## 'data.frame':    200 obs. of  11 variables:
##  $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex       : num  0 0 0 0 1 1 0 0 1 1 ...
##  $ gpa       : int  5 0 3 2 3 3 1 5 1 3 ...
##  $ Alcoholuse: int  2 4 2 2 6 3 2 6 5 2 ...
##  $ alcatt    : int  3 2 3 1 2 0 0 3 0 1 ...
##  $ dadjob    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ momjob    : num  0 0 0 0 1 0 0 0 1 1 ...
##  $ dadclose  : int  1 3 2 1 2 1 3 6 3 1 ...
##  $ momclose  : int  1 4 2 2 1 2 1 2 3 2 ...
##  $ larceny   : int  1 0 0 3 1 0 0 0 1 1 ...
##  $ vandalism : int  3 0 2 2 2 0 5 1 4 0 ...

The str() function tells that we have 200 observations and 11 variables. However, the ID variable is not important in this case study so we can delete it. In this case-study, we can focus on academic performance, GPA, recidivism, vandalism and larceny, alcohol use, or other outcome variables. One concrete example involves trying to predict recidivism and use knn to classify participants in two categories.

Let’s focus on a specific outcome variable representing two or more infractions of vandalism and larceny, which can be considered as “Recidivism”. Participants with one or no infractions can be labeled as “Controls”. First we can use PCA to explore the data and then construct the new derived recidivism variable and split the data into training and testing sets.

# GPA study
# boystown <- boystown[, -1]
# table(boystown$gpa)
# boystown$grade <- boystown$gpa %in% c(3, 4, 5)    # GPA: (1-“A” average, 2-“B” average, 3-“C” average, 4+ below “C” average)
# boystown$grade <- factor(boystown$grade, levels=c(F, T), labels = c("above_avg", "avg_or_below"))
# table(boystown$grade)

# You can also try with alternative outcomes, e.g., alcohol use
# Outcome Alcohol Use 6 categories (1- everyday, 2- once or twice/wk, 3- once or twice/month, 4- less than once or twice per month, 5- once or twice, 6- never)
# Y <- as.factor(ifelse(boystown$Alcoholuse > 3, "1"Heavy Alcohol Use", "0"low Alcohol Use)); table(Y)
# Y <- as.factor(boystown$Alcoholuse); table(Y)
# X <- boystown[, -c(1, 4)]

# First explore the data by running a PCA
rawData <- boystown[ , -1]; head(rawData)

##   sex gpa Alcoholuse alcatt dadjob momjob dadclose momclose larceny vandalism
## 1   0   5          2      3      1      0        1        1       1         3
## 2   0   0          4      2      1      0        3        4       0         0
## 3   0   3          2      3      1      0        2        2       0         2
## 4   0   2          2      1      1      0        1        2       3         2
## 5   1   3          6      2      1      1        2        1       1         2
## 6   1   3          3      0      1      0        1        2       0         0

pca1 <- prcomp(as.matrix(rawData), center = T)
summary(pca1)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6     PC7
## Standard deviation     1.9264 1.6267 1.4556 1.3887 1.3023 1.2495 0.95092
## Proportion of Variance 0.2463 0.1756 0.1406 0.1280 0.1126 0.1036 0.06001
## Cumulative Proportion  0.2463 0.4219 0.5625 0.6905 0.8031 0.9067 0.96672
##                            PC8     PC9    PC10
## Standard deviation     0.46931 0.44030 0.29546
## Proportion of Variance 0.01462 0.01287 0.00579
## Cumulative Proportion  0.98134 0.99421 1.00000

pca1$rotation

##                     PC1          PC2          PC3          PC4         PC5
## sex         0.017329782 -0.041942445  0.038644169  0.028574405 -0.01345929
## gpa         0.050811919  0.251727224 -0.552866390  0.482012801 -0.12517612
## Alcoholuse -0.955960402 -0.171043716 -0.004495945  0.207154515  0.09577419
## alcatt     -0.087619514  0.345341205 -0.639579077 -0.330797550  0.46933240
## dadjob      0.007936535 -0.001870547 -0.017028011  0.006888380 -0.01469275
## momjob      0.030397552 -0.003518869  0.008763667 -0.001482878 -0.02342974
## dadclose    0.052947782 -0.689075136 -0.246264250 -0.452876476  0.21528193
## momclose   -0.055904765 -0.273643927 -0.435005164 -0.092086033 -0.75752753
## larceny     0.063455041 -0.005344755 -0.108928362  0.033669839 -0.07405310
## vandalism  -0.254240056  0.486423422  0.147159970 -0.632255411 -0.35813577
##                     PC6         PC7          PC8          PC9         PC10
## sex         0.026402404  0.02939770 -0.996267983  0.036561664 -0.001236127
## gpa        -0.599548626 -0.13874602 -0.035498809  0.004136514  0.019965416
## Alcoholuse -0.015534060  0.05830281 -0.002438336  0.032626941 -0.008177261
## alcatt      0.362724708  0.01546912 -0.045913748  0.019571649  0.001150003
## dadjob      0.001537909 -0.04457354  0.001761749  0.050310596 -0.997425089
## momjob      0.006536869 -0.01490905  0.037550689  0.997051461  0.051468175
## dadclose   -0.456556766 -0.04166384 -0.008660225  0.005147468  0.001021197
## momclose    0.381040880 -0.06639096  0.008710480 -0.018261764  0.020666371
## larceny    -0.084351462  0.98381408  0.026413939  0.013630662 -0.039663208
## vandalism  -0.383715180 -0.00198434 -0.042593385  0.003164270 -0.004956997

Y <- ifelse (boystown$vandalism + boystown$larceny > 1, "Recidivism", "Control")  # more than 1 vandalism or larceny conviction
X <- boystown[, -c(1, 10, 11)]  # covariate set excludes these 3 columns index/ID, vandalism, and larceny

boystown_z <- as.data.frame(lapply(X, scale))
bt_train <- boystown_z[1:150, ]
bt_test  <- boystown_z[151:200, ]
bt_train_labels <- Y[1:150]  
bt_test_labels  <- Y[151:200]
table(bt_train_labels)

## bt_train_labels
##    Control Recidivism 
##         46        104

Let’s look at the proportions for the two categorizes in both the training and testing sets.

round(prop.table(table(bt_train_labels)), digits=2)

## bt_train_labels
##    Control Recidivism 
##       0.31       0.69

round(prop.table(table(bt_test_labels)), digits=2)

## bt_test_labels
##    Control Recidivism 
##        0.2        0.8

We can see that most of the participants have incidents related to recidivism (~70-80%).

The remaining 8 features are use different measuring scales. If we use these features directly, variables with larger scale may have a greater impact on the classification performance. Therefore, re-scaling may be useful to level the playing field.

3.3 Normalizing Data

First let’s create a function of our own using the min-max normalization formula. We can check the function using some trial vectors.

normalize<-function(x){
  return((x-min(x))/(max(x)-min(x)))
}

# some test examples:
normalize(c(1, 2, 3, 4, 5))

## [1] 0.00 0.25 0.50 0.75 1.00

normalize(c(1, 3, 6, 7, 9))

## [1] 0.000 0.250 0.625 0.750 1.000

After confirming the function definition, we can use lapply() to apply the normalization transformation to each element in a “list” of predictors of the binary outcome=recidivism.

boystown_n <- as.data.frame(lapply(bt_train, normalize))

# alternatively we can use "scale", as we showed above
#boystown_z <- as.data.frame(lapply(X, scale))

Let’s compare the two alternative normalization approaches on one of the features, Alcohol use.

summary(boystown_n$Alcoholuse)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2727  0.3636  0.3576  0.4545  1.0000

summary(boystown_z$Alcoholuse)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.04795 -0.98958  0.06879  0.00000  0.59798  3.77309

3.4 Data preparation - creating training and test datasets

We have 200 observations in this dataset. The more data we use to train the the algorithm, the more precise the prediction would be. We can use $3/4$ of the data for training and the remaining $1/4$ for testing. For simplicity, we will just take the first cases as training and the remaining as testing. Alternatively, we can use a randomization strategy to split the data into testing and training sets.

# may want to use random split of the raw data into training and testing
# subset_int <- sample(nrow(boystown_n),floor(nrow(boystown_n)*0.8))  
# 80% training + 20% testing
# bt_train<- boystown_n [subset_int, ]; bt_test<-boystown_n[-subset_int, ] 

# Note that the object boystown_n already excludes the outcome variable (Delinquency), index 11!
bt_train <- boystown_z[1:150, ]
bt_test  <- boystown_z[151:200, ]

Then let’s extract the recidivism labels or classes for the training and testing sets.

bt_train_labels <- Y[1:150]  
bt_test_labels  <- Y[151:200]

3.5 Step 3 - Training a model on the data

Firs, we will use the class::knn() method.

#install.packages('class', repos = "http://cran.us.r-project.org")
library(class)

The function knn() has following components:

p <- knn(train, test, class, k)

train: data frame containing numeric training data (features)
test: data frame containing numeric testing data (features)
class/cl: class for each observation in the training data
k: predetermined integer indication the number of nearest neighbors

We can first test with $k=7$, which is less than the square root of our number of observations: $\sqrt{200}\approx 14$.

bt_test_pred <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=7)

3.6 Step 4 - Evaluating model performance

We utilize the CrossTable() function in Chapter 2 to evaluate the KNN model. We have two classes in this example. The goal is to create a $2\times 2$ table that shows the matched true and predicted classes as well as the unmatched ones. However chi-square values are not needed so we use option prop.chisq=False to get rid of it.

# install.packages("gmodels", repos="http://cran.us.r-project.org")
library(gmodels)
CrossTable(x=bt_test_labels, y=bt_test_pred, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_pred 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          0 |         10 |         10 | 
##                |      0.000 |      1.000 |      0.200 | 
##                |      0.000 |      0.213 |            | 
##                |      0.000 |      0.200 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          3 |         37 |         40 | 
##                |      0.075 |      0.925 |      0.800 | 
##                |      1.000 |      0.787 |            | 
##                |      0.060 |      0.740 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          3 |         47 |         50 | 
##                |      0.060 |      0.940 |            | 
## ---------------|------------|------------|------------|
## 
##

In this table, the cells in the first row-first column and the second row-second column contain the number for cases that have predicted classes matching the true class labels. The other two cells are the counts for unmatched cases. The accuracy in this classifier is calculated by: $\frac{cell[1, 1]+cell[2, 2]}{total}=\frac{37}{50}=0.72$. Note that this value may slightly fluctuate each time you run the classifier, due to the stochastic nature of the algorithm.

3.7 Step 5 - Improving model performance

The normalization strategy may play a role in the classification performance. We can try alternative standardization methods - standard Z-score centralization and normalization (via scale() method). Let’s give it a try standardization.

Then, we can proceed to training the kNN, predicting and assessing the accuracy of the results).

# bt_train <- bt_z[1:150, ]
# bt_test  <- bt_z[151:200, ]
# bt_train_labels <- boystown[1:150, 11]
# bt_test_labels  <- boystown[151:200, 11]
# bt_test_pred <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=7)
# CrossTable(x=bt_test_labels, y=bt_test_pred, prop.chisq = F)

# bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=18) # retrieve Probabilities
# getTraingDataThreshold <- quantile(attributes(bt_test_pred)$prob, c(table(bt_train_labels)[1]/length(bt_train_labels), 0.5))[1]
# bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > getTraingDataThreshold, 1, 0)          # Binarize the probabilities
bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=14) # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control")          # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_predBin 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          0 |         10 |         10 | 
##                |      0.000 |      1.000 |      0.200 | 
##                |      0.000 |      0.217 |            | 
##                |      0.000 |      0.200 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          4 |         36 |         40 | 
##                |      0.100 |      0.900 |      0.800 | 
##                |      1.000 |      0.783 |            | 
##                |      0.080 |      0.720 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          4 |         46 |         50 | 
##                |      0.080 |      0.920 |            | 
## ---------------|------------|------------|------------|
## 
##

# CT 
print(paste0("Prediction accuracy of model 'bt_test_pred' (k=14) is ", (CT$prop.tbl[1,1]+CT$prop.tbl[2,2]) ))

## [1] "Prediction accuracy of model 'bt_test_pred' (k=14) is 0.72"

Under the z-score method, the prediction result is similar to previous result using the normalization strategy. Albeit, in general, there may be marginal differences, e.g., a few more cases may be correctly labeled based on one of the standardization or normalization approaches.

3.8 Testing alternative values of k

Originally, we used the square root of 200 as our k. However, this might not be the best k in this study. We can test different k’s for their predicting performances.

# bt_train <- boystown_n[1:150, ]
# bt_test  <- boystown_n[151:200, ]
# bt_train_labels <- boystown[1:150, 11]
# bt_test_labels  <- boystown[151:200, 11]
# bt_test_pred1   <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=1)
# bt_test_pred9   <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=9)
# bt_test_pred11  <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, k=11)
# bt_test_pred21  <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, k=21)
# bt_test_pred27  <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=27)
# ct_1  <- CrossTable(x=bt_test_labels, y=bt_test_pred1, prop.chisq = F)
# ct_9  <- CrossTable(x=bt_test_labels, y=bt_test_pred9, prop.chisq = F)
# ct_11 <- CrossTable(x=bt_test_labels, y=bt_test_pred11, prop.chisq = F)
# ct_21 <- CrossTable(x=bt_test_labels, y=bt_test_pred21, prop.chisq = F)
# ct_27 <- CrossTable(x=bt_test_labels, y=bt_test_pred27, prop.chisq = F)

bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=18)        # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control")     # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_predBin 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          1 |          9 |         10 | 
##                |      0.100 |      0.900 |      0.200 | 
##                |      1.000 |      0.184 |            | 
##                |      0.020 |      0.180 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          0 |         40 |         40 | 
##                |      0.000 |      1.000 |      0.800 | 
##                |      0.000 |      0.816 |            | 
##                |      0.000 |      0.800 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          1 |         49 |         50 | 
##                |      0.020 |      0.980 |            | 
## ---------------|------------|------------|------------|
## 
##

# CT 
print(paste0("Prediction accuracy of model 'bt_test_pred' (k=18) is ", (CT$prop.tbl[1,1]+CT$prop.tbl[2,2]) ))

## [1] "Prediction accuracy of model 'bt_test_pred' (k=18) is 0.82"

The choice of $k$ in KNN clustering is very important and it can be fine-tuned using the e1071::tune.knn() of the caret::train() methods. Note that using the tune.knn() method without explicit control over the class-probability cutoff does not generate good results. Specifying that probability exceeding $0.6$ corresponds to recidivism significantly improves the knn performance using caret::train() and caret::predict() methods.

# install.packages("e1071")
library(e1071)
knntuning = tune.knn(x= bt_train, y = as.factor(bt_train_labels), k = 1:30)
knntuning

## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  k
##  9
## 
## - best performance: 0.3

summary(knntuning)

## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  k
##  9
## 
## - best performance: 0.3 
## 
## - Detailed performance results:
##     k     error dispersion
## 1   1 0.4933333 0.16982198
## 2   2 0.4466667 0.09962894
## 3   3 0.4000000 0.15396007
## 4   4 0.4066667 0.11088867
## 5   5 0.3466667 0.12881224
## 6   6 0.3600000 0.14470063
## 7   7 0.3266667 0.12746338
## 8   8 0.3266667 0.11946382
## 9   9 0.3000000 0.12272623
## 10 10 0.3000000 0.12272623
## 11 11 0.3000000 0.10540926
## 12 12 0.3400000 0.12352837
## 13 13 0.3266667 0.13498971
## 14 14 0.3133333 0.12976712
## 15 15 0.3266667 0.13498971
## 16 16 0.3200000 0.13259052
## 17 17 0.3133333 0.14418781
## 18 18 0.3200000 0.14673399
## 19 19 0.3133333 0.14418781
## 20 20 0.3066667 0.14124665
## 21 21 0.3066667 0.14124665
## 22 22 0.3066667 0.14124665
## 23 23 0.3066667 0.14124665
## 24 24 0.3066667 0.14124665
## 25 25 0.3066667 0.14124665
## 26 26 0.3066667 0.14124665
## 27 27 0.3066667 0.14124665
## 28 28 0.3066667 0.14124665
## 29 29 0.3066667 0.14124665
## 30 30 0.3066667 0.14124665

library(caret)
knnControl <- trainControl(
    method = "cv", ## cross validation
    number = 10,   ## 10-fold
    summaryFunction = twoClassSummary,
    classProbs = TRUE,
    verboseIter = FALSE
)
# bt_train_stringLabels <- ifelse (bt_train_labels==1, "Recidivism", "Control")
knn_model <- train(x=bt_train, y=bt_train_labels , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
print(knn_model)

## k-Nearest Neighbors 
## 
## 150 samples
##   8 predictor
##   2 classes: 'Control', 'Recidivism' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 134, 135, 135, 134, 136, 136, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens   Spec     
##    5  0.2885455  0.070  0.9009091
##    7  0.3599773  0.085  0.9318182
##    9  0.4181364  0.070  0.9809091
##   11  0.4244318  0.070  0.9809091
##   13  0.4669091  0.040  0.9709091
##   15  0.4750455  0.000  0.9800000
##   17  0.4890682  0.000  0.9900000
##   19  0.5230227  0.000  0.9900000
##   21  0.5215455  0.000  0.9900000
##   23  0.5326136  0.000  1.0000000
##   25  0.5069773  0.000  1.0000000
##   27  0.5193864  0.000  1.0000000
##   29  0.5493409  0.000  1.0000000
##   31  0.5391591  0.000  1.0000000
##   33  0.5753636  0.000  1.0000000
##   35  0.5628409  0.000  1.0000000
##   37  0.5512955  0.000  1.0000000
##   39  0.5595682  0.000  1.0000000
##   41  0.5457500  0.000  1.0000000
##   43  0.5563636  0.000  1.0000000
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 33.

# Here we are providing a cutoff of 60% probability for derived class label=Recidivism
library(dplyr)
summaryPredictions <- predict(knn_model, newdata = bt_test, type = "prob") # %>% mutate('knnPredClass'=names(.)[apply(., 1, which.max)])
summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
testDataPredSummary <- as.data.frame(cbind(trueLabels=bt_test_labels, controlProb=summaryPredictions$Control, 
                                           recidivismProb=summaryPredictions$Recidivism, knnPredLabel=summaryPredictionsLabel))
print(paste0("Accuracy = ", 2*as.numeric(table(testDataPredSummary$trueLabels == testDataPredSummary $knnPredLabel)[2]), "%"))

## [1] "Accuracy = 78%"

It’s useful to visualize the error rate against the value of $k$. This can help us select the optimal $k$ parameter that minimizes the cross-validation (CV) error, see Chapter 20.

library(class)
library(ggplot2)
library(reshape2)
# define a function that generates CV folds
cv_partition <- function(y, num_folds = 10, seed = NULL) {
  if(!is.null(seed)) {
    set.seed(seed)
  }
  n <- length(y)
  
  # split() divides the data into the folds defined by gl().
  # gl() generates factors according to the pattern of their levels

  folds <- split(sample(seq_len(n), n), gl(n = num_folds, k = 1, length = n))
  folds <- lapply(folds, function(fold) {
    list(
      training = which(!seq_along(y) %in% fold),
      test = fold
    )
  })
  names(folds) <- paste0("Fold", names(folds))
  return(folds)
}

# Generate 10-folds of the data
folds = cv_partition(bt_train_labels, num_folds = 10)

# Define a training set_CV_error calculation function
train_cv_error = function(K) {
  #Train error
  #### knnbt = knn(train = bt_train, test = bt_train, cl = bt_train_labels, k = K)
  #### train_error = mean(knnbt != bt_train_labels)
  knn_model <- train(x=bt_train, y=bt_train_labels , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
  summaryPredictions <- predict(knn_model, newdata = bt_train, type = "prob")
  summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
  train_error = mean(summaryPredictionsLabel != bt_train_labels)

  #CV error
  cverrbt = sapply(folds, function(fold) {
    ###  knnbt = knn(train = bt_train[fold$training,], cl = bt_train_labels[fold$training], test = bt_train[fold$test,], k=K)
    knn_model <- train(x=bt_train[fold$training,], y=bt_train_labels[fold$training] , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
    summaryPredictions <- predict(knn_model, newdata = bt_train[fold$test,], type = "prob")
    summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
    mean(summaryPredictionsLabel != bt_train_labels[fold$test])
    # mean(bt_train_labels[fold$test] != knn(train = bt_train[fold$training,], 
    #                                        cl = bt_train_labels[fold$training], 
    #                                        test = bt_train[fold$test,], k=K))
    }
  )

  cv_error = mean(cverrbt)

  #Test error
  knn.test = knn(train = bt_train, test = bt_test, cl = bt_train_labels, k = K)
  test_error = mean(knn.test != bt_test_labels)
  return(c(train_error, cv_error, test_error))
}

k_err = sapply(1:30, function(k) train_cv_error(k))
df_errs = data.frame(t(k_err), 1:30)
colnames(df_errs) = c('Train', 'CV', 'Test', 'K')
dataL <- melt(df_errs, id="K")

# require(ggplot2)
# library(reshape2)
# ggplot(dataL, aes_string(x="K", y="value", colour="variable",
#    group="variable", linetype="variable", shape="variable")) +
#    geom_line(size=0.8) + labs(x = "Number of nearest neighbors (k)",
#            y = "Classification error",
#            colour="", group="",
#            linetype="", shape="") +
#   geom_point(size=2.8) +
#   geom_vline(xintercept=4:5, colour = "pink")+
#   geom_text(aes(4,0,label = "4", vjust = 1)) +
#   geom_text(aes(5,0,label = "5", vjust = 1))

library(plotly)
plot_ly(dataL, x = ~K, y = ~value, color = ~variable, type = "scatter", mode = "markers+lines") %>% 
  add_segments(x=25, xend=25, y=0.0, yend=0.33, type = "scatter", name="k=9",
               line=list(color="darkgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  add_segments(x=14, xend=14, y=0.0, yend=0.33, type = "scatter", name="k=14",
               line=list(color="lightgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  add_segments(x=18, xend=18, y=0.0, yend=0.36, type = "scatter",  name="k=18",
               line=list(color="gray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  layout(title='K-NN Training, CV, and Testing Error Rates against k', 
           legend=list(title=list(text='<b> Samples </b>')), 
           xaxis=list(title='Number of nearest neighbors (k)'), yaxis=list(title='Classification error'))

3.9 Quantitative Assessment

First review the fundamentals of hypothesis testing inference and recall that:

Confusion Matrix	Negative	Positive
kNN Fails to reject	TN	FN
kNN rejects	FP	TP
Metrics	Specificity: TN/(TN+FP)	Sensitivity: TP/(TP+FN)

Suppose we want to evaluate the kNN model ($k=12$) as to how well it predicts recidivism. Let’s report manually some of the accuracy metrics for the kNN model ($k=12$). Combining the results, we get the following model sensitivity and specificity.

bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=12) # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control")          # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_predBin 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          2 |          8 |         10 | 
##                |      0.200 |      0.800 |      0.200 | 
##                |      0.286 |      0.186 |            | 
##                |      0.040 |      0.160 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          5 |         35 |         40 | 
##                |      0.125 |      0.875 |      0.800 | 
##                |      0.714 |      0.814 |            | 
##                |      0.100 |      0.700 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          7 |         43 |         50 | 
##                |      0.140 |      0.860 |            | 
## ---------------|------------|------------|------------|
## 
##

mod12_TN <- CT$prop.row[1, 1]  
mod12_FP <- CT$prop.row[1, 2]
mod12_FN <- CT$prop.row[2, 1]
mod12_TP <- CT$prop.row[2, 2]

mod12_sensi <- mod12_TN/(mod12_TN+mod12_FP) 
mod12_speci <- mod12_TP/(mod12_TP+mod12_FN)
print(paste0("kNN model k=12 Sensitivity=", mod12_sensi))

## [1] "kNN model k=12 Sensitivity=0.2"

print(paste0("kNN model k=12 Specificity=", mod12_speci))

## [1] "kNN model k=12 Specificity=0.875"

table(bt_test_labels, bt_test_predBin)

##               bt_test_predBin
## bt_test_labels Control Recidivism
##     Control          2          8
##     Recidivism       5         35

Therefore, model12, corresponding to $k=12$, yields a marginal accuracy on the testing cases.

Another strategy for model validation and improvement involves the use of the caret::confusionMatrix() method, which reports several complementary metrics quantifying the performance of the prediction model.

Let’s examine deeper the performance of model12 to predict recidivism.

# corr12 <- cor(as.numeric(bt_test_labels), as.numeric(bt_test_predBin))
# corr12
# plot(as.numeric(bt_test_labels), as.numeric(bt_test_pred9))
# table(as.numeric(bt_test_labels), as.numeric(bt_test_pred9))

# install.packages("caret")
library("caret")

# compute the accuracy, LOR, sensitivity/specificity of 3 kNN models

# Model 12: bt_test_predBin
#confusionMatrix(as.numeric(bt_test_labels), as.numeric(bt_test_pred1))
confusionMatrix(as.factor(bt_test_labels), as.factor(bt_test_predBin))

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Control Recidivism
##   Control          2          8
##   Recidivism       5         35
##                                           
##                Accuracy : 0.74            
##                  95% CI : (0.5966, 0.8537)
##     No Information Rate : 0.86            
##     P-Value [Acc > NIR] : 0.9927          
##                                           
##                   Kappa : 0.0845          
##                                           
##  Mcnemar's Test P-Value : 0.5791          
##                                           
##             Sensitivity : 0.2857          
##             Specificity : 0.8140          
##          Pos Pred Value : 0.2000          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.1400          
##          Detection Rate : 0.0400          
##    Detection Prevalence : 0.2000          
##       Balanced Accuracy : 0.5498          
##                                           
##        'Positive' Class : Control         
##

Finally, we can use a 3D plot to display the confusionMatrix() results of model12 (mod12_TN, mod12_FN, mod12_FP, mod12_TP).

# install.packages("scatterplot3d")
library(scatterplot3d)
grid_xy <- matrix(c(0, 1, 1, 0), nrow=2, ncol=2)
intensity <- matrix(c(mod12_TN, mod12_FN, mod12_FP, mod12_TP), nrow=2, ncol=2)

# scatterplot3d(grid_xy, intensity, pch=16, highlight.3d=TRUE, type="h", main="3D Scatterplot") 

s3d.dat <- data.frame(cols=as.vector(col(grid_xy)), 
      rows=as.vector(row(grid_xy)), 
      value=as.vector(intensity))
scatterplot3d(s3d.dat, pch=16, highlight.3d=TRUE, type="h", xlab="real", ylab="predicted", zlab="Agreement", 
              main="3D Scatterplot: Model12 Results (FP, FN, TP, TN)")

# scatterplot3d(s3d.dat, type="h", lwd=5, pch=" ", xlab="real", ylab="predicted", zlab="Agreement", main="Model9 Results (FP, FN, TP, TN)")

plot_ly(x = c("TN", "FN", "FP", "TP"),
  y = c(mod12_TN, mod12_FN, mod12_FP, mod12_TP),
  name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Confusion Matrix", 
           legend=list(title=list(text='<b> Model k=12; Performance Metrics </b>')), 
           xaxis=list(title='Metrics'), yaxis=list(title='Probability'))

# plot_ly(type = 'barpolar', r = c(0.2, 0.8, 0.2, 0.85),theta = c(0, 90, 180, 360)) %>%
#     layout(polar = list(radialaxis = list(visible = T,range = c(0,10)),
#       sector = c(0,360), radialaxis = list(tickfont = list(size = 45)),
#       angularaxis = list(tickfont = list(size = 10))))

4 Case Study: Predicting Galaxy Spins

Let’s now use the SOCR Case-Study 22 (22_SDSS_GalaxySpins_Case_Study) to train a kNN classifier on $49,122$ (randomly selected out of a total $51,122$) training cases and test the accuracy of predicting the Galactic Spin (L=left or R=right hand spin) on the remaining (randomly chosen testing $2,000$ galaxies). Evaluate and report the classifier performance. Try to compare and improve the performance of several independent classifiers, e.g., kNN, naive Bayesian, LDA.

Report/graph the training, testing and CV error rates for different k parameters and justify your “optimal” choice for $k=k_o$. How good your Galactic spin classifier? Provide some visual and numeric evidence.

Some sample code to get you started in included below.

#loading libraries
library(e1071); library(caret); library(class); library(gmodels)

#loading Galaxy data
galaxy_data <- read.csv("https://umich.instructure.com/files/6105118/download?download_frd=1", sep=",")

#View(head(galaxy_data))
dim(galaxy_data)

## [1] 51122    12

#dropping Galaxy ID
galaxy_data <- galaxy_data[ , -1]
str(galaxy_data)

## 'data.frame':    51122 obs. of  11 variables:
##  $ RA    : num  236 237 238 238 238 ...
##  $ DEC   : num  -0.493 -0.482 -0.506 -0.544 -0.527 ...
##  $ HAND  : chr  "R" "L" "L" "L" ...
##  $ UZS   : num  17.4 19.2 19.4 19.5 17.8 ...
##  $ GZS   : num  16.2 18 17.5 18.3 16.6 ...
##  $ RZS   : num  15.6 17.4 16.6 17.9 16.2 ...
##  $ IZS   : num  15.2 17 16.2 17.6 15.8 ...
##  $ ZZS   : num  14.9 16.7 15.8 17.3 15.6 ...
##  $ ELLIPS: num  0.825 0.961 0.411 0.609 0.505 ...
##  $ PHIS  : num  154 100.2 29.2 109.4 39.6 ...
##  $ RSS   : num  0.0547 0.0977 0.0784 0.076 0.0796 ...

#randomly assigning 2k for validation purposes
subset_int <- sample(nrow(galaxy_data), 2000) 

#training data, dropping the label
galaxy_data_train <- galaxy_data[-subset_int, -3]; dim(galaxy_data_train) # test

## [1] 49122    10

galaxy_data_test <- galaxy_data[subset_int, -3]; dim(galaxy_data_test);

## [1] 2000   10

#summary(galaxy_data_train)
# labels
galaxy_train_labels <- as.factor(galaxy_data[-subset_int, 3])
galaxy_test_labels <- as.factor(galaxy_data[subset_int, 3])

#summary(galaxy_data_test)
# trying to figure out the best K for this kNN model
# testing alternative values of K
gd_test_pred1<-knn(train=galaxy_data_train, test=galaxy_data_test,
                   cl=galaxy_train_labels, k=1)
gd_test_pred10<-knn(train=galaxy_data_train, test=galaxy_data_test,
                    cl=galaxy_train_labels, k=10)
gd_test_pred20<-knn(train=galaxy_data_train, test=galaxy_data_test,
                     cl=galaxy_train_labels, k=20)
gd_test_pred50<-knn(train=galaxy_data_train, test=galaxy_data_test,
                    cl=galaxy_train_labels, k=50)
ct_1<-CrossTable(x=galaxy_test_labels, y=gd_test_pred1, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                    | gd_test_pred1 
## galaxy_test_labels |         L |         R | Row Total | 
## -------------------|-----------|-----------|-----------|
##                  L |      1003 |         8 |      1011 | 
##                    |     0.992 |     0.008 |     0.505 | 
##                    |     0.994 |     0.008 |           | 
##                    |     0.501 |     0.004 |           | 
## -------------------|-----------|-----------|-----------|
##                  R |         6 |       983 |       989 | 
##                    |     0.006 |     0.994 |     0.494 | 
##                    |     0.006 |     0.992 |           | 
##                    |     0.003 |     0.491 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |      1009 |       991 |      2000 | 
##                    |     0.504 |     0.495 |           | 
## -------------------|-----------|-----------|-----------|
## 
##

confusionMatrix(galaxy_test_labels, gd_test_pred1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    L    R
##          L 1003    8
##          R    6  983
##                                           
##                Accuracy : 0.993           
##                  95% CI : (0.9883, 0.9962)
##     No Information Rate : 0.5045          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.986           
##                                           
##  Mcnemar's Test P-Value : 0.7893          
##                                           
##             Sensitivity : 0.9941          
##             Specificity : 0.9919          
##          Pos Pred Value : 0.9921          
##          Neg Pred Value : 0.9939          
##              Prevalence : 0.5045          
##          Detection Rate : 0.5015          
##    Detection Prevalence : 0.5055          
##       Balanced Accuracy : 0.9930          
##                                           
##        'Positive' Class : L               
##

#Alternatively we can use the tuning function
knn.tune = tune.knn(x= galaxy_data_train, y = galaxy_train_labels, k = 1:20,
                    tunecontrol=tune.control(sampling = "fix") , fix=10)

#Summarize the resampling results set
summary(knn.tune)

## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: fixed training/validation set 
## 
## - best parameters:
##  k
##  1
## 
## - best performance: 0.1910956 
## 
## - Detailed performance results:
##     k     error dispersion
## 1   1 0.1910956         NA
## 2   2 0.4174301         NA
## 3   3 0.4363625         NA
## 4   4 0.3995969         NA
## 5   5 0.4042995         NA
## 6   6 0.4322707         NA
## 7   7 0.4397826         NA
## 8   8 0.4390497         NA
## 9   9 0.4383779         NA
## 10 10 0.4463173         NA
## 11 11 0.4480884         NA
## 12 12 0.4503481         NA
## 13 13 0.4557225         NA
## 14 14 0.4510199         NA
## 15 15 0.4546232         NA
## 16 16 0.4532185         NA
## 17 17 0.4579211         NA
## 18 18 0.4586540         NA
## 19 19 0.4574325         NA
## 20 20 0.4584097         NA

# plot(knn.tune)
df <- as.data.frame(cbind(x=knn.tune$performance$k, y=knn.tune$performance$error))
plot_ly(df, x = ~x, y = ~y, type = "scatter", mode = "markers+lines") %>% 
  add_segments(x=1, xend=1, y=0.0, yend=0.45, type = "scatter", 
               line=list(color="gray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  add_segments(x=5, xend=5, y=0.0, yend=0.45, type = "scatter", 
               line=list(color="lightgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  layout(title='Galaxy-spin k-NN Prediction - Error Rate against k', 
           xaxis=list(title='Number of nearest neighbors (k)'), yaxis=list(title='Classification error'))

Data Science and Predictive Analytics (UMich HS650)

Lazy Learning - Classification Using Nearest Neighbors

SOCR/MIDAS (Ivo Dinov)

November 2021