SOCR ≫ | DSPA ≫ | Topics ≫ |
In this chapter, we are going to cover two very powerful machine-learning algorithms. These techniques have complex mathematical formulations, however, efficient algorithms and reliable software packages have been developed to utilize them for various practical applications. We will (1) describe Neural Networks as analogues of biological neurons, (2) develop hands-on a neural net that can be trained to compute the square-root function, (3) describe support vector machine (SVM) classification, and (4) complete several case-studies, including optical character recognition (OCR), the Iris flowers, Google Trends and the Stock Market, and Quality of Life in chronic disease.
Later, in Chapter 22, we will provide more details and additional examples of deep neural network learning. For now, let’s start by exploring the magic inside the machine learning black box.
An Artificial Neural Network (ANN)
model mimics the biological brain response to multisource (sensory-motor) stimuli (inputs). ANN simulate the brain using a network of interconnected neuron cells to create a massive parallel processor. Of course, it uses a network of artificial nodes, not brain cells, to train data.
When we have three signals (or inputs) \(x_1\), \(x_2\) and \(x_3\), the first step is weighting the features (\(w\)’s) according to their importance. Then, the weighted signals are summed by the “neuron cell” and this sum is passed on according to an activation function denoted by f. The last step is generating an output y at the end of the process. A typical output will have the following mathematical relationship to the inputs. \[y(x)=f\left (\sum_{i=1}^n w_ix_i\right ).\]
There are three important components for building a neural network:
Let’s look at each of these components one by one.
One of the functions is known as threshold activation function that results in an output signal once a specified input threshold has been attained.
\[ f(x)= \left\{ \begin{array}{ll} 0 & x<0 \\ 1 & x\geq 0 \\ \end{array} \right. . \]
This is the simplest form for activation functions. It seldom used in real world situations. Most commonly used alternative is the sigmoid activation function where \(f(x)=\frac{1}{1+e^{-x}}\). The Euler number e
is the base of natural logarithms. The output signal is no longer binary but can be any real number ranged from 0 to 1.
Other activation functions might also be useful:
Basically, we can chose a proper activation function based on the corresponding codomain of the function. For example, with hyperbolic tangent activation function, we can only have outputs ranging from -1 to 1 regardless of what input do we have. With linear function we can go from \(-\infty\) to \(+\infty\). Our Gaussian activation function will give us a model called Radial Basis Function network.
The number of layers: The \(x\)’s or features in the dataset is called input nodes while the predicted values are called the output nodes. Multilayer networks include multiple hidden layers. The following graph represents a two layer neural network:
\[\text{Two Layer network.}\]
When we have multiple layers, the information flow could be complicated.
The arrows in the last graph (with multiple layers) suggest a feed forward network. In such network, we can also have multiple outcomes modeled simultaneously.
\[\text{Multi-output}\]
Alternatively, in a recurrent network (feedback network), information can also travel backwards in loops (or delay). This is illustrated in the following graph.
\[\text{Delay(feedback network)}\]
This short-term memory increases the power of recurrent networks dramatically. However, in practice, recurrent networks are seldom used.
Number of input nodes and output nodes are predetermined by the dataset and predictive variables. The number we can edit is the hidden nodes in the model. Our goal is to add fewer hidden nodes as possible to simplify the model when the model performance is pleasant.
This algorithm could determine the weights in the model using a strategy of back-propagating errors. First, we assign random numbers for weights (but all weights must be non-trivial, i.e., \(\not= 0\)). For example, we can use normal distribution, or any other random process, to assign initial weights. Then we adjust the weights iteratively by repeating the process until until certain convergence or stopping criterion is met. Each iteration contains two phases.
In the end, we pick a set of weights, corresponding to the least total error, to be the weights in our network.
In this case study, we are going to use the Google trends and stock market dataset. A doc file with the meta-data and the CSV data are available on the Case-Studies Canvas Site. These daily data (between 2008 and 2009) can be used to examine the associations between Google search trends and the daily marker index - Dow Jones Industrial average.
Here we use the RealEstate
as our dependent variable. Let’s see if Google Real Estate Index could be predicted by other variables in the dataset.
First thing first, we need to load the dataset into R.
google<-read.csv("https://umich.instructure.com/files/416274/download?download_frd=1", stringsAsFactors = F)
Let’s delete the first two columns, since the only goal is to predict Google Real Estate Index with other indexes and DJI.
google<-google[, -c(1, 2)]
str(google)
## 'data.frame': 731 obs. of 24 variables:
## $ Unemployment : num 1.54 1.56 1.59 1.62 1.64 1.64 1.71 1.85 1.82 1.78 ...
## $ Rental : num 0.88 0.9 0.92 0.92 0.94 0.96 0.99 1.02 1.02 1.01 ...
## $ RealEstate : num 0.79 0.81 0.82 0.82 0.83 0.84 0.86 0.89 0.89 0.89 ...
## $ Mortgage : num 1 1.05 1.07 1.08 1.1 1.11 1.15 1.22 1.23 1.24 ...
## $ Jobs : num 0.99 1.05 1.1 1.14 1.17 1.2 1.3 1.41 1.43 1.44 ...
## $ Investing : num 0.92 0.94 0.96 0.98 0.99 0.99 1.02 1.09 1.1 1.1 ...
## $ DJI_Index : num 13044 13044 13057 12800 12827 ...
## $ StdDJI : num 4.3 4.3 4.31 4.14 4.16 4.16 4.16 4 4.1 4.17 ...
## $ Unemployment_30MA : num 1.37 1.37 1.38 1.38 1.39 1.4 1.4 1.42 1.43 1.44 ...
## $ Rental_30MA : num 0.72 0.72 0.73 0.73 0.74 0.75 0.76 0.77 0.78 0.79 ...
## $ RealEstate_30MA : num 0.67 0.67 0.68 0.68 0.68 0.69 0.7 0.7 0.71 0.72 ...
## $ Mortgage_30MA : num 0.98 0.97 0.97 0.97 0.98 0.98 0.98 0.99 0.99 1 ...
## $ Jobs_30MA : num 1.06 1.06 1.05 1.05 1.05 1.05 1.05 1.06 1.07 1.08 ...
## $ Investing_30MA : num 0.99 0.98 0.98 0.98 0.98 0.97 0.97 0.97 0.98 0.98 ...
## $ DJI_Index_30MA : num 13405 13396 13390 13368 13342 ...
## $ StdDJI_30MA : num 4.54 4.54 4.53 4.52 4.5 4.48 4.46 4.44 4.41 4.4 ...
## $ Unemployment_180MA: num 1.44 1.44 1.44 1.44 1.44 1.44 1.44 1.44 1.44 1.44 ...
## $ Rental_180MA : num 0.87 0.87 0.87 0.87 0.87 0.87 0.86 0.86 0.86 0.86 ...
## $ RealEstate_180MA : num 0.89 0.89 0.88 0.88 0.88 0.88 0.88 0.88 0.88 0.87 ...
## $ Mortgage_180MA : num 1.18 1.18 1.18 1.18 1.17 1.17 1.17 1.17 1.17 1.17 ...
## $ Jobs_180MA : num 1.24 1.24 1.24 1.24 1.24 1.24 1.24 1.24 1.24 1.24 ...
## $ Investing_180MA : num 1.04 1.04 1.04 1.04 1.04 1.04 1.04 1.04 1.04 1.04 ...
## $ DJI_Index_180MA : num 13493 13492 13489 13486 13482 ...
## $ StdDJI_180MA : num 4.6 4.6 4.6 4.6 4.59 4.59 4.59 4.58 4.58 4.58 ...
As we can see from the structure of the data, these indexes and DJI have different ranges. We should rescale
the data. In Chapter 6, we learned that normalizing these features using our own normalize()
function could fix the problem. We can use lapply()
to apply the normalize()
function to each column.
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
google_norm<-as.data.frame(lapply(google, normalize))
summary(google_norm$RealEstate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.4615 0.6731 0.6292 0.8077 1.0000
The last step clearly normalizes all feature vectors into the 0 to 1 range.
The next step would be separating our google
dataset into training and test subsets. This time we will use the sample()
and floor()
function to separate training and test dataset. sample()
is a function to create a set of indicators for row numbers. We can subset the original dataset with random rows using these indicators. floor()
takes a number x and returns the closest integer to x
sample(row, size)
sub<-sample(nrow(google_norm), floor(nrow(google_norm)*0.75))
google_train<-google_norm[sub, ]
google_test<-google_norm[-sub, ]
We are good to go! Let’s move forward to training phase.
Here, we use the function neuralnet()
in package neuralnet
. neuralnet
returns a NN object containing:
m<-neuralnet(target~predictors, data=mydata, hidden=1)
, where:
# install.packages("neuralnet")
library(neuralnet)
google_model<-neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI, data=google_train)
plot(google_model)
The above graph shows that we have only one hidden node. Error
is stand for the sum of squared errors and Steps
is how many iterations the model has go through. Do note that these outputs could be different when you run exact same codes twice because the weights are randomly generated.
Similar to the predict()
function that we have mentioned in previous chapters, compute()
is an alternative method that could help us to generate the model predictions.
p<-compute(m, test)
In our model we picked Unemployment, Rental, Mortgage, Jobs, Investing, DJI_Index, StdDJI
as our predictors. So we need to find these corresponding column numbers in the test dataset (1, 2, 4, 5, 6, 7, 8 respectively).
google_pred<-compute(google_model, google_test[, c(1:2, 4:8)])
pred_results<-google_pred$net.result
cor(pred_results, google_test$RealEstate)
## [,1]
## [1,] 0.9720240902
As mentioned in Chapter 9, we can still use the correlation between predicted results and observed Real Estate Index to evaluate the algorithm. A correlation over 0.9 is very good for real world datasets. Could this be improved further?
This time we managed to include 4 hidden nodes in the model. Let’s see what results we can get from this more complicated model.
google_model2<-neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI, data=google_train, hidden = 4)
plot(google_model2)
Although the graph looks very complicated, we have smaller Error
or sum of squared errors. Actually, it can be used both for classification and regression, which we will see in the next part. Let’s first try regression.
google_pred2<-compute(google_model2, google_test[, c(1:2, 4:8)])
pred_results2<-google_pred2$net.result
cor(pred_results2, google_test$RealEstate)
## [,1]
## [1,] 0.9875711149
We get an even higher correlation. This is almost an ideal result! The predicted and observed indexes have a strong linear relationship. Nevertheless, too many hidden nodes might even decrease the correlation between predicted and true values, which will be examined in the practice problems later in this chapter.
We observe an even lower Error
by use three hidden layer with nodes 4,3,3, respectively.
google_model2<-neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI, data=google_train, hidden = c(4,3,3))
google_pred2<-compute(google_model2, google_test[, c(1:2, 4:8)])
pred_results2<-google_pred2$net.result
cor(pred_results2, google_test$RealEstate)
## [,1]
## [1,] 0.9874817017
This simple example demonstrates the foundation of the neural network prediction of a basic mathematical function: \(\sqrt {\ \ \ }: \Re^+ \longrightarrow \Re^+\).
# generate random training data: 1,000 |X_i|, where X_i ~ Uniform (0,10) or perhaps ~ N(0,1)
rand_data <- abs(runif(1000, 0, 100))
# create a 2 column data-frame (and_data, sqrt_data)
sqrt_df <- data.frame(rand_data, sqrt_data=sqrt(rand_data))
plot(rand_data, sqrt_df$sqrt_data)
# Train the neural net
net.sqrt <- neuralnet(sqrt_data ~ rand_data, sqrt_df, hidden=10, threshold=0.01)
# report the NN
# print(net.sqrt)
# generate testing data seq(from=0.1, to=N, step=0.1)
N <- 200.0 # out of range [100: 200] is also included in the testing!
test_data <- seq(0, N, 0.1); test_data_sqrt <- sqrt(test_data)
# try to predict the square-root values using 10 hidden nodes
# Compute or predict for test data, test_data
pred_sqrt <- compute(net.sqrt, test_data)$net.result
# compute uses the trained neural net (net.sqrt),
# to estimate the square-roots of the testing data
# compare real (test_data_sqrt) and NN-predicted (pred_sqrt) square roots of test_data
plot(pred_sqrt, test_data_sqrt, xlim=c(0, 12), ylim=c(0, 12)); abline(0,1, col="red", lty=2)
legend("bottomright", c("Pred vs. Actual SQRT", "Pred=Actual Line"), cex=0.8, lty=c(1,2), lwd=c(2,2),col=c("black","red"))
compare_df <-data.frame(pred_sqrt, test_data_sqrt); # compare_df
plot(test_data, test_data_sqrt)
lines(test_data, pred_sqrt, pch=22, col="red", lty=2)
legend("bottomright", c("Actual SQRT","Predicted SQRT"), lty=c(1,2), lwd=c(2,2),col=c("black","red"))
We observe that the NN, net.sqrt
actually learns and predicts pretty close the complex square root function. Of course, everyone’s results may vary as we randomly generate the training data (rand_data) and the NN construction (net.sqrt) is also stochastic.
In practice, NN may be more useful as a classifier. Let’s demonstrate this by using again the Stock Market data. We mark the samples according to their RealEstate
. For those higher than 75% percentile, we give them label 0; For those lower than 0.25 percentile, we label them as 2; Otherwise, label 1. Even in the classification set, response still must be numeric.
google_class = google_norm
id1 = which(google_class$RealEstate>quantile(google_class$RealEstate,0.75))
id2 = which(google_class$RealEstate<quantile(google_class$RealEstate,0.25))
id3 = setdiff(1:nrow(google_class),union(id1,id2))
google_class$RealEstate[id1]=0
google_class$RealEstate[id2]=1
google_class$RealEstate[id3]=2
summary(as.factor(google_class$RealEstate))
## 0 1 2
## 179 178 374
Here, we divide the data to training and testing sets. We need 3 more column indicators, which correspond to the 3 outcomes labels.
set.seed(2017)
train = sample(1:nrow(google_class),0.7*nrow(google_class))
google_tr = google_class[train,]
google_ts = google_class[-train,]
train_x = google_tr[,c(1:2,4:8)]
train_y = google_tr[,3]
colnames(train_x)
## [1] "Unemployment" "Rental" "Mortgage" "Jobs"
## [5] "Investing" "DJI_Index" "StdDJI"
test_x = google_ts[,c(1:2,4:8)]
test_y = google_ts[3]
train_y_ind = model.matrix(~factor(train_y)-1)
colnames(train_y_ind) = c("High","Median","Low")
train = cbind(train_x, train_y_ind)
We use non-linear output and display every 2,000 iterations.
nn_single = neuralnet(High+Median+Low~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI,
data = train,
hidden=4,
linear.output=FALSE,
lifesign='full', lifesign.step=2000)
## hidden: 4 thresh: 0.01 rep: 1/1 steps: 2000 min thresh: 0.1370201548
## 4000 min thresh: 0.08524054094
## 6000 min thresh: 0.08524054094
## 8000 min thresh: 0.08524054094
## 10000 min thresh: 0.08524054094
## 12000 min thresh: 0.08524054094
## 14000 min thresh: 0.0837143363
## 16000 min thresh: 0.07801429993
## 18000 min thresh: 0.07076261657
## 20000 min thresh: 0.0613385125
## 22000 min thresh: 0.05320022886
## 24000 min thresh: 0.05320022886
## 26000 min thresh: 0.0529413719
## 28000 min thresh: 0.04668464218
## 30000 min thresh: 0.04456729135
## 32000 min thresh: 0.03594663725
## 34000 min thresh: 0.03593147665
## 36000 min thresh: 0.03323126867
## 38000 min thresh: 0.0289306258
## 40000 min thresh: 0.02427719823
## 42000 min thresh: 0.02158221449
## 44000 min thresh: 0.01831644589
## 46000 min thresh: 0.01682874388
## 48000 min thresh: 0.01572773551
## 50000 min thresh: 0.01311388938
## 52000 min thresh: 0.01241004281
## 54000 min thresh: 0.01131407008
## 55420 error: 7.01191 time: 19.22 secs
Below is the prediction function translating this model to the forecasting results.
pred = function(nn, dat) {
# compute uses the trained neural net (nn=nn_single), and
# new testing data (dat=google_ts) to generate predictions (y_hat)
# compute returns a list containing:
# (1) neurons: a list of the neurons' output for each layer of the neural network, and
# (2) net.result: a matrix containing the overall result of the neural network.
yhat = compute(nn, dat)$net.result
# find the maximum in each row (1) in the net.result matrix
# to determine the first occurrence of a specific element in each row (1)
# we can use the apply function with which.max
yhat = apply(yhat, 1, which.max)-1
return(yhat)
}
mean(pred(nn_single, google_ts[,c(1:2,4:8)]) != as.factor(google_ts[,3]))
## [1] 0.03181818182
Now let’s inspect the structure of the Neural Network.
plot(nn_single)
Similarly, we can change hidden
to utilize multiple hidden layers, however, a more complicated model won’t necessarily guarantee an improved performance.
nn_single = neuralnet(High+Median+Low~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI,
data = train,
hidden=c(4,5),
linear.output=FALSE,
lifesign='full', lifesign.step=2000)
## hidden: 4, 5 thresh: 0.01 rep: 1/1 steps: 2000 min thresh: 0.307328437
## 4000 min thresh: 0.2875517033
## 6000 min thresh: 0.1383720887
## 8000 min thresh: 0.1115440575
## 10000 min thresh: 0.09233958192
## 12000 min thresh: 0.0766173347
## 14000 min thresh: 0.05763223509
## 16000 min thresh: 0.03417989426
## 18000 min thresh: 0.01473872843
## 20000 min thresh: 0.01101646653
## 20741 error: 7.00627 time: 11.05 secs
mean(pred(nn_single, google_ts[,c(1:2,4:8)]) != as.factor(google_ts[,3]))
## [1] 0.03181818182
Recall that Lazy learning method in Chapter 6 assign classes using geometrical distances of different features. In multidimensional space (multiple features), it is actually using spheres that centered by training dataset to assign the test dataset with classes. Can we make a fancier shape in the \(nD\) space to do the classification?
The easiest shape would be a plane. Support Vector Machine (SVM) can use a hyperplane to separate data into several groups or classes. This is used for datasets that are linearly separable.
Assume that we have only two features, will you use A or B hyperplane to separate the data? Or even another plane C?
To answer the above question, we need to search for the Maximum Margin Hyperplane (MMH). That is the hyperplane that creates greatest separation between the two closest observations.
We define support vectors as the points from each class that are closest to MMH. Each class must have at least one observation as support vector.
Using support vectors along is not sufficient for finding the MMH. Although tricky mathematical calculations are involved, the principal of the process is fairly simple. Let’s look at linearly separable data and non-linearly separable data individually.
If the dataset is linearly separable, we can find the outer boundaries of our two groups of data points. These boundaries are called convex hull (red lines in the following graph). The MMH (black solid line) is just the line that perpendicular to the shortest line between the two convex hulls.
An alternative way would be picking two parallel planes that can separate the data into two groups while the distance between two planes is as far as possible.
We can use vector notation to mathematically define planes. In n-dimensional space, a plane could be expressed by the following equation: \[\vec{w}\cdot\vec{x}+b=0,\] where the vectors \(\vec{w}\) (weights) and \(\vec{x}\) (unknowns) both have n coordinates and b is a (scalar) constant.
To clarify this notation let’s look at the situation in a 3D space where we can express (embed) 2D Euclidean planes using a point (\((x_o,y_o,z_o)\)) and normal-vector (\((a,b,c)\)) form. This is just a linear equation, where \(d=-(ax_o + by_o + cz_0)\): \[ax + by + cz + d = 0,\] or equivalently \[w_1x_1+w_2x_2+w_3x_3+b=0\] We can see that it is equivalent to the vector notation
Using the vector notation, we can specify two hyperplanes as follows: \[\vec{w}\cdot\vec{x}+b\geq+1\] and \[\vec{w}\cdot\vec{x}+b\leq-1\] We require that all of the observations in the first class fall above the first plane and all observations in the other class fall below the second plane.
The distance between two planes is calculated as: \[\frac{2}{\lVert \vec{w}\rVert}\] where \(\lVert \vec{w}\rVert\) is the Euclidean norm. To maximize the distance we need to minimize the Euclidean norm.
To sum up we are going to find \(min\frac{\lVert \vec{w}\rVert}{2}\) with the following constrain: \[y_i(\vec{w}\cdot\vec{x}-b)\geq1, \, \forall\vec{x}_i\] where \(\forall\) means “for all”.
For each nonlinear programming problem, the primal problem, there is related nonlinear programming problem, the Lagrangian dual problem. Under certain assumptions for convexity and suitable constraints, the primal and dual problems have equal optimal objective values. Primal optimization problems are typically described as: \[\begin{array}{rcl} \min_x{f(x)} \\ \text{subject to} \\ g_i(x) \leq 0\\ h_j(x) = 0 \\ \end{array}.\]
Then the Lagrangian dual problem is defined as a parallel nonlinear programming problem: \[\begin{array}{rcl} & \min_{u,v}{\theta(u,v)} & \\ & \text{subject to} & \\ & u \geq 0\\ \end{array},\] where \[ \theta(u,v)= \inf_{x}{ \left ( f(x)+\displaystyle\sum_i {u_i g_i(x)} +\displaystyle\sum_j {v_j h_j(x)} \right )}.\]
Chapter 21 provides additional technical details about optimization duality.
Suppose the Lagrange primal is \[L_p = \frac{1}{2}||w||^2-\sum_{i=1}^{n}\alpha_i[y_i(w_0+x_i^{T}w)-1],\ \text{where}\ \alpha_i\geq 0.\]
To optimize that objective funciton, we can set the partial derivatives equal to zero: \[\frac{\partial}{\partial w}\ :\ w = \sum_{i=1}^{n}\alpha_iy_ix_i\]
\[\frac{\partial}{\partial b}\ :\ 0 = \sum_{i=1}^{n}\alpha_iy_i.\]
Substituting into the Lagrange primal, we obtain the Lagrange dual:
\[L_D = \sum_{i=1}^{n}\alpha_i - \frac{1}{2} \sum_{i=1}^{n}\alpha_i\alpha_i^{'}y_iy_i^{'}x_i^Tx_i^{'}.\]
We maximize \(L_D\) subject to \(\alpha_i \geq 0\) and \(\sum_{i=1}^{n}\alpha_iy_i =0\).
Since it follows the Karush-Kuhn-Tucker optimization conditions, we have \(\hat\alpha[y_i(\hat{b}+x_i^T\hat{w})-1]=0.\)
Which implies that if \(y_i\hat{f}(x_i)>1\), then \(\hat{\alpha}_i=0\).
The support of a function (\(f\)) is the smallest subset of the domain containing only arguments (\(x\)) which are not mapped to zero (\(f(x)\not=0\)). In our case, the solution \(\hat{w}\) is defined in terms of a linear combination of the support points:
\[\hat{f}(x)=w^Tx = w = \sum_{i=1}^{n}\alpha_iy_ix_i. \]
That’s where the name of Support Vector Machines (SVM) comes from.
For non-linearly separable data, we need to use a small trick. Still, we use a plane but allow some of the points to be misclassified into the wrong class. To penalize for that, we add a cost term after the Euclidean norm function that we need to minimize.
Therefore, the solution will optimize the following objective (cost) function: \[\min \left (\frac{\lVert \vec{w}\rVert}{2} \right )+C\sum_{i=1}^{n} \xi_i\] \[\text{subject to}\] \[y_i(\vec{w}\cdot\vec{x}-b)\geq1, \, \forall\vec{x}_i, \, \xi_i\geq0,\] where \(C\) is the cost and \(\xi_i\) is the distance between the misclassified observation i and the plane.
We have Lagrange primal problem: \[L_p = \frac{1}{2}||w||^2 + C\sum_{i=1}^{n}\xi_i-\sum_{i=1}^{n}\alpha_i[y_i(b+x_i^{T}w)-(1-\xi_i)] - \sum_{i=1}^{n}\gamma_i\xi_i,\] where \[\alpha_i,\gamma_i \geq 0.\]
Similar to what we did above for the separable case, we can use the derivatives of the primal problem to solve the dual problem.
Notice the inner product in the final expression. We can replace this inner product with a kernel function that maps the feature space into a higher dimensional space (e.g., using a polynomial kernel) or an infinite dimensional space (e.g., using a Gaussian kernel).
An alternative way to solve for the non-linear separable is called the kernel trick. That is to add some dimensions (or features) to make these non-linear separable data to be separable in a higher dimensional space.
How can we do that? We transform our data using kernel functions. A general form for kernel functions would be: \[K(\vec{x_i}, \vec{x_j})=\phi(\vec{x_i})\cdot\phi(\vec{x_j}),\] where \(\phi\) is a mapping of the data into another space.
The linear kernel would be the simplest one that is just the dot product of the features. \[K(\vec{x_i}, \vec{x_j})=\vec{x_i}\cdot\vec{x_j}.\] The polynomial kernel of degree d transform the data by adding a simple non-linear transformation of the data. \[K(\vec{x_i}, \vec{x_j})=(\vec{x_i}\cdot\vec{x_j}+1)^d.\] The sigmoid kernel is very similar to neural network. It uses a sigmoid activation function. \[K(\vec{x_i}, \vec{x_j})=tanh(k\vec{x_i}\cdot\vec{x_j}-\delta).\] The Gaussian RBF kernel is similar to RBF neural network and is a good place to start investigating a dataset. \[K(\vec{x_i}, \vec{x_j})=exp \left (\frac{-\lVert \vec{x_i}-\vec{x_j}\rVert^2}{2\sigma^2}\right ) .\]
This example illustrates management and transferring of handwritten notes (text) and converting it to typeset or printed text representing the characters in the original notes (unstructured image data).
In this example, we use an optical document image (data) that has already been pre-partitioned into rectangular grid cells containing 1 character of the 26 English letters, A through Z.
The resulting gridded dataset is distributed by the UCI Machine Learning Data Repository. The dataset contains 20, 000 examples of 26 English capital letters printed using 20 different randomly reshaped and morphed fonts.
# read in data and examine structure
hand_letters <- read.csv("https://umich.instructure.com/files/2837863/download?download_frd=1", header = T)
str(hand_letters)
## 'data.frame': 20000 obs. of 17 variables:
## $ letter: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
## $ xbox : int 2 5 4 7 2 4 4 1 2 11 ...
## $ ybox : int 8 12 11 11 1 11 2 1 2 15 ...
## $ width : int 3 3 6 6 3 5 5 3 4 13 ...
## $ height: int 5 7 8 6 1 8 4 2 4 9 ...
## $ onpix : int 1 2 6 3 1 3 4 1 2 7 ...
## $ xbar : int 8 10 10 5 8 8 8 8 10 13 ...
## $ ybar : int 13 5 6 9 6 8 7 2 6 2 ...
## $ x2bar : int 0 5 2 4 6 6 6 2 2 6 ...
## $ y2bar : int 6 4 6 6 6 9 6 2 6 2 ...
## $ xybar : int 6 13 10 4 6 5 7 8 12 12 ...
## $ x2ybar: int 10 3 3 4 5 6 6 2 4 1 ...
## $ xy2bar: int 8 9 7 10 9 6 6 8 8 9 ...
## $ xedge : int 0 2 3 6 1 0 2 1 1 8 ...
## $ xedgey: int 8 8 7 10 7 8 8 6 6 1 ...
## $ yedge : int 0 4 3 2 5 9 7 2 1 1 ...
## $ yedgex: int 8 10 9 8 10 7 10 7 7 8 ...
# divide into training (3/4) and testing (1/4) data
hand_letters_train <- hand_letters[1:15000, ]
hand_letters_test <- hand_letters[15001:20000, ]
We can specify vanilladot
as a linear kernel, or alternatively:
rbfdot
Radial Basis kernel i.e, “Gaussian”polydot
Polynomial kerneltanhdot
Hyperbolic tangent kernellaplacedot
Laplacian kernelbesseldot
Bessel kernelanovadot
ANOVA RBF kernelsplinedot
Spline kernelstringdot
String kernel# begin by training a simple linear SVM
library(kernlab)
set.seed(123)
hand_letter_classifier <- ksvm(letter ~ ., data = hand_letters_train, kernel = "vanilladot")
## Setting default kernel parameters
# look at basic information about the model
hand_letter_classifier
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 6618
##
## Objective Function Value : -13.2947 -19.6051 -20.8982 -5.6651 -7.2092 -31.5151 -48.3253 -17.6236 -57.0476 -30.532 -15.7162 -31.49 -28.2706 -45.741 -11.7891 -33.3161 -28.2251 -16.5347 -13.2693 -30.88 -29.4259 -7.7099 -11.1685 -29.4289 -13.0857 -9.2631 -144.1105 -52.7747 -71.052 -109.7783 -158.3152 -51.2839 -39.6499 -67.0061 -23.8637 -27.6083 -26.3461 -35.2626 -38.6346 -116.8967 -173.8336 -214.2196 -20.7925 -10.3812 -53.1156 -12.228 -46.6132 -8.6867 -18.9108 -11.0535 -94.5751 -26.5689 -224.0215 -70.5714 -8.3232 -4.5265 -132.5431 -74.6876 -19.5742 -12.7352 -81.7894 -11.6983 -25.4835 -17.582 -23.934 -27.022 -50.7092 -10.9228 -4.3852 -13.7216 -3.8547 -3.5723 -8.419 -36.9773 -47.1418 -172.6874 -42.457 -44.0342 -42.7695 -13.0527 -16.7534 -78.7849 -101.8146 -32.1141 -30.3349 -104.0695 -32.1258 -24.6301 -32.6087 -17.0808 -5.1347 -40.5505 -6.684 -16.2962 -56.364 -147.3669 -49.0907 -37.8334 -32.8068 -73.248 -127.7819 -10.5342 -5.2495 -11.9568 -30.1631 -135.5915 -51.521 -176.2669 -99.0973 -10.295 -14.5906 -3.7822 -64.1452 -7.4813 -84.9109 -40.9146 -87.2437 -66.8629 -69.9932 -20.5294 -12.7577 -7.0328 -22.9219 -12.3975 -223.9411 -29.9969 -24.0552 -132.6252 -133.7033 -9.2959 -33.1873 -5.8016 -57.3392 -60.9046 -27.1766 -200.8554 -29.9334 -15.9359 -130.0183 -154.4587 -43.5779 -24.4852 -135.7896 -74.1531 -303.5043 -131.4741 -149.5403 -30.4917 -29.8086 -47.3454 -24.6204 -44.2792 -6.2064 -8.6708 -36.4412 -68.712 -179.7303 -44.7489 -84.8608 -136.6786 -569.3398 -113.0779 -138.435 -303.8556 -32.8011 -60.4546 -139.3525 -108.9841 -34.277 -64.9071 -38.6148 -7.5086 -204.222 -12.9572 -29.0252 -2.0352 -5.9916 -14.3706 -21.5773 -57.0064 -19.6546 -178.0543 -19.812 -4.145 -4.5318 -0.8101 -116.8649 -7.8269 -53.3445 -21.4812 -13.5066 -5.3881 -15.1061 -27.6061 -18.9239 -68.8104 -26.1223 -93.0231 -15.1693 -9.7999 -7.6137 -1.5301 -84.9531 -5.4551 -93.187 -93.4153 -43.8334 -23.6706 -59.1468 -22.0933 -47.8381 -219.9936 -39.5596 -47.2643 -34.0752 -20.2532 -11.239 -118.4152 -6.4126 -5.1846 -8.7272 -9.4584 -20.8522 -22.0878 -113.0806 -29.0912 -80.397 -29.6206 -13.7422 -8.9416 -3.0785 -79.842 -6.1869 -13.9663 -63.3665 -93.2067 -11.5593 -13.0449 -48.2558 -2.9343 -8.25 -76.4361 -33.5374 -109.112 -4.1731 -6.1978 -1.2664 -84.1287 -18.3054 -7.2209 -45.5509 -3.3567 -16.8612 -60.5094 -43.9956 -53.0592 -6.1407 -17.4499 -2.3741 -65.023 -102.1593 -103.4312 -23.1318 -17.3394 -50.6654 -31.4407 -57.6065 -19.6857 -5.2667 -4.1767 -55.8445 -30.92 -57.2396 -30.1101 -7.611 -47.7711 -12.1616 -19.1572 -53.5364 -3.8024 -53.124 -225.6075 -12.6791 -11.5852 -16.6614 -9.7186 -65.824 -16.3897 -42.3931 -50.513 -24.752 -14.513 -40.495 -16.5124 -57.1813 -4.7974 -5.2949 -81.7477 -3.272 -6.3448 -1.1259 -114.3256 -22.3232 -339.8619 -31.0491 -31.3872 -4.9625 -82.4936 -123.6225 -72.8463 -23.4836 -33.1608 -11.7133 -19.7607 -1.8599 -50.1148 -8.2868 -143.3592 -1.8508 -1.9699 -9.4175 -0.5202 -25.0654 -30.0489 -5.6248
## Training error : 0.129733
# predictions on testing dataset
hand_letter_predictions <- predict(hand_letter_classifier, hand_letters_test)
head(hand_letter_predictions)
## [1] C U K U E I
## Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
table(hand_letter_predictions, hand_letters_test$letter)
##
## hand_letter_predictions A B C D E F G H I J K L
## A 191 0 1 0 0 0 0 0 0 1 0 0
## B 0 157 0 9 2 0 1 3 0 0 1 0
## C 0 0 142 0 5 0 14 3 2 0 2 4
## D 1 1 0 196 0 1 4 12 5 3 4 4
## E 0 0 8 0 164 2 1 1 0 0 3 5
## F 0 0 0 0 0 171 4 2 8 2 0 0
## G 1 1 4 1 10 3 150 2 0 0 1 2
## H 0 3 0 1 0 2 2 122 0 2 4 2
## I 0 0 0 0 0 0 0 0 175 10 0 0
## J 2 2 0 0 0 3 0 2 7 158 0 0
## K 2 1 11 0 0 0 4 6 0 0 148 0
## L 0 0 0 0 1 0 1 1 0 0 0 176
## M 0 0 1 1 0 0 1 2 0 0 0 0
## N 0 0 0 1 0 1 0 1 0 0 0 0
## O 0 0 1 2 0 0 2 1 0 2 0 0
## P 0 0 0 1 0 3 1 0 0 0 0 0
## Q 0 0 0 0 0 0 9 3 0 0 0 3
## R 2 5 0 1 1 0 2 9 0 0 11 0
## S 1 2 0 0 1 1 5 0 2 2 0 3
## T 0 0 0 0 3 6 0 1 0 0 1 0
## U 1 0 3 3 0 0 0 2 0 0 0 0
## V 0 0 0 0 0 1 6 3 0 0 0 0
## W 0 0 0 0 0 0 1 0 0 0 0 0
## X 0 1 0 0 2 0 0 1 3 0 2 6
## Y 3 0 0 0 0 0 0 1 0 0 0 0
## Z 2 0 0 0 2 0 0 0 3 3 0 0
##
## hand_letter_predictions M N O P Q R S T U V W X
## A 1 2 2 0 5 0 2 1 1 0 1 0
## B 3 0 0 2 4 8 5 0 0 3 0 1
## C 0 0 2 0 0 0 0 0 0 0 0 0
## D 0 6 5 3 1 4 0 0 0 0 0 5
## E 0 0 0 0 6 0 10 0 0 0 0 4
## F 0 0 0 18 0 0 5 2 0 0 0 1
## G 1 0 0 2 11 2 5 3 0 0 0 1
## H 2 5 23 0 2 6 0 4 1 4 0 0
## I 0 0 0 1 0 0 3 0 0 0 0 4
## J 0 0 1 1 4 0 1 0 0 0 0 2
## K 0 2 0 1 1 7 0 1 3 0 0 4
## L 0 0 0 0 1 0 4 0 0 0 0 1
## M 177 5 1 0 0 0 0 0 4 0 8 0
## N 0 172 0 0 0 3 0 0 1 0 2 0
## O 0 1 132 2 4 0 0 0 3 0 0 0
## P 0 0 3 168 1 0 0 1 0 0 0 0
## Q 0 0 5 1 163 0 5 0 0 0 0 0
## R 1 1 1 1 0 176 0 1 0 2 0 0
## S 0 0 0 0 11 0 135 2 0 0 0 2
## T 0 0 0 0 0 0 3 163 1 0 0 0
## U 0 1 0 1 0 0 0 0 197 0 1 1
## V 0 3 1 0 2 1 0 0 0 152 1 0
## W 2 0 4 0 0 0 0 0 4 7 154 0
## X 0 0 1 0 0 1 2 0 0 0 0 160
## Y 0 0 0 6 0 0 0 3 0 0 0 0
## Z 0 0 0 0 1 0 18 3 0 0 0 0
##
## hand_letter_predictions Y Z
## A 0 0
## B 0 0
## C 0 0
## D 3 1
## E 0 3
## F 3 0
## G 0 0
## H 3 0
## I 1 1
## J 0 11
## K 0 0
## L 0 1
## M 0 0
## N 0 0
## O 0 0
## P 1 0
## Q 3 0
## R 0 0
## S 0 10
## T 5 2
## U 1 0
## V 5 0
## W 0 0
## X 1 1
## Y 157 0
## Z 0 164
# look only at agreement vs. non-agreement
# construct a vector of TRUE/FALSE indicating correct/incorrect predictions
agreement <- hand_letter_predictions == hand_letters_test$letter # check if characters agree
table(agreement)
## agreement
## FALSE TRUE
## 780 4220
prop.table(table(agreement))
## agreement
## FALSE TRUE
## 0.156 0.844
Replacing the vanilladot
linear kernel with rbfdot
Radial Basis Function kernel, i.e., “Gaussian” kernel may improve the OCR prediction.
hand_letter_classifier_rbf <- ksvm(letter ~ ., data = hand_letters_train, kernel = "rbfdot")
hand_letter_predictions_rbf <- predict(hand_letter_classifier_rbf, hand_letters_test)
agreement_rbf <- hand_letter_predictions_rbf == hand_letters_test$letter
table(agreement_rbf)
## agreement_rbf
## FALSE TRUE
## 360 4640
prop.table(table(agreement_rbf))
## agreement_rbf
## FALSE TRUE
## 0.072 0.928
Note the improvement of automated (SVM) classification accuracy (\(0.928\)) for rbfdot
compared to the previous (vanilladot
) result (\(0.844\)).
Let’s have another look at the iris data that we saw in Chapter 2.
SVM require all features to be numeric and each feature has to be scaled into a relative small interval. We are using the Edgar Anderson’s Iris Data in R for this case study. This dataset measures the length and width of sepals and petals from three Iris flower species.
Let’s load the data first. In this case study we want to explore the variable Species
.
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
The data looks good. However, recall that we need a fairly normalized data. We could normalize the data by hand. Luckily, the R package we are going to use will normalized the dataset automatically.
Now we can separate the training and test dataset using 75%-25% rule.
sub<-sample(nrow(iris), floor(nrow(iris)*0.75))
iris_train<-iris[sub, ]
iris_test<-iris[-sub, ]
Let’s first try a toy (iris data) example.
require(e1071)
## Loading required package: e1071
iris.svm_1 <- svm(Species~Petal.Length+Petal.Width, data=iris_train,
kernel="linear", cost=1)
iris.svm_2 <- svm(Species~Petal.Length+Petal.Width, data=iris_train,
kernel="radial", cost=1)
par(mfrow=c(2,1))
plot(iris.svm_1, iris[,c(5,3,4)]); legend("center", "Linear")
plot(iris.svm_2, iris[,c(5,3,4)]); legend("center", "Radial")
We are going to use kernlab
for this case study. However other packages like e1071
and klaR
are available if you are quite familiar with C++.
Let’s break down the function ksvm()
m<-ksvm(target~predictors, data=mydata, kernel="rbfdot", c=1)
rbfdot
).Let’s in stall the package and play with the data now.
# install.packages("kernlab")
library(kernlab)
iris_clas<-ksvm(Species~., data=iris_train, kernel="vanilladot")
## Setting default kernel parameters
iris_clas
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 24
##
## Objective Function Value : -1.0066 -0.3309 -13.8658
## Training error : 0.026786
Here, we used all the variables other than the Species
in the dataset as predictors. We also used kernel vanilladot
that is the linear kernel in this model. We get a training error less than 0.02.
Our old friend predict()
function is used again to make predictions. Here we have a factor outcome, so we need the command table()
to show us how well do the predictions and actual data match.
iris.pred<-predict(iris_clas, iris_test)
table(iris.pred, iris_test$Species)
##
## iris.pred setosa versicolor virginica
## setosa 13 0 0
## versicolor 0 14 0
## virginica 0 1 10
We can see that only 1-2 cases of may be misclassifies as Iris versicolor. The species of the majority of the flowers are all correctly identified.
To see the results more clearly, we can use the proportional table to show the agreements of the categories.
agreement<-iris.pred==iris_test$Species
prop.table(table(agreement))
## agreement
## FALSE TRUE
## 0.02631578947 0.97368421053
Here ==
means “equal to”. Over 90% of predictions are correct. Nevertheless, is there any chance that we can improve the outcome? What if we try a Gaussian kernel?
Linear kernel is the simplest one but usually not the best one. Let’s try the RBF (Radial Basis “Gaussian” Function) kernel instead.
iris_clas1<-ksvm(Species~., data=iris_train, kernel="rbfdot")
iris_clas1
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.877982617394805
##
## Number of Support Vectors : 52
##
## Objective Function Value : -4.6939 -5.1534 -16.2297
## Training error : 0.017857
iris.pred1<-predict(iris_clas1, iris_test)
table(iris.pred1, iris_test$Species)
##
## iris.pred1 setosa versicolor virginica
## setosa 13 0 0
## versicolor 0 14 2
## virginica 0 1 8
agreement<-iris.pred1==iris_test$Species
prop.table(table(agreement))
## agreement
## FALSE TRUE
## 0.07894736842 0.92105263158
Unfortunately, the model performance is actually worse than the previous one (you might get slightly different results). This is because this Iris dataset has a linear feature. In practice, we could try alternative kernel functions and see which one fits the dataset the best.
We can tune the SVM using the tune.svm
function in the package e1071
.
costs = exp(-5:8)
tune.svm(Species~., kernel = "radial", data = iris_train, cost = costs)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 1
##
## - best performance: 0.03636363636
Further, we can draw a cv plot to gauge the model performance:
set.seed(2017)
require(sparsediscrim); require (reshape); require(ggplot2)
## Loading required package: sparsediscrim
## Loading required package: reshape
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:kernlab':
##
## alpha
folds = cv_partition(iris$Species, num_folds = 5)
train_cv_error_svm = function(costC) {
#Train
ir.svm = svm(Species~., data=iris,
kernel="radial", cost=costC)
train_error = sum(ir.svm$fitted != iris$Species) / nrow(iris)
#Test
test_error = sum(predict(ir.svm, iris_test) != iris_test$Species) / nrow(iris_test)
#CV error
ire.cverr = sapply(folds, function(fold) {
svmcv = svm(Species~.,data = iris, kernel="radial", cost=costC, subset = fold$training)
svmpred = predict(svmcv, iris[fold$test,])
return(sum(svmpred != iris$Species[fold$test]) / length(fold$test))
})
cv_error = mean(ire.cverr)
return(c(train_error, cv_error, test_error))
}
costs = exp(-5:8)
ir_cost_errors = sapply(costs, function(cost) train_cv_error_svm(cost))
df_errs = data.frame(t(ir_cost_errors), costs)
colnames(df_errs) = c('Train', 'CV', 'Test', 'Logcost')
dataL <- melt(df_errs, id="Logcost")
ggplot(dataL, aes_string(x="Logcost", y="value", colour="variable",
group="variable", linetype="variable", shape="variable")) +
geom_line(size=1) + labs(x = "Cost",
y = "Classification error",
colour="",group="",
linetype="",shape="") + scale_x_log10()
Now, let’s attempt to improve the performance of a Gaussian kernel by tuning:
set.seed(2020)
gammas = exp(-5:5)
tune_g = tune.svm(Species~., kernel = "radial", data = iris_train, cost = costs, gamma = gammas)
tune_g
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.01831563889 20.08553692
##
## - best performance: 0.025
We observe that the model achieves a better prediction now.
iris.svm_g <- svm(Species~., data=iris_train,
kernel="radial", gamma=0.0183, cost=20)
table(iris_test$Species, predict(iris.svm_g, iris_test))
##
## setosa versicolor virginica
## setosa 13 0 0
## versicolor 0 14 1
## virginica 0 0 10
agreement<-predict(iris.svm_g, iris_test)==iris_test$Species
prop.table(table(agreement))
## agreement
## FALSE TRUE
## 0.02631578947 0.97368421053
Chapter 22 provides more details about neural networks and deep learning.
Use the Google trend data. Fit a neural network model with the same training data as case study 1. This time use Investing
as target and Unemployment, Rental, RealEstate, Mortgage, Jobs, DJI_Index, StdDJI
as predictors. Use 3 hidden nodes.
Note: remember to change the columns you want to include in the test dataset when predicting.
The following number is the correlation between predicted and observed values.
## [,1]
## [1,] 0.899149293
You might get slightly different results since the weights are generated randomly.
Use the same data we showed in Chapter 8, Quality of life and chronic disease and the corresponding meta-data doc.
Let’s load the data first. In this case study, we want to use the variable CHARLSONSCORE
as our target variable.
qol<-read.csv("https://umich.instructure.com/files/481332/download?download_frd=1")
str(qol)
## 'data.frame': 2356 obs. of 41 variables:
## $ ID : int 171 171 172 179 180 180 181 182 183 186 ...
## $ INTERVIEWDATE : int 0 427 0 0 0 42 0 0 0 0 ...
## $ LANGUAGE : int 1 1 1 1 1 1 1 1 1 2 ...
## $ AGE : int 49 49 62 44 64 64 52 48 49 78 ...
## $ RACE_ETHNICITY : int 3 3 3 7 3 3 3 3 3 4 ...
## $ SEX : int 2 2 2 2 1 1 2 1 1 1 ...
## $ QOL_Q_01 : int 4 4 3 6 3 3 4 2 3 5 ...
## $ QOL_Q_02 : int 4 3 3 6 2 5 4 1 4 6 ...
## $ QOL_Q_03 : int 4 4 4 6 3 6 4 3 4 4 ...
## $ QOL_Q_04 : int 4 4 2 6 3 6 2 2 5 2 ...
## $ QOL_Q_05 : int 1 5 4 6 2 6 4 3 4 3 ...
## $ QOL_Q_06 : int 4 4 2 6 1 2 4 1 2 4 ...
## $ QOL_Q_07 : int 1 2 5 -1 0 5 8 4 3 7 ...
## $ QOL_Q_08 : int 6 1 3 6 6 6 3 1 2 4 ...
## $ QOL_Q_09 : int 3 4 3 6 2 2 4 2 2 4 ...
## $ QOL_Q_10 : int 3 1 3 6 3 6 3 2 4 3 ...
## $ MSA_Q_01 : int 1 3 2 6 2 3 4 1 1 2 ...
## $ MSA_Q_02 : int 1 1 2 6 1 6 4 3 2 4 ...
## $ MSA_Q_03 : int 2 1 2 6 1 2 3 3 1 2 ...
## $ MSA_Q_04 : int 1 3 2 6 1 2 1 4 1 5 ...
## $ MSA_Q_05 : int 1 1 1 6 1 2 1 6 3 2 ...
## $ MSA_Q_06 : int 1 2 2 6 1 2 1 1 2 2 ...
## $ MSA_Q_07 : int 2 1 3 6 1 1 1 1 1 5 ...
## $ MSA_Q_08 : int 1 1 1 6 1 1 1 1 2 1 ...
## $ MSA_Q_09 : int 1 1 1 6 2 2 4 6 2 1 ...
## $ MSA_Q_10 : int 1 1 1 6 1 1 1 1 1 3 ...
## $ MSA_Q_11 : int 2 3 2 6 1 1 2 1 1 5 ...
## $ MSA_Q_12 : int 1 1 2 6 1 1 2 6 1 3 ...
## $ MSA_Q_13 : int 1 1 1 6 1 6 2 1 4 2 ...
## $ MSA_Q_14 : int 1 1 1 6 1 2 1 1 3 1 ...
## $ MSA_Q_15 : int 2 1 1 6 1 1 3 2 1 3 ...
## $ MSA_Q_16 : int 2 3 5 6 1 2 1 2 1 2 ...
## $ MSA_Q_17 : int 2 1 1 6 1 1 1 1 1 3 ...
## $ PH2_Q_01 : int 3 2 1 5 1 1 3 1 2 3 ...
## $ PH2_Q_02 : int 4 4 1 5 1 2 1 1 4 2 ...
## $ TOS_Q_01 : int 2 2 2 4 1 1 2 2 1 1 ...
## $ TOS_Q_02 : int 1 1 1 4 4 4 1 2 4 4 ...
## $ TOS_Q_03 : int 4 4 4 4 4 4 4 4 4 4 ...
## $ TOS_Q_04 : int 5 5 5 5 5 5 5 5 5 5 ...
## $ CHARLSONSCORE : int 2 2 3 1 0 0 2 8 0 1 ...
## $ CHRONICDISEASESCORE: num 1.6 1.6 1.54 2.97 1.28 1.28 1.31 1.67 2.21 2.51 ...
Delete the first two columns (we don’t need ID variables) and rows that have missing values in CHARLSONSCORE
(where CHARLSONSCORE
equals “-9”) !qol$CHARLSONSCORE==-9
means we want all the rows that have CHARLSONSCORE not equal to -9. The exclamation sign indicates “exclude”. Also, we need to convert our categorical variable CHARLSONSCORE
into a factor.
qol<-qol[!qol$CHARLSONSCORE==-9 , -c(1, 2)]
qol$CHARLSONSCORE<-as.factor(qol$CHARLSONSCORE)
str(qol)
## 'data.frame': 2328 obs. of 39 variables:
## $ LANGUAGE : int 1 1 1 1 1 1 1 1 1 2 ...
## $ AGE : int 49 49 62 44 64 64 52 48 49 78 ...
## $ RACE_ETHNICITY : int 3 3 3 7 3 3 3 3 3 4 ...
## $ SEX : int 2 2 2 2 1 1 2 1 1 1 ...
## $ QOL_Q_01 : int 4 4 3 6 3 3 4 2 3 5 ...
## $ QOL_Q_02 : int 4 3 3 6 2 5 4 1 4 6 ...
## $ QOL_Q_03 : int 4 4 4 6 3 6 4 3 4 4 ...
## $ QOL_Q_04 : int 4 4 2 6 3 6 2 2 5 2 ...
## $ QOL_Q_05 : int 1 5 4 6 2 6 4 3 4 3 ...
## $ QOL_Q_06 : int 4 4 2 6 1 2 4 1 2 4 ...
## $ QOL_Q_07 : int 1 2 5 -1 0 5 8 4 3 7 ...
## $ QOL_Q_08 : int 6 1 3 6 6 6 3 1 2 4 ...
## $ QOL_Q_09 : int 3 4 3 6 2 2 4 2 2 4 ...
## $ QOL_Q_10 : int 3 1 3 6 3 6 3 2 4 3 ...
## $ MSA_Q_01 : int 1 3 2 6 2 3 4 1 1 2 ...
## $ MSA_Q_02 : int 1 1 2 6 1 6 4 3 2 4 ...
## $ MSA_Q_03 : int 2 1 2 6 1 2 3 3 1 2 ...
## $ MSA_Q_04 : int 1 3 2 6 1 2 1 4 1 5 ...
## $ MSA_Q_05 : int 1 1 1 6 1 2 1 6 3 2 ...
## $ MSA_Q_06 : int 1 2 2 6 1 2 1 1 2 2 ...
## $ MSA_Q_07 : int 2 1 3 6 1 1 1 1 1 5 ...
## $ MSA_Q_08 : int 1 1 1 6 1 1 1 1 2 1 ...
## $ MSA_Q_09 : int 1 1 1 6 2 2 4 6 2 1 ...
## $ MSA_Q_10 : int 1 1 1 6 1 1 1 1 1 3 ...
## $ MSA_Q_11 : int 2 3 2 6 1 1 2 1 1 5 ...
## $ MSA_Q_12 : int 1 1 2 6 1 1 2 6 1 3 ...
## $ MSA_Q_13 : int 1 1 1 6 1 6 2 1 4 2 ...
## $ MSA_Q_14 : int 1 1 1 6 1 2 1 1 3 1 ...
## $ MSA_Q_15 : int 2 1 1 6 1 1 3 2 1 3 ...
## $ MSA_Q_16 : int 2 3 5 6 1 2 1 2 1 2 ...
## $ MSA_Q_17 : int 2 1 1 6 1 1 1 1 1 3 ...
## $ PH2_Q_01 : int 3 2 1 5 1 1 3 1 2 3 ...
## $ PH2_Q_02 : int 4 4 1 5 1 2 1 1 4 2 ...
## $ TOS_Q_01 : int 2 2 2 4 1 1 2 2 1 1 ...
## $ TOS_Q_02 : int 1 1 1 4 4 4 1 2 4 4 ...
## $ TOS_Q_03 : int 4 4 4 4 4 4 4 4 4 4 ...
## $ TOS_Q_04 : int 5 5 5 5 5 5 5 5 5 5 ...
## $ CHARLSONSCORE : Factor w/ 11 levels "0","1","2","3",..: 3 3 4 2 1 1 3 9 1 2 ...
## $ CHRONICDISEASESCORE: num 1.6 1.6 1.54 2.97 1.28 1.28 1.31 1.67 2.21 2.51 ...
Now the dataset is ready. First, separate the dataset into training and test datasets using 75%-25% rule. Then, build a SVM model using all other variables in the dataset to be predictor variables. Try to add different cost of misclassification to the model. Rather than the default C=1
we use C=2
and C=3
. See how the model behaves. Here we utilize the radio basis kernel.
Output for C=2
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 2
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0174510649312293
##
## Number of Support Vectors : 1703
##
## Objective Function Value : -1798.9778 -666.9432 -352.2265 -46.2968 -15.9236 -9.2176 -7.1853 -27.9366 -16.3096 -3.5681 -697.4275 -362.6579 -47.0801 -16.3701 -9.6556 -6.9882 -28.2074 -16.4556 -3.5121 -321.0676 -44.7405 -15.8416 -9.1439 -6.8161 -26.7174 -15.4833 -3.3944 -43.1026 -15.2923 -7.994 -6.58 -24.8459 -14.6379 -3.4484 -13.9377 -5.2876 -5.6728 -15.2542 -9.8408 -3.255 -4.6982 -4.8924 -9.2482 -6.5144 -2.9608 -2.7409 -6.2056 -6.0476 -2.0833 -6.1775 -4.919 -2.7715 -10.5691 -3.0835 -2.566
## Training error : 0.310997
##
## qol.pred2 0 1 2 3 4 5 6 7 8 9 10
## 0 126 76 24 5 1 0 1 0 3 0 1
## 1 88 170 47 19 9 2 1 0 4 3 0
## 2 1 0 0 1 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0 0
## agreement
## FALSE TRUE
## 0.4914089347 0.5085910653
Output for C=3
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 3
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.0168577510531693
##
## Number of Support Vectors : 1695
##
## Objective Function Value : -2440.0638 -915.9967 -492.6748 -63.2895 -21.0929 -11.9108 -10.2404 -39.1843 -21.976 -5.0624 -970.6173 -514.9584 -64.7791 -22.0947 -12.8987 -9.8114 -39.7908 -22.2957 -4.9403 -431.5178 -59.9296 -20.9408 -11.7468 -9.4269 -36.602 -20.1783 -4.6829 -56.9469 -19.7357 -9.238 -8.9047 -32.6121 -18.4667 -4.8007 -17.3102 -5.4133 -6.9733 -17.2097 -10.3016 -4.3739 -4.7816 -5.7083 -9.7236 -6.6365 -3.723 -2.7726 -6.4151 -6.4453 -2.1222 -8.03 -5.411 -3.3088 -11.9186 -3.996 -2.8572
## Training error : 0.266896
##
## qol.pred3 0 1 2 3 4 5 6 7 8 9 10
## 0 131 79 24 6 2 0 1 0 4 1 1
## 1 83 165 47 18 8 2 1 0 3 2 0
## 2 1 2 0 1 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0 0
## agreement
## FALSE TRUE
## 0.4914089347 0.5085910653
Can you reproduce or improve these results?
Below is some additional R
code demonstrating various results reported in this Chapter.
#Picture 1
x<-runif(1000, -10, 10)
y<-ifelse(x>=0, 1, 0)
plot(x, y, xlab = "Sum of input signals", ylab = "Output signal", main = "Threshold")
abline(v=0, lty=2)
#Picture 2
x<-runif(100000, -10, 10)
y<-1/(1+exp(-x))
plot(x, y, xlab = "Sum of input signals", ylab = "Output signal", main = "Sigmoid")
#Picture 3
x<-runif(100000, -10, 10)
y1<-x
y2<-ifelse(x<=-5, -5, ifelse(x>=5, 5, x))
y3<-(exp(x)-exp(-x))/(exp(x)+exp(-x))
y4<-exp(-x^2/2)
par(mfrow=c(2, 2))
plot(x, y1, main="Linear", xlab="", ylab="")
plot(x, y2, main="Saturated Linear", xlab="", ylab="")
plot(x, y3, main="Hyperbolic tangent", xlab="", ylab="")
plot(x, y4, main = "Gaussian", xlab="", ylab="")
#Picture 4
A<-c(1, 4, 3, 2, 4, 8, 6, 10, 9)
B<-c(1, 5, 3, 2, 3, 8, 8, 7, 10)
plot(A, B, xlab="", ylab="", pch=16, cex=2)
abline(v=5, col="red", lty=2)
text(5.4, 9, labels="A")
abline(12, -1, col="red", lty=2)
text(6, 5.4, labels="B")
#Picture 5
plot(A, B, xlab="", ylab="", pch=16, cex=2)
segments(1, 1, 4, 5, lwd=1, col = "red")
segments(1, 1, 4, 3, lwd = 1, col = "red")
segments(4, 3, 4, 5, lwd = 1, col = "red")
segments(6, 8, 10, 7, lwd = 1, col = "red")
segments(6, 8, 9, 10, lwd = 1, col = "red")
segments(10, 7, 9, 10, lwd = 1, col = "red")
segments(6, 8, 4, 5, lwd = 1, lty=2)
abline(9.833, -2/3, lwd=2)
Try to replicate these results with other data from the list of our Case-Studies.