SOCR ≫ | DSPA ≫ | Topics ≫ |
In previous chapters (6, 7, and 8), we covered some classification methods that use mathematical formalism to address everyday life prediction problems. In this chapter, we will focus on specific model-based statistical methods providing forecasting and classification functionality. Specifically, we will (1) demonstrate the predictive power of multiple linear regression, (2) show the foundation of regression trees and model trees, and (3) examine two complementary case-studies (Baseball Players and Heart Attack).
It may be helpful to first review Chapter 4 (Linear Algebra/Matrix Manipulations) and Chapter 6 (Intro to Machine Learning).
Regression is a measurement of relationship between a dependent variable (value to be predicted) and a group of independent variables (predictors similar to features, discussed in Chapter 6. We assume the relationship between our dependent variable and independent variables follow a straight line.
First recall the material in Chapter 4: Linear Algebra & Matrix Computing.
The simplest case of a regression is that we only have one predictor. \[y=a+bx.\]
Does this formula appear familiar to you? In this slope-intercept formula, a
is our intercept while b
is the slope. That is the expression for simple linear regression. If we know a
and b
, for any given x
we can calculate y
via the above formula. If we plot x
and y
in a coordinate system, we will have a straight line.
However, this is the ideal case. When we plot using real world data, the pattern would be harder to recognize. Let’s look at the scatter plot(you can recall Chapter 2) and simple linear regression line of two variables “hospital charges” or CHARGES
(independent variable) and length of stay
in the hospital or LOS
(predictor). The data could be find in our class files CaseStudy12_AdultsHeartAttack_Data
. We removed two observations that have missing data using the command heart_attack<-heart_attack[complete.cases(heart_attack), ]
.
heart_attack<-read.csv("https://umich.instructure.com/files/1644953/download?download_frd=1", stringsAsFactors = F)
heart_attack$CHARGES<-as.numeric(heart_attack$CHARGES)
## Warning: NAs introduced by coercion
heart_attack<-heart_attack[complete.cases(heart_attack), ]
fit1<-lm(CHARGES~LOS, data=heart_attack)
par(cex=.8)
plot(heart_attack$LOS, heart_attack$CHARGES, xlab="LOS", ylab = "CHARGES")
abline(fit1, lwd=2)
It seems to be common sense that the longer you stay in the hospital, the higher the medical costs will be. However, on the scatter plot, we have only a bunch of dots showing a little bit sign of an increasing pattern.
The estimated expression for this regression line is: \[\hat{y}=4582.70+212.29\times x\] or equivalently \[CHARGES=4582.70+212.29\times LOS\] It is simple to make predictions with this regression line. Assume we have a patient that spent 10 days in hospital, then we have LOS=10
. The predicted charge is likely to be \(\$4582.70+\$212.29\times 10=\$6705.6\). Plugging x
into the expression equation automatically gives us an estimated value of the outcome y
. This chapter of the Probability and statistics EBook provides an introduction to linear modeling.
How did we get the estimated expression? The most common estimating method in statistics is ordinary least squares (OLS). OLS estimators are obtained by minimizing sum of the squared errors - that is the sum of squared vertical distance from each dot on the scatter plot to the regression line.
OLS is minimizing the following formula: \[\sum_{i=1}^{n}(y_i-\hat{y}_i)^2=\sum_{i=1}^{n}(y_i-(a+b\times x_i))^2=\sum_{i=1}^{n}e_i^2.\] After some statistical calculations, our value b
with the minimum squared error is: \[b=\frac{\sum(x_i-\bar x)(y_i-\bar y)}{\sum(x_i-\bar x)^2}.\] While the optional a
is: \[a=\bar y-b\bar x.\]
These expressions might seem very confusing. However, they are based on past knowledge. As we learned in Chapter 2, the variance is obtained by averaging sum of squares (\(var(x)=\frac{1}{n}\sum^{n}_{i=1} (x_i-\mu)^2\)). When we use \(\bar{x}\) to estimate the mean of \(X\), we have the following formula for variance: \(var(x)=\frac{1}{n-1}\sum^{n}_{i=1} (x_i-\bar{x})^2\). We can see that this is \(\frac{1}{n-1}\) times the denominator of b. Similar to variance, the covariance of x and y is measuring the average sum of the deviance of x times the deviance of y: \[Cov(x, y)=\frac{1}{n}\sum^{n}_{i=1} (x_i-\mu_x)(y_i-\mu_y).\] If we utilize the sample averages (\(\bar{x}\), \(\bar{y}\)), we have:
\[Cov(x, y)=\frac{1}{n-1}\sum^{n}_{i=1} (x_i-\bar{x})(y_i-\bar{y}).\]
Excitingly, this is \(\frac{1}{n-1}\) times the numerator of b.
Combining the above, we get an estimate of the slope coefficient (effect-size of LOS on Charge): \[b=\frac{Cov(x, y)}{var(x)}.\] Let’s examine these using the heart attack data.
b<-cov(heart_attack$LOS, heart_attack$CHARGES)/var(heart_attack$LOS)
b
## [1] 212.2869
a<-mean(heart_attack$CHARGES)-b*mean(heart_attack$LOS)
a
## [1] 4582.7
We can see that this is exactly the same as previously stated expression.
Regression modeling has five key assumptions:
Note: The SOCR Interactive Scatterplot Game (requires Java enabled browser) provides a dynamic interface demonstrating linear models, trends, correlations, slopes and residuals.
Based on covariance we can calculate correlation, which indicates how closely that the relationship between two variables follows a straight line. \[\rho_{x, y}=Corr(x, y)=\frac{Cov(x, y)}{\sigma_x\sigma_y}=\frac{Cov(x, y)}{\sqrt{Var(x)Var(y)}}.\] In R, correlation is given by cor()
while square root of variance or standard deviation is given by sd()
.
r<-cov(heart_attack$LOS, heart_attack$CHARGES)/(sd(heart_attack$LOS)*sd(heart_attack$CHARGES))
r
## [1] 0.2449743
cor(heart_attack$LOS, heart_attack$CHARGES)
## [1] 0.2449743
Same outputs are obtained. This correlation is a positive number that is relatively small. We can say there is a weak positive linear association between these two variables. If we have a negative number then it is a negative linear association. We have a weak association when \(0.1 \leq Cor < 0.3\), a moderate association for \(0.3 \leq Cor < 0.5\), and a strong association for \(0.5 \leq Cor \leq 1.0\). If the correlation is below \(0.1\) then it suggests little to no linear relation between the variables.
In practice, we usually have more situations with multiple predictors and one dependent variable, which may follow a multiple linear model. That is: \[y=\alpha+\beta_1x_1+\beta_2x_2+...+\beta_kx_k+\epsilon,\] or equivalently \[y=\beta_0+\beta_1x_1+\beta_2x_2+ ... +\beta_kx_k+\epsilon .\] We usually use the second notation method in statistics. This equation shows the linear relationship between k predictors and a dependent variable. In total we have k+1 coefficients to estimate.
The matrix notation for the above equation is: \[Y=X\beta+\epsilon,\] where \[Y=\left(\begin{array}{c} y_1 \\ y_2\\ ...\\ y_n \end{array}\right)\]
\[X=\left(\begin{array}{ccccc} 1 & x_{11}&x_{21}&...&x_{k1} \\ 1 & x_{12}&x_{22}&...&x_{k2} \\ .&.&.&.&.\\ 1 & x_{1n}&x_{2n}&...&x_{kn} \end{array}\right) \] \[\beta=\left(\begin{array}{c} \beta_1 \\ \beta_2\\ ...\\ \beta_k \end{array}\right)\]
and
\[\epsilon=\left(\begin{array}{c} \epsilon_1 \\ \epsilon_2\\ ...\\ \epsilon_n \end{array}\right) \] is the error term.
Similar to simple linear regression, our goal is to minimize sum of squared errors. After solved for \(\beta\), we get: \[\hat{\beta}=(X^TX)^{-1}X^TY .\] The solution is in the matrix form. \(X^{-1}\) is the inverse of matrix \(X\) and \(X^T\) is the transposed matrix.
Let’s make a function of our own using this matrix formula.
reg<-function(y, x){
x<-as.matrix(x)
x<-cbind(Intercept=1, x)
solve(t(x)%*%x)%*%t(x)%*%y
}
solve()
is taking the command for matrix inversion. %*%
is matrix multiplication.
Next, we will apply this function to our heart attack dataset. To begin with, let’s check if the simple linear regression output is the same as we calculated earlier.
reg(y=heart_attack$CHARGES, x=heart_attack$LOS)
## [,1]
## Intercept 4582.6997
## 212.2869
It works! Then, we can include more variables as predictors. As an example, we just add age
into the model.
str(heart_attack)
## 'data.frame': 148 obs. of 8 variables:
## $ Patient : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DIAGNOSIS: int 41041 41041 41091 41081 41091 41091 41091 41091 41041 41041 ...
## $ SEX : chr "F" "F" "F" "F" ...
## $ DRG : int 122 122 122 122 122 121 121 121 121 123 ...
## $ DIED : int 0 0 0 0 0 0 0 0 0 1 ...
## $ CHARGES : num 4752 3941 3657 1481 1681 ...
## $ LOS : int 10 6 5 2 1 9 15 15 2 1 ...
## $ AGE : int 79 34 76 80 55 84 84 70 76 65 ...
reg(y=heart_attack$CHARGES, x=heart_attack[, c(7, 8)])
## [,1]
## Intercept 7280.55493
## LOS 259.67361
## AGE -43.67677
We utilize the mlb data “01a_data.txt”. The dataset contains 1034 records of heights and weights for some current and recent Major League Baseball (MLB) Players. These data were obtained from different resources (e.g., IBM Many Eyes).
Variables:
Let’s load this dataset first. We use as.is=T
to make non-numerical vectors into characters. Also, we delete the Name
variable because we don’t need players’ names in this case study.
mlb<- read.table('https://umich.instructure.com/files/330381/download?download_frd=1', as.is=T, header=T)
str(mlb)
## 'data.frame': 1034 obs. of 6 variables:
## $ Name : chr "Adam_Donachie" "Paul_Bako" "Ramon_Hernandez" "Kevin_Millar" ...
## $ Team : chr "BAL" "BAL" "BAL" "BAL" ...
## $ Position: chr "Catcher" "Catcher" "Catcher" "First_Baseman" ...
## $ Height : int 74 74 72 72 73 69 69 71 76 71 ...
## $ Weight : int 180 215 210 210 188 176 209 200 231 180 ...
## $ Age : num 23 34.7 30.8 35.4 35.7 ...
mlb<-mlb[, -1]
By looking at the srt()
output we notice that the variable TEAM
and Position
are misspecified as characters. To fix this we can use function as.factor()
that convert numerical or character vectors to factors.
mlb$Team<-as.factor(mlb$Team)
mlb$Position<-as.factor(mlb$Position)
The data is good to go. Let’s explore it using some summary statistics and plots.
summary(mlb$Weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 150.0 187.0 200.0 201.7 215.0 290.0
hist(mlb$Weight, main = "Histogram for Weights")
The above plot illustrates our dependent variable Weight
. As we learned from Chapter 2, we know this is somewhat right-skewed.
Applying GGpairs
to obtain a compact dataset summary we can mark heavy weight and light weight players (according to \(light \lt median \lt heavy\)) by different colors in the plot:
require(GGally)
## Loading required package: GGally
mlb_binary = mlb
mlb_binary$bi_weight = as.factor(ifelse(mlb_binary$Weight>median(mlb_binary$Weight),1,0))
g_weight <- ggpairs(data=mlb_binary[-1], title="MLB Light/Heavy Weights",
mapping=ggplot2::aes(colour = bi_weight),
lower=list(combo=wrap("facethist",binwidth=1)),
# upper = list(continuous = wrap("cor", size = 4.75, alignPercent = 1))
)
g_weight
Next, we may also mark player positions by different colors in the plot.
g_position <- ggpairs(data=mlb[-1], title="MLB by Position",
mapping=ggplot2::aes(colour = Position),
lower=list(combo=wrap("facethist",binwidth=1)))
g_position
What about our potential predictors?
table(mlb$Team)
##
## ANA ARZ ATL BAL BOS CHC CIN CLE COL CWS DET FLA HOU KC LA MIN MLW NYM
## 35 28 37 35 36 36 36 35 35 33 37 32 34 35 33 33 35 38
## NYY OAK PHI PIT SD SEA SF STL TB TEX TOR WAS
## 32 37 36 35 33 34 34 32 33 35 34 36
table(mlb$Position)
##
## Catcher Designated_Hitter First_Baseman Outfielder
## 76 18 55 194
## Relief_Pitcher Second_Baseman Shortstop Starting_Pitcher
## 315 58 52 221
## Third_Baseman
## 45
summary(mlb$Height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67.0 72.0 74.0 73.7 75.0 83.0
summary(mlb$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.90 25.44 27.92 28.74 31.23 48.52
Here we have two numerical predictors, two categorical predictors and \(1034\) observations. Let’s see how R treats these three different classes of variables.
Before fitting a model, let’s examine the independence of our potential predictors and the dependent variable. Multiple linear regressions assume that predictors are all independent with each other. Is this assumption valid? As we mentioned earlier, cor()
function can answer this question in pairwise manner. Note we only look at numerical variables.
cor(mlb[c("Weight", "Height", "Age")])
## Weight Height Age
## Weight 1.0000000 0.53031802 0.15784706
## Height 0.5303180 1.00000000 -0.07367013
## Age 0.1578471 -0.07367013 1.00000000
Here we can see \(cor(y, x)=cor(x, y)\) and \(cov(x, x)=1\). Also, our Height
variable is weakly related to the players’ age in a negative manner. This looks very good and wouldn’t cause any multi-collinearity problem. If two of our predictors are highly correlated, they both provide almost the same information. Then that could cause multi-collinearity. A common practice is to delete one of them in the model.
To visualize pairwise correlations, we could use scatterplot matrix. In R, pairs()
function will give use these outputs.
pairs(mlb[c("Weight", "Height", "Age")])
You might get a sense of it but it is difficult to see any linear pattern. We can make an more sophisticated graph using pairs.panels()
in the psych
package.
# install.packages("psych")
library(psych)
pairs.panels(mlb[, c("Weight", "Height", "Age")])
This plot give us much more information about the three variables. Above the diagonal, we have our correlation coefficients in numerical form. On the diagonal, there are histograms of variables. Below the diagonal, more visual information are presented to help us understand the trend. This specific graph shows us height and weight are positively and strongly correlated. Also the relationships between age and height, age and weight are very weak (horizontal red line in the below diagonal graphs indicates weak relationships).
The function we are going to use for this section is lm()
. No extra package is needed when using this function.
The lm()
function has the following components:
m<-lm(dv ~ iv, data=mydata)
OneR()
in Chapter 8. If we use .
as iv
, all of the variables, except the dependent variable (\(dv\)), are included as predictors.fit<-lm(Weight~., data=mlb)
fit
##
## Call:
## lm(formula = Weight ~ ., data = mlb)
##
## Coefficients:
## (Intercept) TeamARZ
## -164.9995 7.1881
## TeamATL TeamBAL
## -1.5631 -5.3128
## TeamBOS TeamCHC
## -0.2838 0.4026
## TeamCIN TeamCLE
## 2.1051 -1.3160
## TeamCOL TeamCWS
## -3.7836 4.2944
## TeamDET TeamFLA
## 2.3024 2.6985
## TeamHOU TeamKC
## -0.6808 -4.7664
## TeamLA TeamMIN
## 2.8598 2.1269
## TeamMLW TeamNYM
## 4.2897 -1.9736
## TeamNYY TeamOAK
## 1.7483 -0.5464
## TeamPHI TeamPIT
## -6.8486 4.3023
## TeamSD TeamSEA
## 2.6133 -0.9147
## TeamSF TeamSTL
## 0.8411 -1.1341
## TeamTB TeamTEX
## -2.6616 -0.7695
## TeamTOR TeamWAS
## 1.3943 -1.7555
## PositionDesignated_Hitter PositionFirst_Baseman
## 8.9037 2.4237
## PositionOutfielder PositionRelief_Pitcher
## -6.2636 -7.7695
## PositionSecond_Baseman PositionShortstop
## -13.0843 -16.9562
## PositionStarting_Pitcher PositionThird_Baseman
## -7.3599 -4.6035
## Height Age
## 4.7175 0.8906
As we can see from the output, factors are included in the model by creating several indicators, one for each factor level. Each numerical variables just have one coefficient.
As we did in previous case studies, let’s examine model performance.
summary(fit)
##
## Call:
## lm(formula = Weight ~ ., data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.692 -10.909 -0.778 9.858 73.649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -164.9995 19.3828 -8.513 < 2e-16 ***
## TeamARZ 7.1881 4.2590 1.688 0.091777 .
## TeamATL -1.5631 3.9757 -0.393 0.694278
## TeamBAL -5.3128 4.0193 -1.322 0.186533
## TeamBOS -0.2838 4.0034 -0.071 0.943492
## TeamCHC 0.4026 3.9949 0.101 0.919749
## TeamCIN 2.1051 3.9934 0.527 0.598211
## TeamCLE -1.3160 4.0356 -0.326 0.744423
## TeamCOL -3.7836 4.0287 -0.939 0.347881
## TeamCWS 4.2944 4.1022 1.047 0.295413
## TeamDET 2.3024 3.9725 0.580 0.562326
## TeamFLA 2.6985 4.1336 0.653 0.514028
## TeamHOU -0.6808 4.0634 -0.168 0.866976
## TeamKC -4.7664 4.0242 -1.184 0.236525
## TeamLA 2.8598 4.0817 0.701 0.483686
## TeamMIN 2.1269 4.0947 0.519 0.603579
## TeamMLW 4.2897 4.0243 1.066 0.286706
## TeamNYM -1.9736 3.9493 -0.500 0.617370
## TeamNYY 1.7483 4.1234 0.424 0.671655
## TeamOAK -0.5464 3.9672 -0.138 0.890474
## TeamPHI -6.8486 3.9949 -1.714 0.086778 .
## TeamPIT 4.3023 4.0210 1.070 0.284890
## TeamSD 2.6133 4.0915 0.639 0.523148
## TeamSEA -0.9147 4.0516 -0.226 0.821436
## TeamSF 0.8411 4.0520 0.208 0.835593
## TeamSTL -1.1341 4.1193 -0.275 0.783132
## TeamTB -2.6616 4.0944 -0.650 0.515798
## TeamTEX -0.7695 4.0283 -0.191 0.848556
## TeamTOR 1.3943 4.0681 0.343 0.731871
## TeamWAS -1.7555 4.0038 -0.438 0.661142
## PositionDesignated_Hitter 8.9037 4.4533 1.999 0.045842 *
## PositionFirst_Baseman 2.4237 3.0058 0.806 0.420236
## PositionOutfielder -6.2636 2.2784 -2.749 0.006084 **
## PositionRelief_Pitcher -7.7695 2.1959 -3.538 0.000421 ***
## PositionSecond_Baseman -13.0843 2.9638 -4.415 1.12e-05 ***
## PositionShortstop -16.9562 3.0406 -5.577 3.16e-08 ***
## PositionStarting_Pitcher -7.3599 2.2976 -3.203 0.001402 **
## PositionThird_Baseman -4.6035 3.1689 -1.453 0.146613
## Height 4.7175 0.2563 18.405 < 2e-16 ***
## Age 0.8906 0.1259 7.075 2.82e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.78 on 994 degrees of freedom
## Multiple R-squared: 0.3858, Adjusted R-squared: 0.3617
## F-statistic: 16.01 on 39 and 994 DF, p-value: < 2.2e-16
plot(fit, which = 1:2)
The summary shows us how well does the model fits the dataset.
Residuals: This tells us about the residuals. If we have extremely large or extremely small residuals for some observations compared to the rest of residuals, either they are outliers due to reporting error or the model fits data poorly. We have \(73.649\) as our maximum and \(-48.692\) as our minimum. Their extremeness could be examined by residual diagnostic plot.
Coefficients: In this section, we look at the very right column that has stars. Stars or dots next to variables show if the variable is significant and should be included in the model. However, if nothing is next to a variable then it means this estimated covariance could be 0 in the linear model. Another thing we can look at is the Pr(>|t|)
column. A number closed to 0 in this column indicates the row variable is significant, otherwise it could be deleted from the model.
Here only some of the teams and positions are not significant. Age
and Height
are significant.
R-squared: What percent in y
is explained by included predictors. Here we have 38.58%, which indicates the model is not bad but could be improved. Usually a well-fitted linear regression would have over 70%.
The diagnostic plots also helps us understanding the situation.
Residual vs Fitted: This is the residual diagnostic plot. We can see that the residuals of observations indexed \(65\), \(160\) and \(237\) are relatively far apart from the rest. They are potential influential points or outliers.
Normal Q-Q: This plot examines the normality assumption of the model. If these dots follows the line on the graph, the normality assumption is valid. In our case, it is relatively close to the line. So, we can say that our model is valid in terms of normality.
We can employ the step
function to perform forward or backward selection of important features/predictors. It works for both lm
and glm
models. In most cases, backward-selection is preferable because it tends to retain much larger models. On the other hand, there are various criteria to evaluate a model. The common used includes AIC, BIC, Adjusted \(R^2\), etc. Let’s compare the backward and forward model selection approaches. The step
function argument direction
allows this control (default is both
, which will select the better result from either backward or forward selection).
step(fit,direction = "backward")
## Start: AIC=5871.04
## Weight ~ Team + Position + Height + Age
##
## Df Sum of Sq RSS AIC
## - Team 29 9468 289262 5847.4
## <none> 279793 5871.0
## - Age 1 14090 293883 5919.8
## - Position 8 20301 300095 5927.5
## - Height 1 95356 375149 6172.3
##
## Step: AIC=5847.45
## Weight ~ Position + Height + Age
##
## Df Sum of Sq RSS AIC
## <none> 289262 5847.4
## - Age 1 14616 303877 5896.4
## - Position 8 20406 309668 5901.9
## - Height 1 100435 389697 6153.6
##
## Call:
## lm(formula = Weight ~ Position + Height + Age, data = mlb)
##
## Coefficients:
## (Intercept) PositionDesignated_Hitter
## -168.0474 8.6968
## PositionFirst_Baseman PositionOutfielder
## 2.7780 -6.0457
## PositionRelief_Pitcher PositionSecond_Baseman
## -7.7782 -13.0267
## PositionShortstop PositionStarting_Pitcher
## -16.4821 -7.3961
## PositionThird_Baseman Height
## -4.1361 4.7639
## Age
## 0.8771
step(fit,direction = "forward")
## Start: AIC=5871.04
## Weight ~ Team + Position + Height + Age
##
## Call:
## lm(formula = Weight ~ Team + Position + Height + Age, data = mlb)
##
## Coefficients:
## (Intercept) TeamARZ
## -164.9995 7.1881
## TeamATL TeamBAL
## -1.5631 -5.3128
## TeamBOS TeamCHC
## -0.2838 0.4026
## TeamCIN TeamCLE
## 2.1051 -1.3160
## TeamCOL TeamCWS
## -3.7836 4.2944
## TeamDET TeamFLA
## 2.3024 2.6985
## TeamHOU TeamKC
## -0.6808 -4.7664
## TeamLA TeamMIN
## 2.8598 2.1269
## TeamMLW TeamNYM
## 4.2897 -1.9736
## TeamNYY TeamOAK
## 1.7483 -0.5464
## TeamPHI TeamPIT
## -6.8486 4.3023
## TeamSD TeamSEA
## 2.6133 -0.9147
## TeamSF TeamSTL
## 0.8411 -1.1341
## TeamTB TeamTEX
## -2.6616 -0.7695
## TeamTOR TeamWAS
## 1.3943 -1.7555
## PositionDesignated_Hitter PositionFirst_Baseman
## 8.9037 2.4237
## PositionOutfielder PositionRelief_Pitcher
## -6.2636 -7.7695
## PositionSecond_Baseman PositionShortstop
## -13.0843 -16.9562
## PositionStarting_Pitcher PositionThird_Baseman
## -7.3599 -4.6035
## Height Age
## 4.7175 0.8906
step(fit,direction = "both")
## Start: AIC=5871.04
## Weight ~ Team + Position + Height + Age
##
## Df Sum of Sq RSS AIC
## - Team 29 9468 289262 5847.4
## <none> 279793 5871.0
## - Age 1 14090 293883 5919.8
## - Position 8 20301 300095 5927.5
## - Height 1 95356 375149 6172.3
##
## Step: AIC=5847.45
## Weight ~ Position + Height + Age
##
## Df Sum of Sq RSS AIC
## <none> 289262 5847.4
## + Team 29 9468 279793 5871.0
## - Age 1 14616 303877 5896.4
## - Position 8 20406 309668 5901.9
## - Height 1 100435 389697 6153.6
##
## Call:
## lm(formula = Weight ~ Position + Height + Age, data = mlb)
##
## Coefficients:
## (Intercept) PositionDesignated_Hitter
## -168.0474 8.6968
## PositionFirst_Baseman PositionOutfielder
## 2.7780 -6.0457
## PositionRelief_Pitcher PositionSecond_Baseman
## -7.7782 -13.0267
## PositionShortstop PositionStarting_Pitcher
## -16.4821 -7.3961
## PositionThird_Baseman Height
## -4.1361 4.7639
## Age
## 0.8771
We can observe that forward retains the whole model. The better feature selection model uses backward
step-wise selection.
Both backward and forward are greedy algorithms and neither guarantees an optimal model result. The optimal feature selection requires exploring every possible combination of the predictors, which is practically not feasible, due to computational complicity, \(n \choose k\) combinations.
Alternatively, we can choose models based on various information criteria.
step(fit,k=2)
## Start: AIC=5871.04
## Weight ~ Team + Position + Height + Age
##
## Df Sum of Sq RSS AIC
## - Team 29 9468 289262 5847.4
## <none> 279793 5871.0
## - Age 1 14090 293883 5919.8
## - Position 8 20301 300095 5927.5
## - Height 1 95356 375149 6172.3
##
## Step: AIC=5847.45
## Weight ~ Position + Height + Age
##
## Df Sum of Sq RSS AIC
## <none> 289262 5847.4
## - Age 1 14616 303877 5896.4
## - Position 8 20406 309668 5901.9
## - Height 1 100435 389697 6153.6
##
## Call:
## lm(formula = Weight ~ Position + Height + Age, data = mlb)
##
## Coefficients:
## (Intercept) PositionDesignated_Hitter
## -168.0474 8.6968
## PositionFirst_Baseman PositionOutfielder
## 2.7780 -6.0457
## PositionRelief_Pitcher PositionSecond_Baseman
## -7.7782 -13.0267
## PositionShortstop PositionStarting_Pitcher
## -16.4821 -7.3961
## PositionThird_Baseman Height
## -4.1361 4.7639
## Age
## 0.8771
step(fit,k=log(nrow(mlb)))
## Start: AIC=6068.69
## Weight ~ Team + Position + Height + Age
##
## Df Sum of Sq RSS AIC
## - Team 29 9468 289262 5901.8
## <none> 279793 6068.7
## - Position 8 20301 300095 6085.6
## - Age 1 14090 293883 6112.5
## - Height 1 95356 375149 6365.0
##
## Step: AIC=5901.8
## Weight ~ Position + Height + Age
##
## Df Sum of Sq RSS AIC
## <none> 289262 5901.8
## - Position 8 20406 309668 5916.8
## - Age 1 14616 303877 5945.8
## - Height 1 100435 389697 6203.0
##
## Call:
## lm(formula = Weight ~ Position + Height + Age, data = mlb)
##
## Coefficients:
## (Intercept) PositionDesignated_Hitter
## -168.0474 8.6968
## PositionFirst_Baseman PositionOutfielder
## 2.7780 -6.0457
## PositionRelief_Pitcher PositionSecond_Baseman
## -7.7782 -13.0267
## PositionShortstop PositionStarting_Pitcher
## -16.4821 -7.3961
## PositionThird_Baseman Height
## -4.1361 4.7639
## Age
## 0.8771
\(k = 2\) yields the genuine AIC criterion, and \(k = log(n)\) refers to BIC. Let’s try to evaluate model performance again.
fit2 = step(fit,k=2,direction = "backward")
## Start: AIC=5871.04
## Weight ~ Team + Position + Height + Age
##
## Df Sum of Sq RSS AIC
## - Team 29 9468 289262 5847.4
## <none> 279793 5871.0
## - Age 1 14090 293883 5919.8
## - Position 8 20301 300095 5927.5
## - Height 1 95356 375149 6172.3
##
## Step: AIC=5847.45
## Weight ~ Position + Height + Age
##
## Df Sum of Sq RSS AIC
## <none> 289262 5847.4
## - Age 1 14616 303877 5896.4
## - Position 8 20406 309668 5901.9
## - Height 1 100435 389697 6153.6
summary(fit2)
##
## Call:
## lm(formula = Weight ~ Position + Height + Age, data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.427 -10.855 -0.344 10.110 75.301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -168.0474 19.0351 -8.828 < 2e-16 ***
## PositionDesignated_Hitter 8.6968 4.4258 1.965 0.049679 *
## PositionFirst_Baseman 2.7780 2.9942 0.928 0.353741
## PositionOutfielder -6.0457 2.2778 -2.654 0.008072 **
## PositionRelief_Pitcher -7.7782 2.1913 -3.550 0.000403 ***
## PositionSecond_Baseman -13.0267 2.9531 -4.411 1.14e-05 ***
## PositionShortstop -16.4821 3.0372 -5.427 7.16e-08 ***
## PositionStarting_Pitcher -7.3961 2.2959 -3.221 0.001316 **
## PositionThird_Baseman -4.1361 3.1656 -1.307 0.191647
## Height 4.7639 0.2528 18.847 < 2e-16 ***
## Age 0.8771 0.1220 7.190 1.25e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.82 on 1023 degrees of freedom
## Multiple R-squared: 0.365, Adjusted R-squared: 0.3588
## F-statistic: 58.81 on 10 and 1023 DF, p-value: < 2.2e-16
plot(fit2, which = 1:2)
Sometime, we prefer a simpler model even if there is a little bit loss of performance. In this case, we have a simpler model and \(R^2=0.365\). The whole model is still very significant. We can see that observations \(65\), \(160\) and \(237\) are relatively far from other residuals. They are potential influential points or outliers.
Also, we can observe the leverage points. Leverage points are those either outliers or influential points or both.
# Half-normal plot for leverages
# install.packages("faraway")
library(faraway)
##
## Attaching package: 'faraway'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:GGally':
##
## happy
halfnorm(lm.influence(fit)$hat, nlab = 2, ylab="Leverages")
mlb[c(226,879),]
## Team Position Height Weight Age
## 226 NYY Designated_Hitter 75 230 36.14
## 879 SD Designated_Hitter 73 200 25.60
summary(mlb)
## Team Position Height Weight
## NYM : 38 Relief_Pitcher :315 Min. :67.0 Min. :150.0
## ATL : 37 Starting_Pitcher:221 1st Qu.:72.0 1st Qu.:187.0
## DET : 37 Outfielder :194 Median :74.0 Median :200.0
## OAK : 37 Catcher : 76 Mean :73.7 Mean :201.7
## BOS : 36 Second_Baseman : 58 3rd Qu.:75.0 3rd Qu.:215.0
## CHC : 36 First_Baseman : 55 Max. :83.0 Max. :290.0
## (Other):813 (Other) :115
## Age
## Min. :20.90
## 1st Qu.:25.44
## Median :27.93
## Mean :28.74
## 3rd Qu.:31.23
## Max. :48.52
##
A deeper discussion of variable selection, controlling the false discovery rate, is provided in Chapters 16 and 17.
In linear regression, the relationship between independent and dependent variables is assumed to be linear. However, this might not be the case. The relationship between age and weight could be quadratic, since middle-aged people might gain weight dramatically.
mlb$age2<-(mlb$Age)^2
fit2<-lm(Weight~., data=mlb)
summary(fit2)
##
## Call:
## lm(formula = Weight ~ ., data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.068 -10.775 -1.021 9.922 74.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -209.07068 27.49529 -7.604 6.65e-14 ***
## TeamARZ 7.41943 4.25154 1.745 0.081274 .
## TeamATL -1.43167 3.96793 -0.361 0.718318
## TeamBAL -5.38735 4.01119 -1.343 0.179552
## TeamBOS -0.06614 3.99633 -0.017 0.986799
## TeamCHC 0.14541 3.98833 0.036 0.970923
## TeamCIN 2.24022 3.98571 0.562 0.574201
## TeamCLE -1.07546 4.02870 -0.267 0.789563
## TeamCOL -3.87254 4.02069 -0.963 0.335705
## TeamCWS 4.20933 4.09393 1.028 0.304111
## TeamDET 2.66990 3.96769 0.673 0.501160
## TeamFLA 3.14627 4.12989 0.762 0.446343
## TeamHOU -0.77230 4.05526 -0.190 0.849000
## TeamKC -4.90984 4.01648 -1.222 0.221837
## TeamLA 3.13554 4.07514 0.769 0.441820
## TeamMIN 2.09951 4.08631 0.514 0.607512
## TeamMLW 4.16183 4.01646 1.036 0.300363
## TeamNYM -1.25057 3.95424 -0.316 0.751870
## TeamNYY 1.67825 4.11502 0.408 0.683482
## TeamOAK -0.68235 3.95951 -0.172 0.863212
## TeamPHI -6.85071 3.98672 -1.718 0.086039 .
## TeamPIT 4.12683 4.01348 1.028 0.304086
## TeamSD 2.59525 4.08310 0.636 0.525179
## TeamSEA -0.67316 4.04471 -0.166 0.867853
## TeamSF 1.06038 4.04481 0.262 0.793255
## TeamSTL -1.38669 4.11234 -0.337 0.736037
## TeamTB -2.44396 4.08716 -0.598 0.550003
## TeamTEX -0.68740 4.02023 -0.171 0.864270
## TeamTOR 1.24439 4.06029 0.306 0.759306
## TeamWAS -1.87599 3.99594 -0.469 0.638835
## PositionDesignated_Hitter 8.94440 4.44417 2.013 0.044425 *
## PositionFirst_Baseman 2.55100 3.00014 0.850 0.395368
## PositionOutfielder -6.25702 2.27372 -2.752 0.006033 **
## PositionRelief_Pitcher -7.68904 2.19166 -3.508 0.000471 ***
## PositionSecond_Baseman -13.01400 2.95787 -4.400 1.20e-05 ***
## PositionShortstop -16.82243 3.03494 -5.543 3.81e-08 ***
## PositionStarting_Pitcher -7.08215 2.29615 -3.084 0.002096 **
## PositionThird_Baseman -4.66452 3.16249 -1.475 0.140542
## Height 4.71888 0.25578 18.449 < 2e-16 ***
## Age 3.82295 1.30621 2.927 0.003503 **
## age2 -0.04791 0.02124 -2.255 0.024327 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.74 on 993 degrees of freedom
## Multiple R-squared: 0.3889, Adjusted R-squared: 0.3643
## F-statistic: 15.8 on 40 and 993 DF, p-value: < 2.2e-16
This actually brought up the overall \(R^2\) up to \(0.3889\).
As discussed earlier, middle-aged people might have a different pattern in weight increase compared to younger people. The overall pattern could be not cumulative, but rather two separate lines for young and middle-aged people. We assume 30 is the threshold. People over 30 have a steeper line for weight increase than under 30. Here we use the ifelse()
function that we mentioned in Chapter 7 to create the indicator of the threshold.
mlb$age30<-ifelse(mlb$Age>=30, 1, 0)
fit3<-lm(Weight~Team+Position+Age+age30+Height, data=mlb)
summary(fit3)
##
## Call:
## lm(formula = Weight ~ Team + Position + Age + age30 + Height,
## data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.313 -11.166 -0.916 10.044 73.630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -159.8884 19.8862 -8.040 2.54e-15 ***
## TeamARZ 7.4096 4.2627 1.738 0.082483 .
## TeamATL -1.4379 3.9765 -0.362 0.717727
## TeamBAL -5.3393 4.0187 -1.329 0.184284
## TeamBOS -0.1985 4.0034 -0.050 0.960470
## TeamCHC 0.4669 3.9947 0.117 0.906976
## TeamCIN 2.2124 3.9939 0.554 0.579741
## TeamCLE -1.1624 4.0371 -0.288 0.773464
## TeamCOL -3.6842 4.0290 -0.914 0.360717
## TeamCWS 4.1920 4.1025 1.022 0.307113
## TeamDET 2.4708 3.9746 0.622 0.534314
## TeamFLA 2.8563 4.1352 0.691 0.489903
## TeamHOU -0.4964 4.0659 -0.122 0.902846
## TeamKC -4.7138 4.0238 -1.171 0.241692
## TeamLA 2.9194 4.0814 0.715 0.474586
## TeamMIN 2.2885 4.0965 0.559 0.576528
## TeamMLW 4.4749 4.0269 1.111 0.266731
## TeamNYM -1.8173 3.9510 -0.460 0.645659
## TeamNYY 1.7074 4.1229 0.414 0.678867
## TeamOAK -0.3388 3.9707 -0.085 0.932012
## TeamPHI -6.6192 3.9993 -1.655 0.098220 .
## TeamPIT 4.6716 4.0332 1.158 0.247029
## TeamSD 2.8600 4.0965 0.698 0.485243
## TeamSEA -1.0121 4.0518 -0.250 0.802809
## TeamSF 1.0244 4.0545 0.253 0.800587
## TeamSTL -1.1094 4.1187 -0.269 0.787703
## TeamTB -2.4485 4.0980 -0.597 0.550312
## TeamTEX -0.6112 4.0300 -0.152 0.879485
## TeamTOR 1.3959 4.0674 0.343 0.731532
## TeamWAS -1.4189 4.0139 -0.354 0.723784
## PositionDesignated_Hitter 9.2378 4.4621 2.070 0.038683 *
## PositionFirst_Baseman 2.6074 3.0096 0.866 0.386501
## PositionOutfielder -6.0408 2.2863 -2.642 0.008367 **
## PositionRelief_Pitcher -7.5100 2.2072 -3.403 0.000694 ***
## PositionSecond_Baseman -12.8870 2.9683 -4.342 1.56e-05 ***
## PositionShortstop -16.8912 3.0406 -5.555 3.56e-08 ***
## PositionStarting_Pitcher -7.0825 2.3099 -3.066 0.002227 **
## PositionThird_Baseman -4.4307 3.1719 -1.397 0.162773
## Age 0.6904 0.2153 3.207 0.001386 **
## age30 2.2636 1.9749 1.146 0.251992
## Height 4.7113 0.2563 18.380 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.77 on 993 degrees of freedom
## Multiple R-squared: 0.3866, Adjusted R-squared: 0.3619
## F-statistic: 15.65 on 40 and 993 DF, p-value: < 2.2e-16
This model performs worse than the quadratic model in terms of \(R^2\). Moreover, age30
is not significant. So, we will stick with the quadratic model.
So far, each feature’s individual effect has considered in our model. It is possible that features act in pairs to affect our independent variable. Let’s examine that deeper.
Interaction is a combined effect by two features. If we are not sure whether two features interact, we could test by adding an interaction term into the model. If the interaction is significant then we confirmed an interaction between two features.
fit4<-lm(Weight~Team+Height+Age*Position+age2, data=mlb)
summary(fit4)
##
## Call:
## lm(formula = Weight ~ Team + Height + Age * Position + age2,
## data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.761 -11.049 -0.761 9.911 75.533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -199.15403 29.87269 -6.667 4.35e-11 ***
## TeamARZ 8.10376 4.26339 1.901 0.0576 .
## TeamATL -0.81743 3.97899 -0.205 0.8373
## TeamBAL -4.64820 4.03972 -1.151 0.2502
## TeamBOS 0.37698 4.00743 0.094 0.9251
## TeamCHC 0.33104 3.99507 0.083 0.9340
## TeamCIN 2.56023 3.99603 0.641 0.5219
## TeamCLE -0.66254 4.03154 -0.164 0.8695
## TeamCOL -3.72098 4.03759 -0.922 0.3570
## TeamCWS 4.63266 4.10884 1.127 0.2598
## TeamDET 3.21380 3.98231 0.807 0.4199
## TeamFLA 3.56432 4.14902 0.859 0.3905
## TeamHOU -0.38733 4.07249 -0.095 0.9242
## TeamKC -4.66678 4.02384 -1.160 0.2464
## TeamLA 3.51766 4.09400 0.859 0.3904
## TeamMIN 2.31585 4.10502 0.564 0.5728
## TeamMLW 4.34793 4.02501 1.080 0.2803
## TeamNYM -0.28505 3.98537 -0.072 0.9430
## TeamNYY 1.87847 4.12774 0.455 0.6491
## TeamOAK -0.23791 3.97729 -0.060 0.9523
## TeamPHI -6.25671 3.99545 -1.566 0.1177
## TeamPIT 4.18719 4.01944 1.042 0.2978
## TeamSD 2.97028 4.08838 0.727 0.4677
## TeamSEA -0.07220 4.05922 -0.018 0.9858
## TeamSF 1.35981 4.07771 0.333 0.7388
## TeamSTL -1.23460 4.11960 -0.300 0.7645
## TeamTB -1.90885 4.09592 -0.466 0.6413
## TeamTEX -0.31570 4.03146 -0.078 0.9376
## TeamTOR 1.73976 4.08565 0.426 0.6703
## TeamWAS -1.43933 4.00274 -0.360 0.7192
## Height 4.70632 0.25646 18.351 < 2e-16 ***
## Age 3.32733 1.37088 2.427 0.0154 *
## PositionDesignated_Hitter -44.82216 30.68202 -1.461 0.1444
## PositionFirst_Baseman 23.51389 20.23553 1.162 0.2455
## PositionOutfielder -13.33140 15.92500 -0.837 0.4027
## PositionRelief_Pitcher -16.51308 15.01240 -1.100 0.2716
## PositionSecond_Baseman -26.56932 20.18773 -1.316 0.1884
## PositionShortstop -27.89454 20.72123 -1.346 0.1786
## PositionStarting_Pitcher -2.44578 15.36376 -0.159 0.8736
## PositionThird_Baseman -10.20102 23.26121 -0.439 0.6611
## age2 -0.04201 0.02170 -1.936 0.0531 .
## Age:PositionDesignated_Hitter 1.77289 1.00506 1.764 0.0780 .
## Age:PositionFirst_Baseman -0.71111 0.67848 -1.048 0.2949
## Age:PositionOutfielder 0.24147 0.53650 0.450 0.6527
## Age:PositionRelief_Pitcher 0.30374 0.50564 0.601 0.5482
## Age:PositionSecond_Baseman 0.46281 0.68281 0.678 0.4981
## Age:PositionShortstop 0.38257 0.70998 0.539 0.5901
## Age:PositionStarting_Pitcher -0.17104 0.51976 -0.329 0.7422
## Age:PositionThird_Baseman 0.18968 0.79561 0.238 0.8116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.73 on 985 degrees of freedom
## Multiple R-squared: 0.3945, Adjusted R-squared: 0.365
## F-statistic: 13.37 on 48 and 985 DF, p-value: < 2.2e-16
Here we can see the overall \(R^2\) improved and some of the interactions are significant under 0.1 level.
As we saw in Chapter 8, a decision tree builds by multiple if-else
logical decisions and can classify observations. We could add regression into decision trees so that a decision tree can make numerical predictions.
Numeric prediction trees are built in the same way as classification trees. Recall the discussion in Chapter 8 where the data are partitioned first via a divide-and-conquer strategy based on features. The homogeneity of the resulting classification trees is measured by various metrics, e.g., entropy. In prediction, tree homogeneity is measured by statistics such as variance, standard deviation or absolute deviation from the mean.
A common splitting criterion for decision trees is the standard deviation reduction (SDR).
\[SDR=sd(T)-\sum_{i=1}^n \left | \frac{T_i}{T} \right | \times sd(T_i),\]
where sd(T)
is the standard deviation for the original data. After the summation of all segments, \(|\frac{T_i}{T}|\) is the proportion of observations in \(i^{th}\) segment compared to total number of observations. \(sd(T_i)\) is the standard deviation for the \(i^{th}\) segment.
An example for this would be: \[Original\, data:\{1, 2, 3, 3, 4, 5, 6, 6, 7, 8\}\] \[Split\, method\, 1:\{1, 2, 3|3, 4, 5, 6, 6, 7, 8\}\] \[Split\, method\, 2:\{1, 2, 3, 3, 4, 5|6, 6, 7, 8\}\] In split method 1, \(T_1=\{1, 2, 3\}\), \(T_2=\{3, 4, 5, 6, 6, 7, 8\}\). In split method 2, \(T_1=\{1, 2, 3, 3, 4, 5\}\), \(T_2=\{6, 6, 7, 8\}\).
ori<-c(1, 2, 3, 3, 4, 5, 6, 6, 7, 8)
at1<-c(1, 2, 3)
at2<-c(3, 4, 5, 6, 6, 7, 8)
bt1<-c(1, 2, 3, 3, 4, 5)
bt2<-c(6, 6, 7, 8)
sdr_a<-sd(ori)-(length(at1)/length(ori)*sd(at1)+length(at2)/length(ori)*sd(at2))
sdr_b<-sd(ori)-(length(bt1)/length(ori)*sd(bt1)+length(bt2)/length(ori)*sd(bt2))
sdr_a
## [1] 0.7702557
sdr_b
## [1] 1.041531
length()
is used in the above R codes to get the number of elements in a specific vector.
Larger SDR indicates greater reduction in standard deviation after splitting. Here we have split method 2 with greater SDR so the tree decide to use second method first. The second method results more homogeneous sets than the first one.
Now, the tree will be split under bt1
and bt2
following same rules (greater SDR wins). Assume we cannot split further (bt1
and bt2
are terminal nodes). The observations classified into bt1
will be predicted with \(mean(bt1)=3\) and those classified as bt2
with \(mean(bt2)=6.75\).
We still use the mlb dataset for this section. This dataset has 1034 observations. Let’s try to separate them into training and test datasets first.
set.seed(1234)
train_index <- sample(seq_len(nrow(mlb)), size = 0.75*nrow(mlb))
mlb_train<-mlb[train_index, ]
mlb_test<-mlb[-train_index, ]
Here we use a randomized split (75%-25%) to divide the training and test datasets.
In R, function for prediction tree models is rpart()
under rpart
package.
m<-rpart(dv~iv, data=mydata)
dv
and iv
We use two numerical features in the mlb data “01a_data.txt” Age
and Height
as features.
#install.packages("rpart")
library(rpart)
##
## Attaching package: 'rpart'
## The following object is masked from 'package:faraway':
##
## solder
mlb.rpart<-rpart(Weight~Height+Age, data=mlb_train)
mlb.rpart
## n= 775
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 775 323502.600 201.4361
## 2) Height< 73.5 366 112465.500 192.5000
## 4) Height< 70.5 55 9865.382 178.7818 *
## 5) Height>=70.5 311 90419.300 194.9260
## 10) Age< 31.585 234 71123.060 192.8547 *
## 11) Age>=31.585 77 15241.250 201.2208 *
## 3) Height>=73.5 409 155656.400 209.4328
## 6) Height< 76.5 335 118511.700 206.8627
## 12) Age< 28.6 194 75010.250 202.2938
## 24) Height< 74.5 76 20688.040 196.8026 *
## 25) Height>=74.5 118 50554.610 205.8305 *
## 13) Age>=28.6 141 33879.870 213.1489 *
## 7) Height>=76.5 74 24914.660 221.0676
## 14) Age< 25.37 12 3018.000 206.0000 *
## 15) Age>=25.37 62 18644.980 223.9839 *
The output contains rich information. split
indicates the method to split; n
is the number of observations that falls in this segment; yval
is the predicted value if the test data falls into a segment.
A fancy way of drawing the rpart
decision tree is by rpart.plot()
function under rpart.plot
package.
# install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(mlb.rpart, digits=3)
A more detailed graph can be obtained stating more options in the function.
rpart.plot(mlb.rpart, digits = 4, fallen.leaves = T, type=3, extra=101)
Also, you can use a more fancy tree plot from package rattle
. From the fancy plot, you can observe the order and rules of splits.
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(mlb.rpart, cex = 0.8)
Let’s make predictions with the prediction tree. predict()
command is used in this section.
mlb.p<-predict(mlb.rpart, mlb_test)
summary(mlb.p)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 178.8 192.9 201.2 201.2 213.1 224.0
summary(mlb_test$Weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 155.0 185.0 200.0 202.6 220.0 290.0
After comparing five-number statistics for the predicted and true Weight
, we can see that the model cannot precisely identify extreme cases such as the maximum. However, within IQR, the predictions are relatively accurate.
Correlation could be used to measure the correspondence of two equal length numeric variables. Let’s use cor()
to examine the prediction accuracy.
cor(mlb.p, mlb_test$Weight)
## [1] 0.502253
The predicted values (\(Weights\)) are moderately correlated with their true value counterparts. Chapter 13 provides additional strategies for model quality assessment.
To measure the distance between predicted value and the true value, we can use a measurement called mean absolute error (MAE). MAE follows the following formula: \[MAE=\frac{1}{n}\sum_{i=1}^{n}|pred_i-obs_i|,\] where the pred_i
is the \(i^{th}\) predicted value and obs_i
is the \(i^{th}\) observed value. Let’s make a corresponding MAE function in R and evaluate our model performance.
MAE<-function(obs, pred){
mean(abs(obs-pred))
}
MAE(mlb_test$Weight, mlb.p)
## [1] 15.10803
This implies on average the difference between predicted value and the observed value is 14.975. Considering that the Weight
variable in our test dataset ranges from 150 to 260, the model performs well.
What if we are using the most primitive method for prediction - the mean?
mean(mlb_test$Weight)
## [1] 202.556
MAE(mlb_test$Weight, 202.3643)
## [1] 18.17063
This proves that the predictive decision tree is better than using the over all mean to predict every observation in the test dataset. However, it is not dramatically better. There might be room for improvement.
To improve the performance of our tree, we are going to use a model tree instead of a regression tree. M5P()
function under package RWeka
is used in this chapter. It stands for the M5
algorithm. This function uses similar syntax as rpart()
.
m<-M5P(dv~iv, data=mydata)
#install.packages("RWeka")
# Sometimes RWeka installations may be off a bit, see:
# http://stackoverflow.com/questions/41878226/using-rweka-m5p-in-rstudio-yields-java-lang-noclassdeffounderror-no-uib-cipr-ma
Sys.getenv("WEKA_HOME") # where does it point to? Maybe some obscure path?
## [1] ""
# if yes, correct the variable:
Sys.setenv(WEKA_HOME="C:\\MY\\PATH\\WEKA_WPM")
library(RWeka)
# WPM("list-packages", "installed")
mlb.m5<-M5P(Weight~Height+Age, data=mlb_train)
mlb.m5
## M5 pruned model tree:
## (using smoothed linear models)
## LM1 (775/82.142%)
##
## LM num: 1
## Weight =
## 5.0343 * Height
## + 1.0169 * Age
## - 198.8265
##
## Number of Rules : 1
Instead of using segment averages to predict, this model uses a linear regression (LM1
) as the terminal node. In some datasets with more variables, M5P
could give us multiple linear models under different terminal nodes.
summary(mlb.m5)
##
## === Summary ===
##
## Correlation coefficient 0.5703
## Mean absolute error 13.0634
## Root mean squared error 16.7824
## Relative absolute error 80.2838 %
## Root relative squared error 82.1424 %
## Total Number of Instances 775
mlb.p.m5<-predict(mlb.m5, mlb_test)
summary(mlb.p.m5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 168.6 192.1 200.8 201.3 210.0 258.2
cor(mlb.p.m5, mlb_test$Weight)
## [1] 0.5563317
MAE(mlb_test$Weight, mlb.p.m5)
## [1] 14.9374
summary(mlb.m5)
give us some rough diagnostic statistics. We can see that the correlation and MAE for this model is better than the previous model under rpart()
.
Let’s use the heart attack dataset for practice.
heart_attack<-read.csv("https://umich.instructure.com/files/1644953/download?download_frd=1", stringsAsFactors = F)
str(heart_attack)
## 'data.frame': 150 obs. of 8 variables:
## $ Patient : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DIAGNOSIS: int 41041 41041 41091 41081 41091 41091 41091 41091 41041 41041 ...
## $ SEX : chr "F" "F" "F" "F" ...
## $ DRG : int 122 122 122 122 122 121 121 121 121 123 ...
## $ DIED : int 0 0 0 0 0 0 0 0 0 1 ...
## $ CHARGES : chr "4752" "3941" "3657" "1481" ...
## $ LOS : int 10 6 5 2 1 9 15 15 2 1 ...
## $ AGE : int 79 34 76 80 55 84 84 70 76 65 ...
To begin with, we need to convert the CHARGES
(independent variable) to numerical form. NA’s are created so let’s remain only the complete cases as mentioned in the beginning of this chapter. Also, let’s create a gender variable as an indicator for female patients using ifelse()
and delete the previous SEX
column.
heart_attack$CHARGES<-as.numeric(heart_attack$CHARGES)
## Warning: NAs introduced by coercion
heart_attack<-heart_attack[complete.cases(heart_attack), ]
heart_attack$gender<-ifelse(heart_attack$SEX=="F", 1, 0)
heart_attack<-heart_attack[, -3]
Now we can build a model tree using M5P()
with all the features in the model. As usual, we need to separate the heart_attack
data in to training and test datasets (use the 75%-25% way of separation).
After using the model to predict CHARGES
in the test dataset we can obtain the following correlation and MAE.
## [1] 0.3316823
## [1] 2867.884
We can see that the predicted values and observed values are strongly correlated. In terms of MAE, it may seem very large at first glance.
range(ha_test$CHARGES)
## [1] 815 17137
# 17137-701
# 3193.502/16436
However, the test data itself has a wide range and the MAE is within 20% of the range. With only 148 observations, the model did a fairly good job in prediction. Can you reproduce or perhaps improve these results?
Try to replicate these results with other data from the list of our Case-Studies.