SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
In this chapter we present the fundamental linear algebra concepts underpinning many modern matrix computing operations, vector-based operations, linear modeling techniques, eigenspectra decompositions, multivariate linear regression modeling, ordinary least squares estimation, machine learning and artificial intelligence algorithms. We will demonstrate these techniques using simulated data and real observed data of baseball players and clinical heart attack patients.
Some students and readers may find it useful to first review some of the fundamental mathematical representations, analytical modeling techniques, and basic concepts. These foundations play critical roles in all subsequent chapters and sections. Examples of core mathematical principles include calculus of differentiation and integration; representation of scalars, vectors, matrices, and tensors; displacement, velocity, and acceleration; polynomials, exponents, and logarithmic functions; Taylor’s series; complex numbers; ordinary and partial differential equations; probability and statistics; statistical moments; and probability distributions.
Linear algebra is a branch of mathematics that studies linear associations using vectors, vector-spaces, linear equations, linear transformations and matrices. Although it is generally challenging to visualize complex data, e.g., large vectors, tensors, and tables in n-dimensional Euclidean spaces (\(n\ge 3\)), linear algebra allows us to represent, model, synthesize and summarize such complex data.
Virtually all natural processes permit first-order linear approximations. This is useful because linear equations are easy (to write, interpret, solve) and these first order approximations may be useful to practically assess the process, determine general trends, identify potential patterns, and suggest associations in the data.
Linear equations represent the simplest type of models for many processes. Higher order models may include additional nonlinear terms, e.g., Taylor-series expansion. Linear algebra provides the foundation for linear representation, analytics, solutions, inference and visualization of first-order affine models. Linear algebra is a small part of the larger mathematics functional analysis field, which is actually the infinite-dimensional version of linear algebra.
Specifically, linear algebra allows us to computationally manipulate, model, solve, and interpret complex systems of equations representing large numbers of dimensions/variables. Arbitrarily large problems can be mathematically transformed into simple matrix equations of the form \(A x = b\) or \(A x = \lambda x\).
In this chapter, we review the fundamentals of linear algebra, matrix manipulation and their applications to representation, modeling, and analysis of real data. Specifically, we will cover (1) construction of matrices and matrix operations, (2) general matrix algebra notations, (3) eigenvalues and eigenvectors of linear operators, (4) least squares estimation, and (5) linear regression and variance-covariance matrices.
The easiest way to create a matrix in R
is by using the
matrix()
or array()
functions, which allow
splicing a long vector into a matrix or a tensor of certain size.
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 0 0 0 0 0
## [2,] 0 2 0 0 0 0
## [3,] 0 0 3 0 0 0
## [4,] 0 0 0 4 0 0
## [5,] 0 0 0 0 5 0
## [6,] 0 0 0 0 0 6
## [,1] [,2] [,3] [,4]
## [1,] 0.6853547 -0.2235185 1.67162225 1.3844034
## [2,] 1.0129383 -0.8752916 0.25179456 -1.9376570
## [3,] 1.1909151 -0.3486104 0.04481644 0.3068674
## [4,] -1.4903572 1.9290822 -1.71315143 -0.2505063
## [5,] 1.5944445 -0.2161392 1.03280995 -1.6371302
The function diag()
is very useful. When the object is a
vector, it creates a diagonal matrix with the vector in the principal
diagonal.
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 2 0
## [3,] 0 0 3
When the object is a matrix, diag()
returns its
principal diagonal.
## [1] 1 4
When the object is a scalar, diag(k)
returns a \(k\times k\) identity matrix.
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
The functions cbind()
and rbind()
are also
useful for building matrices from vectors by column or row
concatenation.
## c1
## [1,] 0.6853547 -0.2235185 1.67162225 1.3844034 1
## [2,] 1.0129383 -0.8752916 0.25179456 -1.9376570 2
## [3,] 1.1909151 -0.3486104 0.04481644 0.3068674 3
## [4,] -1.4903572 1.9290822 -1.71315143 -0.2505063 4
## [5,] 1.5944445 -0.2161392 1.03280995 -1.6371302 5
## [,1] [,2] [,3] [,4]
## 0.6853547 -0.2235185 1.67162225 1.3844034
## 1.0129383 -0.8752916 0.25179456 -1.9376570
## 1.1909151 -0.3486104 0.04481644 0.3068674
## -1.4903572 1.9290822 -1.71315143 -0.2505063
## 1.5944445 -0.2161392 1.03280995 -1.6371302
## r1 1.0000000 2.0000000 3.00000000 4.0000000
Note that m5
has a row name r1
in the
4th row. We remove row/column names by naming them as
NULL
.
## [,1] [,2] [,3] [,4]
## [1,] 0.6853547 -0.2235185 1.67162225 1.3844034
## [2,] 1.0129383 -0.8752916 0.25179456 -1.9376570
## [3,] 1.1909151 -0.3486104 0.04481644 0.3068674
## [4,] -1.4903572 1.9290822 -1.71315143 -0.2505063
## [5,] 1.5944445 -0.2161392 1.03280995 -1.6371302
## [6,] 1.0000000 2.0000000 3.00000000 4.0000000
Each element in a matrix has a location indexed by the corresponding
row and column. A[i, j]
stores the element in the
ith row and jth column in the matrix A
.
We can also access some specific rows or columns using matrix
subscripts.
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
## [1] 4
## [1] 1 4 7 10
## [1] 4 5 6
## [,1] [,2]
## [1,] 4 7
## [2,] 5 8
## [3,] 6 9
The ordinal scalar operations addition, subtraction, multiplication and division can be generalized as matrix operations.
Matrix addition and subtraction require matrices of the same dimensions. The sum or difference of two matrices are matrices containing elements representing the scalar sum or difference, respectively, of the values in corresponding positions in the two matrices.
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [,1] [,2] [,3]
## [1,] 2 4 6
## [2,] 3 5 7
## [,1] [,2] [,3]
## [1,] 3 7 11
## [2,] 5 9 13
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 1 1
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Element-wise matrix multiplication is valid for matrices of the same sizes. However, matrix multiplication is different from component-wise scalar multiplication, and requires a special match between the dimensions of the multiplied matrices, \(P_{m\times k}=L_{m\times n} \cdot R_{n\times k}\). That is, the number of columns in the left matrix, \(L\),, must equal to the number of rows of the right matrix, \(R\). Then, the \(row\times column\) matrix multiplication rule yields a product matrix of dimensions corresponding to the number of rows (\(m\)) of \(L\) and the number of columns (\(k\)) of \(R\), i.e., \(P_{m\times k}\).
Element-wise matrix multiplication (\(*\)) involves scalar products of the elements in the same positions.
## [,1] [,2] [,3]
## [1,] 2 12 30
## [2,] 6 20 42
Matrix product (\(\%*\%\)) generates an output matrix having the same number of rows as the left matrix and the same number of columns as the right matrix.
## [1] 2 3
## [,1] [,2]
## [1,] 3 6
## [2,] 4 7
## [3,] 5 8
## [1] 3 2
## [,1] [,2]
## [1,] 52 88
## [2,] 64 109
The product of multiplying two matrices \(m^8_{2\times 3}\) * \(m^9_{3\times 2}\) is another matrix \(M_{2\times 2}\) of dimensions \(2\times 2\).
The process of multiplying two vectors is called outer
product. Assume we have two vectors \(u\) and \(v\). Using matrix multiplication, their
outer product is the same as \(u\
\underbrace{\% * \%}_{product}\ \overbrace{t(v)}^{transpose}\) or
mathematically \(uv^t\). In
R
the vector outer product operator is %o%
,
which generates second order tensors (matrices).
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 5 6 7 8
## [2,] 8 10 12 14 16
## [3,] 12 15 18 21 24
## [4,] 16 20 24 28 32
## [5,] 20 25 30 35 40
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 5 6 7 8
## [2,] 8 10 12 14 16
## [3,] 12 15 18 21 24
## [4,] 16 20 24 28 32
## [5,] 20 25 30 35 40
What are the differences between \(u \%*\% v\), \(u \%*\% t(v)\), \(u * t(v)\), and \(u * v\)?
Element-wise matrix (or scalar) division is well defined for matrices of the same dimensions.
## [,1] [,2] [,3]
## [1,] 2.0 1.333333 1.200000
## [2,] 1.5 1.250000 1.166667
## [,1] [,2] [,3]
## [1,] 1.0 2.0 3.0
## [2,] 1.5 2.5 3.5
However, matrix inversion is different. Recall that the
transpose of a matrix is a matrix with swapped columns and
rows. In R
, matrix transposition is done by the function
t()
.
## [,1] [,2] [,3]
## [1,] 2 4 6
## [2,] 3 5 7
## [,1] [,2]
## [1,] 2 3
## [2,] 4 5
## [3,] 6 7
Notice that the \([1, 2]\) element
in m8[1, 2]
is the same as the \([2, 1]\) element in
t(m8)[2, 1]
.
The right inverse of a matrix, (\(A^{-1}_{m\times n}\)), is a special matrix with the property that multiplying the original matrix (\(A_{n\times m}\)) on the right by this inverse (\(A^{-1}\)) yields the identity matrix, which has \(1\)’s on the main diagonal and \(0\)’s off the diagonal. That is,
\[A_{n\times m}\ A^{-1}_{m\times n} = I_{n\times n}.\] Similarly, a left matrix inverse is defined as a matrix, (\(A^{-1}_{m\times n}\)), with the property that multiplying the original matrix (\(A_{n\times m}\)) on the left by this inverse (\(A^{-1}\)) yields the identity matrix.
\[ A^{-1}_{m\times n}\ A_{n\times m} = I_{m\times m}.\]
A matrix inverse is only defined for square matrices, \(m=n\).
Given four numbers satisfying \(ad-bc \not= 0\), the following \(2\times 2\) matrix.
\[A_{2\times 2}=\left(\begin{array}{cc} a & b \\ c & d \end{array}\right) ,\]
has an inverse matrix given by
\[A^{-1}_{2\times 2} =\frac{1}{ad-bc}\left(\begin{array}{cc} d & -b \\ -c & a \end{array}\right) .\]
It’s easy to validate that \(A_{2\times 2} A^{-1}_{2\times 2} =I_{2\times 2}\).
In higher dimensions, the Cramer’s
rule may be used to compute the matrix inverse. Matrix inversion is
available in R
via the solve()
function.
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [,1] [,2]
## [1,] -2 1.5
## [2,] 1 -0.5
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
Note that only special matrices are invertible, not all. These matrices are square (have the same number of rows and columns) and non-singular.
Another function that can help us to get the inverse of a matrix is
the ginv()
function in the MASS
package. This
function gives us the Moore-Penrose Generalized Inverse of a matrix.
## Loading required package: MASS
## [,1] [,2]
## [1,] -2 1.5
## [2,] 1 -0.5
In addition, the function solve()
can be used to solve
matrix equations. For instance, solve(A, b)
returns a
vector \(x\) satisfying the equation
\(b = Ax\), i.e., \(x= A^{-1}b\).
## [1] 0.5 0.5 0.5 0.5
The following table summarizes some of the basic matrix operation functions.
Expression | Explanation |
---|---|
t(x) |
transpose |
diag(x) |
diagonal |
%*% |
matrix multiplication |
solve(a, b) |
solves a %*% x = b for x |
solve(a) |
matrix inverse of a |
rowsum(x) |
sum of rows for a matrix-like object. rowSums(x) is a
faster version |
colSums(x) , colSums(x) |
id. for columns |
rowMeans(x) |
fast version of row means |
colMeans(x) |
id. for columns |
mat1 <- cbind(c(1, -1/5), c(-1/3, 1))
mat1.inv <- solve(mat1)
mat1.identity <- mat1.inv %*% mat1
mat1.identity
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
## [1] 1.785714 2.357143
Let’s look at the basics of matrix notation and matrix algebra. The
product \(AB\) between matrices \(A\) and \(B\) is defined only if the number of
columns in \(A\) equals the number of
rows in \(B\). That is, we can multiply
an \(m\times n\) matrix \(A\) by an \(n\times k\) matrix \(B\) and the result will be \(AB_{m\times k}\) matrix. Each element of
the product matrix, \((AB_{i, j})\),
represents the product of the \(i\)-th
row in \(A\) and the \(j\)-th column in \(B\), which are of the same size \(n\). Matrix multiplication is
row-by-column
.
Linear algebra notation simplifies the mathematical descriptions and
manipulations of linear models, as well as coding in R
.
The main point is to show how we can write linear models using matrix notation. Later, we’ll explain how this is useful for solving the least squares problems.
Linear algebra notation enables the mathematical analysis and derivation of solutions of systems of linear equations and provides a generic machinery for solving linear problems.
\[\begin{align*} a + b + 2c &= 6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 3 \end{align*} \]
\[\underbrace{\begin{pmatrix} 1&1&2\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}}_{\text{A}} \underbrace{\begin{pmatrix} a\\ b\\ c \end{pmatrix}}_{\text{x}} = \underbrace{\begin{pmatrix} 6\\ 2\\ 3 \end{pmatrix}}_{\text{b}}\]
That is, \(Ax = b\), which implies that:
\[\begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 1&1&2\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}^{-1} \begin{pmatrix} 6\\ 2\\ 3 \end{pmatrix}\]
In other words, \(A^{-1}A x ==x = A^{-1}b\).
Notice that this approach parallels the strategy for solving of simple (univariate) linear equations like: \[\underbrace{2}_{\text{(design matrix) A }} \overbrace{x}^{\text{unknown x }} \underbrace{-3}_{\text{simple constant term}} = \overbrace{5}^{\text{b}}.\]
The constant term, \(-3\), can be moved and integrated into the right-hand-side, \(b\), to form a new term \(b'=5+3=8\). Thus, the shifting factor is mostly ignored in linear models, or linear equations, which simplifies the linear matrix equation to: \[\underbrace{2}_{\text{(design matrix) A }} \overbrace{x}^{\text{unknown x }} = \underbrace{5+3}_{b'}=\overbrace{8}^{b'}.\]
This (simple) linear equation is solved by multiplying both hand sides by the inverse (reciprocal) of the \(x\) multiplier, \(2\).
\[\frac{1}{2} 2 x = \frac{1}{2} 8.\] Thus, the unique solution is: \[x = \frac{1}{2} 8=4.\]
So, let’s use exactly the same strategy to solve the corresponding
matrix equation (linear equation, \(Ax = b\)) using R
, where the
unknown is \(x\), and the
design matrix \(A\) and the
constant vector \(b\) are
known.
\[\underbrace{\begin{pmatrix} 1&1&2\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}}_{\text{A}} \underbrace{\begin{pmatrix} a\\ b\\ c \end{pmatrix}}_{\text{x}} = \underbrace{\begin{pmatrix} 6\\ 2\\ 3 \end{pmatrix}}_{\text{b}}.\]
A_matrix_values <- c(1, 1, 2, 3, -2, 1, 2, 1, -1)
# matrix elements are arranged by columns, so, we need to transpose them to arrange them by rows.
A <- t(matrix(A_matrix_values, nrow=3, ncol=3))
b <- c(6, 2, 3)
# to solve Ax = b, x=A^{-1}*b
x <- solve (A, b)
# Ax = b ==> x = A^{-1} * b
x
## [1] 1.35 1.75 1.45
## [,1]
## [1,] 0
## [2,] 0
## [3,] 0
How about if we want to triple-check the consistency of the
solve()
method to provide accurate solutions to
matrix-based systems of linear equations?
We can generate the solution (\(x\)) to the equation \(Ax=b\) by using first principles. \[ x = A^{-1}b.\]
A.inverse <- solve(A) # the inverse matrix A^{-1}
x1 <- A.inverse %*% b
# check if X and x1 are the same
x; x1
## [1] 1.35 1.75 1.45
## [,1]
## [1,] 1.35
## [2,] 1.75
## [3,] 1.45
## [,1]
## [1,] 0
## [2,] 0
## [3,] 0
The identity matrix is the matrix analog to the multiplicative numeric identity, the number \(1\). Multiplying the identity matrix by any other matrix (\(B\)) does not change the matrix \(B\). This property requires that the multiplicative identity matrix must look like this.
\[\mathbf{I} = \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ \vdots &\vdots & \vdots & \ddots&\vdots&\vdots\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix}.\]
The identity matrix is always a square matrix with diagonal elements \(1\) and \(0\) at the off-diagonal elements.
Following the above matrix multiplication rules, we can see that:
\[\mathbf{X\times I} = \begin{pmatrix} x_{1, 1} & \dots & x_{1, p}\\ & \vdots & \\ x_{n, 1} & \dots & x_{n, p} \end{pmatrix} \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ & & &\vdots & &\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix}=\]
\[\begin{pmatrix} x_{1, 1} & \dots & x_{1, p}\\ & \vdots & \\ x_{n, 1} & \dots & x_{n, p} \end{pmatrix}= \mathbf{X}.\]
In R
, we can express the identity matrix as follows.
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
## [,1] [,2] [,3]
## [1,] 1 1 2
## [2,] 3 -2 1
## [3,] 2 1 -1
## [,1] [,2] [,3]
## [1,] 1 1 2
## [2,] 3 -2 1
## [3,] 2 1 -1
Let’s look at this notation deeper using the Baseball
players data containing three quantitative variables,
Heights
, Weight
, and Age
. Suppose
the variable Weight
is considered as a random
response
(outcome vector) denoted by \(Y_1, Y_2, \dots, Y_n\).
We can express each player’s Weight
as a function of
Age
and Height
.
# Data: https://umich.instructure.com/courses/38100/files/folder/data (01a_data.txt)
data <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1', as.is=T, header=T)
attach(data)
head(data)
## Name Team Position Height Weight Age
## 1 Adam_Donachie BAL Catcher 74 180 22.99
## 2 Paul_Bako BAL Catcher 74 215 34.69
## 3 Ramon_Hernandez BAL Catcher 72 210 30.78
## 4 Kevin_Millar BAL First_Baseman 72 210 35.43
## 5 Chris_Gomez BAL First_Baseman 73 188 35.71
## 6 Brian_Roberts BAL Second_Baseman 69 176 29.39
In matrix form, we can express the outcome using one symbol, \(\mathbf{Y}\). We usually use bold face to distinguish between scalars, vectors, matrices and tensors.
\[\mathbf{Y} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix}.\]
In R
, the default representation of vector data is as
columns, i.e., our outcome vector dimension is \(n\times 1\), as opposed to \(1 \times n\) used for row vectors (e.g.,
\(\mathbf{Y}^t\)).
Similarly, we can use matrix notation to represent the covariates, or
predictors, Age
and Height
. In a case with two
predictors, we can represent them like this:
\[\mathbf{X}_1 = \begin{pmatrix} x_{1, 1}\\ \vdots\\ x_{n, 1} \end{pmatrix} \mbox{ and } \mathbf{X}_2 = \begin{pmatrix} x_{1, 2}\\ \vdots\\ x_{n, 2} \end{pmatrix}.\]
In the that for the Baseball players study, \(x_{1, 1}= Age_1\) and \(x_{i, 1}=Age_i\) with \(Age_i\) representing the Age
of the \(i\)-th player, and similarly,
\(x_{i, 2}= Height_i\) is the height of
the \(i\)-th player. These vectors can
be thought of as \(n\times 1\)
matrices. For instance, it is convenient to represent the covariates as
design matrices.
\[\mathbf{X} = [ \mathbf{X}_1 \mathbf{X}_2 ] = \begin{pmatrix} x_{1, 1}&x_{1, 2}\\ \vdots\\ x_{n, 1}&x_{n, 2} \end{pmatrix}. \]
This design matrix has dimension \(n \times 2\).
## Age Height
## [1,] 22.99 74
## [2,] 34.69 74
## [3,] 30.78 72
## [4,] 35.43 72
## [5,] 35.71 73
## [6,] 29.39 69
## [1] 1034 2
We can also use this notation to denote an arbitrary number (\(k\)) of covariates with the following \(n\times k\) matrix.
\[\mathbf{X} = \begin{pmatrix} x_{1, 1}&\dots & x_{1, k} \\ x_{2, 1}&\dots & x_{2, k} \\ & \vdots & \\ x_{n, 1}&\dots & x_{n, k} \end{pmatrix}. \]
You can simulate such a design matrix in R
using
matrix()
, instead of cbind
.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1035 2069 3103 4137
## [2,] 2 1036 2070 3104 4138
## [3,] 3 1037 2071 3105 4139
## [4,] 4 1038 2072 3106 4140
## [5,] 5 1039 2073 3107 4141
## [6,] 6 1040 2074 3108 4142
## [1] 1034 5
By default, matrices are filled column-by-column order,
however using the byrow=TRUE
argument allows us to change
the order to row-by-row.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
## [4,] 16 17 18 19 20
## [5,] 21 22 23 24 25
## [6,] 26 27 28 29 30
## [1] 1034 5
Scalars are just one-dimensional values, typically numbers, that are different from their higher-dimensional counterparts, vectors, matrices, and tensors, which are usually denoted by bold characters.
To compute the sample average and variance of a dataset, we use the formulas: \[\bar{Y}=\frac{1}{n} \sum_{i=1}^n {Y_i}\]
and
\[\mbox{var}(Y)=\frac{1}{n-1} \sum_{i=1}^n {(Y_i - \bar{Y})}^2, \] which can be represented as matrix multiplications.
Define an \(n \times 1\) matrix made of \(1\)’s.
\[A=\begin{pmatrix} 1\\ 1\\ \vdots\\ 1 \end{pmatrix}.\]
This implies that.
\[\frac{1}{n} \mathbf{A}^\top Y = \frac{1}{n} \begin{pmatrix}1&1& \dots&1\end{pmatrix} \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix}= \frac{1}{n} \sum_{i=1}^n {Y_i}= \bar{Y}.\]
Recall that we multiply matrices and scalars, like \(\frac{1}{n}\), by *
, whereas
we multiply matrices using the matrix product operator,
%*%
.
## [1] 73.69729
## [,1]
## [1,] 73.69729
## [1] 73.69729
Multiplying the transpose of a matrix with another matrix is very
common in linear modeling and statistical computing, so there is an
appropriate function in R
, crossprod()
.
## [,1]
## [1,] 73.69729
There is a similar matrix algebra for computing the variance.
\[\mathbf{Y'}\equiv \begin{pmatrix} Y_1 - \bar{Y}\\ \vdots\\ Y_n - \bar{Y} \end{pmatrix}, \, \, \frac{1}{n-1} \mathbf{Y'}^\top\mathbf{Y'} = \frac{1}{n-1}\sum_{i=1}^n (Y_i - \bar{Y})^2. \]
A crossprod
with only one matrix computes \(Y^\top Y\).
## [,1]
## [1,] 5.316798
## [1] 5.316798
Let’s use the following pair of matrices.
\[\overbrace{\mathbf{Y}}^{outcome} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix} , \underbrace{\mathbf{X}}_{design} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_n \end{pmatrix} , \overbrace{\mathbf{\beta}}^{effects} = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \mbox{ and } \underbrace{\mathbf{\varepsilon}}_{error} = \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{pmatrix}.\]
Then, we can express the problem as a linear model.
\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1, \dots, n\]
We can also express the complete problem formulation into a corresponding succinct matrix notation formula.
\[\begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_n \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{pmatrix}. \]
And in its matrix form.
\[\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon},\]
which is a simpler way to write the same model equation.
One way to obtain an optimal solution is by minimizing all residuals (\(\epsilon_i\)). This high-fidelity criterion indicates a good model fit. The least squares (LS) solution represents one way to solve this matrix equation (\(Y=X\beta+\epsilon\)). The LS solution is obtained by minimizing the residual sum square error \[\langle\epsilon^t, \epsilon \rangle = (Y-X\beta)^t \times (Y-X\beta).\]
Let’s define the LS objective function using the cross-product notation.
\[f(\beta) = (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^t (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}). \]
We can determine the effect size estimates, \({\hat {\beta}}\), are obtained by minimizing this expression. Of course, we can derive an analytic solution using calculus to find the minimum of the cost (objective) function, \(f(\beta)\).
There are a number of rules that help with solving partial derivative equations in matrix form. recall that the critical points of the objective functions are either at the domain border or at values where the derivative of a objective function is trivial, \(f'(x)=0\). hence, solving for the unknown parameter \(\beta\) requires identifying the critical points, \({\hat {\beta}}\), which will represent candidate solution(s). The derivative of the above equation is
\[2 \mathbf{X}^t (\mathbf{Y} - \mathbf{X} \boldsymbol{\hat{\beta}})=0,\]
\[\mathbf{X}^t \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}^t \mathbf{Y},\]
\[\boldsymbol{\hat{\beta}} = (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \mathbf{Y}.\]
This estimate \({\hat {\beta}}\) represents the desired LS solution to the linear modeling problem. The hat notation, \({\hat {\cdot}}\), is used to denote estimates. For instance, the solution for the unknown \(\beta\) parameters is denoted by the (data-driven) estimate \(\hat{\beta}\).
The least squares minimization works because minimizing a function corresponds to finding the roots of its (first) derivative. With ordinary least squares (OLS), we square the residuals.
\[(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^t (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}).\]
Notice that the minimum of \(f(x)\) and \(f^2(x)\) are achieved at the same roots of \(f'(x)\), as the derivative of \(f^2(x)\) is \(\frac{d}{dx}f^2(x) = 2f(x)f'(x)\).
Here is how we obtain the Least Square estimation in
R
.
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'data':
##
## Position
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#x=cbind(data$Height, data$Age)
x=data$Height
y=data$Weight
X <- cbind(1, x)
beta_hat <- solve( t(X) %*% X ) %*% t(X) %*% y
###or
beta_hat <- solve( crossprod(X) ) %*% crossprod( X, y )
Now we can see the results of this by computing the estimated \(\hat{\beta}_0+\hat{\beta}_1 x\) (fitted model prediction) corresponding to any covariate input value of \(x\).
# newx <- seq(min(x), max(x), len=100)
X <- cbind(1, x)
fitted <- X%*%beta_hat
# or directly: fitted <- lm(y ~ x)$fitted
# plot(x, y, xlab="MLB Player's Height", ylab="Player's Weight")
# lines(x, fitted, col=2)
plot_ly(x = ~x) %>%
add_markers(y = ~y, name="Data Scatter") %>%
add_lines(x = ~x, y = ~fitted[,1], name="(Manual) Linear Model (Weight ~ Height)") %>%
add_lines(x = ~x, y = ~lm(y ~ x)$fitted, name="(Direct) lm(Weight ~ Height)",
line = list(width = 4, dash = 'dash')) %>%
layout(title='Baseball Players: Linear Model of Weight vs. Height',
xaxis = list(title="Height (in)"), yaxis = list(title="Weight (lb)"),
legend = list(orientation = 'h'))
The closed-form analytical expression for the LS estimate \[\hat{\boldsymbol{\beta}}=(\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \mathbf{Y}\] is one of the most widely used results in data analysis. One of the advantages of this approach is that we can use it in many different situations.
R
In R
, there is a very convenient function
lm()
that fits these linear models. We will learn more
about this function later, but here is a demonstration that it agrees
with a simple manual LS estimation approach we showed above.
Starting in the 18th century, the work of Euler’s on rotational motion and later Lagrange on the study of inertia matrices, the notions of principal axes (eigenvectors) and characteristic roots (eigenvectors) arose. However, it took close to 200 years until Hilbert and others working on integral operators settled on using the terminology eigen, “own”, to denote eigenvalues (proper characteristic values) and eigenvectors (principal axes).
The eigen-spectrum (eigenspace) decomposition of linear operators (matrices) into eigenvalues and eigenvectors enables us to understand linear transformations and characterize their properties. The eigenvectors represent the “axes” (directions) along which a linear transformation acts by stretching, compressing, or flipping.
The eigenvalues represent the amounts of this linear transformation into the specified eigen-vector direction. In higher dimensions, there are more directions along which we need to understand the behavior of the linear transformation. The eigen-spectrum makes it easier to understand the linear transformation especially when many (all?) of the eigenvectors are linearly independent (orthogonal).
For a given matrix \(A\), if we have \(A\vec{v}=\lambda \vec{v}\), then we say that a nonzero vector \(\vec{v}\) is a right eigenvector of the matrix \(A\) and the scale factor \(\lambda\) is the eigenvalue corresponding to that eigenvector.
With some calculations we can show that \(A\vec{v}=\lambda \vec{v}\) is the same as
\((\lambda I_n-A)\vec{v}=\vec{0}\),
where \(I_n\) is the \(n\times n\) identity matrix. So, when we
solve this equation, we get the corresponding eigenvalues and
eigenvectors. As this is a very common operation, we don’t need to do
that by hand - the method eigen()
provides this
functionality.
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
## eigen() decomposition
## $values
## [1] 1 1
##
## $vectors
## [,1] [,2]
## [1,] 0 -1
## [2,] 1 0
We can easily validate that \((\lambda I_n-A)\vec{v}=\vec{0}\).
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
As we mentioned earlier, diag(n)
creates a \(n\times n\) identity matrix. Thus,
diag(2)
is the \(I_2\)
matrix in the above equation. The zero output matrix proves that the
equation \((\lambda
I_n-A)\vec{v}=\vec{0}\) holds true.
Many interesting applications of the eigen-spectrum are shown here.
Other important functions about matrix operation are listed in the following table.
Functions | Math expression or explanation |
---|---|
crossprod(A, B) |
\(A^TB\) Where \(A\), \(B\) are matrices |
y<-svd(A) |
the output has the following components |
-y$d |
vector containing the singular values of A, |
-y$u |
matrix with columns contain the left singular vectors of A, |
-y$v |
matrix with columns contain the right singular vectors of A |
k <- qr(A) |
the output has the following components |
-k$qr |
has an upper triangle that contains the decomposition and a lower triangle that contains information on the Q decomposition. |
-k$rank |
is the rank of A. |
-k$qraux |
a vector which contains additional information on Q. |
-k$pivot |
contains information on the pivoting strategy used. |
rowMeans(A) /colMeans(A) |
returns vector of row/column means |
rowSums(A) /colSums(A) |
returns vector of row/column sums |
Some flexible matrix operations can help us save time calculating row or column averages. For example, column averages can be calculated by the following matrix operation.
\[AX = \left(\begin{array}{cccc} \frac{1}{N}&\frac{1}{N}&\cdots&\frac{1}{N} \end{array}\right) \left(\begin{array}{cccc} X_{1, 1}&\cdots&X_{1, p}\\ X_{2, 1}&\cdots&X_{2, p}\\ \vdots&\vdots&\vdots\\ X_{N, 1}&\cdots&X_{N, p} \end{array}\right)= \left(\begin{array}{cccc} \bar{X}_1&\bar{X}_2&\cdots &\bar{X}_N \end{array}\right).\]
The row averages can be calculated similarly.
\[XB = \left(\begin{array}{cccc} X_{1, 1}&\cdots &X_{1, p}\\ X_{2, 1}&\cdots &X_{2, p}\\ \vdots &\vdots &\vdots \\ X_{N, 1}&\cdots&X_{N, p} \end{array}\right) \left(\begin{array}{c} \frac{1}{p}\\ \frac{1}{p}\\ \vdots\\ \frac{1}{p} \end{array}\right)= \left(\begin{array}{c} \bar{X}_1\\ \bar{X}_2\\ \vdots\\ \bar{X}_q \end{array}\right). \]
Expeditious matrix calculations can be done by multiplying a matrix on the left or at the right by another matrix. In general multiplying by a vector on the left can amounts to weight averaging.
\[AX = \left(\begin{array}{cccc} a_1&a_2&\cdots &a_N \end{array}\right) \left(\begin{array}{cccc} X_{1, 1}&\cdots &X_{1, p}\\ X_{2, 1}&\cdots &X_{2, p}\\ \vdots&\vdots &\vdots \\ X_{N, 1}&\cdots &X_{N, p} \end{array}\right)= \left(\begin{array}{cccc} \sum_{i=1}^N a_i \bar{X}_{i, 1}&\sum_{i=1}^N a_i \bar{X}_{i, 2}&\cdots &\sum_{i=1}^N a_i \bar{X}_{i, N} \end{array}\right). \]
Now let’s try this matrix notation to look at genetic expression data including \(8,793\) different genes for \(208\) subjects. These gene expression data represents a microarray experiment - GSE5859 - comparing Gene Expression Profiles from Lymphoblastoid cells. Specifically, the data compares the expression level of genes in lymphoblasts from individuals in three HapMap populations {CEU, CHB, JPT}. The study found that the mean expression levels between the {CEU} and {CHB+JPT} samples were significantly different (\(p < 0.05\)) for more than a thousand genes.
The gene expression profiles data has two components:
gene <- read.csv("https://umich.instructure.com/files/2001417/download?download_frd=1",
header = T) # exprs_GSE5859.csv
info<-read.csv("https://umich.instructure.com/files/2001418/download?download_frd=1",
header=T) # exprs_MetaData_GSE5859.csv
Recall that the lapply()
function that we talked about
in Chapter
2 and the sapply()
function can be used to calculate
column and row averages. Let’s compare the outputs of
sapply
and the corresponding matrix algebra process.
colmeans <- sapply(gene[, -1], mean)
gene1 <- as.matrix(gene[, -1])
# can also use built in functions
# colMeans <- colMeans(gene1)
colmeans.matrix <- crossprod(rep(1/nrow(gene1), nrow(gene1)), gene1)
colmeans[1:15]
## GSM25581.CEL.gz GSM25681.CEL.gz GSM136524.CEL.gz GSM136707.CEL.gz
## 5.703998 5.721779 5.726300 5.743632
## GSM25553.CEL.gz GSM136676.CEL.gz GSM136711.CEL.gz GSM136542.CEL.gz
## 5.835499 5.742565 5.751601 5.732211
## GSM136535.CEL.gz GSM25399.CEL.gz GSM25552.CEL.gz GSM25542.CEL.gz
## 5.741741 5.618825 5.805147 5.733117
## GSM136544.CEL.gz GSM25662.CEL.gz GSM136563.CEL.gz
## 5.733175 5.716855 5.750600
## [1] 5.703998 5.721779 5.726300 5.743632 5.835499 5.742565 5.751601 5.732211
## [9] 5.741741 5.618825 5.805147 5.733117 5.733175 5.716855 5.750600
The same outputs are generated by both protocols. Note that we uses
rep(1/nrow(gene1), nrow(gene1))
to create the vector
\[\left(\begin{array}{cccc} \frac{1}{N}&\frac{1}{N}&\cdots &\frac{1}{N} \end{array}\right). \]
needed to obtain manually the column averages by matrix algebra. Similarly, we can compute the column means.
colmeans <- as.matrix(colmeans)
h <- hist(colmeans, plot=F)
plot_ly(x = h$mids, y = h$counts, type = "bar", name = "Column Averages") %>%
layout(title='Average Gene Expression Histogram',
xaxis = list(title = "Column Means"),
yaxis = list(title = "Average Expression", side = "left"),
legend = list(orientation = 'h'))
The histogram shows that the distribution is symmetric, unimodal and bell-shaped, i.e., roughly normal.
We can also solve harder problems using matrix algebra. For example, let’s calculate the differences between genders for each gene. First, we need to get the gender information for each subject.
Then, we have to reorder the columns to make it consistent with the
feature matrix gene1
.
Next, we are going to design the matrix. We want to multiply it with the feature matrix. The plan is to multiply the following two matrices.
\[\left(\begin{array}{cccc} X_{1, 1}&\cdots&X_{1, p}\\ X_{2, 1}&\cdots&X_{2, p}\\ \vdots & \vdots &\vdots \\ X_{N, 1}&\cdots&X_{N, p} \end{array}\right) \left(\begin{array}{cc} \frac{1}{p}&a_1\\ \frac{1}{p}&a_2\\ \vdots & \vdots \\ \frac{1}{p}&a_p \end{array}\right)= \left(\begin{array}{cc} \bar{X}_1&gender.diff_1\\ \bar{X}_2&gender.diff_2\\ \vdots & \vdots \\ \bar{X}_N&gender.diff_N \end{array}\right),\] where \(a_i=-\frac{1}{N_F}\) if the subject is female and \(a_i=\frac{1}{N_M}\) if the subject is male. Thus, we gave each female and male the same weight before the subtraction. We average each gender and get their difference. \(\bar{X}_i\) is the average across both genders and \(gender.diff_i\) represents the gender difference for the i-th gene.
##
## F M
## 86 122
gender$vector <- ifelse(gender$sex=="F", -1/86, 1/122)
vec1 <- as.matrix(data.frame(rowavg=rep(1/ncol(gene1), ncol(gene1)), gender.diff=gender$vector))
gender.matrix <- gene1 %*% vec1
gender.matrix[1:15, ]
## rowavg gender.diff
## [1,] 6.383263 -0.003209464
## [2,] 7.091630 -0.031320597
## [3,] 5.477032 0.064806978
## [4,] 7.584042 -0.001300152
## [5,] 3.197687 0.015265502
## [6,] 7.338204 0.078434938
## [7,] 4.232132 0.008437864
## [8,] 3.716460 0.018235650
## [9,] 2.810554 -0.038698101
## [10,] 5.208787 0.020219666
## [11,] 6.498989 0.025979654
## [12,] 5.292992 -0.029988980
## [13,] 7.069081 0.038575442
## [14,] 5.952406 0.030352616
## [15,] 7.247116 0.046020066
As we mentioned earlier, the formula for linear regression can be written as
\[Y_i=\beta_0+X_{i, 1}\beta_1+\cdots+X_{i, p}\beta_p +\epsilon_i, i=1, \cdots, N\] This formula can also be expressed in matrix form.
\[\left(\begin{array}{c} Y_1\\ Y_2\\ \vdots\\ Y_N \end{array}\right)= \left(\begin{array}{c} 1\\ 1\\ \vdots\\ 1 \end{array}\right)\beta_0+ \left(\begin{array}{c} X_{1, 1}\\ X_{2, 1}\\ \vdots\\ X_{N, 1} \end{array}\right)\beta_1+\cdots + \left(\begin{array}{c} X_{1, p}\\ X_{2, p}\\ \vdots \\ X_{N, p} \end{array}\right)\beta_p+ \left(\begin{array}{c} \epsilon_1\\ \epsilon_2\\ \vdots \\ \epsilon_N \end{array}\right), \]
which can be compressed into a simple matrix equation \(Y=X\beta +\epsilon\).
\[\left(\begin{array}{c} Y_1\\ Y_2\\ \vdots \\ Y_N \end{array}\right)= \left(\begin{array}{cccc} 1&X_{1, 1}&\cdots&X_{1, p}\\ 1&X_{2, 1}&\cdots&X_{2, p}\\ \vdots&\vdots&\vdots&\vdots\\ 1&X_{N, 1}&\cdots&X_{N, p} \end{array}\right) \left(\begin{array}{c} \beta_o\\ \beta_1\\ \vdots\\ \beta_p \end{array}\right)+ \left(\begin{array}{c} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_N \end{array}\right). \] As \(Y=X\beta +\epsilon\) implies that \(X^tY \sim X^t(X\beta)=(X^tX)\beta\), and thus, the LS solution for \(\beta\) is obtained by multiplying both hand sides by the inverse of the square cross product matrix \((X^tX)^{-1}\): \[\hat{\beta}=(X^tX)^{-1}X^tY.\]
Matrix calculations are much faster, especially on specialized
computer chips, than fitting a manual regression model. Let’s apply this
to the Lahman
baseball data representing yearly stats and standings. Let’s
download it first via this link baseball.data
and save it in the R
working directory. We can use the
load()
function to import the local RData. For this
example, we subset the dataset by G==162
and
yearID < 2002
. Also, we create a new feature named
Singles
that is equal to
H(Hits by batters) - X2B(Doubles) - X3B(Tripples) - HR(Home Runs by batters)
.
Finally, we only pick four features R (Runs scored),
Singles, HR (Home Runs by batters), and BB
(Walks by batters).
#If you downloaded the .RData locally first, then you can easily load it into the`R`workspace by:
# load("Teams.RData")
# Alternatively you can also download the data in CSV format from https://umich.instructure.com/courses/38100/files/folder/data (teamsData.csv)
Teams <- read.csv('https://umich.instructure.com/files/2798317/download?download_frd=1', header=T)
dat <- Teams[Teams$G==162&Teams$yearID<2002, ]
dat$Singles <- dat$H-dat$X2B-dat$X3B-dat$HR
dat <- dat[, c("R", "Singles", "HR", "BB")]
head(dat)
## R Singles HR BB
## 439 505 997 11 344
## 1367 683 989 90 580
## 1368 744 902 189 681
## 1378 652 948 156 516
## 1380 707 1017 92 620
## 1381 632 1020 126 504
In this example, let’s work with R as the response variable and BB as the independent variable. For a full linear model, we need to add another column of \(1\)’s to the design matrix \(X\).
## [,1] [,2]
## [1,] 1 344
## [2,] 1 580
## [3,] 1 681
## [4,] 1 516
## [5,] 1 620
## [6,] 1 504
## [7,] 1 498
## [8,] 1 502
## [9,] 1 493
## [10,] 1 556
We use the LS analytical formula to obtain the beta (effects) estimates
\[\hat{\beta}=(X^t X)^{-1}X^t Y.\]
## [,1]
## [1,] 326.8241628
## [2,] 0.7126402
To confirm this manual calculation, we can refit the linear equation
using the lm()
function, and compare the computational
times. In this simple example, is there evidence of higher computational
efficiency using matrix calculations?
fit <- lm(R~BB, data=dat)
# fit <- lm(R~., data=dat)
# '.' indicates all other variables, very useful when fitting models with many predictors
fit
##
## Call:
## lm(formula = R ~ BB, data = dat)
##
## Coefficients:
## (Intercept) BB
## 326.8242 0.7126
##
## Call:
## lm(formula = R ~ BB, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -187.788 -53.977 -2.995 55.649 258.614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 326.82416 22.44340 14.56 <2e-16 ***
## BB 0.71264 0.04157 17.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.95 on 661 degrees of freedom
## Multiple R-squared: 0.3078, Adjusted R-squared: 0.3068
## F-statistic: 294 on 1 and 661 DF, p-value: < 2.2e-16
## user system elapsed
## 0 0 0
## user system elapsed
## 0 0 0
For a better model, we can expand the covariates to include multiple predictors and compare the resulting estimates.
## [,1] [,2] [,3] [,4]
## [1,] 1 344 997 11
## [2,] 1 580 989 90
## [3,] 1 681 902 189
## [4,] 1 516 948 156
## [5,] 1 620 1017 92
## [6,] 1 504 1020 126
## [7,] 1 498 1064 167
## [8,] 1 502 937 180
## [9,] 1 493 1048 105
## [10,] 1 556 1073 116
## user system elapsed
## 0 0 0
## user system elapsed
## 0 0 0
## (Intercept) BB Singles HR
## -401.0057242 0.3666606 0.6705535 1.7175775
## [,1] [,2] [,3] [,4]
## [1,] -401.0057 0.3666606 0.6705535 1.717577
A scatter plot would show visually the relationship between the outcome R and one of the predictors BB.
# plot(dat$BB, dat$R, xlab = "BB", ylab = "R", main = "Scatter plot/regression for baseball data")
# abline(beta1[1, 1], beta1[2, 1], lwd=4, col="red")
plot_ly(x = ~dat$BB) %>%
add_markers(y = ~dat$R, name="Data Scatter") %>%
add_lines(x = ~dat$BB, y = ~lm(dat$R ~ dat$BB)$fitted,
name="lm(Runs scored ~ Walks by batters)", line = list(width = 4)) %>%
layout(title='Scatter plot/regression for baseball data',
xaxis = list(title="(BB) Walks by batters"), yaxis = list(title="(R) Runs scored"),
legend = list(orientation = 'h'))
Here the color model line represents our regression model estimated using matrix algebra.
The power of matrix algebra becomes more apparent when we use multiple variables. We can add the variable HR to the model.
library(reshape2)
X <- cbind(rep(1, n=nrow(dat)), dat$BB, dat$HR)
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
beta
## [,1]
## [1,] 287.7226756
## [2,] 0.3897178
## [3,] 1.5220448
# #install.packages("scatterplot3d")
# library(scatterplot3d)
# myScatter3D <- scatterplot3d(dat$BB, dat$HR, dat$R)
#
# fit = lm(dat$R ~ dat$BB + dat$HR, data = dat)
# # Plot the linear model
# # get the BB & HR ranges summary(dat$BB); summary(dat$HR)
# cf = fit$coefficients
# pltx = seq(344, 775,length.out = 100)
# plty = seq(11,264,length.out = 100)
# pltz = cf[1] + cf[2]*pltx + cf[3]*plty
# #Add the line to the plot
# myScatter3D$points3d(pltx,plty,pltz, type = "l", col = "red", lwd=3)
# # interactive *rgl* 3D plot
# library(rgl)
# fit <- lm(dat$R ~ dat$BB + dat$HR)
# coefs <- coef(fit)
# a <- coefs["dat$BB"]
# b <- coefs["dat$HR"]
# c <- -1
# d <- coefs["(Intercept)"]
# open3d()
# plot3d(dat$BB, dat$HR, dat$R, type = "s", col = "red", size = 1)
# planes3d(a, b, c, d, alpha = 0.5)
# # planes3d(b, a, -1.5, d, alpha = 0.5)
# # planes3d draws planes using the parametrization a*x + b*y + c*z + d = 0.
# # Multiple planes may be specified by giving multiple values for the normal
# # vector (a, b, c) and the offset parameter d
#
# pca1 <- prcomp(as.matrix(cbind(dat$BB, dat$HR, dat$R)), center = T); summary(pca1)
#
# # Given two vectors PCA1 and PCA2, the cross product V = PCA1 x PCA2
# # is orthogonal to both A and to B, and a normal vector to the
# # plane containing PCA1 and PCA2
# # If PCA1 = (a,b,c) and PCA2 = (d, e, f), then the cross product is
# # PCA1 x PCA2 = (bf - ce, cd - af, ae - bd)
# # PCA1 = pca1$rotation[,1] and PCAS2=pca1$rotation[,2]
# # https://en.wikipedia.org/wiki/Cross_product#Names
# # prcomp$rotation contains the matrix of variable loadings,
# # i.e., a matrix whose columns contain the eigenvectors
# #normVec = c(pca1$rotation[,1][2]*pca1$rotation[,2][3]-
# # pca1$rotation[,1][3]*pca1$rotation[,2][2],
# # pca1$rotation[,1][3]*pca1$rotation[,2][1]-
# # pca1$rotation[,1][1]*pca1$rotation[,2][3],
# # pca1$rotation[,1][1]*pca1$rotation[,2][2]-
# # pca1$rotation[,1][2]*pca1$rotation[,2][1]
# # )
# normVec = c(pca1$rotation[2,1]*pca1$rotation[3,2]-
# pca1$rotation[3,1]*pca1$rotation[2,2],
# pca1$rotation[3,1]*pca1$rotation[1,2]-
# pca1$rotation[1,1]*pca1$rotation[3,2],
# pca1$rotation[1,1]*pca1$rotation[2,2]-
# pca1$rotation[2,1]*pca1$rotation[1,2]
# )
#
# # Plot the PCA Plane
# plot3d(dat$BB, dat$HR, dat$R, type = "s", col = "red", size = 1)
# planes3d(normVec[1], normVec[2], normVec[3], 90, alpha = 0.5)
# myScatter3D <- scatterplot3d(dat$BB, dat$HR, dat$R)
dat$name <- Teams[Teams$G==162&Teams$yearID<2002, "name"]
fit = lm(dat$R ~ dat$BB + dat$HR, data = dat)
# Plot the linear model
# get the BB & HR ranges summary(dat$BB); summary(dat$HR)
cf = fit$coefficients
pltx = seq(344, 775,length.out = length(dat$BB))
plty = seq(11,264,length.out = length(dat$BB))
pltz = cf[1] + cf[2]*pltx + cf[3]*plty
# Plot Scatter and add the LM line to the plot
plot_ly() %>%
add_trace(x = ~pltx, y = ~plty, z = ~pltz, type="scatter3d", mode="lines",
line = list(color = "red", width = 4), name="lm(R ~ BB + HR") %>%
add_markers(x = ~dat$BB, y = ~dat$HR, z = ~dat$R, color = ~dat$name, mode="markers") %>%
layout(scene = list(xaxis = list(title = '(BB) Walks by batters'),
yaxis = list(title = '(HR) Home runs by batters'),
zaxis = list(title = '(R) Runs scored')))
# Plot Scatter and add the LM PLANE to the plot
lm <- lm(R ~ 0 + HR + BB, data = dat)
#Setup Axis
axis_x <- seq(min(dat$HR), max(dat$HR), length.out=100)
axis_y <- seq(min(dat$BB), max(dat$BB), length.out=100)
#Sample points
lm_surface <- expand.grid(HR = axis_x, BB = axis_y, KEEP.OUT.ATTRS = F)
lm_surface$R <- predict.lm(lm, newdata = lm_surface)
lm_surface <- acast(lm_surface, HR ~ BB, value.var = "R") #`R`~ 0 + HR + BB
plot_ly(dat, x = ~HR, y = ~BB, z = ~R,
text = ~name, type = "scatter3d", mode = "markers", color = ~dat$name) %>%
add_trace(x = ~axis_x, y = ~axis_y, z = ~lm_surface, type="surface", color="gray", opacity=0.3) %>%
layout(title="3D Plane Regression (R ~ BB + HR); Color=BB Team", showlegend = F,
xaxis = list(title = '(BB) Walks by batters'),
yaxis = list(title = '(HR) Home runs by batters'),
zaxis = list(title = '(R) Runs scored')) %>%
hide_colorbar()
We can also express the covariance matrix for our features using matrix operation. Suppose
\[X_{N\times K}=\left(\begin{array}{cccc} X_{1, 1}&\cdots&X_{1, K}\\ X_{2, 1}&\cdots&X_{2, K}\\ \vdots&\ddots&\vdots\\ X_{N, 1}&\cdots&X_{N, K} \end{array}\right) =[X_1 \ X_2\ \cdots\ X_K]. \]
Then the \(K\times K\) (square) covariance matrix is: \[\Sigma_{K\times K} = (\Sigma_{i, j}),\] where \(\Sigma_{i, j}=Cov(X_i, X_j)=E\left( (X_i-\mu_i)(X_j-\mu_j)\right)\), \(1\leq i, j, \leq K\). Note that earlier, we denoted the number of variables by \(p\), whereas here we use \(K\); this is to avoid potential confusion with probabilities, \(p\).
The sample covariance matrix is: \[\Sigma_{i, j}=\frac{1}{N-1}\sum_{m=1}^{N}\left( x_{m, i}-\bar{x}_i \right) \left( x_{m, j}-\bar{x}_j \right) , \]
where \[\bar{x}_{i}=\frac{1}{N}\sum_{m=1}^{N}x_{m, i}, \quad i=1, \cdots, K .\]
In general, \[\Sigma=\frac{1}{n-1}(X-\bar{X})^t(X-\bar{X}).\]
Suppose that we want to get the sample covariance matrix of the following \(5\times 3\) feature matrix \(x\).
x <- matrix(c(4.0, 4.2, 3.9, 4.3, 4.1, 2.0, 2.1, 2.0, 2.1, 2.2, 0.60, 0.59, 0.58, 0.62, 0.63), ncol=3)
x
## [,1] [,2] [,3]
## [1,] 4.0 2.0 0.60
## [2,] 4.2 2.1 0.59
## [3,] 3.9 2.0 0.58
## [4,] 4.3 2.1 0.62
## [5,] 4.1 2.2 0.63
Notice that this matrix represents the design matrix of \(3\) features and \(5\) observations. Let’s compute the column means first.
## [,1] [,2] [,3]
## [1,] 4.1 2.08 0.604
## [,1] [,2] [,3]
## [1,] 0.02500 0.00750 0.00175
## [2,] 0.00750 0.00700 0.00135
## [3,] 0.00175 0.00135 0.00043
In the covariance matrix, \(S[i, i]\) is the variance of the i-th feature and \(S[i, j]\) is the covariance of i-th and j-th features.
Compare this to the automated calculation of the variance-covariance matrix.
## [,1] [,2] [,3]
## [1,] 0.02500 0.00750 0.00175
## [2,] 0.00750 0.00700 0.00135
## [3,] 0.00175 0.00135 0.00043
Later, in Chapter 5 (Supervised classification), we will cover some classification methods that use this mathematical framework for model-based and model-free ML/AI prediction. However, let’s start with linear model-based statistical methods providing forecasting and classification functionality. Specifically, we will (1) demonstrate the predictive power of multivariate linear regression, (2) show the foundation of regression trees and model trees, and (3) examine two complementary case-studies (Baseball Players and Heart Attack).
Regression represents a model of a relationship between a dependent variable (value to be predicted) and a group of independent variables (predictors or features). We assume the relationships between the outcome dependent variable and the independent variables is linear.
Earlier, we discussed the straightforward case of regression as simple linear regression, which involves a single predictor \[y=a+bx.\]
In this slope-intercept formula, a
is the model
intercept and b
is the model slope. Thus,
simple linear regression may be expressed as a bivariate equation. If we
know a
and b
, for any given x
we
can estimate, or predict, y
via the regression formula. If
we plot x
against y
in a 2D coordinate system,
where the two variables are exactly linearly related, the results will
be a straight line.
However, this is the ideal case. Bivariate scatterplots using real
world data may show patterns that are not necessarily precisely linear,
see Chapter
2. Let’s look at a bivariate scatterplot and try to fit a simple
linear regression line using two variables, e.g.,
hospital charges
or CHARGES
as a dependent
variable, and length of stay
in the hospital or
LOS
as an independent predictor. The data is available in
the DSPA
Data folder as CaseStudy12_AdultsHeartAttack_Data
. We
can remove the pair of observations with missing values using the
command
heart_attack<-heart_attack[complete.cases(heart_attack), ]
.
library(plotly)
heart_attack <- read.csv("https://umich.instructure.com/files/1644953/download?download_frd=1", stringsAsFactors = F)
heart_attack$CHARGES <- as.numeric(heart_attack$CHARGES)
heart_attack <- heart_attack[complete.cases(heart_attack), ]
fit1 <- lm(CHARGES ~ LOS, data=heart_attack)
# par(cex=.8)
# plot(heart_attack$LOS, heart_attack$CHARGES, xlab="LOS", ylab = "CHARGES")
# abline(fit1, lwd=2, col="red")
plot_ly(heart_attack, x = ~LOS, y = ~CHARGES, type = 'scatter', mode = "markers", name="Data") %>%
add_trace(x=~mean(LOS), y=~mean(CHARGES), type="scatter", mode="markers",
name="(mean(LOS), mean(Charge))", marker=list(size=20, color='blue', line=list(color='yellow', width=2))) %>%
add_lines(x = ~LOS, y = fit1$fitted.values, mode = "lines", name="Linear Model") %>%
layout(title=paste0("lm(CHARGES ~ LOS), Cor(LOS,CHARGES) = ",
round(cor(heart_attack$LOS, heart_attack$CHARGES),3)))
As expected, longer hospital stays are expected to be associated with higher medical costs, or hospital charges. The scatterplot shows dots for each pair of observed measurements (\(x=LOS\) and \(y=CHARGES\)), and an increasing linear trend.
The estimated expression for this regression line is: \[\hat{y}=4582.70+212.29\times x\]
or equivalently
\[CHARGES=4582.70+212.29\times LOS.\]
Once the linear model is fit, i.e., its coefficients are estimated,
we can make predictions using this explicit
regression
model. Assume we have a patient that spent 10 days in hospital, then we
have LOS=10
. The predicted charge is likely to be \(\$ 4582.70 + \$ 212.29 \times 10= \$
6705.6\). Plugging x
into the expression equation
automatically gives us an estimated value of the outcome y
.
This chapter
of the Probability and statistics EBook provides an introduction to
linear modeling.
How did we get the estimated expression? The most common estimating method in statistics is ordinary least squares (OLS). OLS estimators are obtained by minimizing the sum of the squared errors - that is the sum of squared vertical distance from each dot on the scatter plot to the regression line.
OLS is minimizing the following expression:
\[\langle\epsilon , \epsilon\rangle^2 \equiv \sum_{i=1}^{n}(y_i-\hat{y}_i)^2=\sum_{i=1}^{n}\left (\underbrace{y_i}_{\text{observed outcome}}-\underbrace{(a+b\times x_i)}_{\text{predicted outcome}}\right )^2=\sum_{i=1}^{n}\underbrace{\epsilon_i^2}_{\text{squared residual}}.\]
Some calculus-based calculations suggest that the value
b
minimizing the squared error is:
\[b=\frac{\sum(x_i-\bar x)(y_i-\bar y)}{\sum(x_i-\bar x)^2}.\]
Then, the corresponding constant term (\(y\)-intercept) a
is
\[a=\bar y-b\bar x.\]
These expressions would become apparent if you review the material in Chapter 2. Recall that the variance is obtained by averaging sums of squared deviations (\(var(x)=\frac{1}{n}\sum^{n}_{i=1} (x_i-\mu)^2\)). When we use \(\bar{x}\) to estimate the mean of \(x\), we have the following formula for variance: \(var(x)=\frac{1}{n-1}\sum^{n}_{i=1} (x_i-\bar{x})^2\). Note that this is \(\frac{1}{n-1}\) times the denominator of b. Similar to the variance, the covariance of x and y is measuring the average sum of the deviance of x times the deviance of y:
\[Cov(x, y)=\frac{1}{n}\sum^{n}_{i=1} (x_i-\mu_x)(y_i-\mu_y).\]
If we utilize the sample averages (\(\bar{x}\), \(\bar{y}\)) as estimates of the corresponding population means, we have:
\[Cov(x, y)=\frac{1}{n-1}\sum^{n}_{i=1} (x_i-\bar{x})(y_i-\bar{y}).\]
This is \(\frac{1}{n-1}\) times the numerator of b. Thus, combining the above 2 expressions, we get an estimate of the slope coefficient (effect-size of LOS on Charge) expressed as:
\[b=\frac{Cov(x, y)}{var(x)}.\]
Let’s use the heart attack data to demonstrate these calculations.
## [1] 212.2869
## [1] 4582.7
## (Intercept)
## 4582.7
We can see that this is exactly the same as the previously computed
estimate of the constant intercept terms using lm()
.
Regression modeling has five key assumptions:
If these assumptions are violated, the model may provide invalid estimates and unreliable predictions.
Note: The SOCR Interactive Scatterplot Game (requires Java enabled browser) provides a dynamic interface demonstrating linear models, trends, correlations, slopes and residuals.
Based on covariance we can calculate correlation, which indicates how closely the relationship between two variables follows a straight line.
\[\rho_{x, y}=Corr(x, y)=\frac{Cov(x, y)}{\sigma_x\sigma_y}=\frac{Cov(x, y)}{\sqrt{Var(x)Var(y)}}.\]
In R
, the correlation may be computed using the method
cor()
and the square root of the variance, or the standard
deviation, is computed by sd()
.
## [1] 0.2449743
## [1] 0.2449743
Same outputs are obtained. This correlation is a positive number that is relatively small. We can say there is a weak positive linear association between these two variables. If we have a negative number then it is a negative linear association. We have a weak association when \(0.1 \leq Cor < 0.3\), a moderate association for \(0.3 \leq Cor < 0.5\), and a strong association for \(0.5 \leq Cor \leq 1.0\). If the correlation is below \(0.1\) then it suggests little to no linear relation between the variables.
In practice, we usually have more situations with multiple predictors and one dependent variable, which may follow a multiple linear model. That is:
\[y=\alpha+\beta_1x_1+\beta_2x_2+ \cdots +\beta_kx_k+\epsilon,\]
or equivalently
\[y=\beta_0+\beta_1x_1+\beta_2x_2+ \cdots +\beta_kx_k+\epsilon .\]
We usually use the second notation method in statistics. This equation shows the linear relationship between k predictors and a dependent variable. In total we have k+1 coefficients to estimate.
The matrix notation for the above equation is
\[Y=X\beta+\epsilon,\]
where
\[Y=\left(\begin{array}{c} y_1 \\ y_2\\ \vdots \\ y_n \end{array}\right),\]
\[X=\left(\begin{array}{ccccc} 1 & x_{11}&x_{21}&\cdots&x_{k1} \\ 1 & x_{12}&x_{22}&\cdots&x_{k2} \\ \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & x_{1n}&x_{2n}&\cdots&x_{kn} \end{array}\right) \] \[\beta=\left(\begin{array}{c} \beta_0 \\ \beta_1\\ \vdots\\ \beta_k \end{array}\right),\]
and
\[\epsilon=\left(\begin{array}{c} \epsilon_1 \\ \epsilon_2\\ \vdots \\ \epsilon_n \end{array}\right)\] is the error term.
Similar to simple linear regression, our goal is to minimize the sum of squared errors. Solving the matrix equation for \(\beta\), we get the OLS solution for the parameter vector:
\[\hat{\beta}=(X^tX)^{-1}X^tY .\]
The solution is presented in a matrix form, where \(X^{-1}\) and \(X^t\) are the inverse and the
transpose matrices of the original design matrix \(X\). This example demonstrates making
de novo a simple regression (least squares estimating) function
reg()
.
We saw earlier that a clever use of matrix multiplication
(%*%
) and solve()
can help with the explicit
OLS solution.
Next, we will apply our function reg()
to the heart
attack data. To begin with, let’s check if the simple linear regression
(lm()
) output coincides with the results of our manual
regression estimator, reg()
.
## [,1]
## Intercept 4582.6997
## 212.2869
##
## Call:
## lm(formula = CHARGES ~ LOS, data = heart_attack)
##
## Coefficients:
## (Intercept) LOS
## 4582.7 212.3
The results of the official (lm()
) and the manual
(reg()
) simple linear models agree and we can proceed with
testing the multivariate functionality using additional variables as
predictors, e.g., just adding age
as a second variable into
the model.
## 'data.frame': 148 obs. of 8 variables:
## $ Patient : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DIAGNOSIS: int 41041 41041 41091 41081 41091 41091 41091 41091 41041 41041 ...
## $ SEX : chr "F" "F" "F" "F" ...
## $ DRG : int 122 122 122 122 122 121 121 121 121 123 ...
## $ DIED : int 0 0 0 0 0 0 0 0 0 1 ...
## $ CHARGES : num 4752 3941 3657 1481 1681 ...
## $ LOS : int 10 6 5 2 1 9 15 15 2 1 ...
## $ AGE : int 79 34 76 80 55 84 84 70 76 65 ...
## [,1]
## Intercept 7280.55493
## LOS 259.67361
## AGE -43.67677
##
## Call:
## lm(formula = CHARGES ~ LOS + AGE, data = heart_attack)
##
## Coefficients:
## (Intercept) LOS AGE
## 7280.55 259.67 -43.68
The following sections provide additional examples of simple and multivariate regression, and Chapter 9 will generalize the OLS regression modeling to regularized linear model estimation, which facilitates joint model fitting and feature selection.
In this example, we will utilize the MLB data (01a_data.txt). The data contains \(1,034\) records of heights and weights for some recent Major League Baseball (MLB) Players. These data were obtained from different resources (e.g., IBM Many Eyes).
Variables:
Let’s first load this dataset using as.is=T
to keep
non-numerical vectors as characters. Also, we will delete the
Name
variable because we don’t need the players’ names in
this case study.
mlb <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1', as.is=T, header=T)
str(mlb)
## 'data.frame': 1034 obs. of 6 variables:
## $ Name : chr "Adam_Donachie" "Paul_Bako" "Ramon_Hernandez" "Kevin_Millar" ...
## $ Team : chr "BAL" "BAL" "BAL" "BAL" ...
## $ Position: chr "Catcher" "Catcher" "Catcher" "First_Baseman" ...
## $ Height : int 74 74 72 72 73 69 69 71 76 71 ...
## $ Weight : int 180 215 210 210 188 176 209 200 231 180 ...
## $ Age : num 23 34.7 30.8 35.4 35.7 ...
By looking at the str()
output we notice that the
variables TEAM
and Position
are misspecified
as characters. To fix this we can use the function
as.factor()
to convert numeric or character vectors to
factors.
The data is now ready to compute some summary statistics and generate simple plots.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 150.0 187.0 200.0 201.7 215.0 290.0
# hist(mlb$Weight, main = "Histogram for Weights")
plot_ly(x = mlb$Weight, type = "histogram", name= "Histogram for Weights") %>%
layout(title="Baseball Players' Weight Histogram", bargap=0.1,
xaxis=list(title="Weight"), # control the y:x axes aspect ratio
yaxis = list(title="Frequency"))
The above plot illustrates our dependent variable
Weight
. As we saw in Chapter
1, this distribution appears a little right-skewed.
Displaying pairs plots
provides a compact summary of
different features in the data.
# require(GGally)
# mlb_binary = mlb
# mlb_binary$bi_weight = as.factor(ifelse(mlb_binary$Weight>median(mlb_binary$Weight),1,0))
# g_weight <- ggpairs(data=mlb_binary[-1], title="MLB Light/Heavy Weights",
# mapping=ggplot2::aes(colour = bi_weight),
# lower=list(combo=wrap("facethist",binwidth=1)),
# # upper = list(continuous = wrap("cor", size = 4.75, alignPercent = 1))
# )
# g_weight
plot_ly(mlb) %>%
add_trace(type = 'splom', dimensions = list( list(label='', values=~Position), # Position
list(label='Height', values=~Height), list(label='Weight', values=~Weight),
list(label='Age', values=~Age), list(label='', values=~Team)), # Team
text=~Team,
marker = list(color = as.integer(mlb$Team),
size = 7, line = list(width = 1, color = 'rgb(230,230,230)')
)
) %>%
style(diagonal = list(visible = FALSE)) %>%
layout(title= 'MLB Pairs Plot', hovermode='closest', dragmode= 'select',
plot_bgcolor='rgba(240,240,240, 0.95)')
# We may also mark player positions by different colors in the ggpairs plot
# g_position <- ggpairs(data=mlb[-1], title="MLB by Position",
# mapping=ggplot2::aes(colour = Position),
# lower=list(combo=wrap("facethist",binwidth=1)))
# g_position
Let’s try to summarize certain candidate predictors.
##
## ANA ARZ ATL BAL BOS CHC CIN CLE COL CWS DET FLA HOU KC LA MIN MLW NYM NYY OAK
## 35 28 37 35 36 36 36 35 35 33 37 32 34 35 33 33 35 38 32 37
## PHI PIT SD SEA SF STL TB TEX TOR WAS
## 36 35 33 34 34 32 33 35 34 36
##
## Catcher Designated_Hitter First_Baseman Outfielder
## 76 18 55 194
## Relief_Pitcher Second_Baseman Shortstop Starting_Pitcher
## 315 58 52 221
## Third_Baseman
## 45
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67.0 72.0 74.0 73.7 75.0 83.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.90 25.44 27.93 28.74 31.23 48.52
Here we have two numerical predictors and two
categorical predictors for \(1,034\) observations. Let’s see how
R
treats these three different classes of variables.
Before fitting linear models, let’s examine the independence of our
potential predictors and the dependent variable. Multiple linear
regressions assume that predictors are all independent with each other.
Is this assumption valid? As we mentioned earlier, the correlation
function, cor()
, can help with this question in the case of
linear pairwise dependencies for numerical variables.
## Weight Height Age
## Weight 1.0000000 0.53031802 0.15784706
## Height 0.5303180 1.00000000 -0.07367013
## Age 0.1578471 -0.07367013 1.00000000
Of course, the correlation is symmetric, \(cor(y, x)=cor(x, y)\) and \(cov(x, x)=1\). Also, our
Height
variable is weakly (negatively) related to the
players’ age. The results look very good and do not suggest potential
multicollinearity problems. If two of our predictors are highly
correlated, they both provide similar information. Such
multicollinearity may cause undue bias in the model and one common
practice is to remove one of the highly correlated predictors prior to
fitting the model.
In general multivariate regression analysis, we can use the
variance inflation factors (VIFs)
to detect potential
multicollinearity between all covariates. The variance inflation factor
quantifies the amount of artificial inflation of the variance due to
observed multicollinearity in the covariates. The \(VIF_l\)’s represent the expected inflation
of the corresponding estimated variances. In a simple linear regression
model with a single predictor \(x_l\),
\(y_i=\beta_o + \beta_1
x_{i,l}+\epsilon_i,\) relative to the baseline variance,
\(\sigma\), the lower bound (min)
of the variance of the estimated effect-size, \(\beta_l\), is:
\[Var(\beta_l)_{min}=\frac{\sigma^2}{\sum_{i=1}^n{\left ( x_{i,l}-\bar{x}_l\right )^2}}.\]
This allows us to track the inflation of the \(\beta_l\) variance \(\left (Var(\beta_l)\right )\) in the presence of correlated predictors in the regression model. Suppose the linear model includes \(k\) covariates with some of them being multicollinear or correlated.
\[y_i=\beta_o+\beta_1x_{i,1} + \beta_2 x_{i,2} + \cdots + \underbrace{\beta_l x_{i,l}}_{\text{effect*feature}} + \cdots + \beta_k x_{i,k} +\epsilon_i.\]
Assume some of the predictors are correlated with the feature \({x_l}\), then the variance of its effect, \(Var(\beta_l)\), will be inflated as follows:
\[Var(\beta_l)=\frac{\sigma^2}{\sum_{i=1}^n{\left ( x_{i,l}-\bar{x}_l\right )^2}}\times \frac{1}{1-R_l^2},\]
where \(R_l^2\) is the \(R^2\)-value computed by regressing the \(l^{th}\) feature on the remaining \((k-1)\) predictors. The stronger the linear dependence between the \(l^{th}\) feature and the remaining predictors, the larger the corresponding \(R_l^2\) value will be, the smaller the denominator in the inflation factor (\(VIF_l\)), and the larger the variance estimate of \(\beta_l\).
The variance inflation factor (\(VIF_l\)) is the ratio of the two variances - the variance of the effect (numerator) and the lower bound minimum variance of the effect (denominator).
\[VIF_l=\frac{Var(\beta_l)}{Var(\beta_l)_{min}}= \frac{\frac{\sigma^2}{\sum_{i=1}^n{\left ( x_{i,l}-\bar{x}_l\right )^2}}\times \frac{1}{1-R_l^2}}{\frac{\sigma^2}{\sum_{i=1}^n{\left ( x_{i,l}-\bar{x}_l\right )^2}}}=\frac{1}{1-R_l^2}.\]
The regression model’s VIFs measure how much the variance of the estimated regression coefficients, \(\beta_l\), may be “inflated” by unavoidable presence of multicollinearity among the model predictor features. \(VIF_l\sim 1\) implies that there is no substantial multicollinearity involving the \(l^{th}\) predictor and the remaining features, and hence, the variance estimate of \(\beta_l\) is not inflated. On the other hand, when \(VIF_l > 4\), potential multicollinearity is likely, when \(VIF_l> 10\), there may be serious multicollinearity in the data, which may require some model correction to account for variance estimates that may be significantly biased.
We can use the function car::vif()
to compute and report
the VIF factors.
## Height Age
## 1.005457 1.005457
In Chapters 11 (Feature Selection), we will discuss the methods and computational strategies to identify salient features in high-dimensional datasets. Let’s briefly identify some practical approaches to address multicollinearity problems and tackle challenges related to large numbers of inter-dependencies in the data. Data that contains a large number of predictors is likely to include completely unrelated variables having high sample correlation. To see the nature of this problem, assume we are generating a random Gaussian \(n\times k\) matrix, \(X=(X_1,X_2, \cdots, X_k)\) of \(k\) feature vectors, \(X_i, 1\leq i\leq k\), using IID standard normal random samples. Then, the expected maximum correlation between any pair of columns, \(\rho(X_{i_1},X_{i_2})\), can be as large as \(k\gg n\).
Even in this IID sampling problem, we still expect a high rate of intrinsic and strong feature correlations. In general, this phenomenon is amplified for high-dimensional observational data, which would be expected to have a high degree of collinearity. This problem presents a number of computational, model-fitting, model-interpretation, and selection of salient predictors challenges, e.g., function singularities and negative definite Hessian matrices.
There are some techniques that allow us to resolve such multicollinearity issues in high-dimensional data. Let’s denote \(n\) to be the number of cases (samples, subjects, units, etc.) and and \(k\) be the number of features. Using a divide-and-conquer strategy, we can split the problem into two special cases:
There are many alternative ways to visualize correlations, e.g.,
pairs()
, ggpairs()
, or
plot_ly()
.
# pairs(mlb[c("Weight", "Height", "Age")])
plot_ly(mlb) %>%
add_trace(type = 'splom', dimensions = list( list(label='Height', values=~Height),
list(label='Weight', values=~Weight), list(label='Age', values=~Age)),
text=~Position,
marker = list(color = as.integer(mlb$Team),
size = 7, line = list(width = 1, color = 'rgb(230,230,230)')
)
) %>%
layout(title= 'MLB Pairs Plot', hovermode='closest', dragmode= 'select',
plot_bgcolor='rgba(240,240,240, 0.95)')
Some of these plots may give a sense of variable associations or show
specific patterns in the data. The psych::pairs.panels()
function provides another sophisticated display (SPLOM =
scatter plot matrix) that is often useful in exploring multivariate
relations.
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
This plot gives us much more information about the selected three variables. Above the diagonal, we have our correlation coefficients in numerical form. On the diagonal, there are histograms of variables. Below the diagonal, more visual information is presented to help us understand the corresponding bivariate trends. Specifically, this graph shows that height and weight are strongly positively correlated. Also there are some weaker relationships, e.g., between age and height, and age and weight (horizontal red line in the below diagonal graphs indicates weak relationships).
The base R
method we are going to use now is the linear
modeling function, lm()
. No extra package is needed when
using this function. The lm()
function has the following
invocation protocol:
m <- lm(dv ~ iv, data=mydata)
OneR()
in Chapter
5. If we use .
as iv
, then all of the
variables, except the dependent variable (\(dv\)), are included as model
predictors.##
## Call:
## lm(formula = Weight ~ ., data = mlb)
##
## Coefficients:
## (Intercept) TeamARZ
## -164.9995 7.1881
## TeamATL TeamBAL
## -1.5631 -5.3128
## TeamBOS TeamCHC
## -0.2838 0.4026
## TeamCIN TeamCLE
## 2.1051 -1.3160
## TeamCOL TeamCWS
## -3.7836 4.2944
## TeamDET TeamFLA
## 2.3024 2.6985
## TeamHOU TeamKC
## -0.6808 -4.7664
## TeamLA TeamMIN
## 2.8598 2.1269
## TeamMLW TeamNYM
## 4.2897 -1.9736
## TeamNYY TeamOAK
## 1.7483 -0.5464
## TeamPHI TeamPIT
## -6.8486 4.3023
## TeamSD TeamSEA
## 2.6133 -0.9147
## TeamSF TeamSTL
## 0.8411 -1.1341
## TeamTB TeamTEX
## -2.6616 -0.7695
## TeamTOR TeamWAS
## 1.3943 -1.7555
## PositionDesignated_Hitter PositionFirst_Baseman
## 8.9037 2.4237
## PositionOutfielder PositionRelief_Pitcher
## -6.2636 -7.7695
## PositionSecond_Baseman PositionShortstop
## -13.0843 -16.9562
## PositionStarting_Pitcher PositionThird_Baseman
## -7.3599 -4.6035
## Height Age
## 4.7175 0.8906
The output model report includes both numeric and factor predictors. For each factor variable, the model creates a set of several indicators (one-hot-encoding, dummy variables) with corresponding coefficients matching each factor level (except for all reference factor levels, as the effects of all factors are reported relative to the corresponding reference level). For each numerical variable, there is just one coefficient (the matching effect).
Let’s examine the linear model performance.
##
## Call:
## lm(formula = Weight ~ ., data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.692 -10.909 -0.778 9.858 73.649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -164.9995 19.3828 -8.513 < 2e-16 ***
## TeamARZ 7.1881 4.2590 1.688 0.091777 .
## TeamATL -1.5631 3.9757 -0.393 0.694278
## TeamBAL -5.3128 4.0193 -1.322 0.186533
## TeamBOS -0.2838 4.0034 -0.071 0.943492
## TeamCHC 0.4026 3.9949 0.101 0.919749
## TeamCIN 2.1051 3.9934 0.527 0.598211
## TeamCLE -1.3160 4.0356 -0.326 0.744423
## TeamCOL -3.7836 4.0287 -0.939 0.347881
## TeamCWS 4.2944 4.1022 1.047 0.295413
## TeamDET 2.3024 3.9725 0.580 0.562326
## TeamFLA 2.6985 4.1336 0.653 0.514028
## TeamHOU -0.6808 4.0634 -0.168 0.866976
## TeamKC -4.7664 4.0242 -1.184 0.236525
## TeamLA 2.8598 4.0817 0.701 0.483686
## TeamMIN 2.1269 4.0947 0.519 0.603579
## TeamMLW 4.2897 4.0243 1.066 0.286706
## TeamNYM -1.9736 3.9493 -0.500 0.617370
## TeamNYY 1.7483 4.1234 0.424 0.671655
## TeamOAK -0.5464 3.9672 -0.138 0.890474
## TeamPHI -6.8486 3.9949 -1.714 0.086778 .
## TeamPIT 4.3023 4.0210 1.070 0.284890
## TeamSD 2.6133 4.0915 0.639 0.523148
## TeamSEA -0.9147 4.0516 -0.226 0.821436
## TeamSF 0.8411 4.0520 0.208 0.835593
## TeamSTL -1.1341 4.1193 -0.275 0.783132
## TeamTB -2.6616 4.0944 -0.650 0.515798
## TeamTEX -0.7695 4.0283 -0.191 0.848556
## TeamTOR 1.3943 4.0681 0.343 0.731871
## TeamWAS -1.7555 4.0038 -0.438 0.661142
## PositionDesignated_Hitter 8.9037 4.4533 1.999 0.045842 *
## PositionFirst_Baseman 2.4237 3.0058 0.806 0.420236
## PositionOutfielder -6.2636 2.2784 -2.749 0.006084 **
## PositionRelief_Pitcher -7.7695 2.1959 -3.538 0.000421 ***
## PositionSecond_Baseman -13.0843 2.9638 -4.415 1.12e-05 ***
## PositionShortstop -16.9562 3.0406 -5.577 3.16e-08 ***
## PositionStarting_Pitcher -7.3599 2.2976 -3.203 0.001402 **
## PositionThird_Baseman -4.6035 3.1689 -1.453 0.146613
## Height 4.7175 0.2563 18.405 < 2e-16 ***
## Age 0.8906 0.1259 7.075 2.82e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.78 on 994 degrees of freedom
## Multiple R-squared: 0.3858, Adjusted R-squared: 0.3617
## F-statistic: 16.01 on 39 and 994 DF, p-value: < 2.2e-16
#plot(fit, which = 1:2)
plot_ly(x=fit$fitted.values, y=fit$residuals, type="scatter", mode="markers") %>%
layout(title="LM: Fitted-values vs. Model-Residuals",
xaxis=list(title="Fitted"), yaxis = list(title="Residuals"))
The summary shows how well the model fits the dataset.
Residuals: This tells us about the residuals. If we have extremely large or extremely small residuals for some observations compared to the rest of residuals, either they are outliers due to reporting error or the model fits data poorly. We have \(73.649\) as our maximum and \(-48.692\) as our minimum. Their extremeness could be examined by residual diagnostic plot.
Coefficients: In this section, strong effects are
indicated by more stars (\(*\)) in the
right-most column. Stars, or dots, next to probability-value for each
variable indicate whether the variable is a significant predictor of the
outcome, and therefore should be included in the model. However, an
empty field there suggests that (statistically speaking), this variable
does not contribute significantly (in the specified model) to predicting
the outcome, i.e., there is no strong evidence to suggest its estimated
effect is non-zero. The column Pr(>|t|)
contains the
estimated probability corresponding to the t-statistic for this
covariate. Smaller values (close to \(0\)$) indicate the variable is a
significant covariate, and conversely, larger values indicate lack of
significance and indication that the variable may be dropped from the
model. In our examples, only some of the teams and positions are not
significant, whereas Age
and Height
are
significant predictors of the outcome, Weight
.
R-squared: Quantifies what percent in \(Y\) (outcome) is explained by included predictors (\(X\)). Here, we have \(R^2=38.58\%\), which indicates the model is not bad but could be improved. Usually a well-fitted linear regression would have over \(R^2>50\%\).
Diagnostic plots are also helpful for understanding the model performance relative to the data.
Residual vs Fitted: This is the residual diagnostic plot. We can see that the residuals of observations indexed \(65\), \(160\) and \(237\) are relatively far apart from the rest. They are potential influential points or outliers.
Normal Q-Q: This plot examines the normality assumption of the model. If these dots follow the line on the graph, the normality assumption is valid. In our case, it is relatively close to the line. So, we can say that our model is valid in terms of normality.
We can employ the step
function to perform
forward or backward selection of important
features/predictors. It works for both lm()
and
glm()
models. In most cases, backward-selection is
preferable because it tends to retain much larger models. On the other
hand, there are various criteria to evaluate a model. Commonly used
criteria include Akaike Information Criterion (AIC), Bayesian
Information Criterion (BIC), Adjusted \(R^2\), etc. Let’s compare the backward and
forward model selection approaches. The step
function
argument direction
allows this control (default is
both
, which will select the better result from either
backward or forward selection). Later, in Chapter
11, we will present details about alternative feature selection
approaches.
## Start: AIC=5871.04
## Weight ~ Team + Position + Height + Age
##
## Df Sum of Sq RSS AIC
## - Team 29 9468 289262 5847.4
## <none> 279793 5871.0
## - Age 1 14090 293883 5919.8
## - Position 8 20301 300095 5927.5
## - Height 1 95356 375149 6172.3
##
## Step: AIC=5847.45
## Weight ~ Position + Height + Age
##
## Df Sum of Sq RSS AIC
## <none> 289262 5847.4
## - Age 1 14616 303877 5896.4
## - Position 8 20406 309668 5901.9
## - Height 1 100435 389697 6153.6
##
## Call:
## lm(formula = Weight ~ Position + Height + Age, data = mlb)
##
## Coefficients:
## (Intercept) PositionDesignated_Hitter
## -168.0474 8.6968
## PositionFirst_Baseman PositionOutfielder
## 2.7780 -6.0457
## PositionRelief_Pitcher PositionSecond_Baseman
## -7.7782 -13.0267
## PositionShortstop PositionStarting_Pitcher
## -16.4821 -7.3961
## PositionThird_Baseman Height
## -4.1361 4.7639
## Age
## 0.8771
## Start: AIC=5871.04
## Weight ~ Team + Position + Height + Age
##
## Call:
## lm(formula = Weight ~ Team + Position + Height + Age, data = mlb)
##
## Coefficients:
## (Intercept) TeamARZ
## -164.9995 7.1881
## TeamATL TeamBAL
## -1.5631 -5.3128
## TeamBOS TeamCHC
## -0.2838 0.4026
## TeamCIN TeamCLE
## 2.1051 -1.3160
## TeamCOL TeamCWS
## -3.7836 4.2944
## TeamDET TeamFLA
## 2.3024 2.6985
## TeamHOU TeamKC
## -0.6808 -4.7664
## TeamLA TeamMIN
## 2.8598 2.1269
## TeamMLW TeamNYM
## 4.2897 -1.9736
## TeamNYY TeamOAK
## 1.7483 -0.5464
## TeamPHI TeamPIT
## -6.8486 4.3023
## TeamSD TeamSEA
## 2.6133 -0.9147
## TeamSF TeamSTL
## 0.8411 -1.1341
## TeamTB TeamTEX
## -2.6616 -0.7695
## TeamTOR TeamWAS
## 1.3943 -1.7555
## PositionDesignated_Hitter PositionFirst_Baseman
## 8.9037 2.4237
## PositionOutfielder PositionRelief_Pitcher
## -6.2636 -7.7695
## PositionSecond_Baseman PositionShortstop
## -13.0843 -16.9562
## PositionStarting_Pitcher PositionThird_Baseman
## -7.3599 -4.6035
## Height Age
## 4.7175 0.8906
## Start: AIC=5871.04
## Weight ~ Team + Position + Height + Age
##
## Df Sum of Sq RSS AIC
## - Team 29 9468 289262 5847.4
## <none> 279793 5871.0
## - Age 1 14090 293883 5919.8
## - Position 8 20301 300095 5927.5
## - Height 1 95356 375149 6172.3
##
## Step: AIC=5847.45
## Weight ~ Position + Height + Age
##
## Df Sum of Sq RSS AIC
## <none> 289262 5847.4
## + Team 29 9468 279793 5871.0
## - Age 1 14616 303877 5896.4
## - Position 8 20406 309668 5901.9
## - Height 1 100435 389697 6153.6
##
## Call:
## lm(formula = Weight ~ Position + Height + Age, data = mlb)
##
## Coefficients:
## (Intercept) PositionDesignated_Hitter
## -168.0474 8.6968
## PositionFirst_Baseman PositionOutfielder
## 2.7780 -6.0457
## PositionRelief_Pitcher PositionSecond_Baseman
## -7.7782 -13.0267
## PositionShortstop PositionStarting_Pitcher
## -16.4821 -7.3961
## PositionThird_Baseman Height
## -4.1361 4.7639
## Age
## 0.8771
We can observe that forward
selection retains the whole
model. The better feature selection model uses backward
stepwise selection. Both backward and forward feature selection methods
utilize greedy algorithms and do not guarantee an optimal model
selection result. Identifying the best feature selection requires
exploring every possible combination of the predictors, which is often
not practically feasible due to computational complexity associated with
model selection using \(n \choose k\)
combinations of features.
Alternatively, we can choose models based on various information criteria.
## Start: AIC=5871.04
## Weight ~ Team + Position + Height + Age
##
## Df Sum of Sq RSS AIC
## - Team 29 9468 289262 5847.4
## <none> 279793 5871.0
## - Age 1 14090 293883 5919.8
## - Position 8 20301 300095 5927.5
## - Height 1 95356 375149 6172.3
##
## Step: AIC=5847.45
## Weight ~ Position + Height + Age
##
## Df Sum of Sq RSS AIC
## <none> 289262 5847.4
## - Age 1 14616 303877 5896.4
## - Position 8 20406 309668 5901.9
## - Height 1 100435 389697 6153.6
##
## Call:
## lm(formula = Weight ~ Position + Height + Age, data = mlb)
##
## Coefficients:
## (Intercept) PositionDesignated_Hitter
## -168.0474 8.6968
## PositionFirst_Baseman PositionOutfielder
## 2.7780 -6.0457
## PositionRelief_Pitcher PositionSecond_Baseman
## -7.7782 -13.0267
## PositionShortstop PositionStarting_Pitcher
## -16.4821 -7.3961
## PositionThird_Baseman Height
## -4.1361 4.7639
## Age
## 0.8771
## Start: AIC=6068.69
## Weight ~ Team + Position + Height + Age
##
## Df Sum of Sq RSS AIC
## - Team 29 9468 289262 5901.8
## <none> 279793 6068.7
## - Position 8 20301 300095 6085.6
## - Age 1 14090 293883 6112.5
## - Height 1 95356 375149 6365.0
##
## Step: AIC=5901.8
## Weight ~ Position + Height + Age
##
## Df Sum of Sq RSS AIC
## <none> 289262 5901.8
## - Position 8 20406 309668 5916.8
## - Age 1 14616 303877 5945.8
## - Height 1 100435 389697 6203.0
##
## Call:
## lm(formula = Weight ~ Position + Height + Age, data = mlb)
##
## Coefficients:
## (Intercept) PositionDesignated_Hitter
## -168.0474 8.6968
## PositionFirst_Baseman PositionOutfielder
## 2.7780 -6.0457
## PositionRelief_Pitcher PositionSecond_Baseman
## -7.7782 -13.0267
## PositionShortstop PositionStarting_Pitcher
## -16.4821 -7.3961
## PositionThird_Baseman Height
## -4.1361 4.7639
## Age
## 0.8771
Setting the parameter \(k = 2\) yields the genuine AIC criterion, and \(k = log(n)\) refers to BIC. Let’s try to evaluate the model performance again.
## Start: AIC=5871.04
## Weight ~ Team + Position + Height + Age
##
## Df Sum of Sq RSS AIC
## - Team 29 9468 289262 5847.4
## <none> 279793 5871.0
## - Age 1 14090 293883 5919.8
## - Position 8 20301 300095 5927.5
## - Height 1 95356 375149 6172.3
##
## Step: AIC=5847.45
## Weight ~ Position + Height + Age
##
## Df Sum of Sq RSS AIC
## <none> 289262 5847.4
## - Age 1 14616 303877 5896.4
## - Position 8 20406 309668 5901.9
## - Height 1 100435 389697 6153.6
##
## Call:
## lm(formula = Weight ~ Position + Height + Age, data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.427 -10.855 -0.344 10.110 75.301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -168.0474 19.0351 -8.828 < 2e-16 ***
## PositionDesignated_Hitter 8.6968 4.4258 1.965 0.049679 *
## PositionFirst_Baseman 2.7780 2.9942 0.928 0.353741
## PositionOutfielder -6.0457 2.2778 -2.654 0.008072 **
## PositionRelief_Pitcher -7.7782 2.1913 -3.550 0.000403 ***
## PositionSecond_Baseman -13.0267 2.9531 -4.411 1.14e-05 ***
## PositionShortstop -16.4821 3.0372 -5.427 7.16e-08 ***
## PositionStarting_Pitcher -7.3961 2.2959 -3.221 0.001316 **
## PositionThird_Baseman -4.1361 3.1656 -1.307 0.191647
## Height 4.7639 0.2528 18.847 < 2e-16 ***
## Age 0.8771 0.1220 7.190 1.25e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.82 on 1023 degrees of freedom
## Multiple R-squared: 0.365, Adjusted R-squared: 0.3588
## F-statistic: 58.81 on 10 and 1023 DF, p-value: < 2.2e-16
#plot(fit2, which = 1:2)
plot_ly(x=fit2$fitted.values, y=fit2$residuals, type="scatter", mode="markers") %>%
layout(title="LM: Fitted-values vs. Model-Residuals",
xaxis=list(title="Fitted"),
yaxis = list(title="Residuals"))
# compute the quantiles
QQ <- qqplot(fit2$fitted.values, fit2$residuals, plot.it=FALSE)
# take a smaller sample size to expedite the viz
ind <- sample(1:length(QQ$x), 1000, replace = FALSE)
plot_ly() %>%
add_markers(x=~QQ$x, y=~QQ$y, name="Quantiles Scatter", type="scatter", mode="markers") %>%
add_trace(x = ~c(160,260), y = ~c(-50,80), type="scatter", mode="lines",
line = list(color = "red", width = 4), name="Line", showlegend=F) %>%
layout(title='Quantile plot',
xaxis = list(title="Fitted"),
yaxis = list(title="Residuals"),
legend = list(orientation = 'h'))
Sometimes, simpler models are preferable, even when there is a little bit of loss of performance. In this case, we have a simpler model and \(R^2=0.365\). The whole model is still very significant. We can see that observations \(65\), \(160\) and \(237\) are relatively far from the bulk of other residuals. These cases represent potentially influential points, or outliers.
Also, we can observe the leverage points - those that are either outliers, influential points, or both. In a regression model setting, observation leverage is the relative distance of the observation (data point) from the mean of the explanatory variable. Observations near the mean of the explanatory variable have low leverage and those far from the mean have high leverage. Yet, not all points of high leverage are necessarily influential.
# Half-normal plot for leverages
# install.packages("faraway")
library(faraway)
halfnorm(lm.influence(fit)$hat, nlab = 2, ylab="Leverages")
## Team Position Height Weight Age
## 226 NYY Designated_Hitter 75 230 36.14
## 879 SD Designated_Hitter 73 200 25.60
## Team Position Height Weight
## NYM : 38 Relief_Pitcher :315 Min. :67.0 Min. :150.0
## ATL : 37 Starting_Pitcher:221 1st Qu.:72.0 1st Qu.:187.0
## DET : 37 Outfielder :194 Median :74.0 Median :200.0
## OAK : 37 Catcher : 76 Mean :73.7 Mean :201.7
## BOS : 36 Second_Baseman : 58 3rd Qu.:75.0 3rd Qu.:215.0
## CHC : 36 First_Baseman : 55 Max. :83.0 Max. :290.0
## (Other):813 (Other) :115
## Age
## Min. :20.90
## 1st Qu.:25.44
## Median :27.93
## Mean :28.74
## 3rd Qu.:31.23
## Max. :48.52
##
A deeper discussion of variable selection, controlling the false discovery rate, is provided in Chapter 11.
In linear regression, the relationship between independent and dependent variables is assumed to be affine. However, in general, this might not be the case. The relationship between age and weight could be quadratic, logarithmic, exponential, etc. For instance, if middle-aged people are expected to gain weight dramatically and then lose it as they age. This is an example of adding a non-linear (quadratic) term to the linear model. Note that the model is still referred to as linear, as it still has a linear matrix representation.
##
## Call:
## lm(formula = Weight ~ ., data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.068 -10.775 -1.021 9.922 74.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -209.07068 27.49529 -7.604 6.65e-14 ***
## TeamARZ 7.41943 4.25154 1.745 0.081274 .
## TeamATL -1.43167 3.96793 -0.361 0.718318
## TeamBAL -5.38735 4.01119 -1.343 0.179552
## TeamBOS -0.06614 3.99633 -0.017 0.986799
## TeamCHC 0.14541 3.98833 0.036 0.970923
## TeamCIN 2.24022 3.98571 0.562 0.574201
## TeamCLE -1.07546 4.02870 -0.267 0.789563
## TeamCOL -3.87254 4.02069 -0.963 0.335705
## TeamCWS 4.20933 4.09393 1.028 0.304111
## TeamDET 2.66990 3.96769 0.673 0.501160
## TeamFLA 3.14627 4.12989 0.762 0.446343
## TeamHOU -0.77230 4.05526 -0.190 0.849000
## TeamKC -4.90984 4.01648 -1.222 0.221837
## TeamLA 3.13554 4.07514 0.769 0.441820
## TeamMIN 2.09951 4.08631 0.514 0.607512
## TeamMLW 4.16183 4.01646 1.036 0.300363
## TeamNYM -1.25057 3.95424 -0.316 0.751870
## TeamNYY 1.67825 4.11502 0.408 0.683482
## TeamOAK -0.68235 3.95951 -0.172 0.863212
## TeamPHI -6.85071 3.98672 -1.718 0.086039 .
## TeamPIT 4.12683 4.01348 1.028 0.304086
## TeamSD 2.59525 4.08310 0.636 0.525179
## TeamSEA -0.67316 4.04471 -0.166 0.867853
## TeamSF 1.06038 4.04481 0.262 0.793255
## TeamSTL -1.38669 4.11234 -0.337 0.736037
## TeamTB -2.44396 4.08716 -0.598 0.550003
## TeamTEX -0.68740 4.02023 -0.171 0.864270
## TeamTOR 1.24439 4.06029 0.306 0.759306
## TeamWAS -1.87599 3.99594 -0.469 0.638835
## PositionDesignated_Hitter 8.94440 4.44417 2.013 0.044425 *
## PositionFirst_Baseman 2.55100 3.00014 0.850 0.395368
## PositionOutfielder -6.25702 2.27372 -2.752 0.006033 **
## PositionRelief_Pitcher -7.68904 2.19166 -3.508 0.000471 ***
## PositionSecond_Baseman -13.01400 2.95787 -4.400 1.20e-05 ***
## PositionShortstop -16.82243 3.03494 -5.543 3.81e-08 ***
## PositionStarting_Pitcher -7.08215 2.29615 -3.084 0.002096 **
## PositionThird_Baseman -4.66452 3.16249 -1.475 0.140542
## Height 4.71888 0.25578 18.449 < 2e-16 ***
## Age 3.82295 1.30621 2.927 0.003503 **
## age2 -0.04791 0.02124 -2.255 0.024327 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.74 on 993 degrees of freedom
## Multiple R-squared: 0.3889, Adjusted R-squared: 0.3643
## F-statistic: 15.8 on 40 and 993 DF, p-value: < 2.2e-16
Including a quadratic factor may change the overall \(R^2\).
As discussed earlier, middle-aged people might exhibit a different
weight pattern, compared to younger or older people. The overall pattern
may not always be cumulative, i.e., weight may represent two separate
trajectories for young and middle-aged people. For concreteness, let’s
use the age of 30 as a threshold segregating young and middle-aged
people. People over 30 may have a steeper line for weight change than
those under 30. Here we use an ifelse()
conditioning
statement to create a new indicator variable (\(age30\)) based on this threshold value.
mlb$age30 <- ifelse(mlb$Age>=30, 1, 0)
fit3 <- lm(Weight ~ Team+Position+Age+age30+Height, data=mlb)
summary(fit3)
##
## Call:
## lm(formula = Weight ~ Team + Position + Age + age30 + Height,
## data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.313 -11.166 -0.916 10.044 73.630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -159.8884 19.8862 -8.040 2.54e-15 ***
## TeamARZ 7.4096 4.2627 1.738 0.082483 .
## TeamATL -1.4379 3.9765 -0.362 0.717727
## TeamBAL -5.3393 4.0187 -1.329 0.184284
## TeamBOS -0.1985 4.0034 -0.050 0.960470
## TeamCHC 0.4669 3.9947 0.117 0.906976
## TeamCIN 2.2124 3.9939 0.554 0.579741
## TeamCLE -1.1624 4.0371 -0.288 0.773464
## TeamCOL -3.6842 4.0290 -0.914 0.360717
## TeamCWS 4.1920 4.1025 1.022 0.307113
## TeamDET 2.4708 3.9746 0.622 0.534314
## TeamFLA 2.8563 4.1352 0.691 0.489903
## TeamHOU -0.4964 4.0659 -0.122 0.902846
## TeamKC -4.7138 4.0238 -1.171 0.241692
## TeamLA 2.9194 4.0814 0.715 0.474586
## TeamMIN 2.2885 4.0965 0.559 0.576528
## TeamMLW 4.4749 4.0269 1.111 0.266731
## TeamNYM -1.8173 3.9510 -0.460 0.645659
## TeamNYY 1.7074 4.1229 0.414 0.678867
## TeamOAK -0.3388 3.9707 -0.085 0.932012
## TeamPHI -6.6192 3.9993 -1.655 0.098220 .
## TeamPIT 4.6716 4.0332 1.158 0.247029
## TeamSD 2.8600 4.0965 0.698 0.485243
## TeamSEA -1.0121 4.0518 -0.250 0.802809
## TeamSF 1.0244 4.0545 0.253 0.800587
## TeamSTL -1.1094 4.1187 -0.269 0.787703
## TeamTB -2.4485 4.0980 -0.597 0.550312
## TeamTEX -0.6112 4.0300 -0.152 0.879485
## TeamTOR 1.3959 4.0674 0.343 0.731532
## TeamWAS -1.4189 4.0139 -0.354 0.723784
## PositionDesignated_Hitter 9.2378 4.4621 2.070 0.038683 *
## PositionFirst_Baseman 2.6074 3.0096 0.866 0.386501
## PositionOutfielder -6.0408 2.2863 -2.642 0.008367 **
## PositionRelief_Pitcher -7.5100 2.2072 -3.403 0.000694 ***
## PositionSecond_Baseman -12.8870 2.9683 -4.342 1.56e-05 ***
## PositionShortstop -16.8912 3.0406 -5.555 3.56e-08 ***
## PositionStarting_Pitcher -7.0825 2.3099 -3.066 0.002227 **
## PositionThird_Baseman -4.4307 3.1719 -1.397 0.162773
## Age 0.6904 0.2153 3.207 0.001386 **
## age30 2.2636 1.9749 1.146 0.251992
## Height 4.7113 0.2563 18.380 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.77 on 993 degrees of freedom
## Multiple R-squared: 0.3866, Adjusted R-squared: 0.3619
## F-statistic: 15.65 on 40 and 993 DF, p-value: < 2.2e-16
This model performs worse than the quadratic model in terms of \(R^2\). Moreover, age30
does
not appear as a significant predictor of weight. Therefore, such a
pseudo factor does not contribute to explaining the observed variability
in the dataset. However, if including such binary indicator (dummy
variable, or one-hot-encoding feature) improved the model (e.g.,
increased \(R^2\)), then we can leave
the binary feature in the model and interpret its coefficient estimate
in terms of a difference of expectations:
So far, we only accounted for the individual and independent effects of each variable included in the linear model. It is also possible that pairs of features jointly affect the independent outcome variable. Interactions represent combined effects of two features on the outcome. If we are uncertain whether two variables interact, we could include them along with their interaction in the model, and then test the significance of the interaction term. If the interaction is significant, then it remains in the model, if not, this term can be dropped.
##
## Call:
## lm(formula = Weight ~ Team + Height + Age * Position + age2,
## data = mlb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.761 -11.049 -0.761 9.911 75.533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -199.15403 29.87269 -6.667 4.35e-11 ***
## TeamARZ 8.10376 4.26339 1.901 0.0576 .
## TeamATL -0.81743 3.97899 -0.205 0.8373
## TeamBAL -4.64820 4.03972 -1.151 0.2502
## TeamBOS 0.37698 4.00743 0.094 0.9251
## TeamCHC 0.33104 3.99507 0.083 0.9340
## TeamCIN 2.56023 3.99603 0.641 0.5219
## TeamCLE -0.66254 4.03154 -0.164 0.8695
## TeamCOL -3.72098 4.03759 -0.922 0.3570
## TeamCWS 4.63266 4.10884 1.127 0.2598
## TeamDET 3.21380 3.98231 0.807 0.4199
## TeamFLA 3.56432 4.14902 0.859 0.3905
## TeamHOU -0.38733 4.07249 -0.095 0.9242
## TeamKC -4.66678 4.02384 -1.160 0.2464
## TeamLA 3.51766 4.09400 0.859 0.3904
## TeamMIN 2.31585 4.10502 0.564 0.5728
## TeamMLW 4.34793 4.02501 1.080 0.2803
## TeamNYM -0.28505 3.98537 -0.072 0.9430
## TeamNYY 1.87847 4.12774 0.455 0.6491
## TeamOAK -0.23791 3.97729 -0.060 0.9523
## TeamPHI -6.25671 3.99545 -1.566 0.1177
## TeamPIT 4.18719 4.01944 1.042 0.2978
## TeamSD 2.97028 4.08838 0.727 0.4677
## TeamSEA -0.07220 4.05922 -0.018 0.9858
## TeamSF 1.35981 4.07771 0.333 0.7388
## TeamSTL -1.23460 4.11960 -0.300 0.7645
## TeamTB -1.90885 4.09592 -0.466 0.6413
## TeamTEX -0.31570 4.03146 -0.078 0.9376
## TeamTOR 1.73976 4.08565 0.426 0.6703
## TeamWAS -1.43933 4.00274 -0.360 0.7192
## Height 4.70632 0.25646 18.351 < 2e-16 ***
## Age 3.32733 1.37088 2.427 0.0154 *
## PositionDesignated_Hitter -44.82216 30.68202 -1.461 0.1444
## PositionFirst_Baseman 23.51389 20.23553 1.162 0.2455
## PositionOutfielder -13.33140 15.92500 -0.837 0.4027
## PositionRelief_Pitcher -16.51308 15.01240 -1.100 0.2716
## PositionSecond_Baseman -26.56932 20.18773 -1.316 0.1884
## PositionShortstop -27.89454 20.72123 -1.346 0.1786
## PositionStarting_Pitcher -2.44578 15.36376 -0.159 0.8736
## PositionThird_Baseman -10.20102 23.26121 -0.439 0.6611
## age2 -0.04201 0.02170 -1.936 0.0531 .
## Age:PositionDesignated_Hitter 1.77289 1.00506 1.764 0.0780 .
## Age:PositionFirst_Baseman -0.71111 0.67848 -1.048 0.2949
## Age:PositionOutfielder 0.24147 0.53650 0.450 0.6527
## Age:PositionRelief_Pitcher 0.30374 0.50564 0.601 0.5482
## Age:PositionSecond_Baseman 0.46281 0.68281 0.678 0.4981
## Age:PositionShortstop 0.38257 0.70998 0.539 0.5901
## Age:PositionStarting_Pitcher -0.17104 0.51976 -0.329 0.7422
## Age:PositionThird_Baseman 0.18968 0.79561 0.238 0.8116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.73 on 985 degrees of freedom
## Multiple R-squared: 0.3945, Adjusted R-squared: 0.365
## F-statistic: 13.37 on 48 and 985 DF, p-value: < 2.2e-16
In this example, we see that the overall \(R^2\) improved by including a \(Age\times Position\) interaction and we can interpret its significance level.
In Chapter 5, we will discuss decision trees built by multiple conditional logical decisions that lead to natural classifications of all observations. We could also add regression into decision tree modeling to make numerical predictions.
Numeric prediction trees are built in the same way as classification trees. In Chapter 5 we will show how data are partitioned first by a divide-and-conquer strategy based on features. The homogeneity of the resulting classification trees is measured by various metrics, e.g., entropy. In regression-tree prediction, node homogeneity (which is used to determine if a node needs to be split) is measured by various statistics such as variance, standard deviation, or absolute deviation from the mean. A common splitting criterion for decision trees is the standard deviation reduction (SDR).
\[SDR=sd(T)-\sum_{i=1}^n \left | \frac{T_i}{T} \right | \times sd(T_i),\]
where sd(T)
is the standard deviation for the original
data. After the summation of all segments, \(|\frac{T_i}{T}|\) is the proportion of
observations in the \(i^{th}\) segment
compared to total number of observations, and \(sd(T_i)\) is the standard deviation for the
\(i^{th}\) segment.
Let’s look at a simple example
\[{\text{Original data}}:\{1, 2, 3, 3, 4, 5, 6, 6, 7, 8\}\] \[{\text{Split method 1}}:\left \{\underbrace{1, 2, 3}_{T_1}\ |\ \underbrace{3, 4, 5, 6, 6, 7, 8}_{T_2}\right \}\] \[{\text{Split method 2}}:\left \{\underbrace{1, 2, 3, 3, 4, 5}_{T_1'}\ | \ \underbrace{6, 6, 7, 8}_{T_2'}\right \}.\]
In split method 1, \(T_1=\{1, 2, 3\}\), \(T_2=\{3, 4, 5, 6, 6, 7, 8\}\). In split method 2, \(T_1'=\{1, 2, 3, 3, 4, 5\}\), \(T_2'=\{6, 6, 7, 8\}\).
ori<-c(1, 2, 3, 3, 4, 5, 6, 6, 7, 8)
at1<-c(1, 2, 3)
at2<-c(3, 4, 5, 6, 6, 7, 8)
bt1<-c(1, 2, 3, 3, 4, 5)
bt2<-c(6, 6, 7, 8)
sdr_a<-sd(ori)-(length(at1)/length(ori)*sd(at1)+length(at2)/length(ori)*sd(at2))
sdr_b<-sd(ori)-(length(bt1)/length(ori)*sd(bt1)+length(bt2)/length(ori)*sd(bt2))
sdr_a
## [1] 0.7702557
## [1] 1.041531
The method length()
is used above to get the number of
elements in a specific vector.
Larger SDR indicates greater reduction in standard deviation after splitting. Here we have split method 2 yielding greater SDR, so the tree splitting decision would prefer the second method, which is expected to produce more homogeneous subsets (children nodes), compared to method 1.
Now, the tree will be split under bt1
and
bt2
following the same rules (greater SDR wins). Assume we
cannot split further (bt1
and bt2
are terminal
nodes). The observations classified into bt1
will be
predicted with \(mean(bt1)=3\) and
those classified as bt2
with \(mean(bt2)=6.75\).
Bayesian Additive Regression Trees (BART) represent sums of regression trees models that rely on boosting the constituent Bayesian regularized trees.
The R
packages BayesTree
and
BART
provide computational implementation of fitting BART
models to data. In supervised setting where \(x\) and \(y\) represent the predictors and the
outcome, the BART model is mathematically represented as:
\[y=f(x)+\epsilon = \sum_{j=1}^m{f_j(x)} +\epsilon.\]
More specifically,
\[y_i=f(x_i)+\epsilon = \sum_{j=1}^m{f_j(x_i)} +\epsilon,\ \forall 1\leq i\leq n.\]
The residuals are typically assumed to be white noise, \(\epsilon_i \sim N(o,\sigma^2), \ iid\). The function \(f\) represents the boosted ensemble of weaker regression trees, \(f_i=g( \cdot | T_j, M_j)\), where \(T_j\) and \(M_j\) represent the \(j^{th}\) tree and the set of values, \(\mu_{k,j}\), assigned to each terminal node \(k\) in \(T_j\), respectively.
The BART model may be estimated via Gibbs sampling, e.g., Bayesian backfitting Markov Chain Monte Carlo (MCMC) algorithm. For instance, iteratively sampling \((T_j ,M_j)\) and \(\sigma\), conditional on all other variables, \((x,y)\), for each \(j\) until meeting a certain convergence criterion. Given \(\sigma\), conditional sampling of \((T_j ,M_j)\) may be accomplished via the partial residual
\[\epsilon_{j_o} = \sum_{i=1}^n{\left (y_i - \sum_{j\not= j_o}^m{g( x_i | T_j, M_j)} \right )}.\]
For prediction on new data, \(X\), data-driven priors on \(\sigma\) and the parameters defining \((T_j ,M_j)\) may be used to allow sampling a model from the posterior distribution:
\[p\left (\sigma, \{(T_1 ,M_1), (T_2 ,M_2), \cdots, (T_2 ,M_2)\} | X, Y \right )=\] \[= p(\sigma) \prod_{(tree)j=1}^m{\left ( \left ( \prod_{(node)k} {p(\mu_{k,j}|T_j)} \right )p(T_j) \right )} .\] In this posterior factorization, model regularization is achieved by four criteria:
Using the BART model to forecast a response corresponding to newly observed data \(x\) is achieved by using one individual (or ensembling/averaging multiple) prediction models near algorithmic convergence.
The BART algorithm involves three steps:
This example illustrates a simple 1D BART simulation, based on a simple analytical process model \(h(x)=x^3\sin(x)\), where we can nicely track the performance of the BART classifier.
# simulate training data
sig = 0.2 # sigma
func = function(x) {
return (sin(x) * (x^(3)))
}
set.seed(1234)
n = 300
x = sort(2*runif(n)-1) # define the input
y = func(x) + sig * rnorm(n) # define the output
# xtest: values we want to estimate func(x) at; this is also our prior prediction for y.
xtest = seq(-pi, pi, by=0.2)
##plot simulated data
# plot(x, y, cex=1.0, col="gray")
# points(xtest, rep(0, length(xtest)), col="blue", pch=1, cex=1.5)
# legend("top", legend=c("Data", "(Conjugate) Flat Prior"),
# col=c("gray", "blue"), lwd=c(2,2), lty=c(3,3), bty="n", cex=0.9, seg.len=3)
plot_ly(x=x, y=y, type="scatter", mode="markers", name="data") %>%
add_trace(x=xtest, y=rep(0, length(xtest)), mode="markers", name="prior") %>%
layout(title='(Conjugate) Flat Prior',
xaxis = list(title="X", range = c(-1.3, 1.3)),
yaxis = list(title="Y", range = c(-0.6, 1.6)),
legend = list(orientation = 'h'))
# run the weighted BART (BART::wbart) on the simulated data
# install.packages("BART")
library(BART)
set.seed(1234) # set seed for reproducibility of MCMC
# nskip=Number of burn-in MCMC iterations; ndpost=number of posterior draws to return
model_bart <- wbart(x.train=as.data.frame(x), y.train=y, x.test=as.data.frame(xtest),
nskip=300, ndpost=1000, printevery=1000)
## *****Into main of wbart
## *****Data:
## data:n,p,np: 300, 1, 32
## y1,yn: 0.572918, 1.073126
## x1,x[n*p]: -0.993435, 0.997482
## xp1,xp[np*p]: -3.141593, 3.058407
## *****Number of Trees: 200
## *****Number of Cut Points: 100 ... 100
## *****burn and ndpost: 300, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,0.032889,3.000000,0.018662
## *****sigma: 0.309524
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,1,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1300)
## done 1000 (out of 1300)
## time: 1s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
## [1] "sigma" "yhat.train.mean" "yhat.train" "yhat.test.mean"
## [5] "yhat.test" "varcount" "varprob" "treedraws"
## [9] "proc.time" "mu" "varcount.mean" "varprob.mean"
## [13] "rm.const"
## [1] 1000 32
# The (l,j) element of the matrix `yhat.test` represents the l^{th} draw of `func` evaluated at the j^{th} value of x.test
# A matrix with ndpost rows and nrow(x.train) columns. Each row corresponds to a draw f* from the posterior of f
# and each column corresponds to a row of x.train. The (i,j) value is f*(x) for the i^th kept draw of f
# and the j^th row of x.train
# # plot the data, the BART Fit, and the uncertainty
# plot(x, y, cex=1.2, cex.axis=0.8, cex.lab=0.8, mgp=c(1.3,.3,0), tcl=-0.2, col="gray")
# lines(xtest, func(xtest), col="blue", lty=1, lwd=2)
# lines(xtest, apply(model_bart$yhat.test, 2, mean), col="green", lwd=2, lty=2) # show the mean of f(x_j)
# quant_marg = apply(model_bart$yhat.test, 2, quantile, probs=c(0.025, 0.975)) # plot the 2.5% and 97.5% quantiles
# lines(xtest, quant_marg[1,], col="red", lty=1, lwd=2)
# lines(xtest, quant_marg[2,], col="red", lty=1,lwd=2)
# legend("top", legend=c("Data", "True Signal","Posterior Mean","95% CI"),
# col=c("black", "blue","green","red"), lwd=c(2, 2,2,2), lty=c(3, 1,1,1), bty="n", cex=0.9, seg.len=3)
quant_marg = apply(model_bart$yhat.test, 2, quantile, probs=c(0.025, 0.975)) # plot the 2.5% and 97.5% quantiles
plot_ly(x=x, y=y, type="scatter", mode="markers", name="Data") %>%
add_trace(x=xtest, y=func(xtest), mode="markers", name="True Signal") %>%
add_trace(x=xtest, y=apply(model_bart$yhat.test, 2, mean), mode="lines", name="Posterior Mean") %>%
add_trace(x=xtest, y=apply(model_bart$yhat.test, 2, quantile, probs=c(0.025, 0.975)),
mode="markers", name="95% CI") %>%
add_trace(x=xtest, y=quant_marg[1,], mode="lines", name="Lower Band") %>%
add_trace(x=xtest, y=quant_marg[2,], mode="lines", name="Upper Band") %>%
layout(title='BART Model (n=300)',
xaxis = list(title="X", range = c(-1.3, 1.3)),
yaxis = list(title="Y", range = c(-0.6, 1.6)),
legend = list(orientation = 'h'))
## [1] "sigma" "yhat.train.mean" "yhat.train" "yhat.test.mean"
## [5] "yhat.test" "varcount" "varprob" "treedraws"
## [9] "proc.time" "mu" "varcount.mean" "varprob.mean"
## [13] "rm.const"
## [1] 1000 300
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.441e-16 -8.327e-17 6.939e-18 -8.130e-18 9.714e-17 5.551e-16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.776e-16 1.535e-16 3.331e-16 2.661e-16 4.441e-16 4.441e-16
If we increase the sample size, \(n\), the computational complexity increases and the BART model bounds should get tighter.
n = 3000 # 300 --> 3,000
set.seed(1234)
x = sort(2*runif(n)-1)
y = func(x) + sig*rnorm(n)
model_bart_2 <- wbart(x.train=as.data.frame(x), y.train=y, x.test=as.data.frame(xtest),
nskip=300, ndpost=1000, printevery=1000)
## *****Into main of wbart
## *****Data:
## data:n,p,np: 3000, 1, 32
## y1,yn: 1.021931, 0.626872
## x1,x[n*p]: -0.999316, 0.998606
## xp1,xp[np*p]: -3.141593, 3.058407
## *****Number of Trees: 200
## *****Number of Cut Points: 100 ... 100
## *****burn and ndpost: 300, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,0.035536,3.000000,0.017828
## *****sigma: 0.302529
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,1,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1300)
## done 1000 (out of 1300)
## time: 8s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
# plot(x, y, cex=1.2, cex.axis=0.8, cex.lab=0.8, mgp=c(1.3,.3,0), tcl=-0.2, col="gray")
# lines(xtest, func(xtest), col="blue", lty=1, lwd=2)
# lines(xtest, apply(model_bart_2$yhat.test, 2, mean), col="green", lwd=2, lty=2) # show the mean of f(x_j)
# quant_marg = apply(model_bart_2$yhat.test, 2, quantile, probs=c(0.025, 0.975)) # plot the 2.5% and 97.5% CI
# lines(xtest, quant_marg[1,], col="red", lty=1, lwd=2)
# lines(xtest, quant_marg[2,], col="red", lty=1,lwd=2)
# legend("top",legend=c("Data (n=3,000)", "True Signal", "Posterior Mean","95% CI"),
# col=c("gray", "blue","green","red"), lwd=c(2, 2,2,2), lty=c(3, 1,1,1), bty="n", cex=0.9, seg.len=3)
quant_marg2 = apply(model_bart_2$yhat.test, 2, quantile, probs=c(0.025, 0.975)) # plot the 2.5% and 97.5% CI
plot_ly(x=x, y=y, type="scatter", mode="markers", name="Data") %>%
add_trace(x=xtest, y=func(xtest), mode="markers", name="True Signal") %>%
add_trace(x=xtest, y=apply(model_bart_2$yhat.test, 2, mean), mode="lines", name="Posterior Mean") %>%
add_trace(x=xtest, y=apply(model_bart_2$yhat.test, 2, quantile, probs=c(0.025, 0.975)),
mode="markers", name="95% CI") %>%
add_trace(x=xtest, y=quant_marg2[1,], mode="lines", name="Lower Band") %>%
add_trace(x=xtest, y=quant_marg2[2,], mode="lines", name="Upper Band") %>%
layout(title='BART Model (n=3,000)',
xaxis = list(title="X", range = c(-1.3, 1.3)),
yaxis = list(title="Y", range = c(-0.6, 1.6)),
legend = list(orientation = 'h'))
In this second BART example, we will simulate \(n=5,000\) cases with \(p=20\) features.
# simulate data
set.seed(1234)
n=5000; p=20
beta = 3*(1:p)/p
sig=1.0
X = matrix(rnorm(n*p), ncol=p) # design matrix)
y = 10 + X %*% matrix(beta, ncol=1) + sig*rnorm(n) # outcome
y=as.double(y)
np=100000
Xp = matrix(rnorm(np*p), ncol=p)
set.seed(1234)
t1 <- system.time(model_bart_MD <-
wbart(x.train=as.data.frame(X), y.train=y, x.test=as.data.frame(Xp),
nkeeptrain=200, nkeeptest=100, nkeeptestmean=500, nkeeptreedraws=100, printevery=1000)
)
## *****Into main of wbart
## *****Data:
## data:n,p,np: 5000, 20, 100000
## y1,yn: -6.790191, 5.425931
## x1,x[n*p]: -1.207066, -3.452091
## xp1,xp[np*p]: -0.396587, -0.012973
## *****Number of Trees: 200
## *****Number of Cut Points: 100 ... 100
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,1.049775,3.000000,0.189323
## *****sigma: 0.985864
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,20,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 200,100,500,100
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 5,10,2,10
##
## MCMC
## done 0 (out of 1100)
## done 1000 (out of 1100)
## time: 87s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 200,100,500,100
## [1] 200 5000
## [1] 100 100000
## [1] "cutpoints" "trees"
# str(model_bart_MD$treedraws$trees)
# The trees are stored in a long character string and there are 100 draws each consisting of 200 trees.
# To predict using the Multi-Dimensional BART model (MD)
t2 <- system.time({pred_model_bart_MD2 <- predict(model_bart_MD, as.data.frame(Xp), mc.cores=6)})
## *****In main of C++ for bart prediction
## tc (threadcount): 6
## number of bart draws: 100
## number of trees in bart sum: 200
## number of x columns: 20
## from x,np,p: 20, 100000
## ***using parallel code
## [1] 100 100000
## user system elapsed
## 39.28 0.16 86.91
## user system elapsed
## 12.78 0.11 7.89
# pred_model_bart_MD2 has row dimension equal to the number of kept tree draws (100) and
# column dimension equal to the number of rows in Xp (100,000).
# Compare the BART predictions using 1K trees vs. 100 kept trees (very similar results)
# plot(model_bart_MD$yhat.test.mean, apply(pred_model_bart_MD2, 2, mean),
# xlab="BART Prediction using 1,000 Trees", ylab="BART Prediction using 100 Kept Trees")
# abline(0,1, col="red", lwd=2)
plot_ly() %>%
add_trace(x = c(-30,50), y = c(-30,50), type="scatter", mode="lines",
line = list(width = 4), name="Consistent BART Prediction (1,000 vs. 100 Trees)") %>%
add_markers(x=model_bart_MD$yhat.test.mean, y=apply(pred_model_bart_MD2, 2, mean),
name="BART Prediction Mean Estimates", type="scatter", mode="markers") %>%
layout(title='Scatter of BART Predictions (1,000 vs. 100 Trees)',
xaxis = list(title="BART Prediction (1,000 Trees)"),
yaxis = list(title="BART Prediction (100 Trees)"),
legend = list(orientation = 'h'))
# Compare BART Prediction to a linear fit
lm_func = lm(y ~ ., data.frame(X,y))
pred_lm = predict(lm_func, data.frame(Xp))
# plot(pred_lm, model_bart_MD$yhat.test.mean, xlab="Linear Model Predictions", ylab="BART Predictions",
# cex=0.5, cex.axis=1.0, cex.lab=0.8, mgp=c(1.3,.3,0), tcl=-0.2)
# abline(0,1, col="red", lwd=2)
plot_ly() %>%
add_markers(x=pred_lm, y=model_bart_MD$yhat.test.mean,
name="LM vs. BART Prediction", type="scatter", mode="markers") %>%
add_trace(x = c(-30,50), y = c(-30,50), type="scatter", mode="lines",
line = list(width = 4), name="Perfectly Consistent LM/BART Prediction") %>%
layout(title='Scatter of Linear Model vs. BART Predictions',
xaxis = list(title="Linear Model Prediction"),
yaxis = list(title="BART Prediction"),
legend = list(orientation = 'h'))
Let’s use BART to model the heart
attack dataset (CaseStudy12_ AdultsHeartAttack_Data.csv). The data
includes about \(150\) observations and
\(8\) features, including hospital
charges (CHARGES
), which will be used as a response
variable.
heart_attack<-read.csv("https://umich.instructure.com/files/1644953/download?download_frd=1",
stringsAsFactors = F)
str(heart_attack)
## 'data.frame': 150 obs. of 8 variables:
## $ Patient : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DIAGNOSIS: int 41041 41041 41091 41081 41091 41091 41091 41091 41041 41041 ...
## $ SEX : chr "F" "F" "F" "F" ...
## $ DRG : int 122 122 122 122 122 121 121 121 121 123 ...
## $ DIED : int 0 0 0 0 0 0 0 0 0 1 ...
## $ CHARGES : chr "4752" "3941" "3657" "1481" ...
## $ LOS : int 10 6 5 2 1 9 15 15 2 1 ...
## $ AGE : int 79 34 76 80 55 84 84 70 76 65 ...
# convert the CHARGES (independent variable) to numerical form.
# NA's are created so let's remain only the complete cases
heart_attack$CHARGES <- as.numeric(heart_attack$CHARGES)
heart_attack <- heart_attack[complete.cases(heart_attack), ]
heart_attack$gender <- ifelse(heart_attack$SEX=="F", 1, 0)
heart_attack <- heart_attack[, -c(1,2,3)]
dim(heart_attack); colnames(heart_attack)
## [1] 148 6
## [1] "DRG" "DIED" "CHARGES" "LOS" "AGE" "gender"
x.train <- as.matrix(heart_attack[ , -3]) # x training, excluding the charges (output)
y.train = heart_attack$CHARGES # y=output for modeling (BART, lm, lasso, etc.)
# Data should be standardized for all model-based predictions (e.g., lm, lasso/glmnet), but
# this is not critical for BART
# We'll just do some random train/test splits and report the out of sample performance of BART and lasso
RMSE <- function(y, yhat) {
return(sqrt(mean((y-yhat)^2)))
}
nd <- 10 # number of train/test splits (ala CV validation)
n <- length(y.train)
ntrain <- floor(0.8*n) # 80:20 train:test split each time
RMSE_BART <- rep(0, nd) # initialize BART and LASSO RMSE vectors
RMSE_LASSO <- rep(0, nd)
pred_BART <- matrix(0.0, n-ntrain,nd) # Initialize the BART and LASSO out-of-sample predictions
pred_LASSO <- matrix(0.0, n-ntrain,nd)
In Chapter
11, we will learn more about LASSO regularized linear
modeling. Now, let’s use the glmnet::glmnet()
method to fit
a LASSO model and compare it to BART using the Heart Attack
hospitalization case-study.
library(glmnet)
for(i in 1:nd) {
set.seed(1234*i)
# train/test split index
train_ind <- sample(1:n, ntrain)
# Outcome (CHARGES)
yTrain <- y.train[train_ind]; yTest <- y.train[-train_ind]
# Features for BART
xBTrain <- x.train[train_ind, ]; xBTest <- x.train[-train_ind, ]
# Features for LASSO (scale)
xLTrain <- apply(x.train[train_ind, ], 2, scale)
xLTest <- apply(x.train[-train_ind, ], 2, scale)
# BART: parallel version of mc.wbart, same arguments as in wbart
# model_BART <- mc.wbart(xBTrain, yTrain, xBTest, mc.cores=6, keeptrainfits=FALSE)
model_BART <- wbart(xBTrain, yTrain, xBTest, printevery=1000)
# LASSO
cv_LASSO <- cv.glmnet(xLTrain, yTrain, family="gaussian", standardize=TRUE)
best_lambda <- cv_LASSO$lambda.min
model_LASSO <- glmnet(xLTrain, yTrain, family="gaussian", lambda=c(best_lambda), standardize=TRUE)
#get predictions on testing data
pred_BART1 <- model_BART$yhat.test.mean
pred_LASSO1 <- predict(model_LASSO, xLTest, s=best_lambda, type="response")[, 1]
#store results
RMSE_BART[i] <- RMSE(yTest, pred_BART1); pred_BART[, i] <- pred_BART1
RMSE_LASSO[i] <- RMSE(yTest, pred_LASSO1); pred_LASSO[, i] <- pred_LASSO1;
}
## *****Into main of wbart
## *****Data:
## data:n,p,np: 118, 5, 30
## y1,yn: 2105.135593, -2641.864407
## x1,x[n*p]: 121.000000, 0.000000
## xp1,xp[np*p]: 122.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 2 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,290.550176,3.000000,2228477.133405
## *****sigma: 3382.354604
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 1000 (out of 1100)
## time: 1s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
## *****Into main of wbart
## *****Data:
## data:n,p,np: 118, 5, 30
## y1,yn: -3870.449153, 5404.550847
## x1,x[n*p]: 122.000000, 1.000000
## xp1,xp[np*p]: 122.000000, 0.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 2 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,290.550176,3.000000,2261599.844262
## *****sigma: 3407.398506
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 1000 (out of 1100)
## time: 1s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
## *****Into main of wbart
## *****Data:
## data:n,p,np: 118, 5, 30
## y1,yn: 2659.855932, -1749.144068
## x1,x[n*p]: 121.000000, 1.000000
## xp1,xp[np*p]: 122.000000, 0.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 2 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,290.550176,3.000000,2182151.820220
## *****sigma: 3347.013986
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 1000 (out of 1100)
## time: 2s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
## *****Into main of wbart
## *****Data:
## data:n,p,np: 118, 5, 30
## y1,yn: 714.000000, -4103.000000
## x1,x[n*p]: 121.000000, 1.000000
## xp1,xp[np*p]: 122.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 2 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,290.550176,3.000000,2310883.132530
## *****sigma: 3444.324312
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 1000 (out of 1100)
## time: 1s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
## *****Into main of wbart
## *****Data:
## data:n,p,np: 118, 5, 30
## y1,yn: 3229.305085, 1884.305085
## x1,x[n*p]: 122.000000, 1.000000
## xp1,xp[np*p]: 122.000000, 0.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 2 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,290.550176,3.000000,2169347.152131
## *****sigma: 3337.179552
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 1000 (out of 1100)
## time: 1s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
## *****Into main of wbart
## *****Data:
## data:n,p,np: 118, 5, 30
## y1,yn: 2115.474576, 1998.474576
## x1,x[n*p]: 121.000000, 1.000000
## xp1,xp[np*p]: 122.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 2 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,290.550176,3.000000,2209447.645085
## *****sigma: 3367.882283
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 1000 (out of 1100)
## time: 2s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
## *****Into main of wbart
## *****Data:
## data:n,p,np: 118, 5, 30
## y1,yn: -1255.525424, -3624.525424
## x1,x[n*p]: 123.000000, 0.000000
## xp1,xp[np*p]: 122.000000, 0.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 2 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,290.550176,3.000000,2177441.668674
## *****sigma: 3343.399788
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 1000 (out of 1100)
## time: 1s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
## *****Into main of wbart
## *****Data:
## data:n,p,np: 118, 5, 30
## y1,yn: 3795.457627, -388.542373
## x1,x[n*p]: 121.000000, 0.000000
## xp1,xp[np*p]: 122.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 2 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,288.534922,3.000000,1972180.926129
## *****sigma: 3181.913893
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 1000 (out of 1100)
## time: 1s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
## *****Into main of wbart
## *****Data:
## data:n,p,np: 118, 5, 30
## y1,yn: -3943.779661, -524.779661
## x1,x[n*p]: 123.000000, 0.000000
## xp1,xp[np*p]: 122.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 2 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,290.550176,3.000000,2184631.918632
## *****sigma: 3348.915451
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 1000 (out of 1100)
## time: 2s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
## *****Into main of wbart
## *****Data:
## data:n,p,np: 118, 5, 30
## y1,yn: 5245.610169, 4149.610169
## x1,x[n*p]: 121.000000, 0.000000
## xp1,xp[np*p]: 122.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 2 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,289.984491,3.000000,2281761.566623
## *****sigma: 3422.552953
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 1000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 1000 (out of 1100)
## time: 1s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
Plot BART vs. LASSO predictions.
# compare the out of sample RMSE measures
# qqplot(RMSE_BART, RMSE_LASSO)
# abline(0, 1, col="red", lwd=2)
plot_ly() %>%
add_markers(x=RMSE_BART, y=RMSE_LASSO,
name="", type="scatter", mode="markers") %>%
add_trace(x = c(2800,4000), y = c(2800,4000), type="scatter", mode="lines",
line = list(width = 4), name="") %>%
layout(title='Scatter of Linear Model vs. BART RMSE',
xaxis = list(title="RMSE (BART)"),
yaxis = list(title="RMSE (Linear Model)"),
legend = list(orientation = 'h'))
# Next compare the out of sample predictions
model_lm <- lm(B ~ L, data.frame(B=as.double(pred_BART), L=as.double(pred_LASSO)))
x1 <- c(2800,9000)
y1 <- model_lm$coefficients[1] + model_lm$coefficients[2]*x1
# plot(as.double(pred_BART), as.double(pred_LASSO),xlab="BART Predictions",ylab="LASSO Predictions", col="gray")
# abline(0, 1, col="red", lwd=2)
# abline(model_lm$coef, col="blue", lty=2, lwd=3)
# legend("topleft",legend=c("Scatterplot Predictions (BART vs. LASSO)", "Ideal Agreement", "LM (BART~LASSO)"),
# col=c("gray", "red","blue"), lwd=c(2,2,2), lty=c(3,1,1), bty="n", cex=0.9, seg.len=3)
plot_ly() %>%
add_markers(x=as.double(pred_BART), y=as.double(pred_LASSO),
name="BART Predictions vs. Observed Scatter", type="scatter", mode="markers") %>%
add_trace(x = c(2800,9000), y = c(2800,9000), type="scatter", mode="lines",
line = list(width = 4), name="Ideal Agreement") %>%
add_trace(x = x1, y = y1, type="scatter", mode="lines",
line = list(width = 4), name="LM (BART ~ LASSO)") %>%
layout(title='Scatter Plot Predictions (BART vs. LASSO)',
xaxis = list(title="BART Predictions"),
yaxis = list(title="LASSO Predictions"),
legend = list(orientation = 'h'))
If the default prior estimate (sigest
of the error
variance (\(\sigma^2\)) is inverted
chi-squared, i.e., using a standard conditionally conjugate prior)
yields reasonable results, we can try longer BART runs
(ndpost=5000
). Mind the stable distribution of the \(\hat{\sigma}^2\) (\(y\)-axis) with respect to the number of
posterior draws (\(x\)-axis).
## *****Into main of wbart
## *****Data:
## data:n,p,np: 148, 5, 0
## y1,yn: -712.837838, 2893.162162
## x1,x[n*p]: 122.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 2 ... 1
## *****burn and ndpost: 1000, 5000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,290.550176,3.000000,2178098.906176
## *****sigma: 3343.904335
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,5,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 5000,5000,5000,5000
## *****printevery: 5000
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 6000)
## done 5000 (out of 6000)
## time: 8s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 5000,0,0,5000
# plot(model_BART_long$sigma, xlab="Number of Posterior Draws Returned")
plot_ly() %>%
add_markers(x=c(1:length(model_BART_long$sigma)), y=model_BART_long$sigma,
name="BART vs. LASSO Scatter", type="scatter", mode="markers") %>%
layout(title='Scatter Plot BART Sigma (post burn in draws of sigma)',
xaxis = list(title="Number of Posterior Draws Returned"),
yaxis = list(title="model_BART_long$sigma"),
legend = list(orientation = 'h'))
# plot(model_BART_long$yhat.train.mean, y.train, xlab="BART Predicted Charges", ylab="Observed Charges",
# main=sprintf("Correlation (Observed,Predicted)=%f",
# round(cor(model_BART_long$yhat.train.mean, y.train), digits=2)))
# abline(0, 1, col="red", lty=2)
# legend("topleft",legend=c("BART Predictions", "LM (BART~LASSO)"),
# col=c("gray", "red","blue"), lwd=c(2,2,2), lty=c(3,1,1), bty="n", cex=0.9, seg.len=3)
plot_ly() %>%
add_markers(x=model_BART_long$yhat.train.mean, y=y.train,
name="BART vs. LASSO Scatter", type="scatter", mode="markers") %>%
add_trace(x = c(2800,9000), y = c(2800,9000), type="scatter", mode="lines",
line = list(width = 4), name="LM (BART~LASSO)") %>%
layout(title=sprintf("Observed vs. BART-Predicted Hospital Charges: Cor(BART,Observed)=%f",
round(cor(model_BART_long$yhat.train.mean, y.train), digits=2)),
xaxis = list(title="BART Predictions"),
yaxis = list(title="Observed Values"),
legend = list(orientation = 'h'))
ind <- order(model_BART_long$yhat.train.mean)
# boxplot(model_BART_long$yhat.train[ , ind], ylim=range(y.train), xlab="case",
# ylab="BART Hospitalization Charge Prediction Range")
caseIDs <- paste0("Case",rownames(heart_attack))
rowIDs <- paste0("", c(1:dim(model_BART_long$yhat.train)[1]))
colnames(model_BART_long$yhat.train) <- caseIDs
rownames(model_BART_long$yhat.train) <- rowIDs
df1 <- as.data.frame(model_BART_long$yhat.train[ , ind])
df2_wide <- as.data.frame(cbind(index=c(1:dim(model_BART_long$yhat.train)[1]), df1))
# colnames(df2_wide); dim(df2_wide)
df_long <- tidyr::gather(df2_wide, case, measurement, Case138:Case8)
# str(df_long )
# 'data.frame': 74000 obs. of 3 variables:
# $ index : int 1 2 3 4 5 6 7 8 9 10 ...
# $ case : chr "Case138" "Case138" "Case138" "Case138" ...
# $ measurement: num 5013 3958 4604 2602 2987 ...
actualCharges <- as.data.frame(cbind(cases=caseIDs, value=y.train))
plot_ly() %>%
add_trace(data=df_long, y = ~measurement, color = ~case, type = "box") %>%
add_trace(x=~actualCharges$cases, y=~actualCharges$value, type="scatter", mode="markers",
name="Observed Charge", marker=list(size=20, color='green', line=list(color='yellow', width=2))) %>%
layout(title="Box-and-whisker Plots across all 148 Cases (Highlighted True Charges)",
xaxis = list(title="Cases"),
yaxis = list(title="BART Hospitalization Charge Prediction Range"),
showlegend=F)