| SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
In this chapter we present the fundamental linear algebra concepts underpinning many modern matrix computing operations, vector-based operations, linear modeling techniques, eigenspectra decompositions, multivariate linear regression modeling, ordinary least squares estimation, machine learning and artificial intelligence algorithms. We will demonstrate these techniques using simulated data and real observed data of baseball players and clinical heart attack patients.
Some students and readers may find it useful to first review some of the fundamental mathematical representations, analytical modeling techniques, and basic concepts. These foundations play critical roles in all subsequent chapters and sections. Examples of core mathematical principles include calculus of differentiation and integration; representation of scalars, vectors, matrices, and tensors; displacement, velocity, and acceleration; polynomials, exponents, and logarithmic functions; Taylor’s series; complex numbers; ordinary and partial differential equations; probability and statistics; statistical moments; and probability distributions.
Linear algebra is a branch of mathematics that studies linear associations using vectors, vector-spaces, linear equations, linear transformations and matrices. Although it is generally challenging to visualize complex data, e.g., large vectors, tensors, and tables in n-dimensional Euclidean spaces (\(n\ge 3\)), linear algebra allows us to represent, model, synthesize and summarize such complex data.
Virtually all natural processes permit first-order linear approximations. This is useful because linear equations are easy (to write, interpret, solve) and these first order approximations may be useful to practically assess the process, determine general trends, identify potential patterns, and suggest associations in the data.
Linear equations represent the simplest type of models for many processes. Higher order models may include additional nonlinear terms, e.g., Taylor-series expansion. Linear algebra provides the foundation for linear representation, analytics, solutions, inference and visualization of first-order affine models. Linear algebra is a small part of the larger mathematics functional analysis field, which is actually the infinite-dimensional version of linear algebra.
Specifically, linear algebra allows us to computationally manipulate, model, solve, and interpret complex systems of equations representing large numbers of dimensions/variables. Arbitrarily large problems can be mathematically transformed into simple matrix equations of the form \(A x = b\) or \(A x = \lambda x\).
In this chapter, we review the fundamentals of linear algebra, matrix manipulation and their applications to representation, modeling, and analysis of real data. Specifically, we will cover (1) construction of matrices and matrix operations, (2) general matrix algebra notations, (3) eigenvalues and eigenvectors of linear operators, (4) least squares estimation, and (5) linear regression and variance-covariance matrices.
The easiest way to create a matrix in R is by using the
matrix() or array() functions, which allow
splicing a long vector into a matrix or a tensor of certain size.
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 0 0 0 0 0
## [2,] 0 2 0 0 0 0
## [3,] 0 0 3 0 0 0
## [4,] 0 0 0 4 0 0
## [5,] 0 0 0 0 5 0
## [6,] 0 0 0 0 0 6
## [,1] [,2] [,3] [,4]
## [1,] 0.42826855 0.78581107 0.6956294 1.2622379
## [2,] 1.69432442 0.03411308 1.3167843 -0.6325525
## [3,] 0.08254209 0.04679539 -0.0180993 -0.2298754
## [4,] 0.90331659 1.54208243 1.7342205 -0.1269655
## [5,] 1.95724419 0.59320691 0.2500012 -0.9664280
The function diag() is very useful. When the object is a
vector, it creates a diagonal matrix with the vector in the principal
diagonal.
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 2 0
## [3,] 0 0 3
When the object is a matrix, diag() returns its
principal diagonal.
## [1] 1 4
When the object is a scalar, diag(k) returns a \(k\times k\) identity matrix.
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
The functions cbind() and rbind() are also
useful for building matrices from vectors by column or row
concatenation.
## c1
## [1,] 0.42826855 0.78581107 0.6956294 1.2622379 1
## [2,] 1.69432442 0.03411308 1.3167843 -0.6325525 2
## [3,] 0.08254209 0.04679539 -0.0180993 -0.2298754 3
## [4,] 0.90331659 1.54208243 1.7342205 -0.1269655 4
## [5,] 1.95724419 0.59320691 0.2500012 -0.9664280 5
## [,1] [,2] [,3] [,4]
## 0.42826855 0.78581107 0.6956294 1.2622379
## 1.69432442 0.03411308 1.3167843 -0.6325525
## 0.08254209 0.04679539 -0.0180993 -0.2298754
## 0.90331659 1.54208243 1.7342205 -0.1269655
## 1.95724419 0.59320691 0.2500012 -0.9664280
## r1 1.00000000 2.00000000 3.0000000 4.0000000
Note that m5 has a row name r1 in the
4th row. We remove row/column names by naming them as
NULL.
## [,1] [,2] [,3] [,4]
## [1,] 0.42826855 0.78581107 0.6956294 1.2622379
## [2,] 1.69432442 0.03411308 1.3167843 -0.6325525
## [3,] 0.08254209 0.04679539 -0.0180993 -0.2298754
## [4,] 0.90331659 1.54208243 1.7342205 -0.1269655
## [5,] 1.95724419 0.59320691 0.2500012 -0.9664280
## [6,] 1.00000000 2.00000000 3.0000000 4.0000000
Each element in a matrix has a location indexed by the corresponding
row and column. A[i, j] stores the element in the
ith row and jth column in the matrix A.
We can also access some specific rows or columns using matrix
subscripts.
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
## [1] 4
## [1] 1 4 7 10
## [1] 4 5 6
## [,1] [,2]
## [1,] 4 7
## [2,] 5 8
## [3,] 6 9
The ordinal scalar operations addition, subtraction, multiplication and division can be generalized as matrix operations.
Matrix addition and subtraction require matrices of the same dimensions. The sum or difference of two matrices are matrices containing elements representing the scalar sum or difference, respectively, of the values in corresponding positions in the two matrices.
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
## [,1] [,2] [,3]
## [1,] 2 4 6
## [2,] 3 5 7
## [,1] [,2] [,3]
## [1,] 3 7 11
## [2,] 5 9 13
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 1 1
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Element-wise matrix multiplication is valid for matrices of the same sizes. However, matrix multiplication is different from component-wise scalar multiplication, and requires a special match between the dimensions of the multiplied matrices, \(P_{m\times k}=L_{m\times n} \cdot R_{n\times k}\). That is, the number of columns in the left matrix, \(L\),, must equal to the number of rows of the right matrix, \(R\). Then, the \(row\times column\) matrix multiplication rule yields a product matrix of dimensions corresponding to the number of rows (\(m\)) of \(L\) and the number of columns (\(k\)) of \(R\), i.e., \(P_{m\times k}\).
Element-wise matrix multiplication (\(*\)) involves scalar products of the elements in the same positions.
## [,1] [,2] [,3]
## [1,] 2 12 30
## [2,] 6 20 42
Matrix product (\(\%*\%\)) generates an output matrix having the same number of rows as the left matrix and the same number of columns as the right matrix.
## [1] 2 3
## [,1] [,2]
## [1,] 3 6
## [2,] 4 7
## [3,] 5 8
## [1] 3 2
## [,1] [,2]
## [1,] 52 88
## [2,] 64 109
The product of multiplying two matrices \(m^8_{2\times 3}\) * \(m^9_{3\times 2}\) is another matrix \(M_{2\times 2}\) of dimensions \(2\times 2\).
The process of multiplying two vectors is called outer
product. Assume we have two vectors \(u\) and \(v\). Using matrix multiplication, their
outer product is the same as \(u\
\underbrace{\% * \%}_{product}\ \overbrace{t(v)}^{transpose}\) or
mathematically \(uv^t\). In
R the vector outer product operator is %o%,
which generates second order tensors (matrices).
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 5 6 7 8
## [2,] 8 10 12 14 16
## [3,] 12 15 18 21 24
## [4,] 16 20 24 28 32
## [5,] 20 25 30 35 40
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 5 6 7 8
## [2,] 8 10 12 14 16
## [3,] 12 15 18 21 24
## [4,] 16 20 24 28 32
## [5,] 20 25 30 35 40
What are the differences between \(u \%*\% v\), \(u \%*\% t(v)\), \(u * t(v)\), and \(u * v\)?
Element-wise matrix (or scalar) division is well defined for matrices of the same dimensions.
## [,1] [,2] [,3]
## [1,] 2.0 1.333333 1.200000
## [2,] 1.5 1.250000 1.166667
## [,1] [,2] [,3]
## [1,] 1.0 2.0 3.0
## [2,] 1.5 2.5 3.5
However, matrix inversion is different. Recall that the
transpose of a matrix is a matrix with swapped columns and
rows. In R, matrix transposition is done by the function
t().
## [,1] [,2] [,3]
## [1,] 2 4 6
## [2,] 3 5 7
## [,1] [,2]
## [1,] 2 3
## [2,] 4 5
## [3,] 6 7
Notice that the \([1, 2]\) element
in m8[1, 2] is the same as the \([2, 1]\) element in
t(m8)[2, 1].
The right inverse of a matrix, (\(A^{-1}_{m\times n}\)), is a special matrix with the property that multiplying the original matrix (\(A_{n\times m}\)) on the right by this inverse (\(A^{-1}\)) yields the identity matrix, which has \(1\)’s on the main diagonal and \(0\)’s off the diagonal. That is,
\[A_{n\times m}\ A^{-1}_{m\times n} = I_{n\times n}.\] Similarly, a left matrix inverse is defined as a matrix, (\(A^{-1}_{m\times n}\)), with the property that multiplying the original matrix (\(A_{n\times m}\)) on the left by this inverse (\(A^{-1}\)) yields the identity matrix.
\[ A^{-1}_{m\times n}\ A_{n\times m} = I_{m\times m}.\]
A matrix inverse is only defined for square matrices, \(m=n\).
Given four numbers satisfying \(ad-bc \not= 0\), the following \(2\times 2\) matrix.
\[A_{2\times 2}=\left(\begin{array}{cc} a & b \\ c & d \end{array}\right) ,\]
has an inverse matrix given by
\[A^{-1}_{2\times 2} =\frac{1}{ad-bc}\left(\begin{array}{cc} d & -b \\ -c & a \end{array}\right) .\]
It’s easy to validate that \(A_{2\times 2} A^{-1}_{2\times 2} =I_{2\times 2}\).
In higher dimensions, the Cramer’s
rule may be used to compute the matrix inverse. Matrix inversion is
available in R via the solve() function.
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
## [,1] [,2]
## [1,] -2 1.5
## [2,] 1 -0.5
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
Note that only special matrices are invertible, not all. These matrices are square (have the same number of rows and columns) and non-singular.
Another function that can help us to get the inverse of a matrix is
the ginv() function in the MASS package. This
function gives us the Moore-Penrose Generalized Inverse of a matrix.
## Loading required package: MASS
## [,1] [,2]
## [1,] -2 1.5
## [2,] 1 -0.5
In addition, the function solve() can be used to solve
matrix equations. For instance, solve(A, b) returns a
vector \(x\) satisfying the equation
\(b = Ax\), i.e., \(x= A^{-1}b\).
## [1] 0.5 0.5 0.5 0.5
The following table summarizes some of the basic matrix operation functions.
| Expression | Explanation |
|---|---|
t(x) |
transpose |
diag(x) |
diagonal |
%*% |
matrix multiplication |
solve(a, b) |
solves a %*% x = b for x |
solve(a) |
matrix inverse of a |
rowsum(x) |
sum of rows for a matrix-like object. rowSums(x) is a
faster version |
colSums(x), colSums(x) |
id. for columns |
rowMeans(x) |
fast version of row means |
colMeans(x) |
id. for columns |
mat1 <- cbind(c(1, -1/5), c(-1/3, 1))
mat1.inv <- solve(mat1)
mat1.identity <- mat1.inv %*% mat1
mat1.identity## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
## [1] 1.785714 2.357143
Let’s look at the basics of matrix notation and matrix algebra. The
product \(AB\) between matrices \(A\) and \(B\) is defined only if the number of
columns in \(A\) equals the number of
rows in \(B\). That is, we can multiply
an \(m\times n\) matrix \(A\) by an \(n\times k\) matrix \(B\) and the result will be \(AB_{m\times k}\) matrix. Each element of
the product matrix, \((AB_{i, j})\),
represents the product of the \(i\)-th
row in \(A\) and the \(j\)-th column in \(B\), which are of the same size \(n\). Matrix multiplication is
row-by-column.
Linear algebra notation simplifies the mathematical descriptions and
manipulations of linear models, as well as coding in R.
The main point is to show how we can write linear models using matrix notation. Later, we’ll explain how this is useful for solving the least squares problems.
Linear algebra notation enables the mathematical analysis and derivation of solutions of systems of linear equations and provides a generic machinery for solving linear problems.
\[\begin{align*} a + b + 2c &= 6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 3 \end{align*} \]
\[\underbrace{\begin{pmatrix} 1&1&2\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}}_{\text{A}} \underbrace{\begin{pmatrix} a\\ b\\ c \end{pmatrix}}_{\text{x}} = \underbrace{\begin{pmatrix} 6\\ 2\\ 3 \end{pmatrix}}_{\text{b}}\]
That is, \(Ax = b\), which implies that:
\[\begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 1&1&2\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}^{-1} \begin{pmatrix} 6\\ 2\\ 3 \end{pmatrix}\]
In other words, \(A^{-1}A x ==x = A^{-1}b\).
Notice that this approach parallels the strategy for solving of simple (univariate) linear equations like: \[\underbrace{2}_{\text{(design matrix) A }} \overbrace{x}^{\text{unknown x }} \underbrace{-3}_{\text{simple constant term}} = \overbrace{5}^{\text{b}}.\]
The constant term, \(-3\), can be moved and integrated into the right-hand-side, \(b\), to form a new term \(b'=5+3=8\). Thus, the shifting factor is mostly ignored in linear models, or linear equations, which simplifies the linear matrix equation to: \[\underbrace{2}_{\text{(design matrix) A }} \overbrace{x}^{\text{unknown x }} = \underbrace{5+3}_{b'}=\overbrace{8}^{b'}.\]
This (simple) linear equation is solved by multiplying both hand sides by the inverse (reciprocal) of the \(x\) multiplier, \(2\).
\[\frac{1}{2} 2 x = \frac{1}{2} 8.\] Thus, the unique solution is: \[x = \frac{1}{2} 8=4.\]
So, let’s use exactly the same strategy to solve the corresponding
matrix equation (linear equation, \(Ax = b\)) using R, where the
unknown is \(x\), and the
design matrix \(A\) and the
constant vector \(b\) are
known.
\[\underbrace{\begin{pmatrix} 1&1&2\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}}_{\text{A}} \underbrace{\begin{pmatrix} a\\ b\\ c \end{pmatrix}}_{\text{x}} = \underbrace{\begin{pmatrix} 6\\ 2\\ 3 \end{pmatrix}}_{\text{b}}.\]
A_matrix_values <- c(1, 1, 2, 3, -2, 1, 2, 1, -1)
# matrix elements are arranged by columns, so, we need to transpose them to arrange them by rows.
A <- t(matrix(A_matrix_values, nrow=3, ncol=3))
b <- c(6, 2, 3)
# to solve Ax = b, x=A^{-1}*b
x <- solve (A, b)
# Ax = b ==> x = A^{-1} * b
x## [1] 1.35 1.75 1.45
## [,1]
## [1,] 0
## [2,] 0
## [3,] 0
How about if we want to triple-check the consistency of the
solve() method to provide accurate solutions to
matrix-based systems of linear equations?
We can generate the solution (\(x\)) to the equation \(Ax=b\) by using first principles. \[ x = A^{-1}b.\]
A.inverse <- solve(A) # the inverse matrix A^{-1}
x1 <- A.inverse %*% b
# check if X and x1 are the same
x; x1## [1] 1.35 1.75 1.45
## [,1]
## [1,] 1.35
## [2,] 1.75
## [3,] 1.45
## [,1]
## [1,] 0
## [2,] 0
## [3,] 0
The identity matrix is the matrix analog to the multiplicative numeric identity, the number \(1\). Multiplying the identity matrix by any other matrix (\(B\)) does not change the matrix \(B\). This property requires that the multiplicative identity matrix must look like this.
\[\mathbf{I} = \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ \vdots &\vdots & \vdots & \ddots&\vdots&\vdots\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix}.\]
The identity matrix is always a square matrix with diagonal elements \(1\) and \(0\) at the off-diagonal elements.
Following the above matrix multiplication rules, we can see that:
\[\mathbf{X\times I} = \begin{pmatrix} x_{1, 1} & \dots & x_{1, p}\\ & \vdots & \\ x_{n, 1} & \dots & x_{n, p} \end{pmatrix} \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ & & &\vdots & &\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix}=\]
\[\begin{pmatrix} x_{1, 1} & \dots & x_{1, p}\\ & \vdots & \\ x_{n, 1} & \dots & x_{n, p} \end{pmatrix}= \mathbf{X}.\]
In R, we can express the identity matrix as follows.
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
## [,1] [,2] [,3]
## [1,] 1 1 2
## [2,] 3 -2 1
## [3,] 2 1 -1
## [,1] [,2] [,3]
## [1,] 1 1 2
## [2,] 3 -2 1
## [3,] 2 1 -1
Let’s look at this notation deeper using the Baseball
players data containing three quantitative variables,
Heights, Weight, and Age. Suppose
the variable Weight is considered as a random
response (outcome vector) denoted by \(Y_1, Y_2, \dots, Y_n\).
We can express each player’s Weight as a function of
Age and Height.
# Data: https://umich.instructure.com/courses/38100/files/folder/data (01a_data.txt)
data <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1', as.is=T, header=T)
attach(data)
head(data)## Name Team Position Height Weight Age
## 1 Adam_Donachie BAL Catcher 74 180 22.99
## 2 Paul_Bako BAL Catcher 74 215 34.69
## 3 Ramon_Hernandez BAL Catcher 72 210 30.78
## 4 Kevin_Millar BAL First_Baseman 72 210 35.43
## 5 Chris_Gomez BAL First_Baseman 73 188 35.71
## 6 Brian_Roberts BAL Second_Baseman 69 176 29.39
In matrix form, we can express the outcome using one symbol, \(\mathbf{Y}\). We usually use bold face to distinguish between scalars, vectors, matrices and tensors.
\[\mathbf{Y} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix}.\]
In R, the default representation of vector data is as
columns, i.e., our outcome vector dimension is \(n\times 1\), as opposed to \(1 \times n\) used for row vectors (e.g.,
\(\mathbf{Y}^t\)).
Similarly, we can use matrix notation to represent the covariates, or
predictors, Age and Height. In a case with two
predictors, we can represent them like this:
\[\mathbf{X}_1 = \begin{pmatrix} x_{1, 1}\\ \vdots\\ x_{n, 1} \end{pmatrix} \mbox{ and } \mathbf{X}_2 = \begin{pmatrix} x_{1, 2}\\ \vdots\\ x_{n, 2} \end{pmatrix}.\]
In the that for the Baseball players study, \(x_{1, 1}= Age_1\) and \(x_{i, 1}=Age_i\) with \(Age_i\) representing the Age
of the \(i\)-th player, and similarly,
\(x_{i, 2}= Height_i\) is the height of
the \(i\)-th player. These vectors can
be thought of as \(n\times 1\)
matrices. For instance, it is convenient to represent the covariates as
design matrices.
\[\mathbf{X} = [ \mathbf{X}_1 \mathbf{X}_2 ] = \begin{pmatrix} x_{1, 1}&x_{1, 2}\\ \vdots\\ x_{n, 1}&x_{n, 2} \end{pmatrix}. \]
This design matrix has dimension \(n \times 2\).
## Age Height
## [1,] 22.99 74
## [2,] 34.69 74
## [3,] 30.78 72
## [4,] 35.43 72
## [5,] 35.71 73
## [6,] 29.39 69
## [1] 1034 2
We can also use this notation to denote an arbitrary number (\(k\)) of covariates with the following \(n\times k\) matrix.
\[\mathbf{X} = \begin{pmatrix} x_{1, 1}&\dots & x_{1, k} \\ x_{2, 1}&\dots & x_{2, k} \\ & \vdots & \\ x_{n, 1}&\dots & x_{n, k} \end{pmatrix}. \]
You can simulate such a design matrix in R using
matrix(), instead of cbind.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1035 2069 3103 4137
## [2,] 2 1036 2070 3104 4138
## [3,] 3 1037 2071 3105 4139
## [4,] 4 1038 2072 3106 4140
## [5,] 5 1039 2073 3107 4141
## [6,] 6 1040 2074 3108 4142
## [1] 1034 5
By default, matrices are filled column-by-column order,
however using the byrow=TRUE argument allows us to change
the order to row-by-row.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
## [4,] 16 17 18 19 20
## [5,] 21 22 23 24 25
## [6,] 26 27 28 29 30
## [1] 1034 5
Scalars are just one-dimensional values, typically numbers, that are different from their higher-dimensional counterparts, vectors, matrices, and tensors, which are usually denoted by bold characters.
To compute the sample average and variance of a dataset, we use the formulas: \[\bar{Y}=\frac{1}{n} \sum_{i=1}^n {Y_i}\]
and
\[\mbox{var}(Y)=\frac{1}{n-1} \sum_{i=1}^n {(Y_i - \bar{Y})}^2, \] which can be represented as matrix multiplications.
Define an \(n \times 1\) matrix made of \(1\)’s.
\[A=\begin{pmatrix} 1\\ 1\\ \vdots\\ 1 \end{pmatrix}.\]
This implies that.
\[\frac{1}{n} \mathbf{A}^\top Y = \frac{1}{n} \begin{pmatrix}1&1& \dots&1\end{pmatrix} \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix}= \frac{1}{n} \sum_{i=1}^n {Y_i}= \bar{Y}.\]
Recall that we multiply matrices and scalars, like \(\frac{1}{n}\), by *, whereas
we multiply matrices using the matrix product operator,
%*%.
## [1] 73.69729
## [,1]
## [1,] 73.69729
## [1] 73.69729
Multiplying the transpose of a matrix with another matrix is very
common in linear modeling and statistical computing, so there is an
appropriate function in R, crossprod().
## [,1]
## [1,] 73.69729
There is a similar matrix algebra for computing the variance.
\[\mathbf{Y'}\equiv \begin{pmatrix} Y_1 - \bar{Y}\\ \vdots\\ Y_n - \bar{Y} \end{pmatrix}, \, \, \frac{1}{n-1} \mathbf{Y'}^\top\mathbf{Y'} = \frac{1}{n-1}\sum_{i=1}^n (Y_i - \bar{Y})^2. \]
A crossprod with only one matrix computes \(Y^\top Y\).
## [,1]
## [1,] 5.316798
## [1] 5.316798
Let’s use the following pair of matrices.
\[\overbrace{\mathbf{Y}}^{outcome} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix} , \underbrace{\mathbf{X}}_{design} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_n \end{pmatrix} , \overbrace{\mathbf{\beta}}^{effects} = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \mbox{ and } \underbrace{\mathbf{\varepsilon}}_{error} = \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{pmatrix}.\]
Then, we can express the problem as a linear model.
\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1, \dots, n\]
We can also express the complete problem formulation into a corresponding succinct matrix notation formula.
\[\begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_n \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{pmatrix}. \]
And in its matrix form.
\[\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon},\]
which is a simpler way to write the same model equation.
One way to obtain an optimal solution is by minimizing all residuals (\(\epsilon_i\)). This high-fidelity criterion indicates a good model fit. The least squares (LS) solution represents one way to solve this matrix equation (\(Y=X\beta+\epsilon\)). The LS solution is obtained by minimizing the residual sum square error \[\langle\epsilon^t, \epsilon \rangle = (Y-X\beta)^t \times (Y-X\beta).\]
Let’s define the LS objective function using the cross-product notation.
\[f(\beta) = (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^t (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}). \]
We can determine the effect size estimates, \({\hat {\beta}}\), are obtained by minimizing this expression. Of course, we can derive an analytic solution using calculus to find the minimum of the cost (objective) function, \(f(\beta)\).
There are a number of rules that help with solving partial derivative equations in matrix form. recall that the critical points of the objective functions are either at the domain border or at values where the derivative of a objective function is trivial, \(f'(x)=0\). hence, solving for the unknown parameter \(\beta\) requires identifying the critical points, \({\hat {\beta}}\), which will represent candidate solution(s). The derivative of the above equation is
\[2 \mathbf{X}^t (\mathbf{Y} - \mathbf{X} \boldsymbol{\hat{\beta}})=0,\]
\[\mathbf{X}^t \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}^t \mathbf{Y},\]
\[\boldsymbol{\hat{\beta}} = (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \mathbf{Y}.\]
This estimate \({\hat {\beta}}\) represents the desired LS solution to the linear modeling problem. The hat notation, \({\hat {\cdot}}\), is used to denote estimates. For instance, the solution for the unknown \(\beta\) parameters is denoted by the (data-driven) estimate \(\hat{\beta}\).
The least squares minimization works because minimizing a function corresponds to finding the roots of its (first) derivative. With ordinary least squares (OLS), we square the residuals.
\[(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^t (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}).\]
Notice that the minimum of \(f(x)\) and \(f^2(x)\) are achieved at the same roots of \(f'(x)\), as the derivative of \(f^2(x)\) is \(\frac{d}{dx}f^2(x) = 2f(x)f'(x)\).
Here is how we obtain the Least Square estimation in
R.
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'data':
##
## Position
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#x=cbind(data$Height, data$Age)
x=data$Height
y=data$Weight
X <- cbind(1, x)
beta_hat <- solve( t(X) %*% X ) %*% t(X) %*% y
###or
beta_hat <- solve( crossprod(X) ) %*% crossprod( X, y ) Now we can see the results of this by computing the estimated \(\hat{\beta}_0+\hat{\beta}_1 x\) (fitted model prediction) corresponding to any covariate input value of \(x\).
# newx <- seq(min(x), max(x), len=100)
X <- cbind(1, x)
fitted <- X%*%beta_hat
# or directly: fitted <- lm(y ~ x)$fitted
# plot(x, y, xlab="MLB Player's Height", ylab="Player's Weight")
# lines(x, fitted, col=2)
plot_ly(x = ~x) %>%
add_markers(y = ~y, name="Data Scatter") %>%
add_lines(x = ~x, y = ~fitted[,1], name="(Manual) Linear Model (Weight ~ Height)") %>%
add_lines(x = ~x, y = ~lm(y ~ x)$fitted, name="(Direct) lm(Weight ~ Height)",
line = list(width = 4, dash = 'dash')) %>%
layout(title='Baseball Players: Linear Model of Weight vs. Height',
xaxis = list(title="Height (in)"), yaxis = list(title="Weight (lb)"),
legend = list(orientation = 'h'))The closed-form analytical expression for the LS estimate \[\hat{\boldsymbol{\beta}}=(\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \mathbf{Y}\] is one of the most widely used results in data analysis. One of the advantages of this approach is that we can use it in many different situations.
RIn R, there is a very convenient function
lm() that fits these linear models. We will learn more
about this function later, but here is a demonstration that it agrees
with a simple manual LS estimation approach we showed above.
Starting in the 18th century, the work of Euler’s on rotational motion and later Lagrange on the study of inertia matrices, the notions of principal axes (eigenvectors) and characteristic roots (eigenvectors) arose. However, it took close to 200 years until Hilbert and others working on integral operators settled on using the terminology eigen, “own”, to denote eigenvalues (proper characteristic values) and eigenvectors (principal axes).
The eigen-spectrum (eigenspace) decomposition of linear operators (matrices) into eigenvalues and eigenvectors enables us to understand linear transformations and characterize their properties. The eigenvectors represent the “axes” (directions) along which a linear transformation acts by stretching, compressing, or flipping.
The eigenvalues represent the amounts of this linear transformation into the specified eigen-vector direction. In higher dimensions, there are more directions along which we need to understand the behavior of the linear transformation. The eigen-spectrum makes it easier to understand the linear transformation especially when many (all?) of the eigenvectors are linearly independent (orthogonal).
For a given matrix \(A\), if we have \(A\vec{v}=\lambda \vec{v}\), then we say that a nonzero vector \(\vec{v}\) is a right eigenvector of the matrix \(A\) and the scale factor \(\lambda\) is the eigenvalue corresponding to that eigenvector.
With some calculations we can show that \(A\vec{v}=\lambda \vec{v}\) is the same as
\((\lambda I_n-A)\vec{v}=\vec{0}\),
where \(I_n\) is the \(n\times n\) identity matrix. So, when we
solve this equation, we get the corresponding eigenvalues and
eigenvectors. As this is a very common operation, we don’t need to do
that by hand - the method eigen() provides this
functionality.
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
## eigen() decomposition
## $values
## [1] 1 1
##
## $vectors
## [,1] [,2]
## [1,] 0 -1
## [2,] 1 0
We can easily validate that \((\lambda I_n-A)\vec{v}=\vec{0}\).
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
As we mentioned earlier, diag(n) creates a \(n\times n\) identity matrix. Thus,
diag(2) is the \(I_2\)
matrix in the above equation. The zero output matrix proves that the
equation \((\lambda
I_n-A)\vec{v}=\vec{0}\) holds true.
Many interesting applications of the eigen-spectrum are shown here.
Other important functions about matrix operation are listed in the following table.
| Functions | Math expression or explanation |
|---|---|
crossprod(A, B) |
\(A^TB\) Where \(A\), \(B\) are matrices |
y<-svd(A) |
the output has the following components |
-y$d |
vector containing the singular values of A, |
-y$u |
matrix with columns contain the left singular vectors of A, |
-y$v |
matrix with columns contain the right singular vectors of A |
k <- qr(A) |
the output has the following components |
-k$qr |
has an upper triangle that contains the decomposition and a lower triangle that contains information on the Q decomposition. |
-k$rank |
is the rank of A. |
-k$qraux |
a vector which contains additional information on Q. |
-k$pivot |
contains information on the pivoting strategy used. |
rowMeans(A)/colMeans(A) |
returns vector of row/column means |
rowSums(A)/colSums(A) |
returns vector of row/column sums |
Some flexible matrix operations can help us save time calculating row or column averages. For example, column averages can be calculated by the following matrix operation.
\[AX = \left(\begin{array}{cccc} \frac{1}{N}&\frac{1}{N}&\cdots&\frac{1}{N} \end{array}\right) \left(\begin{array}{cccc} X_{1, 1}&\cdots&X_{1, p}\\ X_{2, 1}&\cdots&X_{2, p}\\ \vdots&\vdots&\vdots\\ X_{N, 1}&\cdots&X_{N, p} \end{array}\right)= \left(\begin{array}{cccc} \bar{X}_1&\bar{X}_2&\cdots &\bar{X}_N \end{array}\right).\]
The row averages can be calculated similarly.
\[XB = \left(\begin{array}{cccc} X_{1, 1}&\cdots &X_{1, p}\\ X_{2, 1}&\cdots &X_{2, p}\\ \vdots &\vdots &\vdots \\ X_{N, 1}&\cdots&X_{N, p} \end{array}\right) \left(\begin{array}{c} \frac{1}{p}\\ \frac{1}{p}\\ \vdots\\ \frac{1}{p} \end{array}\right)= \left(\begin{array}{c} \bar{X}_1\\ \bar{X}_2\\ \vdots\\ \bar{X}_q \end{array}\right). \]
Expeditious matrix calculations can be done by multiplying a matrix on the left or at the right by another matrix. In general multiplying by a vector on the left can amounts to weight averaging.
\[AX = \left(\begin{array}{cccc} a_1&a_2&\cdots &a_N \end{array}\right) \left(\begin{array}{cccc} X_{1, 1}&\cdots &X_{1, p}\\ X_{2, 1}&\cdots &X_{2, p}\\ \vdots&\vdots &\vdots \\ X_{N, 1}&\cdots &X_{N, p} \end{array}\right)= \left(\begin{array}{cccc} \sum_{i=1}^N a_i \bar{X}_{i, 1}&\sum_{i=1}^N a_i \bar{X}_{i, 2}&\cdots &\sum_{i=1}^N a_i \bar{X}_{i, N} \end{array}\right). \]
Now let’s try this matrix notation to look at genetic expression data including \(8,793\) different genes for \(208\) subjects. These gene expression data represents a microarray experiment - GSE5859 - comparing Gene Expression Profiles from Lymphoblastoid cells. Specifically, the data compares the expression level of genes in lymphoblasts from individuals in three HapMap populations {CEU, CHB, JPT}. The study found that the mean expression levels between the {CEU} and {CHB+JPT} samples were significantly different (\(p < 0.05\)) for more than a thousand genes.
The gene expression profiles data has two components:
gene <- read.csv("https://umich.instructure.com/files/2001417/download?download_frd=1",
header = T) # exprs_GSE5859.csv
info<-read.csv("https://umich.instructure.com/files/2001418/download?download_frd=1",
header=T) # exprs_MetaData_GSE5859.csvRecall that the lapply() function that we talked about
in Chapter
2 and the sapply() function can be used to calculate
column and row averages. Let’s compare the outputs of
sapply and the corresponding matrix algebra process.
colmeans <- sapply(gene[, -1], mean)
gene1 <- as.matrix(gene[, -1])
# can also use built in functions
# colMeans <- colMeans(gene1)
colmeans.matrix <- crossprod(rep(1/nrow(gene1), nrow(gene1)), gene1)
colmeans[1:15]## GSM25581.CEL.gz GSM25681.CEL.gz GSM136524.CEL.gz GSM136707.CEL.gz
## 5.703998 5.721779 5.726300 5.743632
## GSM25553.CEL.gz GSM136676.CEL.gz GSM136711.CEL.gz GSM136542.CEL.gz
## 5.835499 5.742565 5.751601 5.732211
## GSM136535.CEL.gz GSM25399.CEL.gz GSM25552.CEL.gz GSM25542.CEL.gz
## 5.741741 5.618825 5.805147 5.733117
## GSM136544.CEL.gz GSM25662.CEL.gz GSM136563.CEL.gz
## 5.733175 5.716855 5.750600
## [1] 5.703998 5.721779 5.726300 5.743632 5.835499 5.742565 5.751601 5.732211
## [9] 5.741741 5.618825 5.805147 5.733117 5.733175 5.716855 5.750600
The same outputs are generated by both protocols. Note that we uses
rep(1/nrow(gene1), nrow(gene1)) to create the vector
\[\left(\begin{array}{cccc} \frac{1}{N}&\frac{1}{N}&\cdots &\frac{1}{N} \end{array}\right). \]
needed to obtain manually the column averages by matrix algebra. Similarly, we can compute the column means.
colmeans <- as.matrix(colmeans)
h <- hist(colmeans, plot=F)
plot_ly(x = h$mids, y = h$counts, type = "bar", name = "Column Averages") %>%
layout(title='Average Gene Expression Histogram',
xaxis = list(title = "Column Means"),
yaxis = list(title = "Average Expression", side = "left"),
legend = list(orientation = 'h'))The histogram shows that the distribution is symmetric, unimodal and bell-shaped, i.e., roughly normal.
We can also solve harder problems using matrix algebra. For example, let’s calculate the differences between genders for each gene. First, we need to get the gender information for each subject.
Then, we have to reorder the columns to make it consistent with the
feature matrix gene1.
Next, we are going to design the matrix. We want to multiply it with the feature matrix. The plan is to multiply the following two matrices.
\[\left(\begin{array}{cccc} X_{1, 1}&\cdots&X_{1, p}\\ X_{2, 1}&\cdots&X_{2, p}\\ \vdots & \vdots &\vdots \\ X_{N, 1}&\cdots&X_{N, p} \end{array}\right) \left(\begin{array}{cc} \frac{1}{p}&a_1\\ \frac{1}{p}&a_2\\ \vdots & \vdots \\ \frac{1}{p}&a_p \end{array}\right)= \left(\begin{array}{cc} \bar{X}_1&gender.diff_1\\ \bar{X}_2&gender.diff_2\\ \vdots & \vdots \\ \bar{X}_N&gender.diff_N \end{array}\right),\] where \(a_i=-\frac{1}{N_F}\) if the subject is female and \(a_i=\frac{1}{N_M}\) if the subject is male. Thus, we gave each female and male the same weight before the subtraction. We average each gender and get their difference. \(\bar{X}_i\) is the average across both genders and \(gender.diff_i\) represents the gender difference for the i-th gene.
##
## F M
## 86 122
gender$vector <- ifelse(gender$sex=="F", -1/86, 1/122)
vec1 <- as.matrix(data.frame(rowavg=rep(1/ncol(gene1), ncol(gene1)), gender.diff=gender$vector))
gender.matrix <- gene1 %*% vec1
gender.matrix[1:15, ]## rowavg gender.diff
## [1,] 6.383263 -0.003209464
## [2,] 7.091630 -0.031320597
## [3,] 5.477032 0.064806978
## [4,] 7.584042 -0.001300152
## [5,] 3.197687 0.015265502
## [6,] 7.338204 0.078434938
## [7,] 4.232132 0.008437864
## [8,] 3.716460 0.018235650
## [9,] 2.810554 -0.038698101
## [10,] 5.208787 0.020219666
## [11,] 6.498989 0.025979654
## [12,] 5.292992 -0.029988980
## [13,] 7.069081 0.038575442
## [14,] 5.952406 0.030352616
## [15,] 7.247116 0.046020066
As we mentioned earlier, the formula for linear regression can be written as
\[Y_i=\beta_0+X_{i, 1}\beta_1+\cdots+X_{i, p}\beta_p +\epsilon_i, i=1, \cdots, N\] This formula can also be expressed in matrix form.
\[\left(\begin{array}{c} Y_1\\ Y_2\\ \vdots\\ Y_N \end{array}\right)= \left(\begin{array}{c} 1\\ 1\\ \vdots\\ 1 \end{array}\right)\beta_0+ \left(\begin{array}{c} X_{1, 1}\\ X_{2, 1}\\ \vdots\\ X_{N, 1} \end{array}\right)\beta_1+\cdots + \left(\begin{array}{c} X_{1, p}\\ X_{2, p}\\ \vdots \\ X_{N, p} \end{array}\right)\beta_p+ \left(\begin{array}{c} \epsilon_1\\ \epsilon_2\\ \vdots \\ \epsilon_N \end{array}\right), \]
which can be compressed into a simple matrix equation \(Y=X\beta +\epsilon\).
\[\left(\begin{array}{c} Y_1\\ Y_2\\ \vdots \\ Y_N \end{array}\right)= \left(\begin{array}{cccc} 1&X_{1, 1}&\cdots&X_{1, p}\\ 1&X_{2, 1}&\cdots&X_{2, p}\\ \vdots&\vdots&\vdots&\vdots\\ 1&X_{N, 1}&\cdots&X_{N, p} \end{array}\right) \left(\begin{array}{c} \beta_o\\ \beta_1\\ \vdots\\ \beta_p \end{array}\right)+ \left(\begin{array}{c} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_N \end{array}\right). \] As \(Y=X\beta +\epsilon\) implies that \(X^tY \sim X^t(X\beta)=(X^tX)\beta\), and thus, the LS solution for \(\beta\) is obtained by multiplying both hand sides by the inverse of the square cross product matrix \((X^tX)^{-1}\): \[\hat{\beta}=(X^tX)^{-1}X^tY.\]
Matrix calculations are much faster, especially on specialized
computer chips, than fitting a manual regression model. Let’s apply this
to the Lahman
baseball data representing yearly stats and standings. Let’s
download it first via this link baseball.data
and save it in the R working directory. We can use the
load() function to import the local RData. For this
example, we subset the dataset by G==162 and
yearID < 2002. Also, we create a new feature named
Singles that is equal to
H(Hits by batters) - X2B(Doubles) - X3B(Tripples) - HR(Home Runs by batters).
Finally, we only pick four features R (Runs scored),
Singles, HR (Home Runs by batters), and BB
(Walks by batters).
#If you downloaded the .RData locally first, then you can easily load it into the`R`workspace by:
# load("Teams.RData")
# Alternatively you can also download the data in CSV format from https://umich.instructure.com/courses/38100/files/folder/data (teamsData.csv)
Teams <- read.csv('https://umich.instructure.com/files/2798317/download?download_frd=1', header=T)
dat <- Teams[Teams$G==162&Teams$yearID<2002, ]
dat$Singles <- dat$H-dat$X2B-dat$X3B-dat$HR
dat <- dat[, c("R", "Singles", "HR", "BB")]
head(dat)## R Singles HR BB
## 439 505 997 11 344
## 1367 683 989 90 580
## 1368 744 902 189 681
## 1378 652 948 156 516
## 1380 707 1017 92 620
## 1381 632 1020 126 504
In this example, let’s work with R as the response variable and BB as the independent variable. For a full linear model, we need to add another column of \(1\)’s to the design matrix \(X\).
## [,1] [,2]
## [1,] 1 344
## [2,] 1 580
## [3,] 1 681
## [4,] 1 516
## [5,] 1 620
## [6,] 1 504
## [7,] 1 498
## [8,] 1 502
## [9,] 1 493
## [10,] 1 556
We use the LS analytical formula to obtain the beta (effects) estimates
\[\hat{\beta}=(X^t X)^{-1}X^t Y.\]
## [,1]
## [1,] 326.8241628
## [2,] 0.7126402
To confirm this manual calculation, we can refit the linear equation
using the lm() function, and compare the computational
times. In this simple example, is there evidence of higher computational
efficiency using matrix calculations?
fit <- lm(R~BB, data=dat)
# fit <- lm(R~., data=dat)
# '.' indicates all other variables, very useful when fitting models with many predictors
fit##
## Call:
## lm(formula = R ~ BB, data = dat)
##
## Coefficients:
## (Intercept) BB
## 326.8242 0.7126
##
## Call:
## lm(formula = R ~ BB, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -187.788 -53.977 -2.995 55.649 258.614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 326.82416 22.44340 14.56 <2e-16 ***
## BB 0.71264 0.04157 17.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.95 on 661 degrees of freedom
## Multiple R-squared: 0.3078, Adjusted R-squared: 0.3068
## F-statistic: 294 on 1 and 661 DF, p-value: < 2.2e-16
## user system elapsed
## 0 0 0
## user system elapsed
## 0 0 0
For a better model, we can expand the covariates to include multiple predictors and compare the resulting estimates.
## [,1] [,2] [,3] [,4]
## [1,] 1 344 997 11
## [2,] 1 580 989 90
## [3,] 1 681 902 189
## [4,] 1 516 948 156
## [5,] 1 620 1017 92
## [6,] 1 504 1020 126
## [7,] 1 498 1064 167
## [8,] 1 502 937 180
## [9,] 1 493 1048 105
## [10,] 1 556 1073 116
## user system elapsed
## 0 0 0
## user system elapsed
## 0.01 0.00 0.00
## (Intercept) BB Singles HR
## -401.0057242 0.3666606 0.6705535 1.7175775
## [,1] [,2] [,3] [,4]
## [1,] -401.0057 0.3666606 0.6705535 1.717577
A scatter plot would show visually the relationship between the outcome R and one of the predictors BB.
# plot(dat$BB, dat$R, xlab = "BB", ylab = "R", main = "Scatter plot/regression for baseball data")
# abline(beta1[1, 1], beta1[2, 1], lwd=4, col="red")
plot_ly(x = ~dat$BB) %>%
add_markers(y = ~dat$R, name="Data Scatter") %>%
add_lines(x = ~dat$BB, y = ~lm(dat$R ~ dat$BB)$fitted,
name="lm(Runs scored ~ Walks by batters)", line = list(width = 4)) %>%
layout(title='Scatter plot/regression for baseball data',
xaxis = list(title="(BB) Walks by batters"), yaxis = list(title="(R) Runs scored"),
legend = list(orientation = 'h'))Here the color model line represents our regression model estimated using matrix algebra.
The power of matrix algebra becomes more apparent when we use multiple variables. We can add the variable HR to the model.
library(reshape2)
X <- cbind(rep(1, n=nrow(dat)), dat$BB, dat$HR)
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
beta## [,1]
## [1,] 287.7226756
## [2,] 0.3897178
## [3,] 1.5220448
# #install.packages("scatterplot3d")
# library(scatterplot3d)
# myScatter3D <- scatterplot3d(dat$BB, dat$HR, dat$R)
#
# fit = lm(dat$R ~ dat$BB + dat$HR, data = dat)
# # Plot the linear model
# # get the BB & HR ranges summary(dat$BB); summary(dat$HR)
# cf = fit$coefficients
# pltx = seq(344, 775,length.out = 100)
# plty = seq(11,264,length.out = 100)
# pltz = cf[1] + cf[2]*pltx + cf[3]*plty
# #Add the line to the plot
# myScatter3D$points3d(pltx,plty,pltz, type = "l", col = "red", lwd=3)
# # interactive *rgl* 3D plot
# library(rgl)
# fit <- lm(dat$R ~ dat$BB + dat$HR)
# coefs <- coef(fit)
# a <- coefs["dat$BB"]
# b <- coefs["dat$HR"]
# c <- -1
# d <- coefs["(Intercept)"]
# open3d()
# plot3d(dat$BB, dat$HR, dat$R, type = "s", col = "red", size = 1)
# planes3d(a, b, c, d, alpha = 0.5)
# # planes3d(b, a, -1.5, d, alpha = 0.5)
# # planes3d draws planes using the parametrization a*x + b*y + c*z + d = 0.
# # Multiple planes may be specified by giving multiple values for the normal
# # vector (a, b, c) and the offset parameter d
#
# pca1 <- prcomp(as.matrix(cbind(dat$BB, dat$HR, dat$R)), center = T); summary(pca1)
#
# # Given two vectors PCA1 and PCA2, the cross product V = PCA1 x PCA2
# # is orthogonal to both A and to B, and a normal vector to the
# # plane containing PCA1 and PCA2
# # If PCA1 = (a,b,c) and PCA2 = (d, e, f), then the cross product is
# # PCA1 x PCA2 = (bf - ce, cd - af, ae - bd)
# # PCA1 = pca1$rotation[,1] and PCAS2=pca1$rotation[,2]
# # https://en.wikipedia.org/wiki/Cross_product#Names
# # prcomp$rotation contains the matrix of variable loadings,
# # i.e., a matrix whose columns contain the eigenvectors
# #normVec = c(pca1$rotation[,1][2]*pca1$rotation[,2][3]-
# # pca1$rotation[,1][3]*pca1$rotation[,2][2],
# # pca1$rotation[,1][3]*pca1$rotation[,2][1]-
# # pca1$rotation[,1][1]*pca1$rotation[,2][3],
# # pca1$rotation[,1][1]*pca1$rotation[,2][2]-
# # pca1$rotation[,1][2]*pca1$rotation[,2][1]
# # )
# normVec = c(pca1$rotation[2,1]*pca1$rotation[3,2]-
# pca1$rotation[3,1]*pca1$rotation[2,2],
# pca1$rotation[3,1]*pca1$rotation[1,2]-
# pca1$rotation[1,1]*pca1$rotation[3,2],
# pca1$rotation[1,1]*pca1$rotation[2,2]-
# pca1$rotation[2,1]*pca1$rotation[1,2]
# )
#
# # Plot the PCA Plane
# plot3d(dat$BB, dat$HR, dat$R, type = "s", col = "red", size = 1)
# planes3d(normVec[1], normVec[2], normVec[3], 90, alpha = 0.5)
# myScatter3D <- scatterplot3d(dat$BB, dat$HR, dat$R)
dat$name <- Teams[Teams$G==162&Teams$yearID<2002, "name"]
fit = lm(dat$R ~ dat$BB + dat$HR, data = dat)
# Plot the linear model
# get the BB & HR ranges summary(dat$BB); summary(dat$HR)
cf = fit$coefficients
pltx = seq(344, 775,length.out = length(dat$BB))
plty = seq(11,264,length.out = length(dat$BB))
pltz = cf[1] + cf[2]*pltx + cf[3]*plty
# Plot Scatter and add the LM line to the plot
plot_ly() %>%
add_trace(x = ~pltx, y = ~plty, z = ~pltz, type="scatter3d", mode="lines",
line = list(color = "red", width = 4), name="lm(R ~ BB + HR") %>%
add_markers(x = ~dat$BB, y = ~dat$HR, z = ~dat$R, color = ~dat$name, mode="markers") %>%
layout(scene = list(xaxis = list(title = '(BB) Walks by batters'),
yaxis = list(title = '(HR) Home runs by batters'),
zaxis = list(title = '(R) Runs scored')))# Plot Scatter and add the LM PLANE to the plot
lm <- lm(R ~ 0 + HR + BB, data = dat)
#Setup Axis
axis_x <- seq(min(dat$HR), max(dat$HR), length.out=100)
axis_y <- seq(min(dat$BB), max(dat$BB), length.out=100)
#Sample points
lm_surface <- expand.grid(HR = axis_x, BB = axis_y, KEEP.OUT.ATTRS = F)
lm_surface$R <- predict.lm(lm, newdata = lm_surface)
lm_surface <- acast(lm_surface, HR ~ BB, value.var = "R") #`R`~ 0 + HR + BB
plot_ly(dat, x = ~HR, y = ~BB, z = ~R,
text = ~name, type = "scatter3d", mode = "markers", color = ~dat$name) %>%
add_trace(x = ~axis_x, y = ~axis_y, z = ~lm_surface, type="surface", color="gray", opacity=0.3) %>%
layout(title="3D Plane Regression (R ~ BB + HR); Color=BB Team", showlegend = F,
xaxis = list(title = '(BB) Walks by batters'),
yaxis = list(title = '(HR) Home runs by batters'),
zaxis = list(title = '(R) Runs scored')) %>%
hide_colorbar()We can also express the covariance matrix for our features using matrix operation. Suppose
\[X_{N\times K}=\left(\begin{array}{cccc} X_{1, 1}&\cdots&X_{1, K}\\ X_{2, 1}&\cdots&X_{2, K}\\ \vdots&\ddots&\vdots\\ X_{N, 1}&\cdots&X_{N, K} \end{array}\right) =[X_1 \ X_2\ \cdots\ X_K]. \]
Then the \(K\times K\) (square) covariance matrix is: \[\Sigma_{K\times K} = (\Sigma_{i, j}),\] where \(\Sigma_{i, j}=Cov(X_i, X_j)=E\left( (X_i-\mu_i)(X_j-\mu_j)\right)\), \(1\leq i, j, \leq K\). Note that earlier, we denoted the number of variables by \(p\), whereas here we use \(K\); this is to avoid potential confusion with probabilities, \(p\).
The sample covariance matrix is: \[\Sigma_{i, j}=\frac{1}{N-1}\sum_{m=1}^{N}\left( x_{m, i}-\bar{x}_i \right) \left( x_{m, j}-\bar{x}_j \right) , \]
where \[\bar{x}_{i}=\frac{1}{N}\sum_{m=1}^{N}x_{m, i}, \quad i=1, \cdots, K .\]
In general, \[\Sigma=\frac{1}{n-1}(X-\bar{X})^t(X-\bar{X}).\]
Suppose that we want to get the sample covariance matrix of the following \(5\times 3\) feature matrix \(x\).
x <- matrix(c(4.0, 4.2, 3.9, 4.3, 4.1, 2.0, 2.1, 2.0, 2.1, 2.2, 0.60, 0.59, 0.58, 0.62, 0.63), ncol=3)
x## [,1] [,2] [,3]
## [1,] 4.0 2.0 0.60
## [2,] 4.2 2.1 0.59
## [3,] 3.9 2.0 0.58
## [4,] 4.3 2.1 0.62
## [5,] 4.1 2.2 0.63
Notice that this matrix represents the design matrix of \(3\) features and \(5\) observations. Let’s compute the column means first.
## [,1] [,2] [,3]
## [1,] 4.1 2.08 0.604
## [,1] [,2] [,3]
## [1,] 0.02500 0.00750 0.00175
## [2,] 0.00750 0.00700 0.00135
## [3,] 0.00175 0.00135 0.00043
In the covariance matrix, \(S[i, i]\) is the variance of the i-th feature and \(S[i, j]\) is the covariance of i-th and j-th features.
Compare this to the automated calculation of the variance-covariance matrix.
## [,1] [,2] [,3]
## [1,] 0.02500 0.00750 0.00175
## [2,] 0.00750 0.00700 0.00135
## [3,] 0.00175 0.00135 0.00043
Later, in Chapter 5 (Supervised classification), we will cover some classification methods that use this mathematical framework for model-based and model-free ML/AI prediction. However, let’s start with linear model-based statistical methods providing forecasting and classification functionality. Specifically, we will (1) demonstrate the predictive power of multivariate linear regression, (2) show the foundation of regression trees and model trees, and (3) examine two complementary case-studies (Baseball Players and Heart Attack).
Regression represents a model of a relationship between a dependent variable (value to be predicted) and a group of independent variables (predictors or features). We assume the relationships between the outcome dependent variable and the independent variables is linear.
Earlier, we discussed the straightforward case of regression as simple linear regression, which involves a single predictor \[y=a+bx.\]
In this slope-intercept formula, a is the model
intercept and b is the model slope. Thus,
simple linear regression may be expressed as a bivariate equation. If we
know a and b, for any given x we
can estimate, or predict, y via the regression formula. If
we plot x against y in a 2D coordinate system,
where the two variables are exactly linearly related, the results will
be a straight line.
However, this is the ideal case. Bivariate scatterplots using real
world data may show patterns that are not necessarily precisely linear,
see Chapter
2. Let’s look at a bivariate scatterplot and try to fit a simple
linear regression line using two variables, e.g.,
hospital charges or CHARGES as a dependent
variable, and length of stay in the hospital or
LOS as an independent predictor. The data is available in
the DSPA
Data folder as CaseStudy12_AdultsHeartAttack_Data. We
can remove the pair of observations with missing values using the
command
heart_attack<-heart_attack[complete.cases(heart_attack), ].
library(plotly)
heart_attack <- read.csv("https://umich.instructure.com/files/1644953/download?download_frd=1", stringsAsFactors = F)
heart_attack$CHARGES <- as.numeric(heart_attack$CHARGES)
heart_attack <- heart_attack[complete.cases(heart_attack), ]
fit1 <- lm(CHARGES ~ LOS, data=heart_attack)
# par(cex=.8)
# plot(heart_attack$LOS, heart_attack$CHARGES, xlab="LOS", ylab = "CHARGES")
# abline(fit1, lwd=2, col="red")
plot_ly(heart_attack, x = ~LOS, y = ~CHARGES, type = 'scatter', mode = "markers", name="Data") %>%
add_trace(x=~mean(LOS), y=~mean(CHARGES), type="scatter", mode="markers",
name="(mean(LOS), mean(Charge))", marker=list(size=20, color='blue', line=list(color='yellow', width=2))) %>%
add_lines(x = ~LOS, y = fit1$fitted.values, mode = "lines", name="Linear Model") %>%