SOCR ≫ DSPA ≫ DSPA2 Topics ≫

In this chapter we present the fundamental linear algebra concepts underpinning many modern matrix computing operations, vector-based operations, linear modeling techniques, eigenspectra decompositions, multivariate linear regression modeling, ordinary least squares estimation, machine learning and artificial intelligence algorithms. We will demonstrate these techniques using simulated data and real observed data of baseball players and clinical heart attack patients.

Some students and readers may find it useful to first review some of the fundamental mathematical representations, analytical modeling techniques, and basic concepts. These foundations play critical roles in all subsequent chapters and sections. Examples of core mathematical principles include calculus of differentiation and integration; representation of scalars, vectors, matrices, and tensors; displacement, velocity, and acceleration; polynomials, exponents, and logarithmic functions; Taylor’s series; complex numbers; ordinary and partial differential equations; probability and statistics; statistical moments; and probability distributions.

1 Linear Algebra

Linear algebra is a branch of mathematics that studies linear associations using vectors, vector-spaces, linear equations, linear transformations and matrices. Although it is generally challenging to visualize complex data, e.g., large vectors, tensors, and tables in n-dimensional Euclidean spaces (\(n\ge 3\)), linear algebra allows us to represent, model, synthesize and summarize such complex data.

Virtually all natural processes permit first-order linear approximations. This is useful because linear equations are easy (to write, interpret, solve) and these first order approximations may be useful to practically assess the process, determine general trends, identify potential patterns, and suggest associations in the data.

Linear equations represent the simplest type of models for many processes. Higher order models may include additional nonlinear terms, e.g., Taylor-series expansion. Linear algebra provides the foundation for linear representation, analytics, solutions, inference and visualization of first-order affine models. Linear algebra is a small part of the larger mathematics functional analysis field, which is actually the infinite-dimensional version of linear algebra.

Specifically, linear algebra allows us to computationally manipulate, model, solve, and interpret complex systems of equations representing large numbers of dimensions/variables. Arbitrarily large problems can be mathematically transformed into simple matrix equations of the form \(A x = b\) or \(A x = \lambda x\).

In this chapter, we review the fundamentals of linear algebra, matrix manipulation and their applications to representation, modeling, and analysis of real data. Specifically, we will cover (1) construction of matrices and matrix operations, (2) general matrix algebra notations, (3) eigenvalues and eigenvectors of linear operators, (4) least squares estimation, and (5) linear regression and variance-covariance matrices.

1.1 Building Matrices

The easiest way to create a matrix in R is by using the matrix() or array() functions, which allow splicing a long vector into a matrix or a tensor of certain size.

seq1<-seq(1:6)
m1<-matrix(seq1, nrow=2, ncol=3)
m1
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
m2<-diag(seq1)
m2
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    0    0    0    0    0
## [2,]    0    2    0    0    0    0
## [3,]    0    0    3    0    0    0
## [4,]    0    0    0    4    0    0
## [5,]    0    0    0    0    5    0
## [6,]    0    0    0    0    0    6
m3<-matrix(rnorm(20), nrow=5)
m3
##            [,1]       [,2]       [,3]       [,4]
## [1,] 0.42826855 0.78581107  0.6956294  1.2622379
## [2,] 1.69432442 0.03411308  1.3167843 -0.6325525
## [3,] 0.08254209 0.04679539 -0.0180993 -0.2298754
## [4,] 0.90331659 1.54208243  1.7342205 -0.1269655
## [5,] 1.95724419 0.59320691  0.2500012 -0.9664280

The function diag() is very useful. When the object is a vector, it creates a diagonal matrix with the vector in the principal diagonal.

diag(c(1, 2, 3))
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    2    0
## [3,]    0    0    3

When the object is a matrix, diag() returns its principal diagonal.

diag(m1)
## [1] 1 4

When the object is a scalar, diag(k) returns a \(k\times k\) identity matrix.

diag(4)
##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1

The functions cbind() and rbind() are also useful for building matrices from vectors by column or row concatenation.

c1<-1:5
m4<-cbind(m3, c1)
m4
##                                                  c1
## [1,] 0.42826855 0.78581107  0.6956294  1.2622379  1
## [2,] 1.69432442 0.03411308  1.3167843 -0.6325525  2
## [3,] 0.08254209 0.04679539 -0.0180993 -0.2298754  3
## [4,] 0.90331659 1.54208243  1.7342205 -0.1269655  4
## [5,] 1.95724419 0.59320691  0.2500012 -0.9664280  5
r1<-1:4
m5<-rbind(m3, r1)
m5
##          [,1]       [,2]       [,3]       [,4]
##    0.42826855 0.78581107  0.6956294  1.2622379
##    1.69432442 0.03411308  1.3167843 -0.6325525
##    0.08254209 0.04679539 -0.0180993 -0.2298754
##    0.90331659 1.54208243  1.7342205 -0.1269655
##    1.95724419 0.59320691  0.2500012 -0.9664280
## r1 1.00000000 2.00000000  3.0000000  4.0000000

Note that m5 has a row name r1 in the 4th row. We remove row/column names by naming them as NULL.

dimnames(m5)<-list(NULL, NULL)
m5
##            [,1]       [,2]       [,3]       [,4]
## [1,] 0.42826855 0.78581107  0.6956294  1.2622379
## [2,] 1.69432442 0.03411308  1.3167843 -0.6325525
## [3,] 0.08254209 0.04679539 -0.0180993 -0.2298754
## [4,] 0.90331659 1.54208243  1.7342205 -0.1269655
## [5,] 1.95724419 0.59320691  0.2500012 -0.9664280
## [6,] 1.00000000 2.00000000  3.0000000  4.0000000

1.2 Matrix subscripts

Each element in a matrix has a location indexed by the corresponding row and column. A[i, j] stores the element in the ith row and jth column in the matrix A. We can also access some specific rows or columns using matrix subscripts.

m6<-matrix(1:12, nrow=3)
m6
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
m6[1, 2]
## [1] 4
m6[1, ]
## [1]  1  4  7 10
m6[, 2]
## [1] 4 5 6
m6[, c(2, 3)]
##      [,1] [,2]
## [1,]    4    7
## [2,]    5    8
## [3,]    6    9

The ordinal scalar operations addition, subtraction, multiplication and division can be generalized as matrix operations.

1.3 Addition and subtraction

Matrix addition and subtraction require matrices of the same dimensions. The sum or difference of two matrices are matrices containing elements representing the scalar sum or difference, respectively, of the values in corresponding positions in the two matrices.

m7<-matrix(1:6, nrow=2)
m7
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
m8<-matrix(2:7, nrow = 2)
m8
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
m7+m8
##      [,1] [,2] [,3]
## [1,]    3    7   11
## [2,]    5    9   13
m8-m7
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1
m8-1
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

1.4 Multiplication

Element-wise matrix multiplication is valid for matrices of the same sizes. However, matrix multiplication is different from component-wise scalar multiplication, and requires a special match between the dimensions of the multiplied matrices, \(P_{m\times k}=L_{m\times n} \cdot R_{n\times k}\). That is, the number of columns in the left matrix, \(L\),, must equal to the number of rows of the right matrix, \(R\). Then, the \(row\times column\) matrix multiplication rule yields a product matrix of dimensions corresponding to the number of rows (\(m\)) of \(L\) and the number of columns (\(k\)) of \(R\), i.e., \(P_{m\times k}\).

1.4.1 Element-wise multiplication

Element-wise matrix multiplication (\(*\)) involves scalar products of the elements in the same positions.

m8 * m7
##      [,1] [,2] [,3]
## [1,]    2   12   30
## [2,]    6   20   42

1.4.2 Matrix multiplication (Product)

Matrix product (\(\%*\%\)) generates an output matrix having the same number of rows as the left matrix and the same number of columns as the right matrix.

dim(m8)
## [1] 2 3
m9<-matrix(3:8, nrow=3)
m9
##      [,1] [,2]
## [1,]    3    6
## [2,]    4    7
## [3,]    5    8
dim(m9)
## [1] 3 2
M = m8 %*% m9
M
##      [,1] [,2]
## [1,]   52   88
## [2,]   64  109

The product of multiplying two matrices \(m^8_{2\times 3}\) * \(m^9_{3\times 2}\) is another matrix \(M_{2\times 2}\) of dimensions \(2\times 2\).

The process of multiplying two vectors is called outer product. Assume we have two vectors \(u\) and \(v\). Using matrix multiplication, their outer product is the same as \(u\ \underbrace{\% * \%}_{product}\ \overbrace{t(v)}^{transpose}\) or mathematically \(uv^t\). In R the vector outer product operator is %o%, which generates second order tensors (matrices).

u<-c(1, 2, 3, 4, 5)
v<-c(4, 5, 6, 7, 8)
u %o% v
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4    5    6    7    8
## [2,]    8   10   12   14   16
## [3,]   12   15   18   21   24
## [4,]   16   20   24   28   32
## [5,]   20   25   30   35   40
u %*% t(v)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    4    5    6    7    8
## [2,]    8   10   12   14   16
## [3,]   12   15   18   21   24
## [4,]   16   20   24   28   32
## [5,]   20   25   30   35   40

What are the differences between \(u \%*\% v\), \(u \%*\% t(v)\), \(u * t(v)\), and \(u * v\)?

1.4.3 Matrix Inversion (Division)

Element-wise matrix (or scalar) division is well defined for matrices of the same dimensions.

m8 / m7
##      [,1]     [,2]     [,3]
## [1,]  2.0 1.333333 1.200000
## [2,]  1.5 1.250000 1.166667
m8 / 2
##      [,1] [,2] [,3]
## [1,]  1.0  2.0  3.0
## [2,]  1.5  2.5  3.5

However, matrix inversion is different. Recall that the transpose of a matrix is a matrix with swapped columns and rows. In R, matrix transposition is done by the function t().

m8
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
t(m8)
##      [,1] [,2]
## [1,]    2    3
## [2,]    4    5
## [3,]    6    7

Notice that the \([1, 2]\) element in m8[1, 2] is the same as the \([2, 1]\) element in t(m8)[2, 1].

The right inverse of a matrix, (\(A^{-1}_{m\times n}\)), is a special matrix with the property that multiplying the original matrix (\(A_{n\times m}\)) on the right by this inverse (\(A^{-1}\)) yields the identity matrix, which has \(1\)’s on the main diagonal and \(0\)’s off the diagonal. That is,

\[A_{n\times m}\ A^{-1}_{m\times n} = I_{n\times n}.\] Similarly, a left matrix inverse is defined as a matrix, (\(A^{-1}_{m\times n}\)), with the property that multiplying the original matrix (\(A_{n\times m}\)) on the left by this inverse (\(A^{-1}\)) yields the identity matrix.

\[ A^{-1}_{m\times n}\ A_{n\times m} = I_{m\times m}.\]

A matrix inverse is only defined for square matrices, \(m=n\).

Given four numbers satisfying \(ad-bc \not= 0\), the following \(2\times 2\) matrix.

\[A_{2\times 2}=\left(\begin{array}{cc} a & b \\ c & d \end{array}\right) ,\]

has an inverse matrix given by

\[A^{-1}_{2\times 2} =\frac{1}{ad-bc}\left(\begin{array}{cc} d & -b \\ -c & a \end{array}\right) .\]

It’s easy to validate that \(A_{2\times 2} A^{-1}_{2\times 2} =I_{2\times 2}\).

In higher dimensions, the Cramer’s rule may be used to compute the matrix inverse. Matrix inversion is available in R via the solve() function.

m10 <- matrix(1:4, nrow=2)
m10
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
solve(m10)
##      [,1] [,2]
## [1,]   -2  1.5
## [2,]    1 -0.5
m10 %*% solve(m10)
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

Note that only special matrices are invertible, not all. These matrices are square (have the same number of rows and columns) and non-singular.

Another function that can help us to get the inverse of a matrix is the ginv() function in the MASS package. This function gives us the Moore-Penrose Generalized Inverse of a matrix.

require(MASS)
## Loading required package: MASS
ginv(m10)
##      [,1] [,2]
## [1,]   -2  1.5
## [2,]    1 -0.5

In addition, the function solve() can be used to solve matrix equations. For instance, solve(A, b) returns a vector \(x\) satisfying the equation \(b = Ax\), i.e., \(x= A^{-1}b\).

s1 <- diag(c(2, 4, 6, 8))
s2 <- c(1, 2, 3, 4)
solve(s1, s2)
## [1] 0.5 0.5 0.5 0.5

The following table summarizes some of the basic matrix operation functions.

Expression Explanation
t(x) transpose
diag(x) diagonal
%*% matrix multiplication
solve(a, b) solves a %*% x = b for x
solve(a) matrix inverse of a
rowsum(x) sum of rows for a matrix-like object. rowSums(x) is a faster version
colSums(x), colSums(x) id. for columns
rowMeans(x) fast version of row means
colMeans(x) id. for columns
mat1 <- cbind(c(1, -1/5), c(-1/3, 1))
mat1.inv <- solve(mat1)

mat1.identity <- mat1.inv %*% mat1
mat1.identity
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
b <- c(1, 2)
x <- solve (mat1, b)
x
## [1] 1.785714 2.357143

2 Matrix Computing

Let’s look at the basics of matrix notation and matrix algebra. The product \(AB\) between matrices \(A\) and \(B\) is defined only if the number of columns in \(A\) equals the number of rows in \(B\). That is, we can multiply an \(m\times n\) matrix \(A\) by an \(n\times k\) matrix \(B\) and the result will be \(AB_{m\times k}\) matrix. Each element of the product matrix, \((AB_{i, j})\), represents the product of the \(i\)-th row in \(A\) and the \(j\)-th column in \(B\), which are of the same size \(n\). Matrix multiplication is row-by-column.

Linear algebra notation simplifies the mathematical descriptions and manipulations of linear models, as well as coding in R.

The main point is to show how we can write linear models using matrix notation. Later, we’ll explain how this is useful for solving the least squares problems.

2.1 Solving Systems of Equations

Linear algebra notation enables the mathematical analysis and derivation of solutions of systems of linear equations and provides a generic machinery for solving linear problems.

\[\begin{align*} a + b + 2c &= 6\\ 3a - 2b + c &= 2\\ 2a + b - c &= 3 \end{align*} \]

\[\underbrace{\begin{pmatrix} 1&1&2\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}}_{\text{A}} \underbrace{\begin{pmatrix} a\\ b\\ c \end{pmatrix}}_{\text{x}} = \underbrace{\begin{pmatrix} 6\\ 2\\ 3 \end{pmatrix}}_{\text{b}}\]

That is, \(Ax = b\), which implies that:

\[\begin{pmatrix} a\\ b\\ c \end{pmatrix} = \begin{pmatrix} 1&1&2\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}^{-1} \begin{pmatrix} 6\\ 2\\ 3 \end{pmatrix}\]

In other words, \(A^{-1}A x ==x = A^{-1}b\).

Notice that this approach parallels the strategy for solving of simple (univariate) linear equations like: \[\underbrace{2}_{\text{(design matrix) A }} \overbrace{x}^{\text{unknown x }} \underbrace{-3}_{\text{simple constant term}} = \overbrace{5}^{\text{b}}.\]

The constant term, \(-3\), can be moved and integrated into the right-hand-side, \(b\), to form a new term \(b'=5+3=8\). Thus, the shifting factor is mostly ignored in linear models, or linear equations, which simplifies the linear matrix equation to: \[\underbrace{2}_{\text{(design matrix) A }} \overbrace{x}^{\text{unknown x }} = \underbrace{5+3}_{b'}=\overbrace{8}^{b'}.\]

This (simple) linear equation is solved by multiplying both hand sides by the inverse (reciprocal) of the \(x\) multiplier, \(2\).

\[\frac{1}{2} 2 x = \frac{1}{2} 8.\] Thus, the unique solution is: \[x = \frac{1}{2} 8=4.\]

So, let’s use exactly the same strategy to solve the corresponding matrix equation (linear equation, \(Ax = b\)) using R, where the unknown is \(x\), and the design matrix \(A\) and the constant vector \(b\) are known.

\[\underbrace{\begin{pmatrix} 1&1&2\\ 3&-2&1\\ 2&1&-1 \end{pmatrix}}_{\text{A}} \underbrace{\begin{pmatrix} a\\ b\\ c \end{pmatrix}}_{\text{x}} = \underbrace{\begin{pmatrix} 6\\ 2\\ 3 \end{pmatrix}}_{\text{b}}.\]

A_matrix_values <- c(1, 1, 2, 3, -2, 1, 2, 1, -1)
# matrix elements are arranged by columns, so, we need to transpose them to arrange them by rows.
A <- t(matrix(A_matrix_values, nrow=3, ncol=3))  

b <- c(6, 2, 3)
# to solve Ax = b, x=A^{-1}*b
x <- solve (A, b)
# Ax = b ==> x = A^{-1} * b
x
## [1] 1.35 1.75 1.45
# Check the Solution x=(1.35 1.75 1.45)
LHS <- A %*% x
round(LHS-b, 6)
##      [,1]
## [1,]    0
## [2,]    0
## [3,]    0

How about if we want to triple-check the consistency of the solve() method to provide accurate solutions to matrix-based systems of linear equations?

We can generate the solution (\(x\)) to the equation \(Ax=b\) by using first principles. \[ x = A^{-1}b.\]

A.inverse <- solve(A) # the inverse matrix A^{-1} 
x1 <- A.inverse %*% b
# check if X and x1 are the same
x; x1
## [1] 1.35 1.75 1.45
##      [,1]
## [1,] 1.35
## [2,] 1.75
## [3,] 1.45
round(x - x1, 6)
##      [,1]
## [1,]    0
## [2,]    0
## [3,]    0

2.2 The identity matrix

The identity matrix is the matrix analog to the multiplicative numeric identity, the number \(1\). Multiplying the identity matrix by any other matrix (\(B\)) does not change the matrix \(B\). This property requires that the multiplicative identity matrix must look like this.

\[\mathbf{I} = \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ \vdots &\vdots & \vdots & \ddots&\vdots&\vdots\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix}.\]

The identity matrix is always a square matrix with diagonal elements \(1\) and \(0\) at the off-diagonal elements.

Following the above matrix multiplication rules, we can see that:

\[\mathbf{X\times I} = \begin{pmatrix} x_{1, 1} & \dots & x_{1, p}\\ & \vdots & \\ x_{n, 1} & \dots & x_{n, p} \end{pmatrix} \begin{pmatrix} 1&0&0&\dots&0&0\\ 0&1&0&\dots&0&0\\ 0&0&1&\dots&0&0\\ & & &\vdots & &\\ 0&0&0&\dots&1&0\\ 0&0&0&\dots&0&1 \end{pmatrix}=\]

\[\begin{pmatrix} x_{1, 1} & \dots & x_{1, p}\\ & \vdots & \\ x_{n, 1} & \dots & x_{n, p} \end{pmatrix}= \mathbf{X}.\]

In R, we can express the identity matrix as follows.

n <- 3 #pick dimensions
I <- diag(n); I
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
A %*% I; I %*% A
##      [,1] [,2] [,3]
## [1,]    1    1    2
## [2,]    3   -2    1
## [3,]    2    1   -1
##      [,1] [,2] [,3]
## [1,]    1    1    2
## [2,]    3   -2    1
## [3,]    2    1   -1

2.3 Vectors, Matrices, and Scalars

Let’s look at this notation deeper using the Baseball players data containing three quantitative variables, Heights, Weight, and Age. Suppose the variable Weight is considered as a random response (outcome vector) denoted by \(Y_1, Y_2, \dots, Y_n\).

We can express each player’s Weight as a function of Age and Height.

# Data: https://umich.instructure.com/courses/38100/files/folder/data   (01a_data.txt)
data <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1', as.is=T, header=T)    
attach(data)
head(data)
##              Name Team       Position Height Weight   Age
## 1   Adam_Donachie  BAL        Catcher     74    180 22.99
## 2       Paul_Bako  BAL        Catcher     74    215 34.69
## 3 Ramon_Hernandez  BAL        Catcher     72    210 30.78
## 4    Kevin_Millar  BAL  First_Baseman     72    210 35.43
## 5     Chris_Gomez  BAL  First_Baseman     73    188 35.71
## 6   Brian_Roberts  BAL Second_Baseman     69    176 29.39

In matrix form, we can express the outcome using one symbol, \(\mathbf{Y}\). We usually use bold face to distinguish between scalars, vectors, matrices and tensors.

\[\mathbf{Y} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix}.\]

In R, the default representation of vector data is as columns, i.e., our outcome vector dimension is \(n\times 1\), as opposed to \(1 \times n\) used for row vectors (e.g., \(\mathbf{Y}^t\)).

Similarly, we can use matrix notation to represent the covariates, or predictors, Age and Height. In a case with two predictors, we can represent them like this:

\[\mathbf{X}_1 = \begin{pmatrix} x_{1, 1}\\ \vdots\\ x_{n, 1} \end{pmatrix} \mbox{ and } \mathbf{X}_2 = \begin{pmatrix} x_{1, 2}\\ \vdots\\ x_{n, 2} \end{pmatrix}.\]

In the that for the Baseball players study, \(x_{1, 1}= Age_1\) and \(x_{i, 1}=Age_i\) with \(Age_i\) representing the Age of the \(i\)-th player, and similarly, \(x_{i, 2}= Height_i\) is the height of the \(i\)-th player. These vectors can be thought of as \(n\times 1\) matrices. For instance, it is convenient to represent the covariates as design matrices.

\[\mathbf{X} = [ \mathbf{X}_1 \mathbf{X}_2 ] = \begin{pmatrix} x_{1, 1}&x_{1, 2}\\ \vdots\\ x_{n, 1}&x_{n, 2} \end{pmatrix}. \]

This design matrix has dimension \(n \times 2\).

X <- cbind(Age, Height)
head(X)
##        Age Height
## [1,] 22.99     74
## [2,] 34.69     74
## [3,] 30.78     72
## [4,] 35.43     72
## [5,] 35.71     73
## [6,] 29.39     69
dim(X)
## [1] 1034    2

We can also use this notation to denote an arbitrary number (\(k\)) of covariates with the following \(n\times k\) matrix.

\[\mathbf{X} = \begin{pmatrix} x_{1, 1}&\dots & x_{1, k} \\ x_{2, 1}&\dots & x_{2, k} \\ & \vdots & \\ x_{n, 1}&\dots & x_{n, k} \end{pmatrix}. \]

You can simulate such a design matrix in R using matrix(), instead of cbind.

n <- 1034; k <- 5
X <- matrix(1:(n*k), n, k)
head(X)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1 1035 2069 3103 4137
## [2,]    2 1036 2070 3104 4138
## [3,]    3 1037 2071 3105 4139
## [4,]    4 1038 2072 3106 4140
## [5,]    5 1039 2073 3107 4141
## [6,]    6 1040 2074 3108 4142
dim(X)
## [1] 1034    5

By default, matrices are filled column-by-column order, however using the byrow=TRUE argument allows us to change the order to row-by-row.

n <- 1034; k <- 5
X <- matrix(1:(n*k), n, k, byrow=TRUE)
head(X)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    2    3    4    5
## [2,]    6    7    8    9   10
## [3,]   11   12   13   14   15
## [4,]   16   17   18   19   20
## [5,]   21   22   23   24   25
## [6,]   26   27   28   29   30
dim(X)
## [1] 1034    5

Scalars are just one-dimensional values, typically numbers, that are different from their higher-dimensional counterparts, vectors, matrices, and tensors, which are usually denoted by bold characters.

2.4 Sample Statistics

To compute the sample average and variance of a dataset, we use the formulas: \[\bar{Y}=\frac{1}{n} \sum_{i=1}^n {Y_i}\]

and

\[\mbox{var}(Y)=\frac{1}{n-1} \sum_{i=1}^n {(Y_i - \bar{Y})}^2, \] which can be represented as matrix multiplications.

Define an \(n \times 1\) matrix made of \(1\)’s.

\[A=\begin{pmatrix} 1\\ 1\\ \vdots\\ 1 \end{pmatrix}.\]

This implies that.

\[\frac{1}{n} \mathbf{A}^\top Y = \frac{1}{n} \begin{pmatrix}1&1& \dots&1\end{pmatrix} \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix}= \frac{1}{n} \sum_{i=1}^n {Y_i}= \bar{Y}.\]

Recall that we multiply matrices and scalars, like \(\frac{1}{n}\), by *, whereas we multiply matrices using the matrix product operator, %*%.

# Using the Baseball dataset
y <- data$Height
print(mean(y))
## [1] 73.69729
n <- length(y)
Y <- matrix(y, n, 1)
A <- matrix(1, n, 1)
barY = (t(A) %*% Y) / n

print(barY)
##          [,1]
## [1,] 73.69729
# double-check the result
mean(data$Height)
## [1] 73.69729

Multiplying the transpose of a matrix with another matrix is very common in linear modeling and statistical computing, so there is an appropriate function in R, crossprod().

barY = (crossprod(A, Y)) / n
print(barY)
##          [,1]
## [1,] 73.69729

There is a similar matrix algebra for computing the variance.

\[\mathbf{Y'}\equiv \begin{pmatrix} Y_1 - \bar{Y}\\ \vdots\\ Y_n - \bar{Y} \end{pmatrix}, \, \, \frac{1}{n-1} \mathbf{Y'}^\top\mathbf{Y'} = \frac{1}{n-1}\sum_{i=1}^n (Y_i - \bar{Y})^2. \]

A crossprod with only one matrix computes \(Y^\top Y\).

Y1 <- y - mean(y)
crossprod(Y1)/(n-1)  # Y1.man <- (1/(n-1))* t(Y1) %*% Y1
##          [,1]
## [1,] 5.316798
# Check the result
var(y)
## [1] 5.316798

2.5 Applications of Matrix Algebra in Linear Modeling

Let’s use the following pair of matrices.

\[\overbrace{\mathbf{Y}}^{outcome} = \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix} , \underbrace{\mathbf{X}}_{design} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_n \end{pmatrix} , \overbrace{\mathbf{\beta}}^{effects} = \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} \mbox{ and } \underbrace{\mathbf{\varepsilon}}_{error} = \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{pmatrix}.\]

Then, we can express the problem as a linear model.

\[Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, i=1, \dots, n\]

We can also express the complete problem formulation into a corresponding succinct matrix notation formula.

\[\begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix} = \begin{pmatrix} 1&x_1\\ 1&x_2\\ \vdots\\ 1&x_n \end{pmatrix} \begin{pmatrix} \beta_0\\ \beta_1 \end{pmatrix} + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{pmatrix}. \]

And in its matrix form.

\[\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon},\]

which is a simpler way to write the same model equation.

One way to obtain an optimal solution is by minimizing all residuals (\(\epsilon_i\)). This high-fidelity criterion indicates a good model fit. The least squares (LS) solution represents one way to solve this matrix equation (\(Y=X\beta+\epsilon\)). The LS solution is obtained by minimizing the residual sum square error \[\langle\epsilon^t, \epsilon \rangle = (Y-X\beta)^t \times (Y-X\beta).\]

Let’s define the LS objective function using the cross-product notation.

\[f(\beta) = (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^t (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}). \]

We can determine the effect size estimates, \({\hat {\beta}}\), are obtained by minimizing this expression. Of course, we can derive an analytic solution using calculus to find the minimum of the cost (objective) function, \(f(\beta)\).

2.6 Finding function extrema (min/max) using calculus

There are a number of rules that help with solving partial derivative equations in matrix form. recall that the critical points of the objective functions are either at the domain border or at values where the derivative of a objective function is trivial, \(f'(x)=0\). hence, solving for the unknown parameter \(\beta\) requires identifying the critical points, \({\hat {\beta}}\), which will represent candidate solution(s). The derivative of the above equation is

\[2 \mathbf{X}^t (\mathbf{Y} - \mathbf{X} \boldsymbol{\hat{\beta}})=0,\]

\[\mathbf{X}^t \mathbf{X} \boldsymbol{\hat{\beta}} = \mathbf{X}^t \mathbf{Y},\]

\[\boldsymbol{\hat{\beta}} = (\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \mathbf{Y}.\]

This estimate \({\hat {\beta}}\) represents the desired LS solution to the linear modeling problem. The hat notation, \({\hat {\cdot}}\), is used to denote estimates. For instance, the solution for the unknown \(\beta\) parameters is denoted by the (data-driven) estimate \(\hat{\beta}\).

The least squares minimization works because minimizing a function corresponds to finding the roots of its (first) derivative. With ordinary least squares (OLS), we square the residuals.

\[(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})^t (\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}).\]

Notice that the minimum of \(f(x)\) and \(f^2(x)\) are achieved at the same roots of \(f'(x)\), as the derivative of \(f^2(x)\) is \(\frac{d}{dx}f^2(x) = 2f(x)f'(x)\).

Here is how we obtain the Least Square estimation in R.

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'data':
## 
##     Position
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#x=cbind(data$Height, data$Age)
x=data$Height
y=data$Weight
X <- cbind(1, x) 
beta_hat <- solve( t(X) %*% X ) %*% t(X) %*% y 
###or
beta_hat <- solve( crossprod(X) ) %*% crossprod( X, y ) 

Now we can see the results of this by computing the estimated \(\hat{\beta}_0+\hat{\beta}_1 x\) (fitted model prediction) corresponding to any covariate input value of \(x\).

# newx <- seq(min(x), max(x), len=100)
X <- cbind(1, x)
fitted <- X%*%beta_hat
# or directly: fitted <- lm(y ~ x)$fitted
# plot(x, y, xlab="MLB Player's Height", ylab="Player's Weight")
# lines(x, fitted, col=2)

plot_ly(x = ~x) %>% 
  add_markers(y = ~y, name="Data Scatter") %>% 
  add_lines(x = ~x, y = ~fitted[,1], name="(Manual) Linear Model (Weight ~ Height)") %>% 
  add_lines(x = ~x, y = ~lm(y ~ x)$fitted, name="(Direct) lm(Weight ~ Height)", 
            line = list(width = 4, dash = 'dash')) %>% 
    layout(title='Baseball Players: Linear Model of Weight vs. Height', 
           xaxis = list(title="Height (in)"), yaxis = list(title="Weight (lb)"),
           legend = list(orientation = 'h'))

The closed-form analytical expression for the LS estimate \[\hat{\boldsymbol{\beta}}=(\mathbf{X}^t \mathbf{X})^{-1} \mathbf{X}^t \mathbf{Y}\] is one of the most widely used results in data analysis. One of the advantages of this approach is that we can use it in many different situations.

2.7 Linear modeling in R

In R, there is a very convenient function lm() that fits these linear models. We will learn more about this function later, but here is a demonstration that it agrees with a simple manual LS estimation approach we showed above.

# X <- cbind(data$Height, data$Age) # more complicated model
X <- data$Height    # simple model
y <- data$Weight
fit <- lm(y ~ X)

2.8 Eigenspectra - Eigenvalues and Eigenvectors

Starting in the 18th century, the work of Euler’s on rotational motion and later Lagrange on the study of inertia matrices, the notions of principal axes (eigenvectors) and characteristic roots (eigenvectors) arose. However, it took close to 200 years until Hilbert and others working on integral operators settled on using the terminology eigen, “own”, to denote eigenvalues (proper characteristic values) and eigenvectors (principal axes).

The eigen-spectrum (eigenspace) decomposition of linear operators (matrices) into eigenvalues and eigenvectors enables us to understand linear transformations and characterize their properties. The eigenvectors represent the “axes” (directions) along which a linear transformation acts by stretching, compressing, or flipping.

The eigenvalues represent the amounts of this linear transformation into the specified eigen-vector direction. In higher dimensions, there are more directions along which we need to understand the behavior of the linear transformation. The eigen-spectrum makes it easier to understand the linear transformation especially when many (all?) of the eigenvectors are linearly independent (orthogonal).

For a given matrix \(A\), if we have \(A\vec{v}=\lambda \vec{v}\), then we say that a nonzero vector \(\vec{v}\) is a right eigenvector of the matrix \(A\) and the scale factor \(\lambda\) is the eigenvalue corresponding to that eigenvector.

With some calculations we can show that \(A\vec{v}=\lambda \vec{v}\) is the same as \((\lambda I_n-A)\vec{v}=\vec{0}\), where \(I_n\) is the \(n\times n\) identity matrix. So, when we solve this equation, we get the corresponding eigenvalues and eigenvectors. As this is a very common operation, we don’t need to do that by hand - the method eigen() provides this functionality.

m11 <- diag(nrow = 2, ncol=2)
m11
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
eigen(m11)
## eigen() decomposition
## $values
## [1] 1 1
## 
## $vectors
##      [,1] [,2]
## [1,]    0   -1
## [2,]    1    0

We can easily validate that \((\lambda I_n-A)\vec{v}=\vec{0}\).

(eigen(m11)$values*diag(2)-m11) %*% eigen(m11)$vectors
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0

As we mentioned earlier, diag(n) creates a \(n\times n\) identity matrix. Thus, diag(2) is the \(I_2\) matrix in the above equation. The zero output matrix proves that the equation \((\lambda I_n-A)\vec{v}=\vec{0}\) holds true.

Many interesting applications of the eigen-spectrum are shown here.

2.9 Other commonly used matrix computing functions

Other important functions about matrix operation are listed in the following table.

Functions Math expression or explanation
crossprod(A, B) \(A^TB\) Where \(A\), \(B\) are matrices
y<-svd(A) the output has the following components
-y$d vector containing the singular values of A,
-y$u matrix with columns contain the left singular vectors of A,
-y$v matrix with columns contain the right singular vectors of A
k <- qr(A) the output has the following components
-k$qr has an upper triangle that contains the decomposition and a lower triangle that contains information on the Q decomposition.
-k$rank is the rank of A.
-k$qraux a vector which contains additional information on Q.
-k$pivot contains information on the pivoting strategy used.
rowMeans(A)/colMeans(A) returns vector of row/column means
rowSums(A)/colSums(A) returns vector of row/column sums

2.10 Matrix notation

Some flexible matrix operations can help us save time calculating row or column averages. For example, column averages can be calculated by the following matrix operation.

\[AX = \left(\begin{array}{cccc} \frac{1}{N}&\frac{1}{N}&\cdots&\frac{1}{N} \end{array}\right) \left(\begin{array}{cccc} X_{1, 1}&\cdots&X_{1, p}\\ X_{2, 1}&\cdots&X_{2, p}\\ \vdots&\vdots&\vdots\\ X_{N, 1}&\cdots&X_{N, p} \end{array}\right)= \left(\begin{array}{cccc} \bar{X}_1&\bar{X}_2&\cdots &\bar{X}_N \end{array}\right).\]

The row averages can be calculated similarly.

\[XB = \left(\begin{array}{cccc} X_{1, 1}&\cdots &X_{1, p}\\ X_{2, 1}&\cdots &X_{2, p}\\ \vdots &\vdots &\vdots \\ X_{N, 1}&\cdots&X_{N, p} \end{array}\right) \left(\begin{array}{c} \frac{1}{p}\\ \frac{1}{p}\\ \vdots\\ \frac{1}{p} \end{array}\right)= \left(\begin{array}{c} \bar{X}_1\\ \bar{X}_2\\ \vdots\\ \bar{X}_q \end{array}\right). \]

Expeditious matrix calculations can be done by multiplying a matrix on the left or at the right by another matrix. In general multiplying by a vector on the left can amounts to weight averaging.

\[AX = \left(\begin{array}{cccc} a_1&a_2&\cdots &a_N \end{array}\right) \left(\begin{array}{cccc} X_{1, 1}&\cdots &X_{1, p}\\ X_{2, 1}&\cdots &X_{2, p}\\ \vdots&\vdots &\vdots \\ X_{N, 1}&\cdots &X_{N, p} \end{array}\right)= \left(\begin{array}{cccc} \sum_{i=1}^N a_i \bar{X}_{i, 1}&\sum_{i=1}^N a_i \bar{X}_{i, 2}&\cdots &\sum_{i=1}^N a_i \bar{X}_{i, N} \end{array}\right). \]

Now let’s try this matrix notation to look at genetic expression data including \(8,793\) different genes for \(208\) subjects. These gene expression data represents a microarray experiment - GSE5859 - comparing Gene Expression Profiles from Lymphoblastoid cells. Specifically, the data compares the expression level of genes in lymphoblasts from individuals in three HapMap populations {CEU, CHB, JPT}. The study found that the mean expression levels between the {CEU} and {CHB+JPT} samples were significantly different (\(p < 0.05\)) for more than a thousand genes.

The gene expression profiles data has two components:

  • The gene expression intensities (exprs_GSE5859.csv): rows represent features on the microarray (e.g., genes), and columns represent different microarray samples, and
  • Meta-data about each of the samples (exprs_MetaData_GSE5859.csv) rows represent samples and columns represent meta-data (e.g., sex, age, treatment status, the date of the sample processing).
gene <- read.csv("https://umich.instructure.com/files/2001417/download?download_frd=1", 
                 header = T) # exprs_GSE5859.csv
info<-read.csv("https://umich.instructure.com/files/2001418/download?download_frd=1", 
               header=T)  # exprs_MetaData_GSE5859.csv

Recall that the lapply() function that we talked about in Chapter 2 and the sapply() function can be used to calculate column and row averages. Let’s compare the outputs of sapply and the corresponding matrix algebra process.

colmeans <- sapply(gene[, -1], mean)
gene1 <- as.matrix(gene[, -1])
# can also use built in functions
# colMeans <- colMeans(gene1)
colmeans.matrix <- crossprod(rep(1/nrow(gene1), nrow(gene1)), gene1)
colmeans[1:15]
##  GSM25581.CEL.gz  GSM25681.CEL.gz GSM136524.CEL.gz GSM136707.CEL.gz 
##         5.703998         5.721779         5.726300         5.743632 
##  GSM25553.CEL.gz GSM136676.CEL.gz GSM136711.CEL.gz GSM136542.CEL.gz 
##         5.835499         5.742565         5.751601         5.732211 
## GSM136535.CEL.gz  GSM25399.CEL.gz  GSM25552.CEL.gz  GSM25542.CEL.gz 
##         5.741741         5.618825         5.805147         5.733117 
## GSM136544.CEL.gz  GSM25662.CEL.gz GSM136563.CEL.gz 
##         5.733175         5.716855         5.750600
colmeans.matrix[1:15]
##  [1] 5.703998 5.721779 5.726300 5.743632 5.835499 5.742565 5.751601 5.732211
##  [9] 5.741741 5.618825 5.805147 5.733117 5.733175 5.716855 5.750600

The same outputs are generated by both protocols. Note that we uses rep(1/nrow(gene1), nrow(gene1)) to create the vector

\[\left(\begin{array}{cccc} \frac{1}{N}&\frac{1}{N}&\cdots &\frac{1}{N} \end{array}\right). \]

needed to obtain manually the column averages by matrix algebra. Similarly, we can compute the column means.

colmeans <- as.matrix(colmeans)
h <- hist(colmeans, plot=F)

plot_ly(x = h$mids, y = h$counts, type = "bar", name = "Column Averages") %>% 
    layout(title='Average Gene Expression Histogram', 
          xaxis = list(title = "Column Means"),
          yaxis = list(title = "Average Expression", side = "left"),
           legend = list(orientation = 'h'))

The histogram shows that the distribution is symmetric, unimodal and bell-shaped, i.e., roughly normal.

We can also solve harder problems using matrix algebra. For example, let’s calculate the differences between genders for each gene. First, we need to get the gender information for each subject.

gender <- info[, c(3, 4)]
rownames(gender) <- gender$filename

Then, we have to reorder the columns to make it consistent with the feature matrix gene1.

gender <- gender[colnames(gene1), ]

Next, we are going to design the matrix. We want to multiply it with the feature matrix. The plan is to multiply the following two matrices.

\[\left(\begin{array}{cccc} X_{1, 1}&\cdots&X_{1, p}\\ X_{2, 1}&\cdots&X_{2, p}\\ \vdots & \vdots &\vdots \\ X_{N, 1}&\cdots&X_{N, p} \end{array}\right) \left(\begin{array}{cc} \frac{1}{p}&a_1\\ \frac{1}{p}&a_2\\ \vdots & \vdots \\ \frac{1}{p}&a_p \end{array}\right)= \left(\begin{array}{cc} \bar{X}_1&gender.diff_1\\ \bar{X}_2&gender.diff_2\\ \vdots & \vdots \\ \bar{X}_N&gender.diff_N \end{array}\right),\] where \(a_i=-\frac{1}{N_F}\) if the subject is female and \(a_i=\frac{1}{N_M}\) if the subject is male. Thus, we gave each female and male the same weight before the subtraction. We average each gender and get their difference. \(\bar{X}_i\) is the average across both genders and \(gender.diff_i\) represents the gender difference for the i-th gene.

table(gender$sex)
## 
##   F   M 
##  86 122
gender$vector <- ifelse(gender$sex=="F", -1/86, 1/122)
vec1 <- as.matrix(data.frame(rowavg=rep(1/ncol(gene1), ncol(gene1)), gender.diff=gender$vector))
gender.matrix <- gene1 %*% vec1
gender.matrix[1:15, ]
##         rowavg  gender.diff
##  [1,] 6.383263 -0.003209464
##  [2,] 7.091630 -0.031320597
##  [3,] 5.477032  0.064806978
##  [4,] 7.584042 -0.001300152
##  [5,] 3.197687  0.015265502
##  [6,] 7.338204  0.078434938
##  [7,] 4.232132  0.008437864
##  [8,] 3.716460  0.018235650
##  [9,] 2.810554 -0.038698101
## [10,] 5.208787  0.020219666
## [11,] 6.498989  0.025979654
## [12,] 5.292992 -0.029988980
## [13,] 7.069081  0.038575442
## [14,] 5.952406  0.030352616
## [15,] 7.247116  0.046020066

3 Linear regression

As we mentioned earlier, the formula for linear regression can be written as

\[Y_i=\beta_0+X_{i, 1}\beta_1+\cdots+X_{i, p}\beta_p +\epsilon_i, i=1, \cdots, N\] This formula can also be expressed in matrix form.

\[\left(\begin{array}{c} Y_1\\ Y_2\\ \vdots\\ Y_N \end{array}\right)= \left(\begin{array}{c} 1\\ 1\\ \vdots\\ 1 \end{array}\right)\beta_0+ \left(\begin{array}{c} X_{1, 1}\\ X_{2, 1}\\ \vdots\\ X_{N, 1} \end{array}\right)\beta_1+\cdots + \left(\begin{array}{c} X_{1, p}\\ X_{2, p}\\ \vdots \\ X_{N, p} \end{array}\right)\beta_p+ \left(\begin{array}{c} \epsilon_1\\ \epsilon_2\\ \vdots \\ \epsilon_N \end{array}\right), \]

which can be compressed into a simple matrix equation \(Y=X\beta +\epsilon\).

\[\left(\begin{array}{c} Y_1\\ Y_2\\ \vdots \\ Y_N \end{array}\right)= \left(\begin{array}{cccc} 1&X_{1, 1}&\cdots&X_{1, p}\\ 1&X_{2, 1}&\cdots&X_{2, p}\\ \vdots&\vdots&\vdots&\vdots\\ 1&X_{N, 1}&\cdots&X_{N, p} \end{array}\right) \left(\begin{array}{c} \beta_o\\ \beta_1\\ \vdots\\ \beta_p \end{array}\right)+ \left(\begin{array}{c} \epsilon_1\\ \epsilon_2\\ \vdots\\ \epsilon_N \end{array}\right). \] As \(Y=X\beta +\epsilon\) implies that \(X^tY \sim X^t(X\beta)=(X^tX)\beta\), and thus, the LS solution for \(\beta\) is obtained by multiplying both hand sides by the inverse of the square cross product matrix \((X^tX)^{-1}\): \[\hat{\beta}=(X^tX)^{-1}X^tY.\]

Matrix calculations are much faster, especially on specialized computer chips, than fitting a manual regression model. Let’s apply this to the Lahman baseball data representing yearly stats and standings. Let’s download it first via this link baseball.data and save it in the R working directory. We can use the load() function to import the local RData. For this example, we subset the dataset by G==162 and yearID < 2002. Also, we create a new feature named Singles that is equal to H(Hits by batters) - X2B(Doubles) - X3B(Tripples) - HR(Home Runs by batters). Finally, we only pick four features R (Runs scored), Singles, HR (Home Runs by batters), and BB (Walks by batters).

#If you downloaded the .RData locally first, then you can easily load it into the`R`workspace by:
# load("Teams.RData")

# Alternatively you can also download the data in CSV format from https://umich.instructure.com/courses/38100/files/folder/data (teamsData.csv)
Teams <- read.csv('https://umich.instructure.com/files/2798317/download?download_frd=1', header=T)

dat <- Teams[Teams$G==162&Teams$yearID<2002, ]
dat$Singles <- dat$H-dat$X2B-dat$X3B-dat$HR
dat <- dat[, c("R", "Singles", "HR", "BB")]
head(dat)
##        R Singles  HR  BB
## 439  505     997  11 344
## 1367 683     989  90 580
## 1368 744     902 189 681
## 1378 652     948 156 516
## 1380 707    1017  92 620
## 1381 632    1020 126 504

In this example, let’s work with R as the response variable and BB as the independent variable. For a full linear model, we need to add another column of \(1\)’s to the design matrix \(X\).

Y <- dat$R
X <- cbind(rep(1, n=nrow(dat)), dat$BB)
X[1:10, ]
##       [,1] [,2]
##  [1,]    1  344
##  [2,]    1  580
##  [3,]    1  681
##  [4,]    1  516
##  [5,]    1  620
##  [6,]    1  504
##  [7,]    1  498
##  [8,]    1  502
##  [9,]    1  493
## [10,]    1  556

We use the LS analytical formula to obtain the beta (effects) estimates

\[\hat{\beta}=(X^t X)^{-1}X^t Y.\]

beta <- solve(t(X) %*% X) %*% t(X) %*% Y
beta
##             [,1]
## [1,] 326.8241628
## [2,]   0.7126402

To confirm this manual calculation, we can refit the linear equation using the lm() function, and compare the computational times. In this simple example, is there evidence of higher computational efficiency using matrix calculations?

fit <- lm(R~BB, data=dat)
# fit <- lm(R~., data=dat) 
# '.' indicates all other variables, very useful when fitting models with many predictors
fit
## 
## Call:
## lm(formula = R ~ BB, data = dat)
## 
## Coefficients:
## (Intercept)           BB  
##    326.8242       0.7126
summary(fit)
## 
## Call:
## lm(formula = R ~ BB, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -187.788  -53.977   -2.995   55.649  258.614 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 326.82416   22.44340   14.56   <2e-16 ***
## BB            0.71264    0.04157   17.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.95 on 661 degrees of freedom
## Multiple R-squared:  0.3078, Adjusted R-squared:  0.3068 
## F-statistic:   294 on 1 and 661 DF,  p-value: < 2.2e-16
system.time(fit <- lm(R~BB, data=dat))
##    user  system elapsed 
##       0       0       0
system.time(beta1 <- solve(t(X) %*% X) %*% t(X) %*% Y)
##    user  system elapsed 
##       0       0       0

For a better model, we can expand the covariates to include multiple predictors and compare the resulting estimates.

X <- cbind(rep(1, n=nrow(dat)), dat$BB, dat$Singles, dat$HR)
X[1:10, ]
##       [,1] [,2] [,3] [,4]
##  [1,]    1  344  997   11
##  [2,]    1  580  989   90
##  [3,]    1  681  902  189
##  [4,]    1  516  948  156
##  [5,]    1  620 1017   92
##  [6,]    1  504 1020  126
##  [7,]    1  498 1064  167
##  [8,]    1  502  937  180
##  [9,]    1  493 1048  105
## [10,]    1  556 1073  116
system.time(fit <- lm(R ~ BB+ Singles + HR, data=dat))
##    user  system elapsed 
##       0       0       0
system.time(beta2 <- solve(t(X) %*% X) %*% t(X) %*% Y)
##    user  system elapsed 
##    0.01    0.00    0.00
fit$coefficients; t(beta2)
##  (Intercept)           BB      Singles           HR 
## -401.0057242    0.3666606    0.6705535    1.7175775
##           [,1]      [,2]      [,3]     [,4]
## [1,] -401.0057 0.3666606 0.6705535 1.717577

A scatter plot would show visually the relationship between the outcome R and one of the predictors BB.

# plot(dat$BB, dat$R, xlab = "BB", ylab = "R", main = "Scatter plot/regression for baseball data")
# abline(beta1[1, 1], beta1[2, 1], lwd=4, col="red")

plot_ly(x = ~dat$BB) %>% 
  add_markers(y = ~dat$R, name="Data Scatter") %>% 
  add_lines(x = ~dat$BB, y = ~lm(dat$R ~ dat$BB)$fitted, 
            name="lm(Runs scored ~ Walks by batters)", line = list(width = 4)) %>% 
    layout(title='Scatter plot/regression for baseball data', 
           xaxis = list(title="(BB) Walks by batters"), yaxis = list(title="(R) Runs scored"),
           legend = list(orientation = 'h'))

Here the color model line represents our regression model estimated using matrix algebra.

The power of matrix algebra becomes more apparent when we use multiple variables. We can add the variable HR to the model.

library(reshape2)
X <- cbind(rep(1, n=nrow(dat)), dat$BB, dat$HR)
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
beta
##             [,1]
## [1,] 287.7226756
## [2,]   0.3897178
## [3,]   1.5220448
# #install.packages("scatterplot3d")
# library(scatterplot3d)
# myScatter3D <- scatterplot3d(dat$BB, dat$HR, dat$R)
# 
# fit = lm(dat$R ~ dat$BB + dat$HR, data = dat)
# # Plot the linear model
# # get the BB & HR ranges summary(dat$BB); summary(dat$HR)
# cf = fit$coefficients
# pltx = seq(344, 775,length.out = 100)
# plty = seq(11,264,length.out = 100)
# pltz = cf[1] + cf[2]*pltx + cf[3]*plty
# #Add the line to the plot
# myScatter3D$points3d(pltx,plty,pltz, type = "l", col = "red", lwd=3)

# # interactive *rgl* 3D plot
# library(rgl)
# fit <- lm(dat$R ~ dat$BB + dat$HR)
# coefs <- coef(fit)
# a <- coefs["dat$BB"]
# b <- coefs["dat$HR"]
# c <- -1
# d <- coefs["(Intercept)"]
# open3d()
# plot3d(dat$BB,  dat$HR, dat$R, type = "s", col = "red", size = 1)
# planes3d(a, b, c, d, alpha = 0.5)
# # planes3d(b, a, -1.5, d, alpha = 0.5)
# # planes3d draws planes using the parametrization a*x + b*y + c*z + d = 0.
# # Multiple planes may be specified by giving multiple values for the normal
# # vector (a, b, c) and the offset parameter d
# 
# pca1 <- prcomp(as.matrix(cbind(dat$BB,  dat$HR, dat$R)), center = T); summary(pca1)
# 
# # Given two vectors PCA1 and PCA2, the cross product V = PCA1 x PCA2 
# # is orthogonal to both A and to B, and a normal vector to the 
# # plane containing PCA1 and PCA2
# # If PCA1 = (a,b,c) and PCA2 = (d, e, f), then the cross product is
# # PCA1 x PCA2 =  (bf - ce, cd - af, ae - bd)
# # PCA1 = pca1$rotation[,1] and PCAS2=pca1$rotation[,2]
# # https://en.wikipedia.org/wiki/Cross_product#Names
# # prcomp$rotation contains the matrix of variable loadings, 
# # i.e., a matrix whose columns contain the eigenvectors
# #normVec = c(pca1$rotation[,1][2]*pca1$rotation[,2][3]-
# #              pca1$rotation[,1][3]*pca1$rotation[,2][2],
# #            pca1$rotation[,1][3]*pca1$rotation[,2][1]-
# #              pca1$rotation[,1][1]*pca1$rotation[,2][3],
# #            pca1$rotation[,1][1]*pca1$rotation[,2][2]-
# #              pca1$rotation[,1][2]*pca1$rotation[,2][1]
# #            )
# normVec = c(pca1$rotation[2,1]*pca1$rotation[3,2]-
#                  pca1$rotation[3,1]*pca1$rotation[2,2],
#              pca1$rotation[3,1]*pca1$rotation[1,2]-
#                  pca1$rotation[1,1]*pca1$rotation[3,2],
#              pca1$rotation[1,1]*pca1$rotation[2,2]-
#                  pca1$rotation[2,1]*pca1$rotation[1,2]
# )
# 
# # Plot the PCA Plane
# plot3d(dat$BB,  dat$HR, dat$R, type = "s", col = "red", size = 1)
# planes3d(normVec[1], normVec[2], normVec[3], 90, alpha = 0.5)
# myScatter3D <- scatterplot3d(dat$BB, dat$HR, dat$R)

dat$name <- Teams[Teams$G==162&Teams$yearID<2002, "name"]

fit = lm(dat$R ~ dat$BB + dat$HR, data = dat)
# Plot the linear model
# get the BB & HR ranges summary(dat$BB); summary(dat$HR)
cf = fit$coefficients
pltx = seq(344, 775,length.out = length(dat$BB))
plty = seq(11,264,length.out = length(dat$BB))
pltz = cf[1] + cf[2]*pltx + cf[3]*plty

# Plot Scatter and add the LM line to the plot
plot_ly() %>%
  add_trace(x = ~pltx, y = ~plty, z = ~pltz, type="scatter3d", mode="lines",
        line = list(color = "red", width = 4), name="lm(R ~ BB + HR") %>% 
  add_markers(x = ~dat$BB, y = ~dat$HR, z = ~dat$R, color = ~dat$name, mode="markers") %>% 
  layout(scene = list(xaxis = list(title = '(BB) Walks by batters'),
                        yaxis = list(title = '(HR) Home runs by batters'),
                        zaxis = list(title = '(R) Runs scored')))
# Plot Scatter and add the LM PLANE to the plot
lm <- lm(R ~ 0 + HR + BB, data = dat)

#Setup Axis
axis_x <- seq(min(dat$HR), max(dat$HR), length.out=100)
axis_y <- seq(min(dat$BB), max(dat$BB), length.out=100)

#Sample points
lm_surface <- expand.grid(HR = axis_x, BB = axis_y, KEEP.OUT.ATTRS = F)
lm_surface$R <- predict.lm(lm, newdata = lm_surface)
lm_surface <- acast(lm_surface, HR ~ BB, value.var = "R") #`R`~ 0 + HR + BB

plot_ly(dat, x = ~HR, y = ~BB, z = ~R,
        text = ~name, type = "scatter3d", mode = "markers", color = ~dat$name) %>%
  add_trace(x = ~axis_x, y = ~axis_y, z = ~lm_surface, type="surface", color="gray", opacity=0.3) %>%
  layout(title="3D Plane Regression (R ~ BB + HR); Color=BB Team", showlegend = F,
         xaxis = list(title = '(BB) Walks by batters'),
         yaxis = list(title = '(HR) Home runs by batters'),
         zaxis = list(title = '(R) Runs scored')) %>% 
  hide_colorbar()

3.1 Sample covariance matrix

We can also express the covariance matrix for our features using matrix operation. Suppose

\[X_{N\times K}=\left(\begin{array}{cccc} X_{1, 1}&\cdots&X_{1, K}\\ X_{2, 1}&\cdots&X_{2, K}\\ \vdots&\ddots&\vdots\\ X_{N, 1}&\cdots&X_{N, K} \end{array}\right) =[X_1 \ X_2\ \cdots\ X_K]. \]

Then the \(K\times K\) (square) covariance matrix is: \[\Sigma_{K\times K} = (\Sigma_{i, j}),\] where \(\Sigma_{i, j}=Cov(X_i, X_j)=E\left( (X_i-\mu_i)(X_j-\mu_j)\right)\), \(1\leq i, j, \leq K\). Note that earlier, we denoted the number of variables by \(p\), whereas here we use \(K\); this is to avoid potential confusion with probabilities, \(p\).

The sample covariance matrix is: \[\Sigma_{i, j}=\frac{1}{N-1}\sum_{m=1}^{N}\left( x_{m, i}-\bar{x}_i \right) \left( x_{m, j}-\bar{x}_j \right) , \]

where \[\bar{x}_{i}=\frac{1}{N}\sum_{m=1}^{N}x_{m, i}, \quad i=1, \cdots, K .\]

In general, \[\Sigma=\frac{1}{n-1}(X-\bar{X})^t(X-\bar{X}).\]

Suppose that we want to get the sample covariance matrix of the following \(5\times 3\) feature matrix \(x\).

x <- matrix(c(4.0, 4.2, 3.9, 4.3, 4.1, 2.0, 2.1, 2.0, 2.1, 2.2, 0.60, 0.59, 0.58, 0.62, 0.63), ncol=3)
x
##      [,1] [,2] [,3]
## [1,]  4.0  2.0 0.60
## [2,]  4.2  2.1 0.59
## [3,]  3.9  2.0 0.58
## [4,]  4.3  2.1 0.62
## [5,]  4.1  2.2 0.63

Notice that this matrix represents the design matrix of \(3\) features and \(5\) observations. Let’s compute the column means first.

vec2 <- matrix(c(1/5, 1/5, 1/5, 1/5, 1/5), ncol=5)
#column means
x.bar <- vec2 %*% x
x.bar
##      [,1] [,2]  [,3]
## [1,]  4.1 2.08 0.604
x.bar <- matrix(rep(x.bar, each=5), nrow=5)
S <- 1/4*t(x-x.bar) %*% (x-x.bar)
S
##         [,1]    [,2]    [,3]
## [1,] 0.02500 0.00750 0.00175
## [2,] 0.00750 0.00700 0.00135
## [3,] 0.00175 0.00135 0.00043

In the covariance matrix, \(S[i, i]\) is the variance of the i-th feature and \(S[i, j]\) is the covariance of i-th and j-th features.

Compare this to the automated calculation of the variance-covariance matrix.

autoCov <- cov(x)
autoCov
##         [,1]    [,2]    [,3]
## [1,] 0.02500 0.00750 0.00175
## [2,] 0.00750 0.00700 0.00135
## [3,] 0.00175 0.00135 0.00043

3.2 Linear multivariate linear regression modeling

Later, in Chapter 5 (Supervised classification), we will cover some classification methods that use this mathematical framework for model-based and model-free ML/AI prediction. However, let’s start with linear model-based statistical methods providing forecasting and classification functionality. Specifically, we will (1) demonstrate the predictive power of multivariate linear regression, (2) show the foundation of regression trees and model trees, and (3) examine two complementary case-studies (Baseball Players and Heart Attack).

Regression represents a model of a relationship between a dependent variable (value to be predicted) and a group of independent variables (predictors or features). We assume the relationships between the outcome dependent variable and the independent variables is linear.

3.3 Simple linear regression

Earlier, we discussed the straightforward case of regression as simple linear regression, which involves a single predictor \[y=a+bx.\]

In this slope-intercept formula, a is the model intercept and b is the model slope. Thus, simple linear regression may be expressed as a bivariate equation. If we know a and b, for any given x we can estimate, or predict, y via the regression formula. If we plot x against y in a 2D coordinate system, where the two variables are exactly linearly related, the results will be a straight line.

However, this is the ideal case. Bivariate scatterplots using real world data may show patterns that are not necessarily precisely linear, see Chapter 2. Let’s look at a bivariate scatterplot and try to fit a simple linear regression line using two variables, e.g., hospital charges or CHARGES as a dependent variable, and length of stay in the hospital or LOS as an independent predictor. The data is available in the DSPA Data folder as CaseStudy12_AdultsHeartAttack_Data. We can remove the pair of observations with missing values using the command heart_attack<-heart_attack[complete.cases(heart_attack), ].

library(plotly)
heart_attack <- read.csv("https://umich.instructure.com/files/1644953/download?download_frd=1", stringsAsFactors = F)
heart_attack$CHARGES <- as.numeric(heart_attack$CHARGES)
heart_attack <- heart_attack[complete.cases(heart_attack), ]

fit1 <- lm(CHARGES ~ LOS, data=heart_attack)
# par(cex=.8)
# plot(heart_attack$LOS, heart_attack$CHARGES, xlab="LOS", ylab = "CHARGES")
# abline(fit1, lwd=2, col="red")

plot_ly(heart_attack, x = ~LOS, y = ~CHARGES, type = 'scatter', mode = "markers", name="Data") %>%
    add_trace(x=~mean(LOS), y=~mean(CHARGES), type="scatter", mode="markers",
            name="(mean(LOS), mean(Charge))", marker=list(size=20, color='blue', line=list(color='yellow', width=2))) %>%
    add_lines(x = ~LOS, y = fit1$fitted.values, mode = "lines", name="Linear Model") %>%