| SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
The time-varying (longitudinal) characteristics of large information flows represent a special case of the complexity, dynamic and multi-scale nature of big biomedical data that we discussed in the DSPA Motivation section. Previously, in Chapter 2, we saw space-time (4D) functional magnetic resonance imaging (fMRI) data, and in Chapter 10 we discussed streaming data, which also has a natural temporal dimension.
In this Chapter, we will expand our predictive data analytic strategies specifically for analyzing longitudinal data. We will interrogate datasets that track homologous information, across subjects, units or locations, over a period of time. In the first part of the Chapter, we will present classical approaches including time series analysis, forecasting using autoregressive integrated moving average (ARIMA) models, structural equation models (SEM), and longitudinal data analysis via linear mixed models. In the second part, we will discuss recent AI methods for time-series analysis and forecasting including Recurrent Neural Networks (RNN) and long short-term memory (LSTM) networks.
The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are metrics that quantify the fit of different (typically) regression models. The AIC is calculated by the formula \(AIC = 2k – 2\ln(L)\), where \(k\) is the number of parameters estimated in the model, and \(\ln(L)\) is the log-likelihood of the model. In essence, the AIC quantifies how likely the model is, given the observed data (in terms of the likelihood value). Suppose we have estimated several different regression models, the optimal model corresponds to the smallest AIC value, i.e., the model with the lowest AIC yields the best fit (highest likelihood, relative to the number of model parameters). The sign of the AIC does not matter, the lower the AIC value, the better the model fit.
Similarly, the BIC is calculated by the formula \(BIC = k\ln(n) – 2\ln(L)\equiv \ln\left( \frac{n^k}{L^2}\right )\), where \(n\) is the number of data points, \(k\) is the number of parameters estimated in the model, and \(\ln(L)\) is the log-likelihood of the model. Lower BIC values correspond to higher likelihoods (\(2\ln(L)\)) and lower penalty terms (\(k \ln(n)\)), i.e., better models.
The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are metrics that quantify the fit of different (typically) regression models.
The AIC is calculated by the formula \(AIC = 2k – 2\ln(L)\), where \(k\) is the number of parameters estimated in the model, and \(\ln(L)\) is the log-likelihood of the model. In essence, the AIC quantifies how likely the model is, given the observed data (in terms of the likelihood value). Suppose we have estimated several different regression models, the optimal model corresponds to the smallest AIC value, i.e., the model with the lowest AIC yields the best fit (highest likelihood, relative to the number of model parameters). The sign of the AIC does not matter, the lower the AIC value, the better the model fit.
Similarly, the BIC is calculated by the formula \(BIC = k\times \ln(n) – 2\ln(L)\), where \(n\) is the number of data points, \(k\) is the number of parameters estimated in the model, and \(\ln(L)\) is the log-likelihood of the model. Lower BIC values correspond to higher likelihoods (\(2\ln(L)\)) and lower penalty terms (\(k\times \ln(n)\)); i.e., better models.
Time series analysis relies on models like ARIMA (Autoregressive integrated moving average) that utilize past longitudinal information to predict near future outcomes. Time series data tend to track univariate, sometimes multivariate, processes over a continuous time interval. The stock market, e.g., daily closing value of the Dow Jones Industrial Average index, or electroencephalography (EEG) data provide examples of such longitudinal datasets (time series).
The basic concepts in time series analysis include:
We will use the Beijing air quality PM2.5 dataset as an example to demonstrate the analysis process. This dataset measures air pollutants - PM2.5 particles in micrograms per cubic meter for 8 years (2008-2016). It measures the hourly average of the number of particles that are of size 2.5 microns (PM2.5) once per hour in Beijing, China.
Let’s first import the dataset into R.
beijing.pm25<-read.csv("https://umich.instructure.com/files/1823138/download?download_frd=1")
summary(beijing.pm25)## Index Site Parameter Date..LST.
## Min. : 1 Length:69335 Length:69335 Length:69335
## 1st Qu.:17335 Class :character Class :character Class :character
## Median :34668 Mode :character Mode :character Mode :character
## Mean :34668
## 3rd Qu.:52002
## Max. :69335
## Year Month Day Hour
## Min. :2008 Min. : 1.000 Min. : 1.00 Min. : 0.0
## 1st Qu.:2010 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 5.5
## Median :2012 Median : 6.000 Median :16.00 Median :11.0
## Mean :2012 Mean : 6.407 Mean :15.73 Mean :11.5
## 3rd Qu.:2014 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:17.5
## Max. :2016 Max. :12.000 Max. :31.00 Max. :23.0
## Value Duration QC.Name
## Min. :-999.00 Length:69335 Length:69335
## 1st Qu.: 22.00 Class :character Class :character
## Median : 63.00 Mode :character Mode :character
## Mean : 24.99
## 3rd Qu.: 125.00
## Max. : 994.00
The Value column records PM2.5 AQI (Air Quality Index)
for 8 years. We observe that there are some missing data in the
Value column. By looking at the QC.Name
column, we have about 6.5% (~4,408) of the observations with missing
values. One way of solving the missingness is to replace them by the
corresponding variable mean. Of course, in practical applications, some
of the statistical imputation methods we discussed earlier should be
used.
beijing.pm25[beijing.pm25$Value==-999, 9]<-NA
beijing.pm25[is.na(beijing.pm25$Value), 9]<-floor(mean(beijing.pm25$Value, na.rm = T))Here we first reassign the missing values into NA
labels. Then we replace all NA labels with the
mean computed using all non-missing observations. Note that the
floor() function casts the arithmetic averages as integer
numbers, which is needed as AQI values are expected to be whole
numbers.
Now, let’s observe the trend of hourly average PM2.5 across one day. You can see a significant pattern: The PM2.5 level peaks in the afternoons and is the lowest in the early mornings and exhibits approximate periodic boundary conditions (these patterns oscillate daily).
library(plotly)
library(ggplot2)
id = 1:nrow(beijing.pm25)
mat = matrix(0,nrow=24,ncol=3)
stat = function(x){
c(mean(beijing.pm25[iid,"Value"]),quantile(beijing.pm25[iid,"Value"],c(0.2,0.8)))
}
for (i in 1:24){
iid = which(id%%24==i-1)
mat[i,] = stat(iid)
}
mat <- as.data.frame(mat)
colnames(mat) <- c("mean","20%","80%")
# mat$time = c(1:24)
mat$time <- readr::parse_datetime(beijing.pm25$Date..LST., "%m/%d/%Y %h:%M")[1:24]
require(reshape2)
dt <- melt(mat,id="time")
# colnames(dt)
# ggplot(data = dt,mapping = aes(x=time,y=value,color = variable))+geom_line()+
# scale_x_continuous(breaks = 0:23)+ggtitle("Beijing hour average PM2.5 from 2008-2016")
plot_ly(data = dt, x=~time, y=~value, color = ~variable, type="scatter", mode="markers+lines") %>%
layout(title="Beijing daily 24-hour average PM2.5 (2008-2016)")Are there any daily or monthly trends? We can start the data interrogation by building an ARIMA model and examining detailed patterns in the data.
Step 1 - Plot time series
To begin with, we can visualize the overall trend by plotting PM2.5
values against time. This can be achieved using the plyr
package.
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:plotly':
##
## arrange, mutate, rename, summarise
ts <- ts(beijing.pm25$Value, start=1, end=69335, frequency=1)
# ts.plot(ts)
dateTime <- readr::parse_datetime(beijing.pm25$Date..LST., "%m/%d/%Y %h:%M")
tsValue <- ifelse(beijing.pm25$Value>=0, beijing.pm25$Value, -100)
df <- as.data.frame(cbind(dateTime=dateTime, tsValue=tsValue))
plot_ly(x=~dateTime, y=~tsValue, type="scatter", mode="lines") %>%
layout(
title = "Time Series - Beijing daily 24-hour average PM2.5 (2008-2016)",
xaxis = list(title = "date"), yaxis=list(title="PM2.5 Measurement"))