SOCR ≫ DSPA ≫ DSPA2 Topics ≫

1 Big Longitudinal Data Analysis

The time-varying (longitudinal) characteristics of large information flows represent a special case of the complexity, dynamic and multi-scale nature of big biomedical data that we discussed in the DSPA Motivation section. Previously, in Chapter 2, we saw space-time (4D) functional magnetic resonance imaging (fMRI) data, and in Chapter 10 we discussed streaming data, which also has a natural temporal dimension.

In this Chapter, we will expand our predictive data analytic strategies specifically for analyzing longitudinal data. We will interrogate datasets that track homologous information, across subjects, units or locations, over a period of time. In the first part of the Chapter, we will present classical approaches including time series analysis, forecasting using autoregressive integrated moving average (ARIMA) models, structural equation models (SEM), and longitudinal data analysis via linear mixed models. In the second part, we will discuss recent AI methods for time-series analysis and forecasting including Recurrent Neural Networks (RNN) and long short-term memory (LSTM) networks.

1.1 Classical Time-Series Analytic Approaches

1.1.1 Information theoretic model evaluation criteria

The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are metrics that quantify the fit of different (typically) regression models. The AIC is calculated by the formula \(AIC = 2k – 2\ln(L)\), where \(k\) is the number of parameters estimated in the model, and \(\ln(L)\) is the log-likelihood of the model. In essence, the AIC quantifies how likely the model is, given the observed data (in terms of the likelihood value). Suppose we have estimated several different regression models, the optimal model corresponds to the smallest AIC value, i.e., the model with the lowest AIC yields the best fit (highest likelihood, relative to the number of model parameters). The sign of the AIC does not matter, the lower the AIC value, the better the model fit.

Similarly, the BIC is calculated by the formula \(BIC = k\ln(n) – 2\ln(L)\equiv \ln\left( \frac{n^k}{L^2}\right )\), where \(n\) is the number of data points, \(k\) is the number of parameters estimated in the model, and \(\ln(L)\) is the log-likelihood of the model. Lower BIC values correspond to higher likelihoods (\(2\ln(L)\)) and lower penalty terms (\(k \ln(n)\)), i.e., better models.

1.1.2 Information theoretic model evaluation criteria

The Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are metrics that quantify the fit of different (typically) regression models.

The AIC is calculated by the formula \(AIC = 2k – 2\ln(L)\), where \(k\) is the number of parameters estimated in the model, and \(\ln(L)\) is the log-likelihood of the model. In essence, the AIC quantifies how likely the model is, given the observed data (in terms of the likelihood value). Suppose we have estimated several different regression models, the optimal model corresponds to the smallest AIC value, i.e., the model with the lowest AIC yields the best fit (highest likelihood, relative to the number of model parameters). The sign of the AIC does not matter, the lower the AIC value, the better the model fit.

Similarly, the BIC is calculated by the formula \(BIC = k\times \ln(n) – 2\ln(L)\), where \(n\) is the number of data points, \(k\) is the number of parameters estimated in the model, and \(\ln(L)\) is the log-likelihood of the model. Lower BIC values correspond to higher likelihoods (\(2\ln(L)\)) and lower penalty terms (\(k\times \ln(n)\)); i.e., better models.

1.1.3 Time series analysis

Time series analysis relies on models like ARIMA (Autoregressive integrated moving average) that utilize past longitudinal information to predict near future outcomes. Time series data tend to track univariate, sometimes multivariate, processes over a continuous time interval. The stock market, e.g., daily closing value of the Dow Jones Industrial Average index, or electroencephalography (EEG) data provide examples of such longitudinal datasets (time series).

The basic concepts in time series analysis include:

  • The characteristics of (second-order) stationary time series (e.g., the first two moments are stable over time) do not depend on the time at which the series process is observed.
  • Differencing – a transformation applied to time-series data to make it stationary. Differences between consecutive time-observations may be computed by \(\displaystyle y_{t}'=y_{t}-y_{t-1}\). Differencing removes the level changes in the time series, eliminates trend, reduces seasonality, and stabilizes the mean of the time series. Differencing the time series repeatedly may yield a stationary time series. For example, a second order differencing: \[\displaystyle {\begin{aligned}y_{t}^{''}&=y_{t}'-y_{t-1}'\\&=(y_{t}-y_{t-1})-(y_{t-1}-y_{t-2})\\&=y_{t}-2y_{t-1}+y_{t-2}\end{aligned}}.\]
  • Seasonal differencing is computed as a difference between one observation and its corresponding observation in the previous epoch, or season (e.g., annually, there are \(m=4\) seasons in a cycle), like in this example: \[\displaystyle y_{t}'''=y_{t}-y_{t-m}\quad {\text{where }}m={\text{number of seasons}}.\] The differenced data may then be used to estimate an ARMA model.

We will use the Beijing air quality PM2.5 dataset as an example to demonstrate the analysis process. This dataset measures air pollutants - PM2.5 particles in micrograms per cubic meter for 8 years (2008-2016). It measures the hourly average of the number of particles that are of size 2.5 microns (PM2.5) once per hour in Beijing, China.

Let’s first import the dataset into R.

beijing.pm25<-read.csv("https://umich.instructure.com/files/1823138/download?download_frd=1")
summary(beijing.pm25)
##      Index           Site            Parameter          Date..LST.       
##  Min.   :    1   Length:69335       Length:69335       Length:69335      
##  1st Qu.:17335   Class :character   Class :character   Class :character  
##  Median :34668   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :34668                                                           
##  3rd Qu.:52002                                                           
##  Max.   :69335                                                           
##       Year          Month             Day             Hour     
##  Min.   :2008   Min.   : 1.000   Min.   : 1.00   Min.   : 0.0  
##  1st Qu.:2010   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 5.5  
##  Median :2012   Median : 6.000   Median :16.00   Median :11.0  
##  Mean   :2012   Mean   : 6.407   Mean   :15.73   Mean   :11.5  
##  3rd Qu.:2014   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:17.5  
##  Max.   :2016   Max.   :12.000   Max.   :31.00   Max.   :23.0  
##      Value           Duration           QC.Name         
##  Min.   :-999.00   Length:69335       Length:69335      
##  1st Qu.:  22.00   Class :character   Class :character  
##  Median :  63.00   Mode  :character   Mode  :character  
##  Mean   :  24.99                                        
##  3rd Qu.: 125.00                                        
##  Max.   : 994.00

The Value column records PM2.5 AQI (Air Quality Index) for 8 years. We observe that there are some missing data in the Value column. By looking at the QC.Name column, we have about 6.5% (~4,408) of the observations with missing values. One way of solving the missingness is to replace them by the corresponding variable mean. Of course, in practical applications, some of the statistical imputation methods we discussed earlier should be used.

beijing.pm25[beijing.pm25$Value==-999, 9]<-NA
beijing.pm25[is.na(beijing.pm25$Value), 9]<-floor(mean(beijing.pm25$Value, na.rm = T))

Here we first reassign the missing values into NA labels. Then we replace all NA labels with the mean computed using all non-missing observations. Note that the floor() function casts the arithmetic averages as integer numbers, which is needed as AQI values are expected to be whole numbers.

Now, let’s observe the trend of hourly average PM2.5 across one day. You can see a significant pattern: The PM2.5 level peaks in the afternoons and is the lowest in the early mornings and exhibits approximate periodic boundary conditions (these patterns oscillate daily).

library(plotly)
library(ggplot2)
id = 1:nrow(beijing.pm25)
mat = matrix(0,nrow=24,ncol=3)
stat = function(x){
  c(mean(beijing.pm25[iid,"Value"]),quantile(beijing.pm25[iid,"Value"],c(0.2,0.8)))
}
for (i in 1:24){
  iid = which(id%%24==i-1)
  mat[i,] = stat(iid)
}

mat <- as.data.frame(mat)
colnames(mat) <- c("mean","20%","80%")
# mat$time = c(1:24)
mat$time <- readr::parse_datetime(beijing.pm25$Date..LST., "%m/%d/%Y %h:%M")[1:24]

require(reshape2)
dt <- melt(mat,id="time")
# colnames(dt)
# ggplot(data = dt,mapping = aes(x=time,y=value,color = variable))+geom_line()+
#   scale_x_continuous(breaks = 0:23)+ggtitle("Beijing hour average PM2.5 from 2008-2016")

plot_ly(data = dt, x=~time, y=~value, color = ~variable, type="scatter", mode="markers+lines") %>%
  layout(title="Beijing daily 24-hour average PM2.5 (2008-2016)")

Are there any daily or monthly trends? We can start the data interrogation by building an ARIMA model and examining detailed patterns in the data.

Step 1 - Plot time series

To begin with, we can visualize the overall trend by plotting PM2.5 values against time. This can be achieved using the plyr package.

library(plyr)
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:plotly':
## 
##     arrange, mutate, rename, summarise
ts <- ts(beijing.pm25$Value, start=1, end=69335, frequency=1)
# ts.plot(ts)

dateTime <- readr::parse_datetime(beijing.pm25$Date..LST., "%m/%d/%Y %h:%M")
tsValue <- ifelse(beijing.pm25$Value>=0, beijing.pm25$Value, -100)
df <- as.data.frame(cbind(dateTime=dateTime, tsValue=tsValue))

plot_ly(x=~dateTime, y=~tsValue, type="scatter", mode="lines") %>%
  layout(
    title = "Time Series - Beijing daily 24-hour average PM2.5 (2008-2016)",
    xaxis = list(title = "date"), yaxis=list(title="PM2.5 Measurement"))