1 Imaging Data

Review the 3D/4D MRI imaging data discussion in Chapter 2. Extract the time courses of several timeseries at different 3D spatial locations, some near-by, some farther apart (distant voxels). Then, apply time-series analyses, report findings, determine if near-by or farther-apart voxels may be more correlated.

Example of extracting timeseries from 4D fMRI data

# See examples here: https://cran.r-project.org/web/packages/oro.nifti/vignettes/nifti.pdf


library(oro.nifti) fMRIURL <- “http://socr.umich.edu/HTML5/BrainViewer/data/fMRI_FilteredData_4D.nii.gz” fMRIFile <- file.path(tempdir(), “fMRI_FilteredData_4D.nii.gz”) download.file(fMRIURL, dest=fMRIFile, quiet=TRUE) (fMRIVolume <- readNIfTI(fMRIFile, reorient=FALSE)) # dimensions: 64 x 64 x 21 x 180 ; 4mm x 4mm x 6mm x 3 sec
fMRIVolDims <- dim(fMRIVolume); fMRIVolDims time_dim <- fMRIVolDims[4]; time_dim
hist(fMRIVolume)
# To examine the time course of a specific 3D voxel (say the one at x=30, y=30, z=15): plot(fMRIVolume[30, 30, 10,], type=‘l’, main=“Time Series of 3D Voxel (x=30, y=30, z=15)”, col=“blue”)
x1 <- c(1:180) y1 <- loess(fMRIVolume[30, 30, 10,]~ x1, family = “gaussian”) lines(x1, smooth(fMRIVolume[30, 30, 10,]), col = “red”, lwd = 2) lines(ksmooth(x1, fMRIVolume[30, 30, 10,], kernel = “normal”, bandwidth = 5), col = “green”, lwd = 3)

2 Tabular data

Use Google Web-Search Trends and Stock Market Data to:

Plot time series for variable Job
Apply TTR to smooth the original graph by month
Determine the differencing parameter
Decide the auto-regressive (AR) and moving average (MA) parameters
Build an ARIMA model and Forecast the time-course over the next 365 days (for 2012).

3 Latent variables model

Use Handwritten English Letters data to:

Explore the data and evaluate the correlations between covariates
Justify to apply latent variable model
Apply proper data convert and scale data
Fit SEM on the data by adding proper latent variable
Summarize and interpret the outputs
Use the model you find above to fit the GEE and GLMM model setting latent variable as response and compare AIC.

4 Complete the following two end-to-end examples

4.1 Example 1: Temperature Time-series LSTM Modeling

Let’s use the 2009-2017 Climate Data from the Max Planck Institute in Jena, Germany to demonstrate time-series analysis via RNN. You can find more details about this study here. This sequence data represents a time-series recorded at the Weather Station at the Max Planck Institute for Biogeochemistry in Jena, Germany. The data includes date-time and 14 climate measurements/features (e.g., atmospheric pressure, temperature and humidity), which are recorded every 10 minutes over 9 years. Suppose we choose a prediction of window 12 months (1 year) and a batch size of 4 units (evenly dividing the number of testing and training observations), time steps = 1 (specifying one lag), and epochs = 100.

# Model inputs
lag_setting  <- 12 # = nrow(df_test)
batch_size   <- 0
train_length <- 10000

# the number of predictions we'll make equals the length of the hidden state
tsteps       <- 1
epochs       <- 100

# how many features = predictors we have
n_features <- 1

library("dplyr")
library("recipes")
library("tibble")
#install.packages("tibbletime")
library("tibbletime")

df_train <- as.data.frame(cbind(df_Temp_Time$t[1:10000], df_train.std))
# head(df_train)
df_test <- as.data.frame(cbind(df_Temp_Time$t[1:10000], df_test.std))
colnames(df_train) <- c("value"); colnames(df_test) <- c("value")
colnames(df_Temp_Time) <- c("Z", "index")

df <- bind_rows(
    df_train %>% add_column(key = "training"),
    df_test %>% add_column(key = "testing")
) # %>% 
    # as_tbl_time(index = as.Date(df_Temp_Time$index[1:10000]))
# df


df1 <- as.data.frame(cbind(as.Date(df_Temp_Time$index[1:20000]),
            as.double(df$value), df$key))
colnames(df1) <- c("index", "value", "class")
head(df1)
dates <- as.Date(df_Temp_Time$index[1:20000])
# head(as_tbl_time(df, dates))

df2 <- tibble::tibble(
  index  = as.POSIXct(dates),
  value = as.numeric(as.character(df1$value)),
  class = df1$class
); head(df2)
df3 <- as_tbl_time(df2, index)
head(df3); tail(df3); str(df3$value); head(df3$value)

rec_obj <- recipe(value ~ ., df3) %>%
    step_sqrt(value) %>%
    step_center(value) %>%
    step_scale(value) %>%
    prep()
    
tail(rec_obj$template)

# define time, value and labels for 10,000 timepoints
climate_time_10000 <- climate_time[1:10000]
trainLables <- rep("training", 10000); testLabels <- rep("testing", 10000)
TrainTestLabels <- rbind(trainLables, testLabels)
TrainTestValues <- rbind(df_train.std, df_test.std)
tibble(date = as.Date(as.vector(rbind(climate_time_10000, climate_time_10000))), 
       value = TrainTestValues[ , 1], key=as.vector(TrainTestLabels))

# Training Set
lag_train_tbl <- rec_obj$template %>%
    mutate(value_lag = lag(value, n = lag_setting)) %>%
    filter(!is.na(value_lag)) %>%
    filter(class == "training") %>%
    tail(train_length)

if (is.na(lag_train_tbl$value)) lag_train_tbl$value <- 0

x_train_vec <- lag_train_tbl$value_lag
x_train_arr <- array(data = x_train_vec, dim = c(length(x_train_vec), 1, 1))

y_train_vec <- lag_train_tbl$value
y_train_arr <- array(data = y_train_vec, dim = c(length(y_train_vec), 1))

# Testing Set
lag_test_tbl <- rec_obj$template %>%
    mutate(value_lag = lag(value, n = lag_setting)) %>%
    filter(!is.na(value_lag)) %>%
    filter(class == "testing")

x_test_vec <- lag_test_tbl$value_lag
x_test_arr <- array(data = x_test_vec, dim = c(length(x_test_vec), 1, 1))

y_test_vec <- lag_test_tbl$value
y_test_arr <- array(data = y_test_vec, dim = c(length(y_test_vec), 1))

Build a LSTM model,using a linear stack of layers representing a sequential model, try the keras_model_sequential() function.
Start with two LSTM layers each with 10 units. LSTM layer 1 takes the required input shape, \([time\ steps, number\ of\ features]\) and returns a 3D shape, stateful LSTM sequence (return_sequences = TRUE and stateful = TRUE). LSTM layer 2 takes the layer 1 output and returns a 2D shape (return_sequences = FALSE). Adding a layer_dense(units = 1) provides a standard ending to the keras sequential model. At the end, we can use compile() with loss = "mae" and optimizer = "adam".

model.1 <- keras_model_sequential()

model.1 %>%
    layer_lstm(units            = 50, 
               # batch_input_shape = c(batch_size, tsteps, n_features),
               batch_input_shape=c(32, 1, 1),
               return_sequences = TRUE, 
               stateful         = TRUE) %>% 
    layer_lstm(units            = 50, 
               return_sequences = FALSE, 
               stateful         = TRUE) %>% 
    layer_dense(units = 1)

model.1 %>% 
    compile(loss = 'mae', optimizer = 'adam')

model.1

Next we can begin the stateful LSTM model fitting, manually using a for loop (where we reset the states). To preserve sequences, we specify shuffle = FALSE which allows us direct control over the states reset after each epoch, reset_states().

for (i in 1:epochs) {
    model.1 %>% fit(x          = x_train_arr[1:(166*32), ,], 
                    y          = y_train_arr[1:(166*32), ], 
                    batch_size = 32,
                    epochs     = 1, 
                    verbose    = 1, 
                    shuffle    = FALSE)
    
    model.1 %>% reset_states()
    cat("Epoch: ", i)
}

########################### RStudio Documentation
# https://cran.rstudio.com/web/packages/keras/vignettes/sequential_model.html

# define a model along with the compilation step (
model.2 <- keras_model_sequential() 
model.2 %>% 
  layer_dense(units = 32, input_shape = c(100)) %>% 
  layer_activation('relu') %>% 
  layer_dense(units = 10) %>% 
  layer_activation('softmax')

# the compile() function has appropriate arguments for a multi-class/binary classification problem
model.2 %>% compile(
  optimizer = 'rmsprop',
  # loss = 'categorical_crossentropy',
  # metrics = c('accuracy')
  loss = loss_binary_crossentropy,
  metrics = metric_binary_accuracy
)

#  Compilation for mean squared error regression problem
model %>% compile(
  optimizer = optimizer_rmsprop(lr = 0.002),
  loss = 'mse'
)

# Train the Keras models on R matrices or higher dimensional arrays of input data and labels using 
# the fit() function, e.g., a single-input model with 2 classes (binary classification)

# Generate dummy data
data <- matrix(runif(1000*100), nrow = 1000, ncol = 100)
labels <- matrix(round(runif(1000, min = 0, max = 1)), nrow = 1000, ncol = 1)

# Train the model, iterating on the data in batches of 32 samples
model.2 %>% fit(data, labels, epochs=10, batch_size=32)

# constants
data_dim <- 16
timesteps <- 8
num_classes <- 10

# define and compile model
# expected input data shape: (batch_size, timesteps, data_dim)
data_dim <- 1
timesteps <- 2
num_classes <- 2

model <- keras_model_sequential() 
model %>% 
  layer_lstm(units = 32, return_sequences = TRUE, input_shape = c(timesteps, data_dim)) %>% 
  layer_lstm(units = 32, return_sequences = TRUE) %>% 
  layer_lstm(units = 32) %>% # return a single vector dimension 32
  layer_dense(units = 10, activation = 'softmax') %>% 
  compile(
    loss = 'categorical_crossentropy',
    optimizer = 'rmsprop',
    metrics = c('accuracy')
  )
  
# training data
x_train <- array(runif(1000 * timesteps * data_dim), dim = c(1000, timesteps, data_dim))
y_train <- matrix(runif(1000 * num_classes), nrow = 1000, ncol = num_classes)
x_train <- array(rec_obj$template$value[1:10000], dim = c(1000, 2, 1))
y_train <- matrix(rec_obj$template$class[1:10000], nrow = 5000, ncol = 2)


# testing validation data
x_val <- array(runif(100 * timesteps * data_dim), dim = c(100, timesteps, data_dim))
y_val <- matrix(runif(100 * num_classes), nrow = 100, ncol = num_classes)

x_val <- array(rec_obj$template$value, dim = c(2000, 2, 1))
y_val <- matrix(rec_obj$template$class, nrow = 10000, ncol = 2)

# train
model %>% fit( 
  x_train, y_train, batch_size = 64, epochs = 5, validation_data = list(x_val, y_val)
)

Once the LSTM model is estimated, we can use it to predict outcomes for the testing dataset, x_test_arr, using the predict() function. The predictions can be inversely transformed into the original measuring units using the saved mean and SD of the standardizing transformation.
Next, combine the predictions with the original data into a single column vector, using reduce() and the time_bind_rows() function.

# Make Predictions
pred_out <- model %>% 
    predict(x_test_arr, batch_size = batch_size) %>%
    .[,1] 

# Retransform values
pred_tbl <- tibble(
    index   = lag_test_tbl$index,
    value   = (pred_out * scale_history + center_history)^2
) 

# Combine actual data with predictions
tbl_1 <- df_trn %>%
    add_column(key = "actual")

tbl_2 <- df_tst %>%
    add_column(key = "actual")

tbl_3 <- pred_tbl %>%
    add_column(key = "predict")

# Create time_bind_rows() to solve dplyr issue
time_bind_rows <- function(data_1, data_2, index) {
    index_expr <- enquo(index)
    bind_rows(data_1, data_2) %>%
        as_tbl_time(index = !! index_expr)
}

ret <- list(tbl_1, tbl_2, tbl_3) %>%
    reduce(time_bind_rows, index = index) %>%
    arrange(key, index) %>%
    mutate(key = as_factor(key))

ret

The final step involves evaluating the stateful LSTM model. The caret package and yardstick::rmse() function may be appropriate to quantify the performance. As the data is in long format we can write a wrapper function calc_rmse() that transforms the data into the yardstick::rmse() format.

calc_rmse <- function(prediction_tbl) {
    
    rmse_calculation <- function(data) {
        data %>%
            spread(key = key, value = value) %>%
            select(-index) %>%
            filter(!is.na(predict)) %>%
            rename(
                truth    = actual,
                estimate = predict
            ) %>%
            rmse(truth, estimate)
    }
    
    safe_rmse <- possibly(rmse_calculation, otherwise = NA)
    
    safe_rmse(prediction_tbl)
        
}

# assess the RMSE on the model
calc_rmse(ret)

Visualize the model performance, we can use a plotting function, plot_prediction().

# Setup single plot function
plot_prediction <- function(data, id, alpha = 1, size = 2, base_size = 14) {
    
    rmse_val <- calc_rmse(data)
    
    g <- data %>%
        ggplot(aes(index, value, color = key)) +
        geom_point(alpha = alpha, size = size) + 
        theme_tq(base_size = base_size) +
        scale_color_tq() +
        theme(legend.position = "none") +
        labs(
            title = glue("{id}, RMSE: {round(rmse_val, digits = 1)}"),
            x = "", y = ""
        )
    
    return(g)
}

Test the plotting function.

ret %>% 
    plot_prediction(id = split_id, alpha = 0.65) +
    theme(legend.position = "bottom")

4.2 Example 2: Text Generation - Implementing character-level LSTM text generation

Let’s put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a language model. You could use any sufficiently large text file or set of text files – Wikipedia, the Lord of the Rings, etc. In this example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model we will learn will thus be specifically a model of Nietzsche’s writing style and topics of choice, rather than a more generic model of the English language.

4.2.1 Preparing the data

Start by downloading the corpus and converting it to lowercase.

library(keras)
library(stringr)

data.csv <- read.csv("https://umich.instructure.com/files/12554194/download?download_frd=1", header=T)
text <- tolower(data.csv$text)
cat("Corpus length:", nchar(text), "\n")

Extract partially overlapping sequences of length maxlen, generate dummy indicator variables (one-hot encoding), and pack them in a 3D array x of shape (sequences, maxlen, unique_characters). The array y contains the corresponding targets - the one-hot-encoded characters that come after each extracted sequence.

maxlen <- 60  # Length of extracted character sequences
step <- 3  # We sample a new sequence every `step` characters
  
text_indexes <- seq(1, nchar(text) - maxlen, by = step)
# This holds our extracted sequences
sentences <- str_sub(text, text_indexes, text_indexes + maxlen - 1)
# This holds the targets (the follow-up characters)
next_chars <- str_sub(text, text_indexes + maxlen, text_indexes + maxlen)
cat("Number of sequences: ", length(sentences), "\n")
# List of unique characters in the corpus
chars <- unique(sort(strsplit(text, "")[[1]]))
cat("Unique characters:", length(chars), "\n")
# Dictionary mapping unique characters to their index in `chars`
char_indices <- 1:length(chars) 
names(char_indices) <- chars
# Next, one-hot encode the characters into binary arrays.
cat("Vectorization...\n") 
x <- array(0L, dim = c(length(sentences), maxlen, length(chars)))
y <- array(0L, dim = c(length(sentences), length(chars)))
for (i in 1:length(sentences)) {
  sentence <- strsplit(sentences[[i]], "")[[1]]
  for (t in 1:length(sentence)) {
    char <- sentence[[t]]
    x[i, t, char_indices[[char]]] <- 1
  }
  next_char <- next_chars[[i]]
  y[i, char_indices[[next_char]]] <- 1
}

Build the LSTM network consisting of a single LSTM layer, a dense classifier, and softmax over all possible characters. Recurrent neural networks represent just one way to generate sequence data. The targets are dummy indicator variables (i.e., one-hot encoded features) and we can use categorical_crossentropy as the loss to train the model.

model <- keras_model_sequential() %>% 
  layer_lstm(units = 128, input_shape = c(maxlen, length(chars))) %>% 
  layer_dense(units = length(chars), activation = "softmax")

optimizer <- optimizer_rmsprop(lr = 0.01)
model %>% compile(
  loss = "categorical_crossentropy", 
  optimizer = optimizer
)

Training the language model and sampling from it. Given a trained model and a seeding text, we can generate new synth-text by:
- Draw from the model a probability distribution over the next character given the available text
- Re-weight the distribution to a certain “temperature”
- Sample the next character at random according to the re-weighted distribution
- Append the new character at the end of the synth-text.
The sample_next_char() function re-weights the original probability distribution from the current model and draws a new character index.

sample_next_char <- function(preds, temperature = 1.0) {
  preds <- as.numeric(preds)
  preds <- log(preds) / temperature
  exp_preds <- exp(preds)
  preds <- exp_preds / sum(exp_preds)
  which.max(t(rmultinom(1, 1, preds)))
}

At each learning epoch, train the model and generate synth-text with a range of different temperatures.

for (epoch in 1:50) {
  
  cat("Epoch", epoch, "\n")
  
  # Fit the LSTM model for 1 epoch on the available training data
  model %>% fit(x, y, batch_size = 64, epochs = 1) 
  
  # Select a random seed
  start_index <- sample(1:(nchar(text) - maxlen - 1), 1)  
  seed_text <- str_sub(text, start_index, start_index + maxlen - 1)
  
  cat("--- Synth-Text Generation with seed:", seed_text, "\n\n")
  
  for (temperature in c(0.2, 0.5, 1.0, 1.2)) {
    
    cat("----------- temperature:", temperature, "\n")
    cat(seed_text, "\n")
    
    generated_text <- seed_text
    
    # Generate text with 500 characters
    for (i in 1:500) {
      
      sampled <- array(0, dim = c(1, maxlen, length(chars)))
      generated_chars <- strsplit(generated_text, "")[[1]]
      for (t in 1:length(generated_chars)) {
        char <- generated_chars[[t]]
        sampled[1, t, char_indices[[char]]] <- 1
      }
        
      preds <- model %>% predict(sampled, verbose = 0)
      next_index <- sample_next_char(preds[1,], temperature)
      next_char <- chars[[next_index]]
      
      generated_text <- paste0(generated_text, next_char)
      generated_text <- substring(generated_text, 2)
      
      cat(next_char)
    }
    cat("\n")
  }
}

Result reporting for temperature=0.2, temperature=0.5, and temperature=1.0. Inspect the corresponding model convergence. Lower temperature results may yield extremely repetitive and predictable text with realistic English-language structure. Higher temperatures may yield more elaborate synthetic text, however, new plausible words/phrases/tokens may sometimes be invented and break the local sentence structure or include semi-random strings of characters. Experiment with multiple sampling strategies to identify optimal balance between learned structure and randomness.

Training larger models is expected to be more computationally intensive and yield more realistic textual output.

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Assessment: 12. Big Longitudinal Data Analysis

SOCR/MIDAS (Ivo Dinov)

March 2022