SOCR ≫ DSPA ≫ DSPA2 Topics ≫

1 Imaging Data

Review the 3D/4D MRI imaging data discussion in Chapter 2. Extract the time courses of several timeseries at different 3D spatial locations, some near-by, some farther apart (distant voxels). Then, apply time-series analyses, report findings, determine if near-by or farther-apart voxels may be more correlated.

Example of extracting timeseries from 4D fMRI data

# See examples here: https://cran.r-project.org/web/packages/oro.nifti/vignettes/nifti.pdf

library(oro.nifti) fMRIURL <- “http://socr.umich.edu/HTML5/BrainViewer/data/fMRI_FilteredData_4D.nii.gz” fMRIFile <- file.path(tempdir(), “fMRI_FilteredData_4D.nii.gz”) download.file(fMRIURL, dest=fMRIFile, quiet=TRUE) (fMRIVolume <- readNIfTI(fMRIFile, reorient=FALSE)) # dimensions: 64 x 64 x 21 x 180 ; 4mm x 4mm x 6mm x 3 sec

fMRIVolDims <- dim(fMRIVolume); fMRIVolDims time_dim <- fMRIVolDims[4]; time_dim

hist(fMRIVolume)

# To examine the time course of a specific 3D voxel (say the one at x=30, y=30, z=15): plot(fMRIVolume[30, 30, 10,], type=‘l’, main=“Time Series of 3D Voxel (x=30, y=30, z=15)”, col=“blue”)

x1 <- c(1:180) y1 <- loess(fMRIVolume[30, 30, 10,]~ x1, family = “gaussian”) lines(x1, smooth(fMRIVolume[30, 30, 10,]), col = “red”, lwd = 2) lines(ksmooth(x1, fMRIVolume[30, 30, 10,], kernel = “normal”, bandwidth = 5), col = “green”, lwd = 3)

2 Tabular data

Use Google Web-Search Trends and Stock Market Data to:

  • Plot time series for variable Job
  • Apply TTR to smooth the original graph by month
  • Determine the differencing parameter
  • Decide the auto-regressive (AR) and moving average (MA) parameters
  • Build an ARIMA model and Forecast the time-course over the next 365 days (for 2012).

3 Latent variables model

Use Handwritten English Letters data to:

  • Explore the data and evaluate the correlations between covariates
  • Justify to apply latent variable model
  • Apply proper data convert and scale data
  • Fit SEM on the data by adding proper latent variable
  • Summarize and interpret the outputs
  • Use the model you find above to fit the GEE and GLMM model setting latent variable as response and compare AIC.

4 Complete the following two end-to-end examples

4.1 Example 1: Temperature Time-series LSTM Modeling

Let’s use the 2009-2017 Climate Data from the Max Planck Institute in Jena, Germany to demonstrate time-series analysis via RNN. You can find more details about this study here. This sequence data represents a time-series recorded at the Weather Station at the Max Planck Institute for Biogeochemistry in Jena, Germany. The data includes date-time and 14 climate measurements/features (e.g., atmospheric pressure, temperature and humidity), which are recorded every 10 minutes over 9 years. Suppose we choose a prediction of window 12 months (1 year) and a batch size of 4 units (evenly dividing the number of testing and training observations), time steps = 1 (specifying one lag), and epochs = 100.

# Model inputs
lag_setting  <- 12 # = nrow(df_test)
batch_size   <- 0
train_length <- 10000

# the number of predictions we'll make equals the length of the hidden state
tsteps       <- 1
epochs       <- 100

# how many features = predictors we have
n_features <- 1
library("dplyr")
library("recipes")
library("tibble")
#install.packages("tibbletime")
library("tibbletime")

df_train <- as.data.frame(cbind(df_Temp_Time$t[1:10000], df_train.std))
# head(df_train)
df_test <- as.data.frame(cbind(df_Temp_Time$t[1:10000], df_test.std))
colnames(df_train) <- c("value"); colnames(df_test) <- c("value")
colnames(df_Temp_Time) <- c("Z", "index")

df <- bind_rows(
    df_train %>% add_column(key = "training"),
    df_test %>% add_column(key = "testing")
) # %>% 
    # as_tbl_time(index = as.Date(df_Temp_Time$index[1:10000]))
# df


df1 <- as.data.frame(cbind(as.Date(df_Temp_Time$index[1:20000]),
            as.double(df$value), df$key))
colnames(df1) <- c("index", "value", "class")
head(df1)
dates <- as.Date(df_Temp_Time$index[1:20000])
# head(as_tbl_time(df, dates))

df2 <- tibble::tibble(
  index  = as.POSIXct(dates),
  value = as.numeric(as.character(df1$value)),
  class = df1$class
); head(df2)
df3 <- as_tbl_time(df2, index)
head(df3); tail(df3); str(df3$value); head(df3$value)

rec_obj <- recipe(value ~ ., df3) %>%
    step_sqrt(value) %>%
    step_center(value) %>%
    step_scale(value) %>%
    prep()
    
tail(rec_obj$template)

# define time, value and labels for 10,000 timepoints
climate_time_10000 <- climate_time[1:10000]
trainLables <- rep("training", 10000); testLabels <- rep("testing", 10000)
TrainTestLabels <- rbind(trainLables, testLabels)
TrainTestValues <- rbind(df_train.std, df_test.std)
tibble(date = as.Date(as.vector(rbind(climate_time_10000, climate_time_10000))), 
       value = TrainTestValues[ , 1], key=as.vector(TrainTestLabels))

# Training Set
lag_train_tbl <- rec_obj$template %>%
    mutate(value_lag = lag(value, n = lag_setting)) %>%
    filter(!is.na(value_lag)) %>%
    filter(class == "training") %>%
    tail(train_length)

if (is.na(lag_train_tbl$value)) lag_train_tbl$value <- 0

x_train_vec <- lag_train_tbl$value_lag
x_train_arr <- array(data = x_train_vec, dim = c(length(x_train_vec), 1, 1))

y_train_vec <- lag_train_tbl$value
y_train_arr <- array(data = y_train_vec, dim = c(length(y_train_vec), 1))

# Testing Set
lag_test_tbl <- rec_obj$template %>%
    mutate(value_lag = lag(value, n = lag_setting)) %>%
    filter(!is.na(value_lag)) %>%
    filter(class == "testing")

x_test_vec <- lag_test_tbl$value_lag
x_test_arr <- array(data = x_test_vec, dim = c(length(x_test_vec), 1, 1))

y_test_vec <- lag_test_tbl$value
y_test_arr <- array(data = y_test_vec, dim = c(length(y_test_vec), 1))
  • Build a LSTM model,using a linear stack of layers representing a sequential model, try the keras_model_sequential() function.
  • Start with two LSTM layers each with 10 units. LSTM layer 1 takes the required input shape, \([time\ steps, number\ of\ features]\) and returns a 3D shape, stateful LSTM sequence (return_sequences = TRUE and stateful = TRUE). LSTM layer 2 takes the layer 1 output and returns a 2D shape (return_sequences = FALSE). Adding a layer_dense(units = 1) provides a standard ending to the keras sequential model. At the end, we can use compile() with loss = "mae" and optimizer = "adam".
model.1 <- keras_model_sequential()

model.1 %>%
    layer_lstm(units            = 50, 
               # batch_input_shape = c(batch_size, tsteps, n_features),
               batch_input_shape=c(32, 1, 1),
               return_sequences = TRUE, 
               stateful         = TRUE) %>% 
    layer_lstm(units            = 50, 
               return_sequences = FALSE, 
               stateful         = TRUE) %>% 
    layer_dense(units = 1)

model.1 %>% 
    compile(loss = 'mae', optimizer = 'adam')

model.1
  • Next we can begin the stateful LSTM model fitting, manually using a for loop (where we reset the states). To preserve sequences, we specify shuffle = FALSE which allows us direct control over the states reset after each epoch, reset_states().
for (i in 1:epochs) {
    model.1 %>% fit(x          = x_train_arr[1:(166*32), ,], 
                    y          = y_train_arr[1:(166*32), ], 
                    batch_size = 32,
                    epochs     = 1, 
                    verbose    = 1, 
                    shuffle    = FALSE)
    
    model.1 %>% reset_states()
    cat("Epoch: ", i)
}

########################### RStudio Documentation
# https://cran.rstudio.com/web/packages/keras/vignettes/sequential_model.html

# define a model along with the compilation step (
model.2 <- keras_model_sequential() 
model.2 %>% 
  layer_dense(units = 32, input_shape = c(100)) %>% 
  layer_activation('relu') %>% 
  layer_dense(units = 10) %>% 
  layer_activation('softmax')

# the compile() function has appropriate arguments for a multi-class/binary classification problem
model.2 %>% compile(
  optimizer = 'rmsprop',
  # loss = 'categorical_crossentropy',
  # metrics = c('accuracy')
  loss = loss_binary_crossentropy,
  metrics = metric_binary_accuracy
)

#  Compilation for mean squared error regression problem
model %>% compile(
  optimizer = optimizer_rmsprop(lr = 0.002),
  loss = 'mse'
)

# Train the Keras models on R matrices or higher dimensional arrays of input data and labels using 
# the fit() function, e.g., a single-input model with 2 classes (binary classification)

# Generate dummy data
data <- matrix(runif(1000*100), nrow = 1000, ncol = 100)
labels <- matrix(round(runif(1000, min = 0, max = 1)), nrow = 1000, ncol = 1)

# Train the model, iterating on the data in batches of 32 samples
model.2 %>% fit(data, labels, epochs=10, batch_size=32)

# constants
data_dim <- 16
timesteps <- 8
num_classes <- 10

# define and compile model
# expected input data shape: (batch_size, timesteps, data_dim)
data_dim <- 1
timesteps <- 2
num_classes <- 2

model <- keras_model_sequential() 
model %>% 
  layer_lstm(units = 32, return_sequences = TRUE, input_shape = c(timesteps, data_dim)) %>% 
  layer_lstm(units = 32, return_sequences = TRUE) %>% 
  layer_lstm(units = 32) %>% # return a single vector dimension 32
  layer_dense(units = 10, activation = 'softmax') %>% 
  compile(
    loss = 'categorical_crossentropy',
    optimizer = 'rmsprop',
    metrics = c('accuracy')
  )
  
# training data
x_train <- array(runif(1000 * timesteps * data_dim), dim = c(1000, timesteps, data_dim))
y_train <- matrix(runif(1000 * num_classes), nrow = 1000, ncol = num_classes)
x_train <- array(rec_obj$template$value[1:10000], dim = c(1000, 2, 1))
y_train <- matrix(rec_obj$template$class[1:10000], nrow = 5000, ncol = 2)


# testing validation data
x_val <- array(runif(100 * timesteps * data_dim), dim = c(100, timesteps, data_dim))
y_val <- matrix(runif(100 * num_classes), nrow = 100, ncol = num_classes)

x_val <- array(rec_obj$template$value, dim = c(2000, 2, 1))
y_val <- matrix(rec_obj$template$class, nrow = 10000, ncol = 2)

# train
model %>% fit( 
  x_train, y_train, batch_size = 64, epochs = 5, validation_data = list(x_val, y_val)
)
  • Once the LSTM model is estimated, we can use it to predict outcomes for the testing dataset, x_test_arr, using the predict() function. The predictions can be inversely transformed into the original measuring units using the saved mean and SD of the standardizing transformation.
  • Next, combine the predictions with the original data into a single column vector, using reduce() and the time_bind_rows() function.
# Make Predictions
pred_out <- model %>% 
    predict(x_test_arr, batch_size = batch_size) %>%
    .[,1] 

# Retransform values
pred_tbl <- tibble(
    index   = lag_test_tbl$index,
    value   = (pred_out * scale_history + center_history)^2
) 

# Combine actual data with predictions
tbl_1 <- df_trn %>%
    add_column(key = "actual")

tbl_2 <- df_tst %>%
    add_column(key = "actual")

tbl_3 <- pred_tbl %>%
    add_column(key = "predict")

# Create time_bind_rows() to solve dplyr issue
time_bind_rows <- function(data_1, data_2, index) {
    index_expr <- enquo(index)
    bind_rows(data_1, data_2) %>%
        as_tbl_time(index = !! index_expr)
}

ret <- list(tbl_1, tbl_2, tbl_3) %>%
    reduce(time_bind_rows, index = index) %>%
    arrange(key, index) %>%
    mutate(key = as_factor(key))

ret
  • The final step involves evaluating the stateful LSTM model. The caret package and yardstick::rmse() function may be appropriate to quantify the performance. As the data is in long format we can write a wrapper function calc_rmse() that transforms the data into the yardstick::rmse() format.
calc_rmse <- function(prediction_tbl) {
    
    rmse_calculation <- function(data) {
        data %>%
            spread(key = key, value = value) %>%
            select(-index) %>%
            filter(!is.na(predict)) %>%
            rename(
                truth    = actual,
                estimate = predict
            ) %>%
            rmse(truth, estimate)
    }
    
    safe_rmse <- possibly(rmse_calculation, otherwise = NA)
    
    safe_rmse(prediction_tbl)
        
}

# assess the RMSE on the model
calc_rmse(ret)
  • Visualize the model performance, we can use a plotting function, plot_prediction().
# Setup single plot function
plot_prediction <- function(data, id, alpha = 1, size = 2, base_size = 14) {
    
    rmse_val <- calc_rmse(data)
    
    g <- data %>%
        ggplot(aes(index, value, color = key)) +
        geom_point(alpha = alpha, size = size) + 
        theme_tq(base_size = base_size) +
        scale_color_tq() +
        theme(legend.position = "none") +
        labs(
            title = glue("{id}, RMSE: {round(rmse_val, digits = 1)}"),
            x = "", y = ""
        )
    
    return(g)
}
  • Test the plotting function.
ret %>% 
    plot_prediction(id = split_id, alpha = 0.65) +
    theme(legend.position = "bottom")

4.2 Example 2: Text Generation - Implementing character-level LSTM text generation

Let’s put these ideas in practice in a Keras implementation. The first thing we need is a lot of text data that we can use to learn a language model. You could use any sufficiently large text file or set of text files – Wikipedia, the Lord of the Rings, etc. In this example we will use some of the writings of Nietzsche, the late-19th century German philosopher (translated to English). The language model we will learn will thus be specifically a model of Nietzsche’s writing style and topics of choice, rather than a more generic model of the English language.

4.2.1 Preparing the data

  • Start by downloading the corpus and converting it to lowercase.
library(keras)
library(stringr)

data.csv <- read.csv("https://umich.instructure.com/files/12554194/download?download_frd=1", header=T)
text <- tolower(data.csv$text)
cat("Corpus length:", nchar(text), "\n")
  • Extract partially overlapping sequences of length maxlen, generate dummy indicator variables (one-hot encoding), and pack them in a 3D array x of shape (sequences, maxlen, unique_characters). The array y contains the corresponding targets - the one-hot-encoded characters that come after each extracted sequence.
maxlen <- 60  # Length of extracted character sequences
step <- 3  # We sample a new sequence every `step` characters
  
text_indexes <- seq(1, nchar(text) - maxlen, by = step)
# This holds our extracted sequences
sentences <- str_sub(text, text_indexes, text_indexes + maxlen - 1)
# This holds the targets (the follow-up characters)
next_chars <- str_sub(text, text_indexes + maxlen, text_indexes + maxlen)
cat("Number of sequences: ", length(sentences), "\n")
# List of unique characters in the corpus
chars <- unique(sort(strsplit(text, "")[[1]]))
cat("Unique characters:", length(chars), "\n")
# Dictionary mapping unique characters to their index in `chars`
char_indices <- 1:length(chars) 
names(char_indices) <- chars
# Next, one-hot encode the characters into binary arrays.
cat("Vectorization...\n") 
x <- array(0L, dim = c(length(sentences), maxlen, length(chars)))
y <- array(0L, dim = c(length(sentences), length(chars)))
for (i in 1:length(sentences)) {
  sentence <- strsplit(sentences[[i]], "")[[1]]
  for (t in 1:length(sentence)) {
    char <- sentence[[t]]
    x[i, t, char_indices[[char]]] <- 1
  }
  next_char <- next_chars[[i]]
  y[i, char_indices[[next_char]]] <- 1
}
  • Build the LSTM network consisting of a single LSTM layer, a dense classifier, and softmax over all possible characters. Recurrent neural networks represent just one way to generate sequence data. The targets are dummy i