This DSPA2 module represents Part 2 of the DSPA2 Chapter 14 (Deep Learning, Neural Networks). Learners are encouraged to first complete Part 1 of the DSPA2 Chapter 14 (Deep Learning, Neural Networks) prior to continuing with transfer learning and this Part 2.

Part 3 of the DSPA2 Chapter 14 (Deep Learning, Neural Networks) and Part 4 of the DSPA2 Chapter 14 (Deep Learning, Neural Networks) are predicated on this Part 2 and covers the Torch and Tensorflow Image Pre-processing Image Classification Pipelines.

1 Transfer Learning

Humans learn complex tasks by capitalizing on their prior experiences, no matter how remote these previous encounters may appear to be. By the age of 5, most kids can learn how to ride a bicycle in a couple of training sessions. This riding ability is acquired after they have already mastered the arts of running, also known as controlled falling, navigating complex 3D environments, and anticipating dynamic 4D spatio-temporal events. In effect, before kids start pedaling, their many prior holistic training experiences ensure that they “know” the basics of bike balancing. Children’s formative years include a very large number of trial-and-errors, parental guidance sessions, and societal cues. These events already provide the basic building blocks necessary to learn bicycle riding. And this is well in advance of the actual “bicycle training” experience, which we typically associate with bicycle riding.

This learning process is very different for machines. It’s extremely difficult to train a machine (a robot) to ride a bike, because these prior experiences kids go through are missing and can not be easily built and transferred to complete the new task of learning how to balance a bike. In a way, humans learn new tasks easily as (1) they already have a large collection of skills they have already mastered, and (2) they can transfer, mix, match, integrate, and harness their prior experiences to the process of learning a new task. Transfer machine learning attempts to replicate this human transfer learning process into the domain of artificial intelligence. The goals are to expedite the ML training process by capitalizing on prior knowledge, expand the realm of ML/AI applications, and enable their “last mile” training to ensure they generate “reasonable decisions and actions” without starting with blank slate de novo learning.

1.1 Deep Network Transfer Learning in Text Classification

One of the main challenges of AI/ML interpretation of free text is the extreme heterogeneity of the information and the unstructured format of the text content. This problem can be resolved by structurizing the input text and establishing homologies between multiple text samples (e.g., clinical notes). In a nutshell, transfer learning facilitates this process and enables (1) synthetic text generation (new data) that simulates realistic textual content (non-human data); and (2) transformation of unstructured text to structured data elements. For instance, if an \(input=clinical\ notes\), a DNN model generates \(output=vector\) representing a quantitative signature vector of the input text; think of it as a vector of principal components associated with the specific free text.

The result of this AI process is that independently of the text length or type, DNN always generates a numeric vector of a fixed size (say 128 values). This canonical representation establishes homologies between any given set of strings (character arrays).

Let’s demonstrate transfer machine learning using the medical specialty text-mining example of clinical notes example that we saw in Chapter 5. This data includes a binary outcome indicating whether the medical specialty unit (there are 40 such units) is a surgical unit or not. We’ll split the 4,999 cases, each containing 6 data elements, including the medical-specialty unit and clinical notes, into training and testing sets.

The key will be to use keras to build and train a ML model for predicting surgical vs. non-surgical units from the content in the corresponding medical notes by using a previously trained text-mining DNN that quantizes text of any size. One also needs to install the tfhub package (TensorFlow Hub), which provides reusable machine learning libraries.

venv_name <- "r-tensorflow"
reticulate::use_virtualenv(virtualenv = venv_name, required = TRUE)
library(reticulate)

# load the necessary libraries
# May need some installations first, e.g., 
#    in conda%> pip install tensorflow_datasets
#    install.packages("remotes")
#    remotes::install_github("rstudio/tfds")
#    tfds::install_tfds()

library(keras)
# library(reticulate)

# Install package TFhub: https://github.com/rstudio/tfhub
# devtools::install_github("rstudio/tfhub")
library(tfhub)
# library(tfds)

# Install TFdatasets: https://cran.r-project.org/web/packages/tfdatasets/vignettes/introduction.html
# devtools::install_github("rstudio/tfdatasets")
library(tfdatasets)
library(utf8)

# specify r-reticulate or r-tensorflow python Anaconda environment
# use_condaenv("r-tensorflow")
# use_condaenv("r-reticulate", required = TRUE)
# there are many ways to "finding" your conda environments, and using the reticulate package to set them

# conda_list()[[1]][1] %>%  use_condaenv(required = TRUE)

# Check tensorflow install configuration
tensorflow::tf_config()

## TensorFlow v2.15.1 (C:\Users\IvoD\DOCUME~1\VIRTUA~1\R-TENS~1\lib\site-packages\tensorflow_hub\__init__.p)
## Python v3.10 (C:/Users/IvoD/Documents/.virtualenvs/r-tensorflow/Scripts/python.exe)

# py_module_available("tensorflow_hub")

# py_install("tensorflow_hub", pip = TRUE) # py_install("tensorflow_hub")
# py_install("tfds", pip = TRUE) # py_install("tfds")
# py_install("tensorflow_datasets", pip = TRUE)

# py_module_available("tensorflow_datasets")
# py_module_available("tfds")  # tensorflow_datasets

1.1.1 Binary Transfer Learning Label-Classification of Clinical Text

Let’s now design a full DNN binary-classification model composed of 4 layers stacked sequentially. The first transfer learning layer represents the pre-trained TensorFlow Hub layer (prior model), which is loaded as the a priori left-most base layer in the full DNN and maps clinical notes (description sentences) into its embedding vector (canonical signature vector). There are a number of pre-trained text embedding models we can choose in this transfer-learning example. For instance, we can use google/tf2-preview/gnews-swivel-20dim/1, which splits the sentences into tokens, embeds each token, and then combines the embedding yielding an output of dimensions: (num_examples, embedding_dimension). The output of this initial transfer learning prior model layer is a fixed-length output vector, which is fed into the next fully-connected (Dense) layer-2 with 16 hidden units. Layer-2 output feeds into the next (dense) layer-3 with 6 nodes. Finally, Layer-3 output goes into the last layer-4, which also is a densely connected layer with a single output (class label). Using the sigmoid activation function, this output represents a probability value between 0 and 1 indicating the model predicted chance, or confidence level, that the medical note text was written in a hospital surgical unit.

Other examples of pre-trained text mining models that can be used for transfer learning include:

google/tf2-preview/gnews-swivel-20dim/1,
google/tf2-preview/nnlm-en-dim128/1, and
google/tf2-preview/gnews-swivel-20dim-with-oov/1, similar to google/tf2-preview/gnews-swivel-20dim/1, but with 2.5% vocabulary converted to OOV buckets, which helps when the training and testing vocabularies are not fully overlapping.
google/tf2-preview/nnlm-en-dim50/1 is a much larger pre-trained model with vocabulary of size 1M and 50 dimensions.
google/tf2-preview/nnlm-en-dim128/1 is another large model, vocabulary of size 1M, and 128 dimensions.

We will demonstrate NN-augmentation (transfer learning) modifying the base-model using the pre-trained NN English Google News 200B corpus by adding 4 extra layers at the end, which will be tuned for our specific clinical text (medical notes). Of course, similarly, any of the other pre-trained models can be used as alternatives.

Download the clinical dataset and split it into training:training (80:20).

Note that in this clinical-notes example, the input data consists of medical text transcriptions stored as string sentences. In the first demonstration, we will try to predict a binary integer label, 0 or 1, representing a non-surgical or surgical clinical unit where the clinical note was transcribed. To structurize the free-text as a computable data object (a matrix), we will automatically convert sentences into embedding vectors. This can be accomplished using text2vec or keras::layer_text_vectorization() transformations, or by including a pre-trained text embedding as the first layer. This takes care of the text preprocessing, facilitates transfer learning, and makes the text-to-matrix independent of the text and the size of the clinical note.

# install.packages("SnowballC")

library(keras)
library(SnowballC)
dataCT <- read.csv('https://umich.instructure.com/files/21152999/download?download_frd=1', header=T) 
str(dataCT)

## 'data.frame':    4999 obs. of  6 variables:
##  $ Index            : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ description      : chr  " A 23-year-old white female presents with complaint of allergies." " Consult for laparoscopic gastric bypass." " Consult for laparoscopic gastric bypass." " 2-D M-Mode. Doppler.  " ...
##  $ medical_specialty: chr  " Allergy / Immunology" " Bariatrics" " Bariatrics" " Cardiovascular / Pulmonary" ...
##  $ sample_name      : chr  " Allergic Rhinitis " " Laparoscopic Gastric Bypass Consult - 2 " " Laparoscopic Gastric Bypass Consult - 1 " " 2-D Echocardiogram - 1 " ...
##  $ transcription    : chr  "SUBJECTIVE:,  This 23-year-old white female presents with complaint of allergies.  She used to have allergies w"| __truncated__ "PAST MEDICAL HISTORY:, He has difficulty climbing stairs, difficulty with airline seats, tying shoes, used to p"| __truncated__ "HISTORY OF PRESENT ILLNESS: , I have seen ABC today.  He is a very pleasant gentleman who is 42 years old, 344 "| __truncated__ "2-D M-MODE: , ,1.  Left atrial enlargement with left atrial diameter of 4.7 cm.,2.  Normal size right and left "| __truncated__ ...
##  $ keywords         : chr  "allergy / immunology, allergic rhinitis, allergies, asthma, nasal sprays, rhinitis, nasal, erythematous, allegr"| __truncated__ "bariatrics, laparoscopic gastric bypass, weight loss programs, gastric bypass, atkin's diet, weight watcher's, "| __truncated__ "bariatrics, laparoscopic gastric bypass, heart attacks, body weight, pulmonary embolism, potential complication"| __truncated__ "cardiovascular / pulmonary, 2-d m-mode, doppler, aortic valve, atrial enlargement, diastolic function, ejection"| __truncated__ ...

# 'data.frame': 4999 obs. with  6 variables
colnames(dataCT)

## [1] "Index"             "description"       "medical_specialty"
## [4] "sample_name"       "transcription"     "keywords"

# Binarize the 40 hospital units as Surgery-type and Non-Surgery types
dataCT$surgLabel <- ifelse(grepl('Surg', dataCT$medical_specialty), 1, 0)
table(grepl('Surg', dataCT$medical_specialty))

## 
## FALSE  TRUE 
##  3869  1130

# Fix the descriptions to UTF-8 encoding
library(stringi)
# table(stri_enc_mark(dataCT$description)) # ASCII native #  4994      5
dataCT$description <- stri_encode(dataCT$description, "", "UTF-8") 
dataCT$transcription <- stri_encode(dataCT$transcription, "", "UTF-8") 

dataCT$clinicalNotes <- paste(dataCT$description, dataCT$transcription)

# Clean the clinical notes
library(tm)
## Vectorize the text
train_corpus <- VCorpus(VectorSource(dataCT$clinicalNotes))
## Remove Punctuation
train_corpus <- tm_map(train_corpus, content_transformer(removePunctuation))
## Remove numbers
train_corpus <- tm_map(train_corpus, removeNumbers)
## Convert text to lower case
train_corpus <- tm_map(train_corpus, content_transformer(tolower))
## Remove stop words
train_corpus <- tm_map(train_corpus, content_transformer(removeWords), stopwords("english"))
## Stemming
train_corpus <- tm_map(train_corpus, stemDocument)
## Remove multiple whitespaces
train_corpus <- tm_map(train_corpus, stripWhitespace)
# Extract only the simplified text from the complex train_corpus object
dataCT$clinicalNotes <- unlist(lapply(train_corpus, `[[`, 1))

# Split the data 80:20
train_set_ind <- sample(nrow(dataCT), floor(nrow(dataCT)*0.8)) # 80:20 split training:testing
train_data <- dataCT[train_set_ind , ]
test_data <- dataCT[-train_set_ind , ]  

num_words <- 10000
max_length <- 300
text_vectorization <- layer_text_vectorization(max_tokens = num_words, output_sequence_length = max_length)

# # `adapt()` the Clinical Notes Text Vectorization layer. Calling adapt allows the input layer to learn about 
# # the unique Medical Text in this dataset and assign an integer value for each word
# text_vectorization %>%  adapt(train_data$clinicalNotes)
# 
# # Confirm the Medical Notes vocabulary is in the text vectorization layer.
# get_vocabulary(text_vectorization)
# 
# # Input Layer shape - the text vectorization layer transforms it’s inputs
# trainDataX <- text_vectorization(matrix(train_data$clinicalNotes, ncol = 1))
# trainDataY_one_hot_labels <- to_categorical(train_data$surgLabel, num_classes = 2)

text_vectorization %>% adapt(train_data$clinicalNotes)

# Define and fit the model - the input data consists of an array of word-indices. 
# The predicted labels are either 0 or 1.
# The classifier is based on sequentially stacking the network layers
# The first embedding layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. 
# These vectors are learned as the model trains. 
# The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).
# A global_average_pooling_1d layer returns a fixed-length output vector for each example by averaging over the sequence dimension. 
# This allows the model to handle *variable-length* inputs
# The fixed-length output vector is piped through a fully-connected (dense) layer with 16 hidden units.
# The last output layer is densely connected with a single output node. 
# Sigmoid activation function yields a probability between 0 and 1 indicating the confidence of the binary level.

1.1.1.1 Define a fresh new `model1` de novo

# 1. Define a new fresh model1 de novo
# input <- layer_input(input_shape = input_shape = c(300))  # For numerical input, e.g., trainDataX
# library(reticulate)
# reticulate::repl_python()
# use_condaenv(condaenv = "pytorch_env", required = TRUE)

input <- layer_input(shape = c(1), dtype = "string")   # for raw text input as string, needs to match exp next layer
output <- input %>% 
  text_vectorization() %>% 
  layer_embedding(input_dim = num_words + 1, output_dim = 32) %>%
  layer_global_average_pooling_1d() %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dropout(0.5) %>% 
  layer_dense(units = 1, activation = "sigmoid")
model1 <- keras_model(input, output)

model1 %>% compile(
  optimizer = 'adam',
  loss = 'binary_crossentropy',
  metrics = list('accuracy')
)

history <- model1 %>% fit(train_data$clinicalNotes,
          as.numeric(train_data$surgLabel),  
          epochs = 10, batch_size = 512, validation_split = 0.2, verbose=0)

# Evaluate the model1 performance
results <- model1 %>% evaluate(test_data$clinicalNotes, as.numeric(test_data$surgLabel), verbose = 0)
results

##      loss  accuracy 
## 0.5366074 0.7750000

1.1.1.2 Naive - out-of-the-box prior-model assessment (without retraining)

In a naive approach, we can even evaluate the performance of the prior model (English Google News 200B), i.e., assess transfer learning without any additional add-on training using the new problem-specific data. Remember that we have a univariate (binary) outcome and if we use dataset_batch(32), the output will include a vector of 32 probability estimates.

We will see next that Keras knows how to extract elements from TensorFlow Datasets automatically making it a much more memory efficient alternative than loading the entire dataset to RAM before passing to Keras.

To build the DNN model, we need to specify the network topology as a stack of network layers that include (1) schema representing the unstructured text data (clinical note descriptions), and (2) Number and complexity of each subsequent layer in the model. For simplicity, in this example we will convert the 40 different medical units into binary “surgical” unit labels; 0 or 1 factors.

The unstructured text can be converted into embedding vectors of a fixed size, which simplifies the text processing. Using the transfer learning prior model, which includes a pre-trained text embedding and appears as the first DNN layer. This allows us to outsource the text preprocessing and transformation into quantitative information tensor. This is the key step illustrating the benefits of add-on based transfer-learning in fine-tuning previously trained models.

The result of using this transfer-learning prior is that the model is invariant with respect to the length of the input clinical text - the output shape of the embeddings is \((num\_examples\times embedding\_dimension)\).

#  2. Naive - out-of-the-box prior-model assessment (without any retraining)
# Transfer Learning based on nnlm-en-dim128 (prior model) Define only output layer structure
library(tfhub)
library(keras) 

# Clear TF Hub cache
tfhub_cache <- path.expand("C:/Users/IvoD/AppData/Local/Temp/tfhub_modules")
if (dir.exists(tfhub_cache)) unlink(tfhub_cache, recursive = TRUE)

####### May have to remove outputs from prior runs!!!!! ########################
# remove folders here: C:\Users\IvoD\AppData\Local\Temp\tfhub_modules ....#####
model2 <- keras_model_sequential() %>% 
  layer_hub(
    handle = "https://www.kaggle.com/models/google/nnlm/tensorFlow2/tf2-preview-en-dim128/1",
    # handle = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1",
    input_shape = list(),
    dtype = tf$string,
    trainable = FALSE   # Set to TRUE for full model retraining, we use FALSE for quick transfer learning
  ) %>%
  
  layer_dense(units = 1, activation = "sigmoid")  # add the binary labeling output layer format
summary(model2)

## Model: "sequential"
## ________________________________________________________________________________
##  Layer (type)                  Output Shape               Param #    Trainable  
## ================================================================================
##  keras_layer (KerasLayer)      (None, 128)                12464268   N          
##                                                           8                     
##  dense_2 (Dense)               (None, 1)                  129        Y          
## ================================================================================
## Total params: 124642817 (475.47 MB)
## Trainable params: 129 (516.00 Byte)
## Non-trainable params: 124642688 (475.47 MB)
## ________________________________________________________________________________

model2 %>% compile(
  optimizer = 'adam',
  loss = 'binary_crossentropy',
  metrics = list('accuracy')
)
# Just estimate the final 128+1 coefficients of the final layer
history <- model2 %>% fit(
  train_data$clinicalNotes, train_data$surgLabel, 
  epochs = 5, ### increase epochs for better performance
  batch_size = 128
)

## Epoch 1/5
## 32/32 - 1s - loss: 0.7686 - accuracy: 0.3688 - 1s/epoch - 36ms/step
## Epoch 2/5
## 32/32 - 0s - loss: 0.6069 - accuracy: 0.7504 - 403ms/epoch - 13ms/step
## Epoch 3/5
## 32/32 - 0s - loss: 0.5589 - accuracy: 0.7734 - 413ms/epoch - 13ms/step
## Epoch 4/5
## 32/32 - 0s - loss: 0.5382 - accuracy: 0.7737 - 381ms/epoch - 12ms/step
## Epoch 5/5
## 32/32 - 0s - loss: 0.5217 - accuracy: 0.7737 - 356ms/epoch - 11ms/step

# Assess performance
score <- model2 %>% evaluate(test_data$clinicalNotes, test_data$surgLabel)

## 32/32 - 0s - loss: 0.5153 - accuracy: 0.7750 - 305ms/epoch - 10ms/step

print(score)

##      loss  accuracy 
## 0.5153179 0.7750000

y_pred <- ifelse((model2 %>% predict(test_data$clinicalNotes)) >0.4, 1, 0)

## 32/32 - 0s - 262ms/epoch - 8ms/step

table(y_pred, test_data$surgLabel)

##       
## y_pred   0   1
##      0 731 209
##      1  44  16

Clearly these surgical unit predictions can’t be expected to be very reliable, as the model is not fine-tuned yet to respond specifically to clinical text.

The next step is to compile the transfer-learning model by specifying a loss function and an optimizer to facilitate the transfer-learning during the iterative network model fitting (fine-tuning). In this binary classification problem, we will use the binary_crossentropy() loss function. The model results in generating a probability value, which is presented as the output of the final DNN layer (the right-most single-unit layer with a sigmoid activation).

Another possible loss function for binary outcome is mean_squared_error(). However, binary_crossentropy is often better for dealing with probabilities as it measures the “distances” between probability distributions representing the predicted outcome and the ground-truth in supervised problems. Yet, mean_squared_error() is also applicable in a regression model setting. We will also employ Adaptive Moment Estimation (ADAM) as it’s an effective optimizer.

1.1.1.3 Simple Transfer Learning

Let’s use the nnlm-en-dim128 (prior model) to define an expanded DNN model by adding additional four layers at the end to customize the deep neural network to our specific clinical data.

# 3. Transfer Learning based on the nnlm-en-dim128 (prior model) Define expanded DNN model structure + 4 layers
model3 <- keras_model_sequential() %>% 
  layer_hub(
    handle = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1",
    input_shape = list(),
    dtype = tf$string,
    trainable = FALSE   # Set to TRUE for full model retraining, we use FALSE for quick transfer learning
  ) %>% 
  # modify default pre-trained model by adding 4 extra layers at the end tuned for our clinical text (medical notes)
  layer_dense(units = 64, activation = "sigmoid") %>% 
  #layer_dropout(rate = 0.5) %>% 
  layer_dense(units = 32, activation = "sigmoid") %>% 
  #layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = "sigmoid") %>% 
  #layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = "sigmoid")
#   layer_dense(units = 16, activation = "relu") %>% 
#   layer_dense(units = 6, activation = "relu") %>% 
#   layer_dense(units = 1, activation = "sigmoid")

summary(model3)

## Model: "sequential_1"
## ________________________________________________________________________________
##  Layer (type)                  Output Shape               Param #    Trainable  
## ================================================================================
##  keras_layer_1 (KerasLayer)    (None, 128)                12464268   N          
##                                                           8                     
##  dense_6 (Dense)               (None, 64)                 8256       Y          
##  dense_5 (Dense)               (None, 32)                 2080       Y          
##  dense_4 (Dense)               (None, 16)                 528        Y          
##  dense_3 (Dense)               (None, 1)                  17         Y          
## ================================================================================
## Total params: 124653569 (475.52 MB)
## Trainable params: 10881 (42.50 KB)
## Non-trainable params: 124642688 (475.47 MB)
## ________________________________________________________________________________

model3 %>% compile(
  optimizer = 'adam',
  loss = 'binary_crossentropy',
  metrics = list('accuracy')
)

history <- model3 %>% fit(
  train_data$clinicalNotes, train_data$surgLabel, 
  epochs = 10, ### increase epochs for better performance
  batch_size = 128
)

## Epoch 1/10
## 32/32 - 2s - loss: 0.5621 - accuracy: 0.7737 - 2s/epoch - 52ms/step
## Epoch 2/10
## 32/32 - 0s - loss: 0.5328 - accuracy: 0.7737 - 437ms/epoch - 14ms/step
## Epoch 3/10
## 32/32 - 0s - loss: 0.5257 - accuracy: 0.7737 - 440ms/epoch - 14ms/step
## Epoch 4/10
## 32/32 - 0s - loss: 0.5114 - accuracy: 0.7737 - 421ms/epoch - 13ms/step
## Epoch 5/10
## 32/32 - 0s - loss: 0.4846 - accuracy: 0.7737 - 419ms/epoch - 13ms/step
## Epoch 6/10
## 32/32 - 0s - loss: 0.4531 - accuracy: 0.7737 - 388ms/epoch - 12ms/step
## Epoch 7/10
## 32/32 - 0s - loss: 0.4285 - accuracy: 0.7727 - 401ms/epoch - 13ms/step
## Epoch 8/10
## 32/32 - 0s - loss: 0.4146 - accuracy: 0.7714 - 444ms/epoch - 14ms/step
## Epoch 9/10
## 32/32 - 0s - loss: 0.4046 - accuracy: 0.7707 - 404ms/epoch - 13ms/step
## Epoch 10/10
## 32/32 - 0s - loss: 0.3979 - accuracy: 0.7689 - 389ms/epoch - 12ms/step

# Assess performance
score <- model3 %>% evaluate(test_data$clinicalNotes, test_data$surgLabel)

## 32/32 - 0s - loss: 0.4004 - accuracy: 0.7690 - 372ms/epoch - 12ms/step

print(score)

##      loss  accuracy 
## 0.4004216 0.7690000

y_pred <- ifelse((model3 %>% predict(test_data$clinicalNotes)) >0.48, 1, 0)

## 32/32 - 0s - 240ms/epoch - 8ms/step

table(y_pred, test_data$surgLabel)

##       
## y_pred   0   1
##      0 664 121
##      1 111 104

1.1.1.4 Full-scale Transfer learning

Next we will use the structure/topology of the pre-trained model, but estimate all \(124M\) network parameters, not only the final \(11K\) parameters at the end, as we did earlier.

# 4. Full-scale Transfer learning using the skeleton of the pre-trained model, but estimating all parameters 
model4 <- keras_model_sequential() %>% 
  layer_hub(
    handle = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1",
    input_shape = list(),
    dtype = tf$string,
    trainable = TRUE   # Set to FALSE for simple TL-model retraining, we use TRUE for full-transfer learning
  ) %>% 
  # modify default pre-trained model by adding 4 extra layers at the end tuned for our clinical text (medical notes)
  layer_dense(units = 64, activation = "sigmoid") %>% 
  #layer_dropout(rate = 0.5) %>% 
  layer_dense(units = 32, activation = "sigmoid") %>% 
  #layer_dropout(rate = 0.5) %>%
  layer_dense(units = 16, activation = "sigmoid") %>% 
  #layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = "sigmoid")
#   layer_dense(units = 16, activation = "relu") %>% 
#   layer_dense(units = 6, activation = "relu") %>% 
#   layer_dense(units = 1, activation = "sigmoid")

summary(model4)

## Model: "sequential_2"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  keras_layer_2 (KerasLayer)         (None, 128)                     124642688   
##  dense_10 (Dense)                   (None, 64)                      8256        
##  dense_9 (Dense)                    (None, 32)                      2080        
##  dense_8 (Dense)                    (None, 16)                      528         
##  dense_7 (Dense)                    (None, 1)                       17          
## ================================================================================
## Total params: 124653569 (475.52 MB)
## Trainable params: 124653569 (475.52 MB)
## Non-trainable params: 0 (0.00 Byte)
## ________________________________________________________________________________

model4 %>% compile(
  optimizer = 'adam',
  loss = 'binary_crossentropy',
  metrics = list('accuracy')
)

history <- model4 %>% fit(
  train_data$clinicalNotes, train_data$surgLabel, 
  epochs = 10, ### increase epochs for better performance
  batch_size = 128
)

## Epoch 1/10
## 32/32 - 26s - loss: 0.7150 - accuracy: 0.4911 - 26s/epoch - 803ms/step
## Epoch 2/10
## 32/32 - 27s - loss: 0.5517 - accuracy: 0.7737 - 27s/epoch - 840ms/step
## Epoch 3/10
## 32/32 - 30s - loss: 0.5283 - accuracy: 0.7737 - 30s/epoch - 922ms/step
## Epoch 4/10
## 32/32 - 30s - loss: 0.4887 - accuracy: 0.7737 - 30s/epoch - 933ms/step
## Epoch 5/10
## 32/32 - 30s - loss: 0.4375 - accuracy: 0.7737 - 30s/epoch - 939ms/step
## Epoch 6/10
## 32/32 - 30s - loss: 0.4043 - accuracy: 0.7737 - 30s/epoch - 952ms/step
## Epoch 7/10
## 32/32 - 30s - loss: 0.3841 - accuracy: 0.7737 - 30s/epoch - 934ms/step
## Epoch 8/10
## 32/32 - 30s - loss: 0.3708 - accuracy: 0.7752 - 30s/epoch - 946ms/step
## Epoch 9/10
## 32/32 - 30s - loss: 0.3612 - accuracy: 0.7732 - 30s/epoch - 927ms/step
## Epoch 10/10
## 32/32 - 32s - loss: 0.3538 - accuracy: 0.7824 - 32s/epoch - 995ms/step

# Assess performance
score <- model4 %>% evaluate(test_data$clinicalNotes, test_data$surgLabel)

## 32/32 - 3s - loss: 0.4197 - accuracy: 0.7450 - 3s/epoch - 87ms/step

print(score)

##      loss  accuracy 
## 0.4197085 0.7450000

y_pred <- ifelse((model4 %>% predict(test_data$clinicalNotes)) >0.47, 1, 0)

## 32/32 - 3s - 3s/epoch - 83ms/step

table(y_pred, test_data$surgLabel)

##       
## y_pred   0   1
##      0 562  74
##      1 213 151

The final pair of steps include:

Training. Transfer learning involving fine-tuning the model starting with a prior pre-trained model, which is re-trained on the specific medical text (training) data, and
Validation. The learning process involves repeated model estimation using mini-batches of 512 samples (see dataset_batch()) with 10 (for speed) or more (e.g., 100+, for accuracy and precision) epochs. This process involves 10 (or 100+) iterations over all samples in the dataset. During the fine-tuning training process, the transfer learner will report the initial and each subsequent model loss-value (optimization measure) and accuracy (fidelity measure) on sets of 10,000 samples from the validation set (see dataset_shuffle()).

# Evaluate the model
# Examine the model performance. 
# mind the trajectories of the Loss (representing the error), 
# lower values are better), and accuracy, high values are better
library(plotly)

plot_ly(x = ~c(1:history$params$epochs),  y = ~history$metrics$loss,
        type = "scatter", mode="markers+lines", name="Loss") %>% 
  add_trace(x = ~c(1:history$params$epochs),  y = ~history$metrics$accuracy,
        type = "scatter", mode="markers+lines", name="Accuracy") %>%
  layout(title="DNN Training Performance", xaxis=list(title="epoch"),
         yaxis=list(title="Metric Value"), legend = list(orientation='h'),
         hovermode = "x unified")

# subplot(pl_loss, pl_acc, nrows=2, shareX = TRUE, titleX = TRUE)

This simple transfer learning approach achieves an accuracy of about 73-76%. More model customization and longer training are expected to significantly improve the performance of the fine-tuned transfer-learning DNN model. Additional information about R-based tensorflow DNN modeling is available here and here.

1.1.2 Multinomial Transfer Learning classification of Clinical Text

Load all the appropriate R/Python packages and set up the RStudio environment.

The same clinical data can be used for multinomial classification, where the outcome is the clinical specialty unit (there are 40 hospital units in this case-study), the input is the given clinical text. Start by defining the special labels (clinical units). The prediction of the 40-class labels will depend on the input \(x\) consisting of the string clinicalNotes, representing the concatenated transcriptions and descriptions.

In this transfer learning example of multiclass text classification, we will utilize the gnews-swivel-20dim model with text embedding trained on English Google News 130GB corpus.

library(stringi)
dataCT <- read.csv('https://umich.instructure.com/files/21152999/download?download_frd=1', header=T) 

dataCT$description <- stri_encode(dataCT$description, "", "UTF-8") 
dataCT$transcription <- stri_encode(dataCT$transcription, "", "UTF-8") 

# Concatenate Transcriptions and Descriptions into one string/character: clinicalNotes
dataCT$clinicalNotes <- paste(dataCT$description, dataCT$transcription)

convert_specialty <- list()
keys <- unique(dataCT$medical_specialty)
medical_specialtyNames <- dataCT$medical_specialty
values <- 1:length(keys)
for(i in 1:length(keys)) { convert_specialty[keys[i]] <- values[i] }
specialty <- c()
for (i in 1:length(dataCT$medical_specialty)){
  specialty[i] <- as.numeric(convert_specialty[dataCT$medical_specialty[i]])
}

dataCT$medical_specialty <- specialty
dataCT$medical_specialty <- matrix(dataCT$medical_specialty,
                                   nrow = length(dataCT$medical_specialty), ncol = 1)

# Convert labels to categorical one-hot encoding
one_hot_SpecialtyLabels <- to_categorical(dataCT$medical_specialty,
               num_classes = length(unique(dataCT$medical_specialty))+1)
one_hot_SpecialtyLabels <- one_hot_SpecialtyLabels[, -1] # remove empty column 1
# library(keras)
# labels <- to_categorical
# sum(one_hot_SpecialtyLabels)  [1] 4999

num_words <- 10000
max_length <- 300
text_vectorization <- layer_text_vectorization(max_tokens = num_words, output_sequence_length = max_length)

train_set_ind <- sample(nrow(dataCT), floor(nrow(dataCT)*0.8)) # 80:20 plot training:testing
train_data <- dataCT[train_set_ind, ]
test_data <- dataCT[-train_set_ind, ]
one_hot_SpecialtyLabels_trainY <- one_hot_SpecialtyLabels[train_set_ind, ]
one_hot_SpecialtyLabels_testY  <- one_hot_SpecialtyLabels[-train_set_ind, ]
 
# input <- layer_input(shape = c(1), dtype = "string")   # for raw text input as string, needs to match exp next layer
# output <- input %>% 
#   text_vectorization() %>% 
#   layer_embedding(input_dim = num_words + 1, output_dim = 256) %>%
#   layer_global_average_pooling_1d() %>%
#   layer_dense(units = 256, activation = "relu") %>%
#   layer_dropout(0.25) %>% 
#   layer_dense(units = 128, activation = "relu") %>%
#   layer_dropout(0.25) %>% 
#   layer_dense(units = 64, activation = "relu") %>%
#   # layer_dropout(0.25) %>% 
#   layer_dense(units = length(keys), activation = 'softmax')
# model2 <- keras_model(input, output)
# 
# model2 %>% compile(
#   loss = 'categorical_crossentropy', 
#   optimizer = optimizer_sgd(learning_rate = 0.01, decay = 1e-6, momentum = 0.9, nesterov = TRUE),
#   metrics = list('accuracy')
# )
# 
# history2 <- model2 %>% fit(train_data$clinicalNotes, one_hot_SpecialtyLabels_trainY,  
#                   epochs = 10, batch_size = 512, validation_split = 0.2, verbose=2)
# 
# # Evaluate the model2 performance
# results2 <- model2 %>% evaluate(test_data$clinicalNotes, one_hot_SpecialtyLabels_testY, verbose = 2)
# results2  
# 
# score <- model2 %>% evaluate(test_data$clinicalNotes, one_hot_SpecialtyLabels_testY)
# print(score)
# y_pred <- model2 %>% predict(test_data$clinicalNotes)
# head(apply(y_pred, 1, which.max))  # table(apply(y_pred, 1, which.max))
# # hist(y_pred[,8])
# table(y_pred, test_data$medical_specialty)
# 
# ============================================
  
model3 <- keras_model_sequential() %>% 
  layer_hub(
    handle = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1",
    input_shape = list(),
    dtype = tf$string,
    trainable = TRUE
  ) %>% 
  layer_dense(units = 256, activation = "relu") %>%
  layer_dropout(0.25) %>% 
  layer_dense(units = 128, activation = "relu") %>%
  layer_dropout(0.25) %>% 
  layer_dense(units = 64, activation = "relu") %>%
  # layer_dropout(0.25) %>% 
  layer_dense(units = length(keys), activation = 'softmax')

summary(model3)

## Model: "sequential_3"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  keras_layer_3 (KerasLayer)         (None, 20)                      400020      
##  dense_14 (Dense)                   (None, 256)                     5376        
##  dropout_2 (Dropout)                (None, 256)                     0           
##  dense_13 (Dense)                   (None, 128)                     32896       
##  dropout_1 (Dropout)                (None, 128)                     0           
##  dense_12 (Dense)                   (None, 64)                      8256        
##  dense_11 (Dense)                   (None, 40)                      2600        
## ================================================================================
## Total params: 449148 (1.71 MB)
## Trainable params: 449148 (1.71 MB)
## Non-trainable params: 0 (0.00 Byte)
## ________________________________________________________________________________

model3 %>% compile(
  loss = 'categorical_crossentropy', 
  optimizer = optimizer_sgd(learning_rate = 0.01, momentum = 0.9, nesterov = TRUE),
  metrics = list('accuracy')
)

history3 <- model3 %>% fit(train_data$clinicalNotes, one_hot_SpecialtyLabels_trainY,  
                  epochs = 100, batch_size = 512, validation_split = 0.2, verbose=0)

results3 <- model3 %>% evaluate(test_data$clinicalNotes, one_hot_SpecialtyLabels_testY, verbose = 0)
print(paste0("Mind that the testing-case performance metrics (Loss=", round(results3["loss"], 3), 
             " and Accuracy=", round(results3["accuracy"], 3),
             ") of the DNN text classification reflect results of ",
             length(keys), " medical specialties (classes), not a binary classification!"))

## [1] "Mind that the testing-case performance metrics (Loss=2.208 and Accuracy=0.355) of the DNN text classification reflect results of 40 medical specialties (classes), not a binary classification!"

score <- model3 %>% evaluate(test_data$clinicalNotes, one_hot_SpecialtyLabels_testY)

## 32/32 - 0s - loss: 2.2077 - accuracy: 0.3550 - 254ms/epoch - 8ms/step

print(score)

##     loss accuracy 
##  2.20774  0.35500

y_pred <- model3 %>% predict(test_data$clinicalNotes)

## 32/32 - 0s - 355ms/epoch - 11ms/step

head(apply(y_pred, 1, which.max))  # table(apply(y_pred, 1, which.max))

## [1] 38 38 13  8  8 13

y_pred_class <- apply(y_pred, 1, which.max)
# hist(y_pred[,8])
table(y_pred_class, test_data$medical_specialty[,1])

##             
## y_pred_class   1   2   3   4   5   6   7   8   9  10  11  13  14  15  16  17
##           3    0   0  15   0   0   0   0  10   0   0   0   3   0   0   0   0
##           4    0   0   1   5   0   0   0   0   0   0   0   4   0   0   0   0
##           7    1   0   3   0   0   1   3   0   0   1   0   0   1   0   0   2
##           8    0   0  32   8   2  18   2 197   0   1   0   1   0   4   0   2
##           10   0   0   2   1   0   1   8   0   0  11   0   0   1   0   0   2
##           13   0   0  12  10   0   2   1   0   0   1   4  43   0   2   0   0
##           14   0   0   0   3   0   0   0   0   0   0   0   0   2   0   0   0
##           16   0   0   0   1   0   0   0   0   0   1   0   0   0   0   1   0
##           19   0   0   1   4   0   0   1   3   1   2   0   2   0   0   2   1
##           27   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##           30   0   0   2   0   0   0   0   1   0   0   0   2   1   0   0   1
##           34   0   0   3   0   0   1   6   2   0   1   0   0   0   0   0   3
##           38   2   2  12  15   1   6  31   1   1  13   2   3   2   1   1   7
##             
## y_pred_class  18  19  20  21  22  23  24  25  27  29  30  31  32  33  34  35
##           3    0   0   0   0   0   2   0   0   0   0   1   1   0   0   0   0
##           4    0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##           7    0   3   0   3   0   0   0   0   0   0   3   1   0   2   2   0
##           8   16  36  18   1  27  20   9   1   0   4  26  11   0   1   1   0
##           10   0   0   1   0   1   0   1   1   0   0   0   1   1   0   0   3
##           13   0   6   0   0   2   0   5   0   0   1   2   0   0   0   0   0
##           14   0   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0
##           16   0   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##           19   0   6   1   1   0   0   0   1   0   0   0   0   0   2   2   0
##           27   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##           30   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0
##           34   0   1   0   0   1   0   1   0   0   2   2   1   1   1   8   0
##           38   0   9   2   9   1   0   3   1   1   8   8   3   1   9   7   0
##             
## y_pred_class  36  37  38  39  40
##           3    0   0   2   0   0
##           4    0   0   0   0   0
##           7    1   0  14   0   0
##           8    4   6   3   0   0
##           10   0   0   4   0   0
##           13   0   0   2   1   0
##           14   0   0   6   0   0
##           16   0   0   0   1   0
##           19   0   1   4   2   1
##           27   0   0   0   0   0
##           30   0   0   1   0   0
##           34   0   0   0   0   0
##           38   3   0  63   1   0

# DT::datatable(matrix(table(y_pred_class, test_data$medical_specialty[,1]),40,40) )

heat <- matrix(0, 40, 40)
for ( i in 1:length(test_data$clinicalNotes)) { 
  heat[test_data$medical_specialty[i, 1], y_pred_class[i]] =  
    heat[test_data$medical_specialty[i, 1], y_pred_class[i]] + 1 
}

plot_ly(x =~keys, y = ~keys, z = ~heat, name="Model Performance",
        hovertemplate = paste('<i>Matching</i>: %{z:.0f}', 
                              '<br><b>True</b>: %{x}<br>', '<b>Pred</b>: %{y}'),
        colors = 'Reds', type = "heatmap") %>% 
  layout(title="Predicated Classes vs. True Clinical Units", 
         xaxis=list(title="Actual Class"), yaxis=list(title="Predicted Class"))

1.1.3 Binary Classification of Film Reviews

All readers are encouraged to try text-based transfer learning using alternative datasets, e.g., the 50,000 movie reviews dataset. The code skeleton below illustrates the basic pipeline workflow for the movie review’s binary classifications.

# Load Movie Reviews (50K)
# split the entire dataset into a list of 3 objects:
# imdb[[1]]=training_set, imdb[[2]]=testing_set, imdb[[3]]=validation_set
imdb <-
    tfds::tfds_load (
    "imdb_reviews:1.0.0",
    split = list("train[:60%]", "train[-40%:]", "test"),
    as_supervised = TRUE
)

# Install keras package if you haven't already


# Load the keras package
# library(keras)
# 
# # Load the IMDb dataset
# imdb <- dataset_imdb(num_words = 10000)
# 
# # Split the dataset into train, validation, and test sets
# train_split <- 0.6
# validation_split <- 0.4
# 
# # Calculate the number of samples for each split
# total_samples <- length(imdb$train$x)[1]
# train_samples <- round(train_split * total_samples)
# validation_samples <- round(validation_split * total_samples)
# 
# # Create train, validation, and test sets
# train_dataset <- list(x = imdb$train$x[1:train_samples], y = imdb$train$y[1:train_samples])
# validation_dataset <- list(x = imdb$train$x[(train_samples + 1):(train_samples + validation_samples)], y = imdb$train$y[(train_samples + 1):(train_samples + validation_samples)])
# test_dataset <- imdb$test
# 
# # Save train, validation, and test datasets into imdb[[1]], imdb[[2]], and imdb[[3]], respectively
# imdb[[1]] <- train_dataset
# imdb[[2]] <- validation_dataset
# imdb[[3]] <- test_dataset




# imdb <- tfds_load(
#   "imdb_reviews:1.0.0",
#   split = c("train[:60%]", "train[-40%:]", "test"),
#   as_supervised = TRUE
# )
# summary(imdb)

# tfds_load returns a TensorFlow Dataset, an abstraction representing a list
# of elements, in which each element consists of one or more components.
# To access individual elements of a Dataset:
# 
# library(tfds)
# library(magrittr)
firstBatch <- imdb[[1]] %>%
  dataset_batch(1) %>% # Used to get only the first example
  reticulate::as_iterator() %>%
  reticulate::iter_next()
str(firstBatch)

## List of 2
##  $ :<tf.Tensor: shape=(1), dtype=string, numpy=…>
##  $ :<tf.Tensor: shape=(1), dtype=int64, numpy=array([0], dtype=int64)>

# imdb_train_iterator <- as_iterator(imdb[[1]])
# 
# # Retrieve the first example from the iterator
# firstBatch <- iter_next(imdb_train_iterator)

# library(magrittr)
#   firstBatch <- list(
#     x = imdb$train$x[[1]],
#     y = imdb$train$y[[1]]
#   )

review1 <- as_utf8(as.character(firstBatch[1][[1]]$numpy()[1][[1]])) # get text-review (string)
label1 <- as.numeric(firstBatch[2][[1]]$numpy())    # get binary class (0/1)

embedding_layer <- layer_hub(
  handle ="https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1")  

embedding_layer(firstBatch[[1]])

## tf.Tensor(
## [[ 9.01966274e-01 -4.83913347e-03  1.17907055e-01  3.81319046e-01
##    6.57222793e-02 -3.01581532e-01  8.90584365e-02 -2.69034863e-01
##   -8.51345584e-02  1.08877886e-02 -6.66372627e-02 -3.73063087e-01
##   -2.76447266e-01 -1.87254980e-01  5.67507632e-02  9.09779966e-02
##   -6.24961555e-02 -3.28687276e-03 -3.08512092e-01  3.78482223e-01
##    7.62880966e-02  1.43733576e-01 -1.12897493e-01  9.59761534e-03
##   -2.38938913e-01  2.93743908e-02  7.28663057e-02 -2.48727947e-02
##   -8.16893280e-02  6.68320432e-02 -5.62225394e-02  2.47078985e-01
##    1.17681175e-01  3.17581035e-02  2.65932620e-01 -1.37706831e-01
##   -1.50708258e-01 -1.63614675e-01 -1.51269153e-01  2.34616160e-01
##   -9.12236273e-02 -4.22684886e-02 -1.01224177e-01 -2.12229744e-01
##    6.74503446e-02  1.85163647e-01  3.62982228e-02 -3.50210071e-01
##   -5.92576079e-02 -9.54059511e-02 -9.65666175e-02  3.79339904e-02
##   -2.36725271e-01  2.67956525e-01 -2.22367734e-01 -1.80506572e-01
##   -1.13724798e-01  4.91059460e-02 -1.19525626e-01 -2.27335095e-03
##   -1.81468800e-01 -4.74342071e-02  9.61481929e-02  4.93341237e-02
##    2.69693173e-02  2.66610924e-02 -8.21918398e-02 -2.03230649e-01
##    2.25084737e-01  7.74206817e-02 -1.10149167e-01  1.33730099e-01
##    1.08389042e-01 -2.49691661e-02  3.02257799e-02  2.03551911e-02
##   -1.39646962e-01 -1.77291587e-01 -1.31853789e-01  1.65671393e-01
##   -4.72507323e-04 -9.78293121e-02 -1.64517537e-01  6.93127662e-02
##   -7.20646083e-02 -1.01133175e-02 -4.18493431e-03  2.48376504e-01
##    7.00922966e-01  6.45013988e-01 -2.46314004e-01  2.48779714e-01
##    5.55042960e-02 -1.72061652e-01  5.44746453e-03  2.16645315e-01
##    1.24983951e-01 -1.32985115e-02 -9.09600873e-03  8.74783769e-02
##   -2.72958595e-02  5.59117980e-02  2.11243659e-01  2.08114520e-01
##    1.86446942e-02 -2.44881704e-01 -2.11568519e-01  6.63717464e-02
##   -1.52921677e-01  9.16463733e-02 -1.56010687e-01  4.47210558e-02
##   -1.58450484e-01 -1.72194898e-01 -5.40404953e-02 -2.69618005e-01
##    1.23170123e-01  2.13364601e-01 -6.43658787e-02  3.61668468e-02
##    2.14489356e-01 -1.19912423e-01 -4.83419979e-04  2.64609545e-01
##    5.51236942e-02 -3.29729654e-02  3.31326015e-02  2.97882948e-02]], shape=(1, 128), dtype=float32)

# build the complete model
model <- keras_model_sequential() %>% 
  layer_hub(
    handle = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1",
    input_shape = list(),
    dtype = tf$string,
    trainable = TRUE
  ) %>% 
  layer_dense(units = 16, activation = "relu") %>% 
  layer_dense(units = 8, activation = "relu") %>% 
  layer_dense(units = 1, activation = "sigmoid")
summary(model)

## Model: "sequential_4"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  keras_layer_5 (KerasLayer)         (None, 128)                     124642688   
##  dense_17 (Dense)                   (None, 16)                      2064        
##  dense_16 (Dense)                   (None, 8)                       136         
##  dense_15 (Dense)                   (None, 1)                       9           
## ================================================================================
## Total params: 124644897 (475.48 MB)
## Trainable params: 124644897 (475.48 MB)
## Non-trainable params: 0 (0.00 Byte)
## ________________________________________________________________________________

# compile model
model %>% 
  compile(optimizer="adam", loss="binary_crossentropy", metrics="accuracy")

# model training
history <- model %>% 
  fit(
    imdb[[1]] %>% dataset_shuffle(10000) %>% dataset_batch(512),
    epochs = 4,  # for convergence, use larger number of epochs (e.g., 20+)
    validation_data = imdb[[2]] %>% dataset_batch(512), verbose = 0)

library(plotly)
# plot performance
pl_loss <- plot_ly(x = ~c(1:history$params$epochs),  y = ~history$metrics$loss,
        type = "scatter", mode="markers+lines", name="Loss") %>% 
  add_trace(x = ~c(1:history$params$epochs),  y = ~history$metrics$val_loss,
        type = "scatter", mode="markers+lines", name="Validation Loss") %>%
  layout(title="DNN Training/Validation Performance", xaxis=list(title="epoch"),
         yaxis=list(title="Metric Value"), legend = list(orientation='h'),
         hovermode = "x unified")

pl_acc <- plot_ly(x = ~c(1:history$params$epochs),  y = ~history$metrics$accuracy,
        type = "scatter", mode="markers+lines", name="Accuracy") %>% 
  add_trace(x = ~c(1:history$params$epochs),  y = ~history$metrics$val_accuracy,
        type = "scatter", mode="markers+lines", name="Validation Accuracy") %>%
  layout(title="DNN Training/Validation Performance", xaxis=list(title="epoch"),
         yaxis=list(title="Metric Value"), legend = list(orientation='h'),
         hovermode = "x unified")

subplot(pl_loss, pl_acc, nrows=2, shareX = TRUE, titleX = TRUE)

# model evaluation on testing data
model %>% 
  evaluate(imdb[[3]] %>% dataset_batch(512), verbose = 0)

##      loss  accuracy 
## 0.3188951 0.8667600

2 Image classification

Similar to the unstructured text-mining (film review case) we illustrated above, we can use DNN transfer learning for image classification.

2.1 Performance Metrics

2.1.1 Binary Cross-Entropy Measure

The cross-entropy measure of dissimilarity between two discrete probability distributions \(p\) (true state) and \(q\) (predicted state) with identical support \(X\) is defined as

\[H(p,q) = -\sum _{x_i\in X}{p(x_{i})\log q(x_{i})}.\] For binary outcomes, logistic regression transforms the log-loss over all training observations, i.e., it optimizes the average cross-entropy in the sample.

For a sample indexed by \(n = 1, \cdots, N\), the expected (average) loss function is:

\[J(w) ={\frac{1}{N}} \sum _{n=1}^{N}H(p_{n},q_{n})\ =\ -{\frac {1}{N}}\sum _{n=1}^{N}\ {\bigg [}y_{n}\log {\hat {y}}_{n}+(1-y_{n})\log(1-{\hat {y}}_{n}){\bigg ]},\]

where \({\hat {y}}_{n}\equiv g(w \cdot x_{n})=\frac{1}{1+e^{-w \cdot x_{n}}}\) and \(g(z)\) is the logistic function. The logistic loss is the cross-entropy loss or log-loss, and binary refers to the situation of binary outcome labels \(\{-1,+1\}\).

Hence, the binary cross-entropy (BCE) is simply

\[H(p,q) = -\sum _{x_i\in X}{p(x_{i})\log q(x_{i})} = -y\log {\hat {y}}-(1-y)\log(1-{\hat {y}}),\] where \(p \in \{ y , 1 − y \}\) and \(q \in \{ \hat {y}, 1 − \hat {y} \}\) represent the probability of the true and predicted binary outcomes, respectively.

High or low BCE values indicate “bad” or “good” model performance, respectively, with a perfect model having a \(BSE\approx 0\).

2.1.2 Dice Coefficient

The Sørensen–Dice coefficient (Dice Coefficient) is another measure to assess the similarity between two sets, samples, or distributions. In our case we are applying the dice coefficient to track the overlap between the true brain-tumor masks, and the DCNN-derived mask-estimate (prediction) of the tumor based on the raw brain image.

Discrete sets \(X\) and \(Y\)	(Boolean) Binary Data	Probabilities (e.g., quantiles)
\(D=\frac{2 \|X\cap Y\|}{\|X\|+\|Y\|}\), \(\|\cdot \|\) is set cardinality	TP=true positive, FP=false positive, FN=false negative, \(D=\frac {2TP}{2TP+FP+FN}\)	\(D=\frac {2\|{\bf{p}}\cdot {\bf {q}}\|}{\|{\bf{p}}\|^{2}+\|{\bf {q}}\|^{2}}\)

2.2 `Torch` Deep Convolutional Neural Network (CNN)

The U-Net: Convolutional Networks for Biomedical Image Segmentation, shown on the image below, is an example of a DCNN. U-Net Architecture The U-shaped CNN (U-Net) represents successive convolutional layers with max-pooling. During the auto-encoding (left-down-hill branch) the U-Net reduces image resolution (downsampling), whereas during the subsequent decoding phase (right-uphill branch) upsamples the images to arrive at an output of the same size as the original input. The information analysis (encoding) and synthesis (decoding) facilitate the labeling of each output image pixel by feeding information in each decoding layer from the corresponding encoding layer with matching resolution in the downsizing encoding layer.

Each upsampling (decoding) step concatenates the output from the previous layer with that from its counterpart in the compression (encoding) step. The final decoding output is a mask of the same size as the original image, derived by a \(1\times 1\)-convolution, which does not require a dense layer at the end as the output convolutional layer represents a single filter. Below we show how to load, train, and use a U-Net for transfer learning in 2D image segmentation. Note that this model has over \(3M\) trainable parameters. You can see an R example of a Unet model for input-output tensors of shape=c(128,128), see lines 73-183.

# If necessary, download the U-Net package, before you load it into R
# remotes::install_github("r-tensorflow/unet")
library(tfdatasets)
library(tfds)
library(tfhub)
library(tfruns)
library(torch)
# torch::install_torch()

# remotes::install_github("r-tensorflow/unet")
library(unet)
library(tibble)

# The u-Net call takes additional parameters, e.g., number of downsizing blocks, number of filters to start with, 
# number of classes to identify; # ?unet provides details. For instance, we can specify the shape
# of the input images we will be segmenting tumors for: 256*256 3-channel RGB images.
model <- unet(input_shape = c(256, 256, 3))

# to print the model as text output, run:
# model
# Results: # Trainable params: 31,031,745

2.2.0.1 Data Import

Let’s first download and load in the Brain Tumor Imaging dataset. These data come from a 2019 study on Association of genomic subtypes of lower-grade gliomas with shape features automatically extracted by a deep learning algorithm. The 2D brain MR images are paired with 2D tumor masks, which are trivial for controls and non-trivial for patients, using The Cancer Imaging Archive (TCIA). The data represent 110 patients with lower-grade glioma and include fluid-attenuated inversion recovery (FLAIR) MRI scans. There are 3-channels of the MRI data; pre-contrast, FLAIR, and post-contrast. The corresponding tumor masks were obtained by manual-delineations on the FLAIR images by board-certified radiologists.

# If you need to start a clean fresh run, remove all old files first! Be careful with this! set eval=T in all R-blocks!
##### First check > list.files("/data/")
##### do.call(file.remove, list(list.files("/data", full.names = TRUE)))
##### unlink("/data/*", recursive=TRUE, force=TRUE)

library(httr)
pathToZip <- tempfile()
pathToZip<-paste0(pathToZip,".zip")
# 
# 
url <- "https://umich.instructure.com/files/21813670/download?download_frd=1"
response <- GET(url)

content_type <- http_type(response)
print(content_type)


if (content_type == "application/zip" || content_type == "application/x-zip-compressed") {
  content <- content(response, "raw")
  writeBin(content, pathToZip)
} else {
  stop("Unexpected content type received.")
}

# download.file("https://umich.instructure.com/files/21813670/download?download_frd=1", pathToZip, mode = "wb")
zip::unzip(pathToZip, files=NULL, exdir = paste0(getwd(),'/data'))

library(tibble)
library(rsample)

train_dir <- file.path(getwd(),"data","data")
valid_dir <- file.path(getwd(),"data","mri_valid")

library(magick)   # Needed for TIFF --> PNG image conversion and other image processing tasks

Create the necessary directories to store the training and validation imaging data (brain MRIs and tumor masks).

# check if ReadMe file is accessible
# file.rename("/data/ReadMe_TCGA_MRI_Segmentation_Data_Phenotypes.txt", train_dir)

# Import the meta-data
# meta_data <- read.csv(paste0(getwd(),"//data//TCGA_MRI_Segmentation_Data_Phenotypes.csv"))


file_path <- file.path(getwd(), "data", "TCGA_MRI_Segmentation_Data_Phenotypes.csv")

# Read the CSV file
meta_data <- read.csv(file_path)

# note that these are relative file/directory names. To see the complete local path
# tempdir(); getwd()

# Create a validation folder
dir.create(valid_dir)

# Check all n=110 patients are accessible
patients <- list.dirs(train_dir, recursive = FALSE)
length(patients)

# Randomly select 20 Patients for validation, remaining 90=110-20 are for training the DNN model
valid_indices <- sample(1:length(patients), 20)
valid_indices
patients[valid_indices] # prints the actual folders where the validation participants' data is

# Extract and Relocate the Validation cases (separate them from training data)
for (i in valid_indices) {
  dir.create(file.path(valid_dir, basename(patients[i])))
  for (f in list.files(patients[i])) {    
    file.rename(file.path(train_dir, basename(patients[i]), f), file.path(valid_dir, basename(patients[i]), f))    
  }
  unlink(file.path(train_dir, basename(patients[i])), recursive = TRUE) # clean
}

# Confirm that only 80 patients are left in the standard data folder
# list all training data imaging files: list.dirs(train_dir, recursive = FALSE)
length(list.dirs(train_dir, recursive = FALSE))

# and 30-60 validation cases are in the validation folder
length(list.dirs(valid_dir, recursive = FALSE))

# and check validation data
length(list.files(valid_dir, recursive = T)) # [1] 1268

Define data-frames containing the file-names for all training and validation data.

# Identify the TRAINING and VALIDATION data objects (raw images + tumor masks) as filenames
data_train <- tibble(
  img = grep(list.files(train_dir, full.names = TRUE, pattern = "tif", recursive = TRUE),
        pattern = 'mask', invert = TRUE, value = TRUE),
  mask = grep(list.files(train_dir, full.names = TRUE, pattern = "tif", recursive = TRUE),
        pattern = 'mask', value = TRUE)
)
data_valid <- tibble(
  img = grep(list.files(valid_dir, full.names = TRUE, pattern = "tif", recursive = TRUE),
        pattern = 'mask', invert = TRUE, value = TRUE),
  mask = grep(list.files(valid_dir, full.names = TRUE, pattern = "tif", recursive = TRUE),
        pattern = 'mask', value = TRUE)
)

(Optionally) convert all 2D TIFF images to PNG RGB format! This may be necessary to ensure the input images are 3-channels, and are correctly interpreted as tensorflow objects.

print(grepl("\\.tif$", data_train$img))

# If all training + testing data are in one folder, split them by:
#  data <- initial_split(data_train, prop = 0.8)

# convert all Training Data: TIFF images and masks to PNG format (for easier TF processing downstream)
files_img_tif <- data_train$img[grepl("\\.tif$", data_train$img), drop = TRUE]
data_train_img_png <- lapply(files_img_tif,
      function(x) { 
        # image_write(image_read(x), path = gsub(".tif$", ".png", x), format = "png") 
        a = image_convert(image_read(x),  format = "png")
        image_write(a, path = gsub(".tif$", ".png", x), format = "png") 
    }
  )

files_mask_tif <- data_train$mask[grepl("\\.tif$", data_train$mask), drop = TRUE]
data_train_mask_png <- lapply(files_mask_tif, 
     function(x) { 
       # image_write(image_read(x), path = gsub(".tif$", ".png", x), format = "png") 
       a = image_convert(image_read(x),  format = "png")
       image_write(a, path = gsub(".tif$", ".png", x), format = "png") 
    }
  )

# Similarly convert all Validation Data
# convert all TIFF images and masks to PNG format (for easier TF processing downstream)
files_valid_img_tif <- data_valid$img[grepl("\\.tif$", data_valid$img), drop = TRUE]
data_valid_img_png <- lapply(files_valid_img_tif,
      function(x) { 
        # image_write(image_read(x), path = gsub(".tif$", ".png", x), format = "png") 
        a = image_convert(image_read(x),  format = "png")
        image_write(a, path = gsub(".tif$", ".png", x), format = "png") 
    }
  )

files_valid_mask_tif <- data_valid$mask[grepl("\\.tif$", data_valid$mask), drop = TRUE]
data_valid_mask_png <- lapply(files_valid_mask_tif, 
     function(x) { 
       # image_write(image_read(x), path = gsub(".tif$", ".png", x), format = "png") 
       a = image_convert(image_read(x),  format = "png")
       image_write(a, path = gsub(".tif$", ".png", x), format = "png") 
    }
  )

# Check that the TIF --> PNG conversion worked, inspect one case
head(list.files("/data/data/TCGA_HT_A61A_20000127"))
# data_valid  # check root directory

# Inspect some of the images/masks
# image_info(image_read(data_train_img_png[[3]]))
# image_write(image_read(data_train$img[3]), format = "tiff")
# image_write(image_read(data_train$img[3]), path = paste0(data_train$img[3], ".png"), format = "png")
# a <- image_read(paste0(data_train$img[3], ".png"))

# list.files(train_dir) 
# To clean previous file references
# # delete a directory -- must add recursive = TRUE
# unlink("/data", recursive = TRUE); # Clean space # gc(full=T)

Derive a binary class label - cancer (for non-trivial tumor masks) or control (for empty tumor masks).

# Compute a new binary outcome variable 1=Brain Tumor (mask has at least 1 white pixel), 0=Normal Brain, no white pixels in the mask
pos_neg_diagnosis <- sapply(data_train$mask,
     function(x) {   value = max(imager::magick2cimg(image_read(x)))
         ifelse (value > 0, 1, 0)  }
  )
table(pos_neg_diagnosis)   #; head(data_train)

## pos_neg_diagnosis
##    0    1 
## 2164 1153

# pos_neg_diagnosis
#    0    1 
# 2046 1103 

# Add the normal vs. cancer label to training and testing datasets
data_train$label <- pos_neg_diagnosis

pos_neg_diagnosis_valid <- sapply(data_valid$mask,
     function(x) {   value = max(imager::magick2cimg(image_read(x)))
         ifelse (value > 0, 1, 0)  }
  )
table(pos_neg_diagnosis_valid)

## pos_neg_diagnosis_valid
##   0   1 
## 392 220

data_valid$label <- pos_neg_diagnosis_valid
# head(data_valid)

2.2.0.2 Torch-based Transfer Learning

Next we will ingest the 3-channel (RGB) imaging data and the corresponding tumor masks (binary images) for each participant. The method torch::dataset() allows specifying initialize() and .getitem() methods for complex computable data objects. The first method initialize() creates the archive of imaging and mask file names that can be utilized by the second method .getitem() for iterating over all cases. The method .getitem() returns ordered input-output pairs and performs weighted sampling, with prevalence to large lesion images, which is useful for accounting for DNN training with imbalanced classes.

The training sets can be enhanced by data augmentation – a process expanding the set of training images and masks via operations such as flipping, resizing, and rotating based on certain specifications.

Below we use PyTorch to define a brain_dataset method providing a larger augmented training dataset, new size length(train_ds) ~ 2K, and a larger validation set, new size length(valid_ds)~1K. In practice, we can use any alternative transfer-learning strategy including pytorch, tensorflow, theano, etc.

Note that unet training takes significant computational time; training 20-epochs took a total of 600 compute hours, which translates into a couple of days of computing on a 20-core server. We have provided several precomputed/pre-trained *.pt models on Canvas.

Next, after completing this Part 2, go to Part 3 of the DSPA2 Chapter 14 (Deep Learning, Neural Networks) and finally Part 4 of the DSPA2 Chapter 14 (Deep Learning, Neural Networks), which cover the Torch and Tensorflow Image Pre-processing Image Classification Pipelines.

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Deep Learning, Neural Networks (Part 2)

SOCR/MIDAS (Ivo Dinov)

May 2026

1 Transfer Learning

1.1 Deep Network Transfer Learning in Text Classification

1.1.1 Binary Transfer Learning Label-Classification of Clinical Text

1.1.1.1 Define a fresh new `model1` de novo

1.1.1.2 Naive - out-of-the-box prior-model assessment (without retraining)

1.1.1.3 Simple Transfer Learning

1.1.1.4 Full-scale Transfer learning

1.1.2 Multinomial Transfer Learning classification of Clinical Text

1.1.3 Binary Classification of Film Reviews

2 Image classification

2.1 Performance Metrics

2.1.1 Binary Cross-Entropy Measure

2.1.2 Dice Coefficient

2.2 `Torch` Deep Convolutional Neural Network (CNN)

2.2.0.1 Data Import

2.2.0.2 Torch-based Transfer Learning

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Deep Learning, Neural Networks (Part 2)

SOCR/MIDAS (Ivo Dinov)

May 2026

1 Transfer Learning

1.1 Deep Network Transfer Learning in Text Classification

1.1.1 Binary Transfer Learning Label-Classification of Clinical Text

1.1.1.1 Define a fresh new model1 de novo

1.1.1.2 Naive - out-of-the-box prior-model assessment (without retraining)

1.1.1.3 Simple Transfer Learning

1.1.1.4 Full-scale Transfer learning

1.1.2 Multinomial Transfer Learning classification of Clinical Text

1.1.3 Binary Classification of Film Reviews

2 Image classification

2.1 Performance Metrics

2.1.1 Binary Cross-Entropy Measure

2.1.2 Dice Coefficient

2.2 Torch Deep Convolutional Neural Network (CNN)

2.2.0.1 Data Import

2.2.0.2 Torch-based Transfer Learning

1.1.1.1 Define a fresh new `model1` de novo

2.2 `Torch` Deep Convolutional Neural Network (CNN)