#' ---	
#' title: "DSPA2: Data Science and Predictive Analytics (UMich HS650)"	
#' subtitle: "
Deep Learning, Neural Networks
"	
#' author: "SOCR/MIDAS (Ivo Dinov)
"	
#' date: "`r format(Sys.time(), '%B %Y')`"	
#' tags: [DSPA, SOCR, MIDAS, Big Data, Predictive Analytics] 	
#' output:	
#'   html_document:	
#'     theme: spacelab	
#'     highlight: tango	
#'     includes:	
#'       before_body: SOCR_header.html	
#'     toc: true	
#'     number_sections: true	
#'     toc_depth: 4	
#'     toc_float:	
#'       collapsed: false	
#'       smooth_scroll: true	
#'     code_folding: show	
#'     self_contained: yes	
#' ---	
#' 	
#' 	
#' # Deep Learning Neural Networks	
#' 	
#' [Deep learning](https://en.wikipedia.org/wiki/Deep_learning#Deep_neural_%0Anetwork_architectures) is a special branch of machine learning using a collage of algorithms to model high-level motifs in data. Deep learning resembles the biological communications between brain neurons in the central nervous system (CNS), where synthetic graphs represent the CNS network as nodes/states and connections/edges between them. For instance, in a simple synthetic network consisting of a pair of connected nodes, an output sent by one node is received by the other as an input signal. When more nodes are present in the network, they may be arranged in multiple levels (like a multiscale object) where the $i^{th}$ layer output serves as the input of the next $(i+1)^{st}$ layer. The signal is manipulated at each layer and sent as a layer output downstream and interpreted as an input to the next, $(i+1)^{st}$ layer, and so forth. Deep learning relies on multiple layers of nodes and many edges linking the nodes forming input/output (I/O) layered grids representing a multiscale processing network. At each layer, linear and non-linear transformations are converting inputs into outputs.	
#' 	
#' In this chapter, we explore the R-based deep neural network learning and demonstrate state-of-the-art deep learning models utilizing CPU and GPU for fast training (learning) and testing (validation). Other powerful deep learning frameworks include *TensorFlow*, *Theano*, *Caffe*, *Torch*, *CNTK* and *Keras*.	
#' 	
#' *Neural Networks vs. Deep Learning*: Deep Learning is a machine learning strategy that learns a deep multi-level hierarchical representation of the affinities and motifs in the dataset. Machine learning Neural Nets tend to use shallower network models. Although there are no formal restrictions on the depth of the layers in a Neural Net, few layers are commonly utilized. Recent methodological, algorithmic, computational, infrastructure and service advances overcome previous limitations. In addition, the rise of *Big Data* accelerated the evolution of *classical Neural Nets* to *Deep Neural Nets*, which can now handle lots of layers and many hidden nodes per layer. The former is a precursor to the latter, however, there are also *non-neural* deep learning techniques. For example, *syntactic pattern recognition methods* and *grammar induction discover hierarchies*.	
#' 	
#' # Deep Learning Training	
#' 	
#' Review [Chapter 6 (Black Box Machine-Learning Methods: Neural Networks, Support Vector Machines, and Random Forests)](https://socr.umich.edu/DSPA2/DSPA2_notes/06_ML_NN_SVM_RF_Class.html) prior to proceeding.	
#' 	
#' ## Perceptrons	
#' 	
#' A **perceptron** is an artificial analogue of a neuronal brain cell that calculates a *weighted sum of the input values* and *outputs a thresholded version of that result*. For a bivariate perceptron, $P$, let’s denote the weights of the two inputs, $(X,Y)$ by $A$ and $B$, respectively. Then, the weighted sum could be represented as: $$W = A X + B Y.$$	
#' 	
#' At each layer $l$, the weight matrix, $W^{(l)}$, has the following properties:	
#' 	
#'  - The number of rows of $W^{(l)}$ equals the number of nodes/units in the previous $(l-1)^{st}$ layer, and 	
#'  - The number of columns of $W^{(l)}$ equals the number of units in the next $(l+1)^{st}$ layer. 	
#' 	
#' Neuronal cells fire depending on the presynaptic inputs to the cell which causes constant fluctuations of the neuronal membrane - depolarizing or hyperpolarizing, i.e., the cell membrane potential rises or falls. Similarly, perceptrons rely on thresholding of the weight-averaged input signal, which for biological cells corresponds to voltage increases passing a critical threshold. Perceptrons output non-zero values only when the weighted sum exceeds a certain threshold $C$. In terms of its input vector, $(X,Y)$, we can describe the output of each perceptron ($P$) by:	
#' 	
#' $$Output(P) = 	
#' \left\{	
#' \begin{array} {}	
#'     1, & if\ A X + B Y > C \\          	
#'     0, & if\ A X + B Y \leq C	
#' \end{array}	
#' \right. .$$	
#' 	
#' Feed-forward networks are constructed as layers of perceptrons where the first layer ingests the inputs and the last layer generates the network outputs. The intermediate (internal) layers are not directly connected to the external world, and are called hidden layers. In *fully connected networks*, each perceptron in one layer is connected to every perceptron on the next layer enabling information "fed forward" from one layer to the next. There are no connections between perceptrons in the same layer. 	
#' 	
#' Multilayer perceptrons (fully-connected feed-forward neural networks) consist of several fully-connected layers representing an `input matrix` $X_{n, m}$ and a generated `output matrix` $Y_{n, k}$. The input $X_{n,m}$ is a matrix encoding the $n$ cases and $m$ features per case. The weight matrix $W_{m,k}^{(l)}$ for layer $l$ has rows ($i$) corresponding to the weights leading from all the units $i$ in the previous layer to all of the units $j$ in the current layer.	
#' 	
#' The hidden size parameter $k$, the weight matrix $W_{m , k}$, and the bias vector $b_{k}$ are used to compute the outputs at each layer:	
#' 	
#' $$Y_{n, k}^{(l)} =f_k^{(l)}\left ( X_{n, m}^{(l)} W_{m , k}^{(l)} +b_{k}^{(l)} \right ).$$	
#' 	
#' The role of the bias parameter is similar to the intercept term in linear regression and helps improve the accuracy of prediction by shifting the decision boundary along the $Y$ axis. The outputs are fully-connected layers that feed into an `activation layer` to perform element-wise operations. Examples of **activation functions** that transform real numbers to probability-like values include:	
#' 	
#'  - the [sigmoid function]( https://en.wikipedia.org/wiki/Sigmoid_function), a special case of the [logistic function](https://en.wikipedia.org/wiki/Logistic_function), which converts real numbers to probabilities, 	
#'  - the rectifier (`relu`, [Rectified Linear Unit](https://en.wikipedia.org/wiki/Rectifier_(neural_networks))) function, which outputs the $\max(0, input)$,	
#'  - the [tanh (hyperbolic tangent function)](https://en.wikipedia.org/wiki/Hyperbolic_function#Hyperbolic_tangent).	
#' 	
#' 	
#' 	
library(plotly)	
sigmoid <- function(x) { 1 / (1 + exp(-x)) }	
relu <- function(y) { pmin(pmax(0, y), 1) }	
x_Llimit <- -5; x_Rlimit <- 5; y_Llimit <- 0; y_Rlimit <- 1	
x <- seq(x_Llimit, x_Rlimit, 0.01)	
# plot(c(x_Llimit,x_Rlimit), c(y_Llimit,y_Rlimit), type="n", xlab="Input/Domain", ylab="Output/Range" ) 	
# lines(x, sigmoid(x), col="blue", lwd=3)	
# lines(x, (1/2)*(tanh(x)+1), col="green", lwd=3)	
# lines(x, relu(x), col="red", lwd=3)	
# legend("left", title="Probability Transform Functions",	
#        c("sigmoid","tanh","relu"), fill=c("blue", "green", "red"),	
#       ncol = 1, cex = 0.75)	
y <- sigmoid(x)	
	
plot_ly(type="scatter")  %>%	
  add_trace(x = ~x, y=~y, name = 'Sigmoid()', mode = 'lines') %>%	
  add_trace(x = ~x, y = ~(1/2)*(tanh(x)+1), name = 'Tanh()', mode = 'lines') %>%	
  add_trace(x = ~x, y = ~relu(x+1/2), name = 'Relu()', mode = 'lines')%>%	
  layout(legend = list(orientation = 'h'), title="Probability Transformation Functions")	
	
#' 	
#' 	
#' The final fully-connected layer may be hidden of a size equal to the number of classes in the dataset and may be followed by a `softmax` layer mapping the input into a probability score. For example, if a size ${n\times m}$ input is denoted by $X_{n\times m}$, then the probability scores may be obtained by the `softmax` transformation function, which maps real valued vectors to vectors of probabilities:	
#' 	
#' $$\left ( \frac{e^{x_{i,1}}}{\displaystyle\sum_{j=1}^m e^{x_{i,j}}},\ldots, \frac{e^{x_{i,m}}}{\displaystyle\sum_{j=1}^m e^{x_{i,j}}}\right ).$$	
#' 	
#' Below is a schematic of a fully-connected feed-forward neural network of nodes 	
#' $$ \{ a_{j=node\ index, l=layer\ index} \}_{j=1, l=1}^{n_j, 4}.$$	
#' 	
#' 	
# install.packages("GGally"); install.packages("sna"); install.packages("network")	
# library("GGally"); library(network); library(sna); library(ggplot2)	
	
library('igraph')	
#define node names	
nodes <- c('a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)',	
           'a(1,2)','a(2,2)','a(3,2)','a(4,2)', 'a(5,2)',	
           'a(1,3)','a(2,3)','a(3,3)',	
           'a(1,4)','a(2,4)'	
           )	
# define node x,y coordinates	
x <- c(rep(0,7), rep(2,5), rep(4,3), rep(6,2)	
       )	
y <- c(1:7, 2:6, 3:5, 4:5)	
# x;y  # print x & y coordinates of all nodes	
	
#define edges	
from <- c('a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)',	
          'a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)',	
          'a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)',	
          'a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)',	
          'a(1,1)','a(2,1)','a(3,1)','a(4,1)', 'a(5,1)', 'a(6,1)', 'a(7,1)',	
          'a(1,2)','a(2,2)','a(3,2)','a(4,2)', 'a(5,2)',	
          'a(1,2)','a(2,2)','a(3,2)','a(4,2)', 'a(5,2)',	
          'a(1,2)','a(2,2)','a(3,2)','a(4,2)', 'a(5,2)',	
          'a(1,3)','a(2,3)','a(3,3)',	
          'a(1,3)','a(2,3)','a(3,3)'	
          )	
to <- c('a(1,2)','a(1,2)','a(1,2)','a(1,2)', 'a(1,2)', 'a(1,2)', 'a(1,2)',	
        'a(2,2)','a(2,2)','a(2,2)','a(2,2)', 'a(2,2)', 'a(2,2)', 'a(2,2)',	
        'a(3,2)','a(3,2)','a(3,2)','a(3,2)', 'a(3,2)', 'a(3,2)', 'a(3,2)',	
        'a(4,2)','a(4,2)','a(4,2)','a(4,2)', 'a(4,2)', 'a(4,2)', 'a(4,2)',	
        'a(5,2)','a(5,2)','a(5,2)','a(5,2)', 'a(5,2)', 'a(5,2)', 'a(5,2)',	
        'a(1,3)','a(1,3)','a(1,3)','a(1,3)', 'a(1,3)',	
        'a(2,3)','a(2,3)','a(2,3)','a(2,3)', 'a(2,3)',	
        'a(3,3)','a(3,3)','a(3,3)','a(3,3)', 'a(3,3)',	
        'a(1,4)','a(1,4)','a(1,4)',	
        'a(2,4)','a(2,4)', 'a(2,4)'	
        )	
edge_names <- c("w(i=1,k=1,l=2)","","","", "", "", "",	
        "","","","", "", "", "",	
        "","","","w(i=4,k=3,l=2)", "", "", "",	
        "","","","", "", "", "",	
        "","","","", "", "", "w(i=5,k=7,l=2)",	
        "","","","", "",	
        "","","","", "",	
        "","w(i=2,k=3,l=3)","", "", "",	
        "w(i=1,k=1,l=4)","","",	
        "","", "w(i=2,k=3,l=4)"	
        )	
  	
NodeList <- data.frame(nodes, x ,y)	
EdgeList <- data.frame(from, to, edge_names)	
nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names)	
# plot(nn.graph)	
#' 	
#' 	
#' 	
# igraph examples: http://michael.hahsler.net/SMU/LearnROnYourOwn/code/igraph.html	
map <- function(x, range = c(0,1), from.range=NA) {	
    if(any(is.na(from.range))) from.range <- range(x, na.rm=TRUE)	
    	
    ## check if all values are the same	
    if(!diff(from.range)) return(	
	    matrix(mean(range), ncol=ncol(x), nrow=nrow(x), 	
		    dimnames = dimnames(x)))	
    	
    ## map to [0,1]	
    x <- (x-from.range[1])	
    x <- x/diff(from.range)	
    ## handle single values	
    if(diff(from.range) == 0) x <- 0 	
    	
    ## map from [0,1] to [range]	
    if (range[1]>range[2]) x <- 1-x	
    x <- x*(abs(diff(range))) + min(range)	
    x[xmax(range)] <- NA	
    	
    return(x)	
}	
	
col <- rep("gray",length(V(nn.graph))) # input layer	
col[c(1:7, 16:17)] <- "lightblue"      # hidden layer	
col[16:17] <- "lightgreen"             # output layer	
plot(nn.graph, vertex.color=col, vertex.size=map(betweenness(nn.graph),c(25,30)), ylab="Input Layer", xlab="A schematic of fully-connected feed-forward neural network")	
	
# plot(nn.graph, vertex.size=map(betweenness(nn.graph),c(20,30)), edge.width=map(edge.betweenness(nn.graph), c(1,5)))	
#' 	
#' 	
#' 	
col <- rep("gray",length(V(nn.graph))) # input layer	
col[c(1:7, 16:17)] <- "lightblue"      # hidden layer	
col[16:17] <- "lightgreen"             # output layer	
plot(nn.graph, vertex.color=col)	
title(main="A schematic of fully-connected feed-forward neural network")	
mtext("Input Layer (l=1)", side=2, line=-1, col="blue")	
mtext("     Output Layer (l=4)", side=4, line=-2, col="green")	
mtext("Hidden Layers (l=2,3)", side=1, line=0, col="gray")	
#' 	
#' 	
#' The plot above illustrates the key elements in the action potential, or activation function, calculations and the corresponding training parameters:	
#' 	
#' $${a}_{\textrm{node}=k,\textrm{layer}=l}=f\left(\sum \limits_i{w}_{k,i}^{(l)}\times {a}_i^{(l-1)}+{b}_k^{(l)}\right),$$	
#' where:	
#' 	
#'  - $f$ is the *activation function*, e.g., [logistic function](https://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/17_RegularizedLinModel_KnockoffFilter.html) $f(x) = \frac{1}{1+e^{-x}}$. It converts the aggregate weights at each node to probability values,	
#'  - $w_{k,i}^l$ is the weight carried from the $i^{th}$ element of the $(l-1)^{th}$ layer to the $k^{th}$ element of the current $l^{th}$ layer,	
#'  - $b_{k}^l$ is the (residual) bias present in the $k^{th}$ element in the $l^{th}$ layer. This is the information not explained by the training model.	
#' 	
#' These parameters may be estimated using different techniques (e.g., using least squares, or stochastically using steepest descent methods) based on the training data.	
#' 	
#' # Biological Relevance	
#' 	
#' There are parallels between biology (neuronal cells) and the mathematical models (perceptrons) for neural network representation. The human brain contains about $10^{11}$ neuronal cells connected by approximately $10^{15}$ synapses forming the basis of our functional phenotypes. The schematic below illustrates some of the parallels between brain biology and the mathematical representation using synthetic neural nets. Every neuronal cell receives multi-channel (afferent) input from its dendrites, generates output signals and disseminates the results via its (efferent) axonal and synaptic connections to dendrites of other neurons. 	
#' 	
#' The perceptron is a mathematical model of a neuronal cell that allows us to explicitly determine algorithmic and computational protocols for transforming input signals into output actions. For instance, a signal arriving through an axon $x_0$ is modulated by some prior weight, e.g., synaptic strength, $w_0\times x_0$. Internally, within the neuronal cell, this input is aggregated (summed, or weight-averaged) with inputs from all other axons. Brain plasticity suggests that `synaptic strengths` (weight coefficients $w$) are enhanced by training and prior experience. This learning process controls the direction and influence of neurons on other neurons. Either excitatory ($w>0$) or inhibitory ($w\leq 0$) influences are possible. Dendrites and axons carry signals to and from neurons, where the aggregate responses are computed and transmitted downstream. Neuronal cells only fire if action potentials exceed a certain threshold. In this situation, a signal is transmitted downstream through its axons. The neuron remains silent, if the summed signal is below the critical threshold. 	
#' 	
#' Timing of events is important in biological networks. In the computational perceptron model, a first order approximation may ignore the timing of neuronal firing (spike events) and only focus on the frequency of the firing. The firing rate of a neuron with an activation function $f$ represents the frequency of the spikes along the axon. We saw some examples of activation functions earlier.	
#' 	
#' The diagram below illustrates the parallels between the brain network-synaptic organization and an artificial synthetic neural network.	
#' 	
#' [](https://wiki.socr.umich.edu/images/a/a6/DSPA_22_DeepLearning_Fig1.png)	
#' 	
#' 	
#' # Simple Neural Net Examples 	
#' 	
#' Before we look at examples of deep learning algorithms applied to model observed natural phenomena, we will develop a couple of simple networks for computing fundamental Boolean operations.	
#' 	
#' ## Exclusive OR (XOR) Operator	
#' The [exclusive OR (XOR) operator](https://en.wikipedia.org/wiki/Exclusive_or) works as a bivariate binary-outcome function, mapping pairs of false (0) and true (1) values to dichotomous false (0) and true (1) outcomes.	
#' 	
#' We can design a simple two-layer neural network that calculates `XOR`. The **values within each neuron represent its explicit threshold**, which can be normalized so that all neurons utilize the same threshold, typically $1$. The **value labels associated with network connections (edges) represent the weights of the inputs**. When the threshold is not reached, output is $0$, and when the threshold is reached, the output is $1$.	
#' 	
#' 	
# install.packages("network")	
library('igraph')	
#define node names	
nodes <- c('X', 'Y',                            # Inputs (0,1)	
           'H_1_1_th1','H_1_2_th2','H_1_3_th1', # Hidden Layer	
           'Z'                                  # Output Layer (0,1)	
           )	
# define node x,y coordinates	
x <- c(rep(0,2), rep(2,3), 4)	
y <- c(c(1.5, 2.5), 1:3, 2)	
##x;y  # print x & y coordinates of all nodes	
	
#define edges	
from <- c('X', 'X', 	
          'Y', 'Y',	
          'H_1_1_th1', 'H_1_2_th2', 'H_1_3_th1',	
          'Z'	
          )	
to <- c('H_1_1_th1', 'H_1_2_th2',	
        'H_1_2_th2', 'H_1_3_th1', 	
        'Z', 'Z', 'Z', 'Z'	
        )	
edge_names <- c(	
        '1', '1',	
        '1', '1',	
        '1', '-2', '1', '1'	
        )	
  	
NodeList <- data.frame(nodes, x ,y)	
EdgeList <- data.frame(from, to, edge_names)	
nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names)	
# plot(nn.graph)	
	
# See the iGraph specs here: http://kateto.net/networks-r-igraph	
	
col <- rep("gray",length(V(nn.graph))) # input layer	
col[c(3:5)] <- "lightblue"             # hidden layer	
col[6] <- "lightgreen"                 # output layer	
plot(nn.graph, vertex.color=col, vertex.shape="sphere", vertex.size=c(35, 35,   80, 80, 80, 35), edge.label.cex=1.5)	
title(main="XOR Operator")	
mtext("Input Layer: Bivariate (0,1)", side=2, line=-1, col="blue")	
mtext("     Output Layer (XOR)", side=4, line=-2, col="green")	
mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray")	
#' 	
#' 	
#' Let's work out manually the 4 possibilities:	
#' 	
#' InputX|InputY|XOR Output(Z)    	
#' ------|------|-------------   	
#' 0 | 0 |	0	
#' 0 | 1 | 1	
#' 1 | 0 | 1	
#' 1 | 1 | 0	
#' 	
#' We can validate that this network indeed represents an XOR operator by plugging in all four possible input combinations and confirming the expected results at the end.	
#' 	
#' 	
par(mfrow=c(2,2))    # 2 x 2 design	
nodes <- c('X=0', 'Y=0',                        # Inputs 	
           'H_1_1_th1','H_1_2_th2','H_1_3_th1', # Hidden Layer	
           'Z=0'                                # Output Layer (0,1)	
           )	
# define node x,y coordinates	
x <- c(rep(0,2), rep(2,3), 4)	
y <- c(c(1.5, 2.5), 1:3, 2)	
# x;y  # print x & y coordinates of all nodes	
	
######### (0,0)	
#define edges	
from <- c('X=0', 'X=0', 	
          'Y=0', 'Y=0',	
          'H_1_1_th1', 'H_1_2_th2', 'H_1_3_th1',	
          'Z=0'	
          )	
to <- c('H_1_1_th1', 'H_1_2_th2',	
        'H_1_2_th2', 'H_1_3_th1', 	
        'Z=0', 'Z=0', 'Z=0', 'Z=0'	
        )	
edge_names <- c(	
        '1', '1',	
        '1', '1',	
        '1', '-2', '1', '1'	
        )	
  	
NodeList <- data.frame(nodes, x ,y)	
EdgeList <- data.frame(from, to, edge_names)	
nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names)	
	
col <- rep("gray",length(V(nn.graph))) # input layer	
col[c(3:5)] <- "lightblue"             # hidden layer	
col[6] <- "lightgreen"                 # output layer	
plot(nn.graph, vertex.color=col, vertex.shape="pie", vertex.size=c(30, 30,   40, 40, 40, 30), edge.label.cex=2,	
     edge.arrow.size=0.25)	
title(main="XOR Operator")	
mtext("Input Layer: Bivariate (X=0,Y=0)", side=2,line=-1, col="blue")	
mtext("     Output Layer (XOR), Z=0", side=4, line=-2, col="green")	
mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray")	
	
######### (0,1)	
nodes <- c('X=0', 'Y=1',                        # Inputs 	
           'H_1_1_th1','H_1_2_th2','H_1_3_th1', # Hidden Layer	
           'Z=1'                                # Output Layer (0,1)	
           )	
#define edges	
from <- c('X=0', 'X=0', 	
          'Y=1', 'Y=1',	
          'H_1_1_th1', 'H_1_2_th2', 'H_1_3_th1',	
          'Z=1'	
          )	
to <- c('H_1_1_th1', 'H_1_2_th2',	
        'H_1_2_th2', 'H_1_3_th1', 	
        'Z=1', 'Z=1', 'Z=1',	
        'Z=1'	
        )	
	
NodeList <- data.frame(nodes, x ,y)	
EdgeList <- data.frame(from, to, edge_names)	
nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names)	
	
col <- rep("gray",length(V(nn.graph))) # input layer	
col[c(3:5)] <- "lightblue"             # hidden layer	
col[6] <- "lightgreen"                 # output layer	
plot(nn.graph, vertex.color=col, vertex.shape="pie", vertex.size=c(30, 30,   40, 40, 40, 30), edge.label.cex=1,	
     edge.arrow.size=0.25)	
title(main="XOR Operator")	
mtext("Input Layer: Bivariate (X=0,Y=1)",side=2,line=-1, col="blue")	
mtext("     Output Layer (XOR), Z=1", side=4, line=-2, col="green")	
mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray")	
	
######### (1,0)	
nodes <- c('X=1', 'Y=0',                        # Inputs 	
           'H_1_1_th1','H_1_2_th2','H_1_3_th1', # Hidden Layer	
           'Z=1'                                # Output Layer (0,1)	
           )	
#define edges	
from <- c('X=1', 'X=1', 	
          'Y=0', 'Y=0',	
          'H_1_1_th1', 'H_1_2_th2', 'H_1_3_th1',	
          'Z=1'	
          )	
to <- c('H_1_1_th1', 'H_1_2_th2',	
        'H_1_2_th2', 'H_1_3_th1', 	
        'Z=1', 'Z=1', 'Z=1',	
        'Z=1'	
        )	
	
NodeList <- data.frame(nodes, x ,y)	
EdgeList <- data.frame(from, to, edge_names)	
nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names)	
	
col <- rep("gray",length(V(nn.graph))) # input layer	
col[c(3:5)] <- "lightblue"             # hidden layer	
col[6] <- "lightgreen"                 # output layer	
plot(nn.graph, vertex.color=col, vertex.shape="pie", vertex.size=c(30, 30,   40, 40, 40, 30), edge.label.cex=1,	
     edge.arrow.size=0.25)	
title(main="XOR Operator")	
mtext("Input Layer: Bivariate (X=1,Y=0)", side=2,line=-1, col="blue")	
mtext("     Output Layer (XOR), Z=1", side=4, line=-2, col="green")	
mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray")	
	
######### (1,1)	
nodes <- c('X=1', 'Y=1',                        # Inputs 	
           'H_1_1_th1','H_1_2_th2','H_1_3_th1', # Hidden Layer	
           'Z=0'                                # Output Layer (1,1)	
           )	
#define edges	
from <- c('X=1', 'X=1', 	
          'Y=1', 'Y=1',	
          'H_1_1_th1', 'H_1_2_th2', 'H_1_3_th1',	
          'Z=0'	
          )	
to <- c('H_1_1_th1', 'H_1_2_th2',	
        'H_1_2_th2', 'H_1_3_th1', 	
        'Z=0', 'Z=0', 'Z=0',	
        'Z=0'	
        )	
	
NodeList <- data.frame(nodes, x ,y)	
EdgeList <- data.frame(from, to, edge_names)	
nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names)	
	
col <- rep("gray",length(V(nn.graph))) # input layer	
col[c(3:5)] <- "lightblue"             # hidden layer	
col[6] <- "lightgreen"                 # output layer	
plot(nn.graph, vertex.color=col, vertex.shape="pie", vertex.size=c(30, 30,   40, 40, 40, 30), edge.label.cex=1.5,	
     edge.arrow.size=0.25)	
title(main="XOR Operator")	
mtext("Input Layer: Bivariate (X=1,Y=1)",side=2,line=-1, col="blue")	
mtext("     Output Layer (XOR), Z=0", side=4, line=-2, col="green")	
mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray")	
#' 	
#' 	
#' 	
#' ## NAND Operator	
#' Another binary operator is `NAND` ([negative AND, Sheffer stroke](https://en.wikipedia.org/wiki/NAND_gate)) that produces a false (0) output if and only if both of its operands are true (1), and generates true (1), otherwise. Below is the `NAND` input-output table.	
#' 	
#' InputX|InputY|NAND Output(Z)	
#' ------|------|------------- 	
#' 0 | 0 |	1 	
#' 0 | 1 | 1	
#' 1 | 0 | 1	
#' 1 | 1 | 0	
#' 	
#' Similar to the `XOR` operator, we can also design a one-layer neural network that calculates `NAND`. The **values within each neurons represent its explicit threshold**, which can be normalized so that all neurons utilize the same threshold, typically $1$. The **value labels associated with network connections (edges) represent the weights of the inputs**. When the threshold is not reached, the output is trivial ($0$) and when the threshold is reached, the output is correspondingly $1$. Here is a shorthand analytic expression for the `NAND` calculation:	
#' 	
#' $$NAND(X,Y) = 1.3 - (1\times X + 1\times Y).$$	
#' Check that $NAND(X,Y)=0$ if and only if $X=1$ and $Y=1$, otherwise it equals $1$.	
#' 	
#' 	
# install.packages("network")	
library('igraph')	
#define node names	
nodes <- c('X', 'Y',                            # Inputs (0,1)	
           'H_th=1.3-(X+Y)',            # Hidden Layer	
           'Z'                                  # Output Layer (0,1)	
           )	
# define node x,y coordinates	
x <- c(rep(0,2), 2, 4)	
y <- c(c(1.5, 2.5), 2, 2)	
#x;y  # print x & y coordinates of all nodes	
	
#define edges	
from <- c('X', 'Y', 	
          'H_th=1.3-(X+Y)'	
          )	
to <- c('H_th=1.3-(X+Y)', 'H_th=1.3-(X+Y)',	
        'Z'	
        )	
edge_names <- c(	
        '-1', '-1',	
        '1'	
        )	
  	
NodeList <- data.frame(nodes, x ,y)	
EdgeList <- data.frame(from, to, edge_names)	
nn.graph <- graph_from_data_frame(vertices = NodeList, d= EdgeList, directed = TRUE) %>% set_edge_attr("label", value = edge_names)	
# plot(nn.graph)	
	
# See the iGraph specs here: http://kateto.net/networks-r-igraph	
	
col <- rep("gray",length(V(nn.graph))) # input layer	
col[c(3)] <- "lightblue"             # hidden layer	
col[4] <- "lightgreen"                 # output layer	
plot(nn.graph, vertex.color=col, vertex.shape="sphere", vertex.size=c(35, 35,   80,    35), edge.label.cex=2)	
title(main="NAND Operator")	
mtext("Input Layer: Bivariate (X,Y)", side=2, line=-1, col="blue")	
mtext("     Output Layer (NAND)", side=4, line=-2, col="green")	
mtext("Hidden Layers (Neurons)", side=1, line=0, col="gray")	
#' 	
#' 	
#' ## Complex networks designed using simple building blocks 	
#' 	
#' Observe that stringing some of these primitive networks together, or/and increasing the number of hidden layers, allows us to model problems with exponentially increasing complexity. For instance, constructing a 4-input `NAND` function would simply require repeating several of our 2-input `NAND` operators. This will increase the space of possible outcomes from $2^2$ to $2^4$. Of course, introducing more depth in the **hidden layers** further expands the complexity of the problems that can be modeled using neural nets.	
#' 	
#' You can interactively manipulate the [Google's TensorFlow Deep Neural Network Webapp](https://playground.tensorflow.org) to gain additional intuition and experience with the various components of deep learning networks. 	
#' 	
#' The [ConvnetJS demo provides another hands-on example using 2D classification with 2-layer neural network]( https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html).	
#' 	
#' # Neural Network Modeling using `Keras`	
#' 	
#' There are many different neural-net and deep-learning frameworks. The table below summarizes some of the main deep learning `R` packages.	
#' 	
#' Package     | Description	
#' ----------  | --------------------------------------------------------------------	
#' nnet 	      |   Feed-forward neural networks using 1 hidden layer	
#' neuralnet 	| 	Training backpropagation neural networks	
#' tensorflow  |   Google TensorFlow used in TensorBoard (see [SOCR UKBB Demo](https://socr.umich.edu/HTML5/SOCR_TensorBoard_UKBB/))	
#' deepnet     |  	Deep learning toolkit	
#' darch       | 	Deep Architectures based on Restricted Boltzmann Machines	
#' rnn         |   Recurrent Neural Networks (RRNs)	
#' rcppDL      |   Multi-layer machine learning methods including dA (Denoising Autoencoder), SdA (Stacked Denoising Autoencoder), RBM (Restricted Boltzmann machine), and DBN (Deep Belief Nets)	
#' deepr       |   DL training, fine-tuning and predicting processes using darch and deepnet	
#' MXNetR 	    | 	Flexible and efficient ML/DL tools utilizing CPU/GPU computing	
#' kerasR      |   RStudio's keras DL implementation wrapping C++/Python executable libraries	
#' Keras       |   Python based neural networks API, connecting Google TensorFlow, Microsoft Cognitive Toolkit (CNTK), and Theano	
#' 	
#' ## Iterations - Samples, Batches and Epochs	
#' 	
#' Iteration    |  Description	
#' ------------ | -----------------------------------------------------------------------	
#' Sample       | A singleton from the dataset, i.e., one element such as a patient, case, image, file, etc.	
#' Batch        | An n-tuple, a set of $n$ samples. All samples in a batch are typically processed independently, e.g., in parallel.  AI training on a batch yields a single model (or model update). Dataset batches should accurately represent the underlying input data distribution, whereas a single sample represents one input. Larger batch sizes correspond to better model fits; however, they require significantly more computing processing power (cf. algorithmic complexity).	
#' Epoch        | A user-specified iterator that controls the number of passes over the entire dataset. Epochs separate training into independent model estimators and provide mechanisms for performance tracking and algorithmic evaluation.	
#' 	
#' Most DL/ML `R` packages provide interfaces (APIs) to libraries that are built using foundational languages like C/C++ and Java. Most of the Python libraries also act as APIs to lower-level executables compiled for specific platforms (Mac, Linux, PC).	
#' 	
#' The `keras` package uses the `magrittr` package `pipe` operator (`%>%`) to join multiple functions or operators, which streamlines the readability of the script protocol. Also, the library `zeallot` supplies the reverse piping function "%<-%", used in multiple assignment operators, see `brain_dataset()` function later.	
#' 	
#' The `kerasR` package contains functions analogous to the ones in `keras` and utilizes the $\$$ operator to create models. There are parallels between the core `Python` methods and their `keras` counterparts: `compile()` and `keras_compile()`, `fit()` and `keras_fit()`, `predict()` and `keras_predict()`.	
#' 	
#' Below we will demonstrate the utilization of the `Keras` package for deep neural network analytics. This will require installation of `keras` and `TensorFlow` via `R` `devtools::install_github("rstudio/keras")`. For additional details, see the `keras` [installation reference](https://tensorflow.rstudio.com/reference/keras/install_keras/), [user guide and FAQs](https://tensorflow.rstudio.com/guide/keras/faq/).	
#' 	
#' 	
# devtools::install_github("rstudio/keras")	
library("keras")	
# install_keras()	
	
#install.packages("tensorflow")	
library(tensorflow)	
#install_tensorflow()	
#' 	
#' 	
#' The [`Keras` package](https://keras.rstudio.com/) includes built-in datasets with `load()` functions, e.g., `mnist.load_data()` and `imdb.load_data()`.	
#' 	
#' 	
mnist <- dataset_mnist()	
imdb <- dataset_imdb()	
#' 	
#' 	
#' ## Use-Case: Predicting Titanic Passengers Survival	
#' 	
#' Instead of using the default data provided in the `keras` package, we will utilize one of the datasets on the [DSPA Case-Studies Website](https://umich.instructure.com/courses/38100/files/folder/Case_Studies), which can be loaded much like [we did earlier in Chapter 2](https://socr.umich.edu/DSPA2/DSPA2_notes/02_Visualization.html). Below, we download the [Titanic Passengers Dataset](https://umich.instructure.com/courses/38100/files/folder/data/16_TitanicPassengerSurvivalDataset) and perform some preprocessing steps.	
#' 	
#' 	
library(reshape)	
library(caret)	
dat <- read.csv("https://umich.instructure.com/files/9372716/download?download_frd=1")	
	
# Inspect for missing values (empty or NA):	
dat.miss <- melt(apply(dat[, -2], 2, function(x) sum(is.na(x) | x=="")))	
cbind(row.names(dat.miss)[dat.miss$value>0], dat.miss[dat.miss$value>0,])	
	
# We can exclude the "Cabin" feature which includes 80% missing values. 	
	
# Impute the few missing Embarked values using the most common value (S)	
table(dat$embarked)	
dat$embarked[which(is.na(dat$embarked) | dat$embarked=="")] <- "S"	
	
# Some "fare"" values may represent total cost of group purchases	
# We can derive a new variable "price" representing fare per person	
# Update missing fare value with 0	
dat$fare[which(is.na(dat$fare))] <- 0 	
# calculate ticket Price (Fare per person)	
ticket.count <- aggregate(dat$ticket, by=list(dat$ticket), function(x) sum( !is.na(x) ))	
dat$price <- apply(dat, 1, function(x) as.numeric(x["fare"]) /	
            ticket.count[which(ticket.count[, 1] == x["ticket"]), 2])	
	
# Impute missing prices (price=0) using the median price per passenger class	
pclass.price<-aggregate(dat$price, by = list(dat$pclass), FUN = function(x) median(x, na.rm = T))	
dat[which(dat$price==0), "price"] <- 	
  apply(dat[which(dat$price==0), ] , 1, function(x)	
    pclass.price[pclass.price[, 1]==x["pclass"], 2])	
	
# Define a new variable "ticketcount" coding the number of passengers sharing the same ticket number	
dat$ticketcount <- 	
  apply(dat, 1, function(x) ticket.count[which(ticket.count[, 1] ==	
                                                 x["ticket"]), 2])	
	
# Capture the passenger title	
dat$title <- 	
  regmatches(as.character(dat$name),	
             regexpr("\\,[A-z ]{1,20}\\.", as.character(dat$name)))	
dat$title <-	
  unlist(lapply(dat$title,	
            FUN=function(x) substr(x, 3, nchar(x)-1)))	
table(dat$title)	
	
# Bin the 17 alternative title groups into 4 common 4 titles (factors)	
dat$title[which(dat$title %in% c("Mme", "Mlle"))] <- "Miss"	
dat$title[which(dat$title %in% 	
                  c("Lady", "Ms", "the Countess", "Dona"))] <- "Mrs"	
dat$title[which(dat$title=="Dr" & dat$sex=="female")] <- "Mrs"	
dat$title[which(dat$title=="Dr" & dat$sex=="male")] <- "Mr"	
dat$title[which(dat$title %in% c("Capt", "Col", "Don", 	
                  "Jonkheer", "Major", "Rev", "Sir"))] <- "Mr"	
dat$title <- as.factor(dat$title) 	
table(dat$title)	
	
# Impute missing ages using median age for each title group	
title.age <- aggregate(dat$age, by = list(dat$title), 	
                    FUN = function(x) median(x, na.rm = T))	
dat[is.na(dat$age), "age"] <- apply(dat[is.na(dat$age), ] , 1, 	
              function(x) title.age[title.age[, 1]==x["title"], 2])	
#' 	
#' 	
#' ## EDA/Visualization	
#' 	
#' We can start by some [simple EDA plots](https://socr.umich.edu/DSPA2/DSPA2_notes/02_Visualization.html), reporting some numerical summaries, examining pairwise correlations, and showing the distributions of some features in this dataset.	
#' 	
#' 	
library(ggplot2)	
summary(dat)	
	
# cols <- c("red","green")[unclass(dat$survived)]	
# 	
# plot(dat$ticketcount, dat$fare, pch=21, cex=1.5,	
#      bg=alpha(cols, 0.4),	
#      xlab="Number of Tickets per Party", ylab="Passenger Fare", 	
#      main="Titanic Passenger Data (TicketCount vs. Fare) Color Coded by Survival")	
# legend("topright", inset=.02, title="Survival",	
#    c("0","1"), fill=c("red", "green"), horiz=F, cex=0.8)	
	
plot_ly(dat, type="scatter", mode="markers")  %>%	
    add_trace(x = ~ticketcount, y=~fare, mode = 'markers', 	
              color = ~as.character(survived), colors=~survived) %>%	
    layout(legend = list(title=list(text=' Survival '), orientation = 'h'), 	
           title="Titanic Passenger Data (TicketCount vs. Fare) Color Coded by Survival")	
	
# plot_ly(dat, x = ~ticketcount, color = ~survived) %>% add_histogram()	
# 	
# plot_ly(dat, x = ~ticketcount, y = ~fare, color = ~survived, name=~survived) %>% add_bars() %>% layout(barmode = "stack")	
	
	
# ggplot(dat, aes(x=survived, y=fare, fill=sex)) + 	
#   #geom_dotplot(binaxis='y', stackdir='center',	
#   #               position=position_dodge(1)) +	
#   #scale_fill_manual(values=c("#999999", "#E69F00")) +	
#   geom_violin(trim=FALSE) +	
#   theme(legend.position="top")	
	
	
fig <- dat %>% plot_ly(type = 'violin') 	
fig <- fig %>% add_trace(x = ~survived[dat$survived == '1'], y = ~fare[dat$survived == '1'],	
    legendgroup = 'survived',   scalegroup = 'survived',  name = 'survived',  	
    box = list(visible = T),  meanline = list(visible = T ),  color = I("green")) 	
fig <- fig %>% add_trace(x = ~survived[dat$survived == '0'], y = ~fare[dat$survived == '0'],	
    legendgroup = 'died',   scalegroup = 'died',  name = 'died',  	
    box = list(visible = T),  meanline = list(visible = T ),  color = I("red"))	
fig <- fig %>% layout( xaxis = list(title="Survival"), yaxis = list(title="Fare"),	
                       title=' Titanic Passenger Survival vs. Fare ', orientation = 'h')	
fig	
	
# library(GGally)	
# ggpairs(dat[ , c("pclass", "age", "sibsp", "parch", 	
#                  "fare", "price", "ticketcount", "survived")], 	
#         aes(colour = as.factor(survived), alpha = 0.4))	
	
dims <- dplyr::select_if(dat, is.numeric)	
dims <- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))	
plot_ly(type = "splom", dimensions = setNames(dims, NULL), showupperhalf = FALSE, 	
        diagonal = list(visible = FALSE) ) %>% 	
  layout( title=' Titanic Passengers Pairs-Plots ')	
#' 	
#' 	
#' ## Data Preprocessing	
#' 	
#' Before we go into modeling the data, we need to preprocess it, e.g., normalize the numerical values and split it into training and testing sets.	
#' 	
#' 	
dat1 <- dat[ , c("pclass", "age", "sibsp", "parch", "fare", 	
                 "price", "ticketcount", "survived")]	
dat1$pclass <- as.factor(dat1$pclass)	
dat1$age <- as.numeric(dat1$age)	
dat1$sibsp <- as.factor(dat1$sibsp)	
dat1$parch <- as.factor(dat1$parch)	
dat1$fare <- as.numeric(dat1$fare)	
dat1$price <- as.numeric(dat1$price)	
dat1$ticketcount <- as.numeric(dat1$ticketcount)	
dat1$survived <- as.factor(dat1$survived)	
	
# Set the `dimnames` to `NULL`	
# dimnames(dat1) <- NULL	
dim(dat1)	
#' 	
#' 	
#' Use `keras::normalize()` to normalize the numerical data.	
#' 	
#' 	
# devtools::install_github("rstudio/keras")	
# First install Anaconda/Python: https://www.anaconda.com/download/#windows	
# install_keras()	
# reticulate::py_config()	
# library("keras")	
# use_python("C:/Users/Dinov/AppData/Local/Programs/Python/Python37/python.exe")	
# install_keras()	
# install_keras(method = "conda")	
# install_tensorflow()	
library("keras")	
# Normalize the data	
summary(dat1[ , c(2,5,6,7)])	
dat2 <- dat1[ , c(2,5,6,7)]	
dat2 <- as.matrix(dat2)	
dimnames(dat2) <- NULL	
# May be best to avoid normalizing the ordinal variable "ticketcount"	
dat2.norm <- keras::normalize(dat2)	
	
# report the summary`	
summary(dat2.norm)	
colnames(dat2.norm) <- c("age", "fare", "price", "ticketcount")	
#' 	
#' 	
#' Next, we'll partition the raw data into *training* (80%) and *testing* (20%) sets that will be utilized to build the forecasting model (to predict Titanic passenger survival) and assess the model performance, respectively. 	
#' 	
#' 	
train_set_ind <- sample(nrow(dat2.norm), floor(nrow(dat2.norm)*0.8)) # 80:20 plot training:testing	
train_dat2.X <- dat2.norm[train_set_ind, ]	
train_dat2.Y <- dat1[train_set_ind , 8]  # Outcome "survived" column:8	
	
test_dat2.X <- dat2.norm[-train_set_ind, ]	
test_dat2.Y <- dat1[-train_set_ind , 8]  # Outcome "survived" column:8	
	
# double check the size/dimensions of the training and testing data (predictors and responses)	
dim(train_dat2.X); length(train_dat2.Y); dim(test_dat2.X); length(test_dat2.Y)	
#' 	
#' 	
#' ## Keras Modeling	
#' 	
#' For *multi-class classification problems* via NN modeling, the `keras::to_categorical()` function allows us to transform the outcome attribute from a vector of class labels to a matrix of Boolean features, one for each class label. In this case, we have a bivariate (binary classification), passenger survival indicator.	
#' 	
#' Keras modeling starts with first initializing a sequential model using the `keras::keras_model_sequential()` function. 	
#' 	
#' We will try to predict the passenger survival using a fully-connected multi-layer perceptron NN. We will need to choose an activation function, e.g., `relu`, `sigmoid`. A rectifier activation function (relu) may be used in a hidden layer and a `softmax` activation function may be used in the final output layer so that the outputs represent (posterior) probabilities between 0 and 1, corresponding to the odds of survival. In the first layer, we can specify 8 hidden nodes (`units`), an `input_shape` of 4, to reflect the 4 features in the training data *age*, *fare*, *price*, *ticketcount*, and the output layer with 2 output values, one for each of the survival categories. We can also inspect the structure of the NN model using:	
#' 	
#'  - `summary()`: print a summary representation of your model,	
#'  - `get_config()`: return a list that contains the configuration of the model,	
#'  - `get_layer()`: return the layer configuration,	
#'  - `$layers`: NN model attribute retrieves a flattened list of the model's layers,	
#'  - `$inputs`: NN model attribute listing the input tensors,	
#'  - `$outputs`: NN model attribute retrieves the output tensors.	
#' 	
#' 	
model.1 <- keras_model_sequential() 	
	
# Add layers to the model	
model.1 %>% 	
    layer_dense(units = 8, activation = 'relu', input_shape = c(4)) %>% 	
    layer_dense(units = 2, activation = 'softmax')	
	
# NN model summary	
summary(model.1)	
# Report model configuration	
get_config(model.1)	
# report layer configuration	
get_layer(model.1, index = 1)	
# Report model layers	
model.1$layers	
# List the input tensors	
model.1$inputs	
# List the output tensors	
model.1$outputs	
#' 	
#' 	
#' Once the model architecture is specified, we need to estimate (fit) the NN model using the `training` data. The adaptive momentum (`ADAM`) optimizer along with `categorical_crossentropy` objective function may be used to **compile** the NN model. Specifying `accuracy` as a metrics argument allows us to inspect the quality of the NN model fit during the training phase (training data validation). The *optimizer* and the *objective* (loss) functions are the pair of required arguments for model compilation. 	
#' 	
#' In addition to *ADAM*, alternative optimization algorithms include Stochastic Gradient Descent (*SGD*) and Root Mean Square proportion (*RMSprop*). ADAM is essentially RMSprop with momentum whereas NADAM is ADAM RMSprop with Nesterov momentum. Following the selection of the optimization algorithm, we need to tune the model parameters, e.g., learning rate or momentum. Choosing an appropriate objective function depends on the classification or regression forecasting task, e.g., regression prediction (continuous outcomes) usually utilizes Mean Squared Error (*MSE*), whereas multi-class classification problems use *categorical_crossentropy* loss function and binary classification problems commonly use *binary_crossentropy* loss function.	
#' 	
#' 	
# "Compile" the model	
model.1 %>% compile(	
     loss = 'binary_crossentropy',	
     optimizer = 'adam',	
     metrics = 'accuracy'	
 )	
#' 	
#' 	
#' ## NN Model Fitting	
#' 	
#' The next step fits the NN model (`model.1`) to the training data using 200 epochs, or iterations over all the samples in `train_dat2.X` (predictors) and `train_dat2.Y` (outcomes), in batches of 10 samples. This process trains the model on a specified number of epochs (iterations or exposures) on the training data. One epoch is a single pass through the whole training set followed by comparing the model prediction results against the verification labels. The batch size defines the number of samples being propagated through the network at once (as a batch). 	
#' 	
#' 	
# convert the labels to categorical values	
train_dat2.Y <- to_categorical(train_dat2.Y)	
test_dat2.Y <- to_categorical(test_dat2.Y)	
	
# Fit the model & Store the fitting history	
track.model.1 <- model.1 %>% fit(	
     train_dat2.X, 	
     train_dat2.Y, 	
     epochs = 200, 	
     batch_size = 10, 	
     validation_split = 0.2	
 )	
#' 	
#' 	
#' ## Convolutional Neural Networks (CNNs)	
#' 	
#' Convolutional Neural Networks represent a specific type of Deep Learning algorithm that incorporates the topological, geometric, spatial and temporal structure of the input data (generally images) and assigns importance by learning the weights and biases of the (image) intensities associated with the objects or affinities present in the data. These important features are then utilized to differentiate between datasets (images) or components within the data (structure and objects in images). CNNs require less pre-processing compared to other DL classification algorithms, which may depend on manually-specified filters. CNNs tend to learn these filters by iteratively extrapolating multi-resolution characteristics in the data objects by convolution methods. See the [DSPA Appendix for the mathematical operation convolution and its applications in image processing](https://socr.umich.edu/DSPA2/DSPA2_notes/DSPA_Appendix_6_ImageFilteringSpectralProcessing.html).	
#' 	
#' Recall that one may attempt to learn the features of an image (or a higher dimensional tensor) by flattening the image array (matrix/tensor) into a 1D vector. This vectorization works well if there are no spatiotemporal dependencies in the data. Most of the time, there are such image intensity correlations that can’t be ignored. The CNN architecture facilitates a mechanism to better model the intrinsic image affinities, reduce the number of DNN parameters, and produce more reliable predictions. Many images are represented as tensors whose modes (dimensions) encode spatial, temporal, color-channel, and other information about the observed image intensity. For instance, an RGB image of size $1,000 \times 1,000=10^6$ pixels, may require 3MB of memory/storage. A CNN learns to encode the image into a higher-dimensional multispectral hierarchical tensor encoding the intrinsic image characteristics that can lead to easy classification of similar images or generation of synthetic images. For instance, ignoring the color-channels and using a stride=10, convolving the original image of dimension with a kernel of size $10\times 10$ would yield another (smoother) lower-resolution image of size $100\times 100$, encoding the convolved features. 	
#' 	
#' The convolution process aims to extract the high-level features such as edges, borders, and contrasts from the input image. CNNs involve both convolutional and dense layers. Much like the [Fourier transform](https://www.socr.umich.edu/TCIU/HTMLs/Chapter3_Kime_Phase_Problem.html), the first convolutional layer captures low-level features such as edges, color, gradient orientation, etc. Subsequent layers progressively add higher-level details and the entire CNN holistically encodes the understanding of the input image structure.	
#' 	
#' Convolution, de-convolution (the reverse process) and padding reduce or increase the image dimensionality. 	
#' Most CNNs mix *convolutional layers* with *pooling layers*. The latter are responsible for reducing the spatial size of the convolved features, which decreases the computational data processing demand. Pooling may be implemented as *Max Pooling* or *Average Pooling*. Max-pooling takes an image patch defined by the kernel and returns the maximum intensity value. It performs noise-suppression as it decimates noisy pixel intensities, denoises the image, and reduces the image dimensions. Average-pooling returns the average of all intensity values covered by the image-kernel and reduces the image dimension.	
#' 	
#' Jointly, the convolutional and pooling processes form the CNN $i$-th layer and the number of layers may reflect the ANN complexity. Fully connected layers are typically added to the ANN architecture to enhance the classification, prediction, or regression performance of DL models. Fully-connected layers provide a mechanism to learn non-linear associations and non-affine characteristics of high-level features captured as outputs of the convolutional layers. 	
#' 	
#' ## Model EDA	
#' 	
#' We can visualize the model fitting process using `keras::plot()` jointly depicting the loss of the objective function and the accuracy of the model, across epochs. Alternatively, we can split the pair of plots - one for the *model loss* and the other for the *model accuracy*. The $\$$ operator is used to access the tensor data and plot it step-by-step. A sign of overfitting may be an accuracy (on training data) that keeps improving while the accuracy (on the validation data) worsens. This may be an indication that the NN model started to *learn* noise in the data instead of learning real patterns or affinities in the data. While the accuracy trends of both datasets are rising towards the final epochs, this may indicate that the model is still in the process of learning on the training dataset (and we can increase the number of epochs).	
#' 	
#' 	
# Plot the history	
# plot(track.model.1)	
# 	
# # NN model loss on the training data	
# plot(track.model.1$metrics$loss, main="Model 1 Loss", 	
#      xlab = "Epoch", ylab="Loss", col="green", type="l", ylim=c(0.54, 0.6))	
# 	
# # NN model loss of the 20% validation data	
# lines(track.model.1$metrics$val_loss, col="blue", type="l")	
# 	
# # Add legend	
# legend("right", c("train", "test"), col=c("green", "blue"), lty=c(1,1))	
# 	
# # Plot the accuracy of the training data 	
# plot(track.model.1$metrics$acc, main="Model 1 Accuracy", 	
#      xlab = "Epoch", ylab="Accuracy", col="blue", type="l", ylim=c(0.65, 0.75))	
# 	
# # Plot the accuracy of the validation data	
# lines(track.model.1$metrics$val_acc, col="green")	
# 	
# # Add Legend	
# legend("bottom", c("Training", "Testing"), col=c("blue", "green"), lty=c(1,1))	
	
## plot_ly	
epochs <- 200	
time <- 1:epochs	
hist_df <- data.frame(time=time, loss=track.model.1$metrics$loss, acc=track.model.1$metrics$acc,	
                      valid_loss=track.model.1$metrics$val_loss, valid_acc=track.model.1$metrics$val_acc)	
	
plot_ly(hist_df, x = ~time)  %>%	
  add_trace(y = ~loss, name = 'training loss', mode = 'lines') %>%	
  add_trace(y = ~acc, name = 'training accuracy', mode = 'lines+markers') %>%	
  add_trace(y = ~valid_loss, name = 'validation loss',mode = 'lines+markers') %>%	
  add_trace(y = ~valid_acc, name = 'validation accuracy', mode = 'lines+markers') %>% 	
  layout(title="Titanic NN Model Performance",	
           legend = list(orientation = 'h'), yaxis=list(title="metric"))	
#' 	
#' 	
#' ## Passenger Survival Forecasting using New Data	
#' 	
#' Once the model is fit, we can use it to predict the survival of passengers using the testing data, `test_dat2.X`. As we have seen before, `predict()` provides this functionality. Finally, we can evaluate the performance of the NN model by comparing the predicted class labels and `test_dat2.Y` using `table()` or `confusionMatrix()`.	
#' 	
#' 	
# Predict the classes for the test data	
predict.survival <- model.1 %>% predict(test_dat2.X, batch_size = 30) %>%  k_argmax()	
	
# Confusion matrix	
test_dat2.Y <- dat1[-train_set_ind , 8]	
table(test_dat2.Y, predict.survival$numpy())	
caret::confusionMatrix(test_dat2.Y, as.factor(predict.survival$numpy()))	
#' 	
#' 	
#' We can also utilize the `evaluate()` function to assess the model quality using testing data.	
#' 	
#' 	
# Evaluate on test data and labels	
test_dat2.Y <- to_categorical(test_dat2.Y)	
model1.qual <- model.1 %>% evaluate(test_dat2.X, test_dat2.Y, batch_size = 30)	
print(model1.qual)	
#' 	
#' 	
#' ## Fine-tuning the NN Model	
#' 	
#' The main NN model parameters we can adjust to improve the model quality include:	
#' 	
#'  - the *number of layers*,	
#'  - the *number of nodes* within layers (*hidden units*),	
#'  - the *number of epochs*,	
#'  - the *batch size*.	
#'   	
#' Models can be improved by adding additional layers, increasing the number of hidden units, and by tuning the optimization parameters in `compile()`. Let's first try to add another layer to the N model.	
#' 	
#' 	
# Initialize the sequential model	
model.2 <- keras_model_sequential() 	
	
# Add layers to model	
model.2 %>% 	
    layer_dense(units = 8, activation = 'relu', input_shape = c(4)) %>% 	
    layer_dense(units = 6, activation = 'relu') %>% 	
    layer_dense(units = 2, activation = 'softmax')	
	
# Compile the model	
model.2 %>% compile(	
     loss = 'binary_crossentropy',	
     optimizer = 'adam',	
     metrics = 'accuracy'	
 )	
	
# Fit NN model to training data & Save the training history	
track.model.2 <- model.2 %>% fit(	
     train_dat2.X, 	
     train_dat2.Y, 	
     epochs = 200, 	
     batch_size = 10, 	
     validation_split = 0.2	
 )	
	
# Evaluate the model	
model2.qual <- model.2 %>% evaluate(test_dat2.X, test_dat2.Y, batch_size = 30)	
print(model2.qual)	
	
# EDA on the loss and accuracy metrics of this model.2	
# Plot the history	
# plot(track.model.2)	
# 	
# # NN model loss on the training data	
# plot(track.model.2$metrics$loss, main="Model Loss", 	
#      xlab = "Epoch", ylab="Loss", col="green", type="l", ylim=c(0.54, 0.6))	
# 	
# # NN model loss of the 20% validation data	
# lines(track.model.2$metrics$val_loss, col="blue", type="l")	
# 	
# # Add legend	
# legend("right", c("Training", "Testing"), col=c("green", "blue"), lty=c(1,1))	
# 	
# # Plot the accuracy of the training data 	
# plot(track.model.2$metrics$acc, main="Model 2 (Extra Layer) Accuracy", 	
#      xlab = "Epoch", ylab="Accuracy", col="blue", type="l", ylim=c(0.65, 0.76))	
# 	
# # Plot the accuracy of the validation data	
# lines(track.model.2$metrics$val_acc, col="green")	
# 	
# # Add Legend	
# legend("top", c("Training", "Testing"), col=c("blue", "green"), lty=c(1,1))	
	
## plot_ly	
epochs <- 200	
time <- 1:epochs	
hist_df2 <- data.frame(time=time, loss=track.model.2$metrics$loss, acc=track.model.2$metrics$acc,	
                      valid_loss=track.model.2$metrics$val_loss, valid_acc=track.model.2$metrics$val_acc)	
	
plot_ly(hist_df2, x = ~time)  %>%	
  add_trace(y = ~loss, name = 'training loss', mode = 'lines') %>%	
  add_trace(y = ~acc, name = 'training accuracy', mode = 'lines+markers') %>%	
  add_trace(y = ~valid_loss, name = 'validation loss',mode = 'lines+markers') %>%	
  add_trace(y = ~valid_acc, name = 'validation accuracy', mode = 'lines+markers') %>% 	
  layout(title="Titanic (model.2) Model Performance",	
           legend = list(orientation = 'h'), yaxis=list(title="metric"))	
#' 	
#' 	
#' Next we can examine the effects of adding more *hidden units* to the NN model.	
#' 	
#' 	
# Initialize a sequential model	
model.3 <- keras_model_sequential() 	
	
# Add layers and Node-Units to model	
model.3 %>% 	
    layer_dense(units = 30, activation = 'relu', input_shape = c(4)) %>% 	
    layer_dense(units = 15, activation = 'relu') %>% 	
    layer_dense(units = 2, activation = 'softmax')	
	
# Compile the model	
model.3 %>% compile(	
     loss = 'binary_crossentropy',	
     optimizer = 'adam',	
     metrics = 'accuracy'	
 )	
	
# Fit NN model to training data & Save the training history	
track.model.3 <- model.3 %>% fit(	
     train_dat2.X, 	
     train_dat2.Y, 	
     epochs = 200, 	
     batch_size = 10, 	
     validation_split = 0.2	
 )	
	
# Evaluate the model	
model3.qual <- model.3 %>% evaluate(test_dat2.X, test_dat2.Y, batch_size = 30)	
print(model3.qual)	
	
# EDA on the loss and accuracy metrics of this model.2	
# Plot the history	
# plot(track.model.3)	
# 	
# # NN model loss on the training data	
# plot(track.model.3$metrics$loss, main="Model Loss", 	
#      xlab = "Epoch", ylab="Loss", col="green", type="l", ylim=c(0.54, 0.7))	
# 	
# # NN model loss of the 20% validation data	
# lines(track.model.3$metrics$val_loss, col="blue", type="l")	
# 	
# # Add legend	
# legend("top", c("Training", "testing"), col=c("green", "blue"), lty=c(1,1))	
# 	
# # Plot the accuracy of the training data 	
# plot(track.model.3$metrics$acc, main="Model 3 (Extra Layer/More Hidden Units)", 	
#      xlab = "Epoch", ylab="Accuracy", col="blue", type="l", ylim=c(0.65, 0.76))	
# 	
# # Plot the accuracy of the validation data	
# lines(track.model.3$metrics$val_acc, col="green")	
# 	
# # Add Legend	
# legend("top", c("Training", "Testing"), col=c("blue", "green"), lty=c(1,1))	
	
## plot_ly	
epochs <- 200	
time <- 1:epochs	
hist_df3 <- data.frame(time=time, loss=track.model.3$metrics$loss, acc=track.model.3$metrics$acc,	
                      valid_loss=track.model.3$metrics$val_loss, valid_acc=track.model.3$metrics$val_acc)	
	
plot_ly(hist_df3, x = ~time)  %>%	
  add_trace(y = ~loss, name = 'training loss', mode = 'lines') %>%	
  add_trace(y = ~acc, name = 'training accuracy', mode = 'lines+markers') %>%	
  add_trace(y = ~valid_loss, name = 'validation loss',mode = 'lines+markers') %>%	
  add_trace(y = ~valid_acc, name = 'validation accuracy', mode = 'lines+markers') %>% 	
  layout(title="Titanic (model.3) NN Model Performance",	
           legend = list(orientation = 'h'), yaxis=list(title="metric"))	
#' 	
#' 	
#' Finally, we can attempt to fine-tune the optimization parameters provided to the `compile()` function. For instance, we can experiment with alternative optimization algorithms, like the Stochastic Gradient Descent (SGD), `optimizer_sgd()`, and adjust the *learning rate*, `lr`. In addition, we can specify alternative learning rate to train the NN, typically by 10-fold increase or decrease, which trades algorithmic accuracy, speed of convergence, and avoidance of local minima.	
#' 	
#' 	
model.4 <- keras_model_sequential() 	
	
# Add layers and Node-Units to model	
model.4 %>% 	
    layer_dense(units = 30, activation = 'relu', input_shape = c(4)) %>% 	
    layer_dense(units = 15, activation = 'relu') %>% 	
    layer_dense(units = 2, activation = 'softmax')	
	
# Define an optimizer	
SGD <- optimizer_sgd(lr = 0.001)	
	
# Compile the model	
model.4 %>% compile(	
    optimizer=SGD, 	
    loss = 'binary_crossentropy',	
    metrics = 'accuracy'	
 )	
	
# Fit NN model to training data & Save the training history	
set.seed(1234)	
track.model.4 <- model.4 %>% fit(	
     train_dat2.X, 	
     train_dat2.Y, 	
     epochs = 200, 	
     batch_size = 10, 	
     validation_split = 0.1	
 )	
	
# Evaluate the model	
model4.qual <- model.4 %>% evaluate(test_dat2.X, test_dat2.Y, batch_size = 30)	
print(model4.qual)	
	
# EDA on the loss and accuracy metrics of this model.2	
# Plot the history	
# plot(track.model.4)	
# 	
# # NN model loss on the training data	
# plot(track.model.4$metrics$loss, main="Model 4 Loss", 	
#      xlab = "Epoch", ylab="Loss", col="green", type="l", ylim=c(0.54, 0.7))	
# 	
# # NN model loss of the 20% validation data	
# lines(track.model.4$metrics$val_loss, col="blue", type="l")	
# 	
# # Add legend	
# legend("top", c("Training", "Testing"), col=c("green", "blue"), lty=c(1,1))	
# 	
# # Plot the accuracy of the training data 	
# plot(track.model.4$metrics$acc, main="Model 4 (Extra Layer/More Hidden Units/SGD)", 	
#      xlab = "Epoch", ylab="Accuracy", col="blue", type="l", ylim=c(0.65, 0.76))	
# 	
# # Plot the accuracy of the validation data	
# lines(track.model.4$metrics$val_acc, col="green")	
# 	
# # Add Legend	
# legend("top", c("Training", "Testing"), col=c("blue", "green"), lty=c(1,1))	
	
epochs <- 200	
time <- 1:epochs	
hist_df4 <- data.frame(time=time, loss=track.model.4$metrics$loss, acc=track.model.4$metrics$acc,	
                      valid_loss=track.model.4$metrics$val_loss, valid_acc=track.model.4$metrics$val_acc)	
	
plot_ly(hist_df4, x = ~time)  %>%	
  add_trace(y = ~loss, name = 'training loss', mode = 'lines') %>%	
  add_trace(y = ~acc, name = 'training accuracy', mode = 'lines+markers') %>%	
  add_trace(y = ~valid_loss, name = 'validation loss',mode = 'lines+markers') %>%	
  add_trace(y = ~valid_acc, name = 'validation accuracy', mode = 'lines+markers') %>% 	
  layout(title="Titanic (model.4) NN Model Performance",	
           legend = list(orientation = 'h'), yaxis=list(title="metric"))	
#' 	
#' 	
#' ## Model Export and Import	
#' 	
#' Intermediate and final NN models may be saved, (re)loaded, and exported using of `save_model_hdf5()` and `load_model_hdf5()` based on the [HDF5 file format (h5)](https://support.hdfgroup.org/HDF5/whatishdf5.html). We can operate on *complete models* on just on the *model weights*. the models can also be exported in [JSON](https://www.json.org) or [YAML](https://yaml.org) formats using `model_to_json()` and `model_to_yaml()`, and their load counterparts `model_from_json()` and `model_from yaml()`.	
#'   	
#' 	
save_model_hdf5(model.4, "model.4.h5")	
model.new <- load_model_hdf5("model.4.h5")	
	
save_model_weights_hdf5("model_weights.h5")	
model.old %>% load_model_weights_hdf5("model_weights.h5")	
	
json_string <- model_to_json(model.old)	
model.new <- model_from_json(json_string)	
	
yaml_string <- model_to_yaml(model.old)	
model.new <- model_from_yaml(yaml_string)	
#' 	
#' 	
#' Let's demonstrate loading several pre-trained models, [(resnet50)](https://keras.io/api/applications/resnet/), [VGG16](https://keras.io/api/applications/vgg/), [VGG19](https://keras.io/api/applications/vgg/), and using them for simple out-of-the-box image classification and automated labeling of an [image of New Zealand's Lake Mapourika](https://upload.wikimedia.org/wikipedia/commons/2/23/Lake_mapourika_NZ.jpeg). This image recognition example will be expanded later. For now, we will simply illustrate the quick and efficient utilization of an existing pretrained neural network to qualitatively describe an image in a narative form.	
#' 	
#' 	
library(keras)	
	
# get info about local version of Python installation	
reticulate::py_config()	
# The first time you run this install Pillow!	
# tensorflow::install_tensorflow(extra_packages='pillow')	
	
# load the image	
download.file("https://upload.wikimedia.org/wikipedia/commons/2/23/Lake_mapourika_NZ.jpeg", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')	
img <- image_load(paste(getwd(),"results/image.png", sep="/"), target_size = c(224,224))	
	
# Preprocess input image	
x <- image_to_array(img)	
# ensure we have a 4d tensor with single element in the batch dimension,	
# the preprocess the input for prediction using resnet50	
x <- array_reshape(x, c(1, dim(x)))	
x <- imagenet_preprocess_input(x)	
	
# Specify and compare Predictions based on different Pre-trained Models	
# Model 1: resnet50	
model_resnet50 <- application_resnet50(weights = 'imagenet')	
	
# make predictions then decode and print them	
preds_resnet50 <- model_resnet50 %>% predict(x)	
imagenet_decode_predictions(preds_resnet50, top = 10)	
	
# Model2: VGG19	
model_vgg19 <- application_vgg19(weights = 'imagenet')	
preds_vgg19 <- model_vgg19 %>% predict(x)	
imagenet_decode_predictions(preds_vgg19, top = 10)[[1]]	
	
# Model 3: VGG16	
model_vgg16 <- application_vgg16(weights = 'imagenet')	
preds_vgg16 <- model_vgg16 %>% predict(x)	
imagenet_decode_predictions(preds_vgg16, top = 10)[[1]]	
#' 	
#' 	
#' 	
#' # Classification examples	
#' 	
#' ## Sonar data example	
#' 	
#' Let's load the `mlbench` packages which includes a [Sonar data](https://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/Sonar) `mlbench::Sonar` containing information about sonar signals bouncing off a metal cylinder or a roughly cylindrical rock. Each of 208 observations includes a set of 60 numbers (features) in the range 0.0 to 1.0, and a label M (metal) or R (rock). Each feature represents the energy within a particular frequency band, integrated over a certain period of time. The M and R labels associated with each observation classify the record as rock or mine (metal) cylinder. The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly. 	
#' 	
#' 	
library(mlbench)	
	
data(Sonar, package="mlbench")	
	
table(Sonar[,61])	
	
Sonar[,61] = as.numeric(Sonar[,61])-1 # R = "1", "M" = "0"	
set.seed(123)	
train.ind = sample(1:nrow(Sonar),0.7*nrow(Sonar))	
	
train.x = data.matrix(Sonar[train.ind, 1:60])	
train.y = Sonar[train.ind, 61]	
test.x = data.matrix(Sonar[-train.ind, 1:60])	
test.y = Sonar[-train.ind, 61]	
#' 	
#' 	
#' Let's start by using a **multi-layer perceptron** as a classifier using a general multi-layer neural network that can be utilized to do classification or regression modeling. It relies on the following parameters:	
#' 	
#' * Training data and labels,	
#' * Number of hidden nodes in each hidden layer,	
#' * Number of nodes in the output layer,	
#' * Type of activation,	
#' * Type of output loss.	
#' 	
#' Here is one example using the *training* and *testing* data we defined above:	
#' 	
#' 	
library(plotly)	
dim(train.x)  # [1] 145  60	
dim(test.x)   # [1] 63 60	
	
model <- keras_model_sequential() 	
	
model %>% 	
  layer_dense(units = 256, activation = 'relu', input_shape = ncol(train.x)) %>% 	
  layer_dropout(rate = 0.4) %>% 	
  layer_dense(units = 128, activation = 'relu') %>%	
  layer_dropout(rate = 0.3) %>%	
  layer_dense(units = 2, activation = 'sigmoid')	
	
model %>% compile(	
  loss = 'binary_crossentropy',	
  optimizer = 'adam',	
  metrics = c('accuracy')	
)	
	
one_hot_labels <- to_categorical(train.y, num_classes = 2)	
	
# Train the model, iterating on the data in batches of 25 samples	
history <- model %>% fit(	
  train.x, one_hot_labels, 	
  epochs = 100, 	
  batch_size = 5,	
  validation_split = 0.3	
)	
 	
# Evaluate model	
metrics <- model %>% evaluate(test.x, to_categorical(test.y, num_classes = 2))	
metrics	
	
epochs <- 100	
time <- 1:epochs	
hist_df <- data.frame(time=time, loss=history$metrics$loss, acc=history$metrics$accuracy,	
                      valid_loss=history$metrics$val_loss, valid_acc=history$metrics$val_accuracy)	
	
plot_ly(hist_df, x = ~time)  %>%	
  add_trace(y = ~loss, name = 'training loss', type="scatter", mode = 'lines') %>%	
  add_trace(y = ~acc, name = 'training accuracy', type="scatter", mode = 'lines+markers') %>%	
  add_trace(y = ~valid_loss, name = 'validation loss', type="scatter",mode = 'lines+markers') %>%	
  add_trace(y = ~valid_acc, name = 'validation accuracy', type="scatter", mode = 'lines+markers') %>% 	
  layout(title="Sonar Data NN Model Performance",	
           legend = list(orientation = 'h'), yaxis=list(title="metric"))	
	
# Finally prediction of binary class labels and Confusion Matrix	
predictions <- model %>% predict(test.x) %>%  k_argmax()	
table(factor(predictions$numpy()),factor(test.y))	
	
# We can also inspect the corresponding probabilities of the automated binary classification labels	
prediction_probabilities <- model %>% predict(test.x)	
prediction_probabilities	
#' 	
#' 	
#' *Note* that you may need to specify `crossval::confusionMatrix()`, in case you also have the `caret` package loaded, as `caret` also has a function called `confusionMatrix()`.	
#' 	
#' 	
library("crossval")	
diagnosticErrors(crossval::confusionMatrix(predictions$numpy(),test.y, negative = 0))	
#' 	
#' 	
#' We can plot the ROC curve and calculate the AUC (Area under the curve). Specifically, we will show computing the *area under the curve (AUC)* and drawing the *receiver operating characteristic (ROC)* curve. Assuming 'positive' ranks higher than 'negative', the [AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve) quantifies the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. For binary classification, interpreting the AUC values, $0\leq AUC\leq 1$, corresponds with (poor) *uninformative classifiers* when $AUC=0.5$ and perfect classifiers when $AUC\to 1^-$. 	
#' 	
#' 	
# install.packages("pROC"); install.packages("plotROC"); install.packages("reshape2")	
library(pROC); library(plotROC); library(reshape2); 	
# compute AUC	
get_roc = function(preds){	
  roc_obj <- roc(test.y, preds, quiet=TRUE)	
  auc(roc_obj)	
}	
get_roc(predictions$numpy())	
predictions <- predictions$numpy() 	
	
#plot roc	
dt <- data.frame(test.y, predictions)	
colnames(dt) <- c("class","scored.probability")	
	
# basicplot <- ggplot(dt, aes(d = class, m = scored.probability)) + 	
#   geom_roc(labels = FALSE, size = 0.5, alpha.line = 0.6, linejoin = "mitre") +  	
#   theme_bw() + coord_fixed(ratio = 1) + style_roc() + ggtitle("ROC CURVE")+	
#   ggplot2::annotate("rect", xmin = 0.4, xmax = 0.9, ymin = 0.1, ymax = 0.5,	
#            alpha = 0.2)+	
#   ggplot2::annotate("text", x = 0.65, y = 0.32, size = 3,	
#            label = paste0("AUC: ", round(get_roc(predictions)[[1]], 3)))	
# basicplot	
	
# Compute the AUC and draw the ROC curve	
roc_curve <- function(df) {	
  x <- c()	
  y <- c()	
  true_class = df[, "class"]	
  probabilities = df[, "scored.probability"]	
  thresholds = seq(0, 1, 0.01)	
  rx <- 0	
  ry <- 0	
  for (threshold in thresholds) {	
    predicted_class <- c()	
    for (val in probabilities) {	
      if (val > threshold) {	
        predicted_class <- c(predicted_class, 1)	
      }	
      else { predicted_class <- c(predicted_class, 0)   }	
    }	
    df2 <- as.data.frame(cbind(true_class, predicted_class))	
    TP <- nrow(filter(df2, true_class == 1 & predicted_class == 1))	
    TN <- nrow(filter(df2, true_class == 0 & predicted_class == 0))	
    FP <- nrow(filter(df2, true_class == 0 & predicted_class == 1))	
    FN <- nrow(filter(df2, true_class == 1 & predicted_class == 0))	
    specm1 <- 1 - ((TN) / (TN + FP))	
    sens <- (TP) / (TP + FN)	
    x <- append(x, specm1)	
    y <- append(y, sens)	
  }	
  dfr <- as.data.frame(cbind(x, y))	
  plot_ly(dfr, x = ~ x, y = ~ y, type = 'scatter', mode = 'lines') %>%	
    layout(title = paste0("ROC Curve and AUC"), annotations = list(	
        text = paste0("Area Under Curve = ", round(get_roc(predictions)[[1]], 3)),	
        x = 0.75, y = 0.25, showarrow = FALSE),	
      xaxis = list(showgrid = FALSE, title = "1-Specificity (false positive rate)"),	
      yaxis = list(showgrid = FALSE, title = "Sensitivity (true positive rate)"),	
      legend = list(orientation = 'h')	
    ) 	
}	
roc_curve(data.frame(class=test.y, scored.probability=prediction_probabilities[,2]))	
#' 	
#' 	
#' # Case-Studies	
#' 	
#' Let's demonstrate deep neural network regression-modeling and classification-prediction using several biomedical case-studies.	
#' 	
#' ## Schizophrenia Neuroimaging Study	
#' 	
#' [The SOCR Schizo Dataset is available here](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Oct2009_ID_NI).	
#' 	
#' 	
library("XML"); library("xml2")	
library("rvest");	
	
# Schizophrenia Data	
# UCLA Data is available here: 	
# wiki_url <- read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Oct2009_ID_NI")	
# html_nodes(wiki_url, "#content")	
# SchizoData<- html_table(html_nodes(wiki_url, "table")[[2]])	
# UMich Data is available here	
wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_Oct2009_ID_NI")	
html_nodes(wiki_url, "#content")	
SchizoData<- html_table(html_nodes(wiki_url, "table")[[1]])	
	
# View (SchizoData): Select an outcome response "DX"(3), "FS_IQ" (5)	
set.seed(1234)	
test.ind = sample(1:63, 10, replace = F)  # select 10/63 of cases for testing, train on remaining (63-10)/63 cases	
train.x = scale(data.matrix(SchizoData[-test.ind, c(2, 4:9)])) #, 11:66)]) # exclude outcome	
train.y = ifelse(SchizoData[-test.ind, 3] < 2, 0, 1) # Binarize the outcome, Controls=0	
test.x = scale(data.matrix(SchizoData[test.ind, c(2, 4:9)])) #, 11:66)])	
test.y = ifelse(SchizoData[test.ind, 3] < 2, 0, 1)	
	
# View(data.frame(test.x, test.y))	
# View(data.frame(train.x, train.y))	
	
model <- keras_model_sequential() 	
	
model %>% 	
  layer_dense(units = 256, activation = 'relu', input_shape = ncol(train.x)) %>% 	
  layer_dropout(rate = 0.4) %>% 	
  layer_dense(units = 128, activation = 'relu') %>%	
  layer_dropout(rate = 0.3) %>%	
  layer_dense(units = 64, activation = 'relu') %>%	
  layer_dropout(rate = 0.1) %>%	
  layer_dense(units = 32, activation = 'relu') %>%	
  layer_dropout(rate = 0.1) %>%	
  layer_dense(units = 2, activation = 'sigmoid')	
	
model %>% compile(	
  loss = 'binary_crossentropy',	
  optimizer = 'adam',	
  metrics = c('accuracy')	
)	
	
one_hot_labels <- to_categorical(train.y[,1])	
	
# Train the model, iterating on the data in batches of 25 samples	
history <- model %>% fit(	
  train.x, one_hot_labels, 	
  epochs = 100, 	
  batch_size = 5,	
  validation_split = 0.1	
)	
 	
# Evaluate model	
metrics <- model %>% evaluate(test.x, to_categorical(test.y[ , 1]))	
metrics	
	
plot(history)	
	
epochs <- 100	
time <- 1:epochs	
hist_df <- data.frame(time=time, loss=history$metrics$loss, acc=history$metrics$accuracy,	
                      valid_loss=history$metrics$val_loss, valid_acc=history$metrics$val_accuracy)	
	
plot_ly(hist_df, x = ~time)  %>%	
  add_trace(y = ~loss, name = 'training loss', type="scatter", mode = 'lines') %>%	
  add_trace(y = ~acc, name = 'training accuracy', type="scatter", mode = 'lines+markers') %>%	
  add_trace(y = ~valid_loss, name = 'validation loss', type="scatter",mode = 'lines+markers') %>%	
  add_trace(y = ~valid_acc, name = 'validation accuracy', type="scatter", mode = 'lines+markers') %>% 	
  layout(title="Schizophrenia Study NN Model Performance",	
           legend = list(orientation = 'h'), yaxis=list(title="metric"))	
	
# Finally prediction of binary class labels and Confusion Matrix	
predictions <- model %>% predict(test.x) %>%  k_argmax()	
	
# We can also inspect the corresponding probabilities of the automated binary classification labels	
# prediction_probabilities <- model %>% predict(test.x) %>% `>`(0.5) %>% k_cast("int32")	
# prediction_probabilities$numpy()	
prediction_probabilities <- model %>% predict(test.x)	
prediction_probabilities	
	
table(factor(predictions$numpy()),factor(test.y))	
#' 	
#' 	
#' To get a visual representation of the deep learning network we can display the computation graph (this code is suppressed).	
#' 	
#' 	
# devtools::install_github("andrie/deepviz")	
library(deepviz)	
model %>% plot_model()	
#' 	
#' 	
#' 	
#' ## ALS regression example	
#' 	
#' The second example demonstrates a deep learning regression using the [ALS data]("https://umich.instructure.com/files/1789624/download?download_frd=1) to predict `ALSFRS_slope`. Note that in this case the clinical feature we are predicting, $Y=ALSFRS_{slope}$ (ALS functional rating scale progression over time), is a continuous outcome. Hence, we have a *regression problem*, which requires a different `keras` network formulation from the *categorical or binary classification problem* above.	
#' 	
#' In general, normalizing all data features ensures the model is scale- and range-invariant. Feature normalization may not be always necessary, but it helps with improving the network training and ensures the resulting network prediction is more robust. The function `tfdatasets::feature_spec()` provides tensorflow data normalization for tabular data.	
#' 	
#' 	
library(tfdatasets)	
	
als <- read.csv("https://umich.instructure.com/files/1789624/download?download_frd=1")	
ALSFRS_slope <- als[,7]	
als <- as.data.frame(als[,-c(1,7,94)])	
colnames(als)	
	
spec <- feature_spec(als, ALSFRS_slope ~ . ) %>% 	
  step_numeric_column(all_numeric(), normalizer_fn = scaler_standard()) %>% 	
  fit()	
spec	
#' 	
#' 	
#' 	
#' The `feature_spec` output *spec* is used together with the `keras::layer_dense_features()` method to directly perform pre-processing in the TensorFlow graph. We can take a look at the output of a dense-features layer created by the `feature_spec`, which is a matrix (2D tensor) with scaled values.	
#' 	
#' 	
layer <- layer_dense_features(feature_columns = dense_features(spec), dtype = tf$float32)	
# layer(als)	
#' 	
#' 	
#' Next, we design the network architecture model using the `feature_spec` API by passing the `dense_features` from the new *spec* object.	
#' 	
#' 	
input <- layer_input_from_dataset(als)	
	
output <- input %>% 	
  layer_dense_features(dense_features(spec)) %>% 	
  layer_dense(units = 256, activation = "relu") %>%	
  layer_dense(units = 128, activation = "relu") %>%	
  layer_dense(units = 64, activation = "relu") %>%	
  layer_dense(units = 16, activation = "relu") %>%	
  layer_dense(units = 1) 	
	
model <- keras_model(input, output)	
	
# summary(model)	
#' 	
#' 	
#' It's time to compile the deep network model and wrap it into a function `build_model()` that can be reused for different experiments. Remember that `keras::fit()` modifies the model in-place.	
#' 	
#' 	
model %>% compile(loss = "mse", optimizer = optimizer_rmsprop(), metrics = list("mean_absolute_error"))	
	
build_model <- function() {	
  input <- layer_input_from_dataset(als)	
  output <- input %>% 	
  layer_dense_features(dense_features(spec)) %>% 	
  layer_dense(units = 256, activation = "relu") %>%	
  layer_dense(units = 128, activation = "relu") %>%	
  layer_dense(units = 64, activation = "relu") %>%	
  layer_dense(units = 16, activation = "relu") %>%	
  layer_dense(units = 1) 	
	
  model <- keras_model(input, output)	
  	
  model %>% compile(loss = "mse", optimizer = optimizer_rmsprop(), metrics = list("mean_absolute_error"))	
  	
  model	
}	
#' 	
#' 	
#' Model training follows with 200 epochs where we record the training and validation accuracy in a *keras_training_history* object. For tracking the learning progress, we use a custom callback to replace the default training output at each epoch by a single dot (period) printed in the console .	
#' 	
#' 	
# Display training progress by printing a single dot for each completed epoch.	
print_dot_callback <- callback_lambda(	
  on_epoch_end = function(epoch, logs) {	
    if (epoch %% 80 == 0) cat("\n")	
    cat(".")	
  }	
)    	
	
model <- build_model()	
	
history <- model %>% fit(	
  x = als,	
  y = ALSFRS_slope,	
  epochs = 200,	
  validation_split = 0.2,	
  verbose = 0,	
  callbacks = list(print_dot_callback)	
)	
#' 	
#' 	
#' Let's visualize the model’s training data performance using the metrics stored in the history object. This graph provides clues to determine training duration and confirm model performance convergence.	
#' 	
#' This graph shows little improvement in the model after about 200 epochs. Let’s update the fit method to automatically stop training when the validation score doesn’t improve. We’ll use a callback that tests a training condition for every epoch. If a set number of epochs elapses without showing improvement, it automatically stops the training.	
#' 	
#' 	
#plot(history)	
	
epochs <- 200	
time <- 1:epochs	
hist_df <- data.frame(time=time, loss=history$metrics$loss, mae=history$metrics$mean_absolute_error,	
                      valid_loss=history$metrics$val_loss, valid_mae=history$metrics$val_mean_absolute_error)	
	
plot_ly(hist_df, x = ~time)  %>%	
  add_trace(y = ~loss, name = 'training loss', type="scatter", mode = 'lines') %>%	
  add_trace(y = ~mae, name = 'training MAE', type="scatter", mode = 'lines+markers') %>%	
  add_trace(y = ~valid_loss, name = 'validation loss', type="scatter", mode = 'lines+markers') %>%	
  add_trace(y = ~valid_mae, name = 'validation MAE', type="scatter", mode = 'lines+markers') %>% 	
  layout(title="ALS Study NN Model Performance",	
           legend = list(orientation = 'h'), yaxis=list(title="metric"))	
	
# # Note that this graph shows little improvement in the model after about 100 epochs. 	
# # hence, we can update the fit method to automatically *stop training* when the validation score 	
# # doesn’t improve. The callback that tests a training condition for every epoch. 	
# # If a set amount of epochs elapses without showing improvement, it automatically stops the training.	
# 	
# # The patience parameter is the amount of epochs to check for improvement.	
# early_stop <- callback_early_stopping(monitor = "val_loss", patience = 20)	
# 	
# model <- build_model()	
# 	
# history <- model %>% fit(	
#   x = train_df %>% select(-label),	
#   y = train_df$label,	
#   epochs = 500,	
#   validation_split = 0.2,	
#   verbose = 0,	
#   callbacks = list(early_stop)	
# )	
	
# library(deepviz)	
# model %>% plot_model()	
#' 	
#' 	
#' 	
#' Have a look at the [Google TensorFlow API](http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle®Dataset=reg-plane&learningRate=0.03®ularizationRate=0&noise=0&networkShape=4,2&seed=0.81083&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false). It shows the importance of *learning rate* and the *number of rounds*. You should test different sets of parameters. 	
#' 	
#'  - Too small *learning rate* may lead to long computations.	
#'  - Too large *learning rate* may cause the algorithm to fail to converge, as large step size (learning rate) may by-pass the optimal solution and then oscillate or even diverge.	
#' 	
#' Finally, we can forecast and predict the ALSFRS_slope using data in the testing set:	
#' 	
#' 	
cases <- 100	
test_sample <- sample(1:dim(als)[1], size=cases)	
test_predictions <- model %>% predict(als[test_sample, ])	
# test_predictions[ , 1]	
	
print(paste0("Corr(real_ALSFRS_Slope, predicted_ALSFRS_Slope)=", 	
             round(cor(test_predictions, ALSFRS_slope[test_sample]), 3)))	
	
cases <- 1:cases	
hist_df <- data.frame(cases=cases, real=ALSFRS_slope[test_sample], predicted=test_predictions)	
corr1 <- round(cor(hist_df$real, hist_df$predicted), 2)	
	
plot_ly(hist_df, x = ~real)  %>%	
  add_trace(y = ~predicted, name = 'Scatter (Real vs. Predicted ALSFRS_Slope)', type="scatter", mode = 'markers') %>%	
  add_lines(x = ~real, y = ~fitted(lm(predicted ~ real, hist_df)), name="LM(Pred ~ Real)") %>% 	
  layout(title=paste0("ALS Study NN Model Prediction (ALSFRS-slope correlation=", corr1,")"),	
           legend = list(orientation = 'h'), yaxis=list(title="predicted"))	
#' 	
#' 	
#' 	
#' ## IBS Study	
#' Let's try another example using the [IBS NI Study data](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_April2011_NI_IBS_Pain). Again, we will use deep neural network learning to predict a categorical/binary classification label (diagnosis, DX).	
#' 	
#' 	
# IBS NI Data	
library(xml2)	
library(rvest)	
	
# UCLA Data	
wiki_url <- read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_April2011_NI_IBS_Pain")	
IBSData<- html_table(html_nodes(wiki_url, "table")[[2]]) # table 2	
set.seed(1234)	
test.ind = sample(1:354, 50, replace = F)  # select 50/354 of cases for testing, train on remaining (354-50)/354 cases	
	
# UMich Data (includes MISSING data): use `mice` to impute missing data with mean: newData <- mice(data,m=5,maxit=50,meth='pmm',seed=500); summary(newData)	
# wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_April2011_NI_IBS_Pain")	
# IBSData<- html_table(html_nodes(wiki_url, "table")[[1]]) # load Table 1	
# set.seed(1234)	
# test.ind = sample(1:337, 50, replace = F)  # select 50/337 of cases for testing, train on remaining (337-50)/337 cases	
# summary(IBSData); IBSData[IBSData=="."] <- NA; newData <- mice(IBSData,m=5,maxit=50,meth='pmm',seed=500); summary(newData)	
	
html_nodes(wiki_url, "#content")	
	
# View (IBSData); dim(IBSData): Select an outcome response "DX"(3), "FS_IQ" (5)	
	
# scale/normalize all input variables	
IBSData <- na.omit(IBSData) 	
IBSData[,4:66] <- scale(IBSData[,4:66])  # scale the entire dataset	
train.x = data.matrix(IBSData[-test.ind, c(4:66)]) # exclude outcome	
train.y = IBSData[-test.ind, 3]-1	
test.x = data.matrix(IBSData[test.ind, c(4:66)])	
test.y = IBSData[test.ind, 3]-1	
	
train.y <- train.y$Group	
test.y <- test.y$Group	
	
# View(data.frame(test.x, test.y))	
# View(data.frame(train.x, train.y))	
	
model <- keras_model_sequential() 	
	
model %>% 	
  layer_dense(units = 256, activation = 'relu', input_shape = ncol(train.x)) %>% 	
  layer_dropout(rate = 0.4) %>% 	
  layer_dense(units = 128, activation = 'relu') %>%	
  layer_dropout(rate = 0.3) %>%	
  layer_dense(units = 64, activation = 'relu') %>%	
  layer_dropout(rate = 0.1) %>%	
  layer_dense(units = 32, activation = 'relu') %>%	
  layer_dropout(rate = 0.1) %>%	
  layer_dense(units = 16, activation = 'relu') %>%	
  layer_dropout(rate = 0.1) %>%	
  layer_dense(units = 2, activation = 'sigmoid')	
	
model %>% compile(	
  loss = 'binary_crossentropy',	
  optimizer = 'adam',	
  metrics = c('accuracy')	
)	
	
one_hot_labels <- to_categorical(train.y, num_classes = 2)	
	
# Train the model, iterating on the data in batches of 25 samples	
history <- model %>% fit(	
  train.x, one_hot_labels, 	
  epochs = 100, 	
  batch_size = 5,	
  validation_split = 0.1	
)	
 	
# Evaluate model	
metrics <- model %>% evaluate(test.x, to_categorical(test.y, num_classes = 2))	
metrics	
	
plot(history)	
	
epochs <- 100	
time <- 1:epochs	
hist_df <- data.frame(time=time, loss=history$metrics$loss, acc=history$metrics$accuracy,	
                      valid_loss=history$metrics$val_loss, valid_acc=history$metrics$val_accuracy)	
	
plot_ly(hist_df, x = ~time)  %>%	
  add_trace(y = ~loss, name = 'training loss', type="scatter", mode = 'lines') %>%	
  add_trace(y = ~acc, name = 'training accuracy', type="scatter", mode = 'lines+markers') %>%	
  add_trace(y = ~valid_loss, name = 'validation loss', type="scatter",mode = 'lines+markers') %>%	
  add_trace(y = ~valid_acc, name = 'validation accuracy', type="scatter", mode = 'lines+markers') %>% 	
  layout(title="IBS Study NN Model Performance",	
           legend = list(orientation = 'h'), yaxis=list(title="metric"))	
	
# Finally prediction of binary class labels and Confusion Matrix	
predictions <- model %>% predict(test.x) %>%  k_argmax()	
table(factor(predictions$numpy()),factor(test.y))	
	
# We can also inspect the corresponding probabilities of the automated binary classification labels	
# prediction_probabilities <- model %>% predict(test.x)	
#' 	
#' 	
#' These results suggest that the DNN classification of IBS diagnosis is not good (at least under the specific network topology and training conditions).	
#' 	
#' 	
#' ## Country QoL Ranking Data	
#' 	
#' Another case-study we have seen before is the [country quality of life (QoL) dataset](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_2008_World_CountriesRankings). Let's try to fit a network model and use it to predict the overall QoL. This is another binary classification problem categorizing countries as either *developed* or *developing.*	
#' 	
#' 	
wiki_url <- read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_2008_World_CountriesRankings")	
html_nodes(wiki_url, "#content")	
CountryRankingData<- html_table(html_nodes(wiki_url, "table")[[2]])	
	
# View (CountryRankingData); dim(CountryRankingData): Select an appropriate	
# outcome "OA": Overall country ranking (13)	
# Dichotomize outcome, Top-countries OA<20, bottom countries OA>=20	
set.seed(1234)	
test.ind = sample(1:100, 30, replace = F)  # select 15/100 of cases for testing, train on remaining 85/100 cases	
	
CountryRankingData[,c(8:12,14)] <- scale(CountryRankingData[,c(8:12,14)])	
# scale/normalize all input variables	
train.x = data.matrix(CountryRankingData[-test.ind, c(8:12,14)]) # exclude outcome	
train.y = ifelse(CountryRankingData[-test.ind, 13] < 50, 1, 0)	
test.x = data.matrix(CountryRankingData[test.ind, c(8:12,14)])	
test.y = ifelse(CountryRankingData[test.ind, 13] < 50, 1, 0) # developed (high OA rank) country	
	
model <- keras_model_sequential() 	
	
model %>% 	
  layer_dense(units = 16, activation = 'relu', input_shape = ncol(train.x)) %>% 	
  layer_dropout(rate = 0.4) %>% 	
  layer_dense(units = 4, activation = 'relu') %>%	
  layer_dropout(rate = 0.1) %>%	
  layer_dense(units = 2, activation = 'sigmoid')	
	
model %>% compile(	
  loss = 'binary_crossentropy',	
  optimizer = 'adam',	
  metrics = c('accuracy')	
)	
	
one_hot_labels <- to_categorical(train.y, num_classes = 2)	
	
# Train the model, iterating on the data in batches of 25 samples	
history <- model %>% fit(	
  train.x, one_hot_labels, 	
  epochs = 50, 	
  batch_size = 5,	
  validation_split = 0.1	
)	
 	
# Evaluate model	
metrics <- model %>% evaluate(test.x, to_categorical(test.y, num_classes = 2))	
metrics	
	
# plot(history)	
	
epochs <- 50	
time <- 1:epochs	
hist_df <- data.frame(time=time, loss=history$metrics$loss, acc=history$metrics$accuracy,	
                      valid_loss=history$metrics$val_loss, valid_acc=history$metrics$val_accuracy)	
	
plot_ly(hist_df, x = ~time)  %>%	
  add_trace(y = ~loss, name = 'training loss', type="scatter", mode = 'lines') %>%	
  add_trace(y = ~acc, name = 'training accuracy', type="scatter", mode = 'lines+markers') %>%	
  add_trace(y = ~valid_loss, name = 'validation loss', type="scatter", mode = 'lines+markers') %>%	
  add_trace(y = ~valid_acc, name = 'validation accuracy', type="scatter", mode = 'lines+markers') %>% 	
  layout(title="Country QoL Ranking NN Model Performance",	
           legend = list(orientation = 'h'), yaxis=list(title="metric"))	
	
# Finally prediction of binary class labels and Confusion Matrix	
predictions <- model %>% predict(test.x) %>%  k_argmax()	
table(factor(predictions$numpy()),factor(test.y))	
	
# We can also inspect the corresponding probabilities of the automated binary classification labels	
# prediction_probabilities <- model %>% predict(test.x)	
#' 	
#' 	
#' Note that even a simple DNN network rapidly converges to an accurate model. 	
#' 	
#' ## Handwritten Digits Classification	
#' 	
#' In [Chapter 6 (ML, NN, SVM Classification)](https://socr.umich.edu/DSPA2/DSPA2_notes/06_ML_NN_SVM_RF_Class.html) we discussed Optical Character Recognition (OCR). Specifically, we analyzed handwritten notes (unstructured text) and converted it to printed text.	
#' 	
#' The [Modified National Institute of Standards and Technology (MNIST) database](https://yann.lecun.com/exdb/mnist/) includes a large handwritten digits imaging dataset with human annotated labels. Every digit is represented by a $28\times 28$ thumbnail image. You can download the training and testing data from [Kaggle](https://www.kaggle.com/c/digit-recognizer/data).	
#' 	
#' The `train.csv` and `test.csv` data files contain gray-scale images of hand-drawn digits, $0, 1, 2, \ldots, 9$. Each 2D image is $28\times 28$ in size and each of the $784$ pixels has a single pixel-intensity representing the lightness or darkness of that pixel (stored as a 1 byte integer $[0,255]$). Higher intensities correspond to darker pixels.  	
#' 	
#' The training data, `train.csv`, has 785 columns, where the first column, **label**, codes the actual digit drawn by the user. The remaining  $784$ columns contain the $28\times 28=784$ pixel-intensities of the associated 2D image. Columns in the training set have $pixel_K$ names, where $0\leq K\leq 783$. To reconstruct a 2D image out of each row in the training data we use this relation between pixel-index ($K$) and $X,Y$ image coordinates: 	
#' 	
#' $$K = Y \times 28 + X,$$	
#' where $0\leq X, Y\leq 27$. Thus, $pixel_K$ is located on row $Y$ and column $X$ of the corresponding 2D Image of size $28\times 28$. For instance, $pixel_{60=(2\times 28 + 4)} \longleftrightarrow (X=4,Y=2)$ represents the pixel on the 3-rd row and 5-th column in the image. Diagrammatically, omitting the "pixel" prefix, the pixels may be ordered to reconstruct the 2D image as follows:	
#' 	
#' Row  | Col0 | Col1 | Col2 | Col3 | Col4 | ... | Col26 | Co27	
#' -----|------|------|------|------|------|------|------|-----	
#' Row0 | 000 | 001 | 002 | 003 | 004 | ... | 026 | 027	
#' Row1 | 028 | 029 | 030 | 031 | 032 | ... | 054 | 055	
#' **Row2** | 056 | 057 | 058 | 059 | **060** | ... | 082 | 083	
#' RowK | ...|...|...|...|...|...|...|...| ...	
#' Row26| 728 | 729 | 730 | 731 | 732 | ... | 754 | 755	
#' Row27| 756 | 757 | 758 | 759 | 760 | ... | 782 | 783	
#' 	
#' 	
#' Note that the point-to-pixelID transformation ($K = Y \times 28 + X$) may easily be inverted as a pixelID-to-point mapping: $X= K\ mod\ 28$ (remainder of the integer division ($K/28$) and $Y=K %/%28$ (integer part of the division $K/28$)). For example:	
#' 	
#' 	
K <- 60	
X <- K %% 28 # X= K mod 28, remainder of integer division 60/28	
Y <- K%/%28  # integer part of the division	
# This validates that the application of both, the back and forth transformations, leads to an identity	
K; X; Y; Y * 28 + X	
#' 	
#' 	
#' The `test data` (test.csv) has the same organization as the training data, except that it does not contain the first **label** column. It includes 28,000 images and we can predict image labels that can be stored as $ImageId,Label$ pairs, which can be visually compared to the 2D images for validation/inspection.	
#' 	
#' 	
# train.csv	
pathToZip <- tempfile()	
download.file("https://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/DigitRecognizer_TrainingData.zip", pathToZip)	
train <- read.csv(unzip(pathToZip))	
dim(train)	
unlink(pathToZip)	
	
# test.csv	
pathToZip <- tempfile()	
download.file("https://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/DigitRecognizer_TestingData.zip", pathToZip)	
test <- read.csv(unzip(pathToZip))	
dim(test)	
unlink(pathToZip)	
	
train <- data.matrix(train)	
test <- data.matrix(test)	
	
train.x <- train[,-1]	
train.y <- train[,1]	
	
# Scaling will be discussed below	
train.x <- t(train.x/255)	
test <- t(test/255)	
	
# Note that you can also load the MNIST dataset (training & testing directly from keras)	
# mnist <- dataset_mnist()	
# train.x <- mnist$train$x	
# train.y <- mnist$train$y	
# test <- mnist$test$x	
# test.y <- mnist$test$y	
#' 	
#' 	
#' Let's look at some of these example images:	
#' 	
#' 	
library("imager")	
# first convert the CSV data (one row per image, 28,000 rows)	
array_3D <- array(test, c(28, 28, 28000))	
mat_2D <- matrix(array_3D[,,1], nrow = 28, ncol = 28)	
plot(as.cimg(mat_2D))	
	
# extract all N=28,000 images	
N <- 28000	
img_3D <- as.cimg(array_3D[,,], 28, 28, N)	
	
# plot the k-th image (1<=k<=N)	
k <- 5	
plot(img_3D, k)	
	
image_2D <- function(img,index)	
{	
    img[,,index,,drop=FALSE]	
}	
	
plot(image_2D(img_3D, 1))	
	
# Plot a collage of the first 4 images	
imappend(list(image_2D(img_3D, 1), image_2D(img_3D, 2), image_2D(img_3D, 3), image_2D(img_3D, 4)),"y") %>% plot	
	
# img <- image_2D(img_3D, 1)	
# for (i in 10:20)  { imappend(list(img, image_2D(img_3D, i)),"x") }	
#' 	
#' 	
#' In these CSV data files, each $28\times 28$ image is represented as a single row. Gray-scale images are $1$ byte, in the range $[0, 255]$, which we linearly transformed into $[0,1]$. Note that we only scale the $X$ input, not the output (labels). Also, we don't have manual gold-standard validation labels for the testing data, i.e., `test.y` is not available for the handwritten digits data.	
#' 	
#' 	
# We already scaled earlier	
# train.x <- t(train.x/255)	
# test <- t(test/255)	
#' 	
#' 	
#' Next, we can transpose the input matrix to $n\ (pixels) \times  m\ (examples)$, as the column major format required by the classifiers. The image labels are evenly distributed:	
#' 	
#' 	
table(train.y); prop.table(table(train.y))	
#' 	
#' 	
#' The majority class (`1`) in the training set includes 11.2% of the observations.	
#' 	
#' ### Configuring the Neural Network	
#' 	
#' The neural network model is trained by feeding the training data, i.e., *training images (train.x)* and *training labels (train.y)*. The network learns to associate specific images with concrete labels. Then, the network generates label predictions for (new) *testing images* that can be compared to the true labels of test-images (if these are available) or visually inspected to confirm correct auto-classification.	
#' 	
#' The `magrittr` package pipe operator, `%>%`, is commonly used for short-hand notation to allow left-and-right feed-and-assignment that can be interpreted as “do  then feed into and do ”.	
#' 	
#' Nodes and layers represent the basic building-blocks of all artificial neural networks. Both are generalizations of brain neuron and network data processing that effectively transform, compress, or filter the input data. *Inputs* go in a node or a layer, and *outputs* come out. Network layers learn to extract effective representations of the inputs that are coded as meaningful outputs that can be connected by chaining together multiple layers that progressive distill the information into compressed generic knowledge (patterns) that can be used to predict, forecast, classify, or model the mechanistic relations in the process. That is, we use data as a proxy of observable processes for which we don;t have explicit closed-form probability distribution models (typically complex multivariate processes).	
#' 	
#' The network below chains a pair of layers densely connected (i.e., fully connected neural layers). The `keras_model_sequential()` method specifies the network architecture before we start with training (i.e., estimating the weights). The *loss function* specifies how the network measures its performance on the	
#' training data to adjust the network weights (using train.x and train.y) to optimize the loss. The *optimizer* specifies the mechanism for updating the network weight coefficients using the training data relative to the specified loss function. Different metrics can be used to track the performance during the iterative training and testing process. For instance, accuracy represents the fraction of the images that were correctly classified. The second layer is a *10-way softmax layer* that returns a vector of 10 probability scalars (all positive and summing to 1) each representing the probability that the current hand-written image represents any of the 10 digits ($0, 1, 2, \cdots,9$). The `compile()` function modifies the network in place to specify the optimization strategy, the loss function and the assessment metric that will be used in the learning process.	
#' 	
#' In this network example, we chain two dense layers to each layer and apply simple tensor operations (tensor-dot-product/matrix multiplication and tensor addition) to the input data to estimate the weight parameter tensors, i.e., attributes of the layers encoding the persistent knowledge of the network. The `categorical_crossentropy` is the specific loss function that is optimized in the training phase to provide a feedback signal for learning the weight tensors. The loss optimization relies on mini-batch stochastic gradient descent, which is defined by the `rmsprop` optimizer argument.	
#' 	
#' 	
#' 	
network <- keras_model_sequential() %>%	
  layer_dense(units = 512, activation = "relu", input_shape = c(28 * 28)) %>%	
  layer_dense(units = 10, activation = "softmax")	
	
network %>% compile(optimizer = "rmsprop", loss = "categorical_crossentropy", metrics = c("accuracy"))	
#' 	
#' 	
#' Concatenating dense layers allows us to build a neural network whose depth is determined by the number of layers that are specified by a version of `layer_dense(units = 512, activation = "relu")`, which represents a function of the input (2D tensor) and output a different 2D tensor that may be fed as input tensor to the next layer. Let $ReLu(x)=\max(x, 0)$. $W$ and $b$ represent two of the attributes of the layer (trainable weight parameters of the layer), i.e., the 2D kernel tensor and the bias vector. Then the layer-output $O$ is:	
#' 	
#' $$O= ReLu( W\times Input + b).$$	
#' 	
#' Deep learning network models are represented as directed, acyclic graphs of layers. Often, these networks constitute a linear stack of layers mapping a single input to a single output. Different types of network layers are appropriate for different kinds of data tensors:	
#' 	
#'  - Simple vector data, stored in *2D tensors* of shape $(samples,features)$, are often modeled using densely connected layers, i.e., fully connected dense layers (`keras::layer_dense function()`). 	
#'  - Sequence data, stored in *3D tensors* of shape $(samples, timesteps, features)$, are typically modeled by recurrent layers such as `keras::layer_lstm()`. 	
#'  - Image data, stored in *4D tensors*, is usually processed by 2D convolution layers (`keras::layer_conv_2d()`).	
#'  	
#' 	
#' ### Training	
#' 	
#' We are ready to start the network training process. At the initialization step of the learning process, the weight matrices are filled with random values (random initialization). At the start, when $W$ and $b$ are random, the output `relu(W*Input) + b` is likely going to be meaningless. However, the subsequent iterative	
#' process optimizing the objective (loss) function will gradually adapt to these weights (training process) by repeating the following steps until certain stopping criterion is met:	
#' 	
#'  - Draw a (random) batch of training samples $x$ and their corresponding targets $y$.	
#'  - Forward pass: Run the network on $x$ to obtain predictions $y_{pred}$. 	
#'  - Estimate the loss of the network on the batch data assessing the mismatch between $y$ and $y_{pred}$.	
#'  - Update all weights ($W$ and $b$) of the network to reduce the overall loss on this batch.	
#' 	
#' Iterating this process eventually yields a network that has a low loss on its training data indicating good fidelity (match between predictions $y_{pred}$ and expected targets $y$). This reflects the network learning process progressed and accurately maps inputs to correct targets. See the [BPAD Mathematical Foundations Chapter](https://socr.umich.edu/BPAD/BPAD_notes/Biophysics430_Chap01_MathFoundations.html) for more information on calculus of differentiation and integration, gradient minimization, and loss optimization. 	
#' 	
#' **Stochastic gradient descent (SGD)** is a powerful function optimization strategy for differentiable multivariate functions. Recall that function's extrema are attained at points where the derivative (gradient, $\nabla (f)$) is trivial ($0$) or at the domain boundary. Hence, to minimize the loss, we need to find all points (parameter vectors/tensors) that correspond to trivial derivatives/gradients of the objective function $f$. Then, pick the parameter vectors/tensors/points leading to the smallest values of the loss. In neural network learning, this means analytically finding the combination of weight values corresponding to the smallest possible loss values. 	
#' 	
#' This optimization is achieved at $W_o$ when $\nabla (f)(W_o) = 0$. Often, this gradient equation is a polynomial equation of $N$ parameters (variables) corresponding to the number of coefficients ($W$ and $b$) in the network. For large networks (with millions of parameters), this optimization is difficult. An approximate solution may be derived using alternative numerical solutions. This involves incrementally modifying the parameters and assuming the loss function is differentiable. We can then compute its gradient, which points the direction of the fastest growth or decay of the objective function.	
#' 	
#'  - Draw a (random) batch of training samples $x$ and their corresponding targets $y$.	
#'  - Forward pass: Run the network on $x$ to obtain predictions $y_{pred}$. 	
#'  - Estimate the loss of the network on the batch data assessing the mismatch between $y$ and $y_{pred}$.	
#'  - Compute the gradient of the loss $\nabla (f)$ with regard to the network’s parameters (a backward pass).	
#'  - Slightly update/adjust the parameters in the opposite direction of the gradient, e.g., $W = W-(step \times gradient)$, which reduces the loss function value on the batch data.	
#'  	
#' In practice, neural network learning depends on chaining many tensor operations. For instance, a network $f$ composed of three tensor operations $a$, $b$, and $c$, with weight matrices $W_1$, $W_2$, and $W_3$ can be expressed as $f(W_1, W_2, W_3) = a(W_1, b(W_2, c(W_3)))$. The chain-rule for differentiation yields that  $f(g(x)) = f'(g(x)) \times g'(x)$ leads to a corresponding neural network optimization algorithm (*backpropagation*), which starts with the final loss value and works backward from the top layers to the bottom layers, sequentially applying the chain rule to compute the contribution of each parameter to the aggregate loss value.	
#' 	
#' 	
#' 	
train_images <- t(train.x)  # (42000, 28 * 28))	
test_images <- t(test)    # (28000, 28 * 28))	
# Note: we can also use the `array_reshape()` tensor-reshaping function to reshape the array.	
	
# categorically encode the training-image labels	
train_labels <- to_categorical(train.y)	
#' 	
#' 	
#' Let's now train (fit or estimate) the neural network model using `keras`. In general, the first mode (axis) in the data tensors is typically the sample axis (sample dimension). Often, it's difficult to process all data at the same time, breaking the data into small **batches**, e.g., batch_size=128, allows for more effective, efficient, and tractable processing (learning). The MNIST tensor consists of images, saved as 3D color arrays indexed by height, width, and depth, where gray-scale images (like the MNIST digits) have only one color channel. In general, image tensors are always 3D. Hence, a batch of 128 gray-scale images of size $256\times 256$ is stored in a tensor of *shape* $(128, 256, 256, 1)$, whereas a batch of 128 color (RGB) images is stored as a $(128, 256, 256, 3)$ tensor.	
#' 	
#' 	
network %>% fit(train_images, train_labels, epochs = 10, batch_size = 128)	
#' 	
#' 	
#' Invoking the method `fit()` launches the iterative network learning on the training data using mini-batches	
#' of 128 samples. Each iteration over all the training data is called an **epoch**. Here we use epoch=10 to indicate looping 10 times over. At each iteration, the network computes the gradients of the weights with regard to the loss on the batch and updates the tensor weights. After completing 10 epochs, the network learning performed 3,290 gradient updates (329 per epoch), which progressively reduced the loss of the network from $10^{-2}$ to $10^{-4}$. This low loss indicates the network learned to classify handwritten digits with high accuracy (0.99).	
#' 	
#' During the training process, two graphs are dynamically shown that illustrate the parity between the network loss function (expected to decrease) and the accuracy of the network, using the training data.	
#' Note that the accuracy approaches 0.99, but remember, this is training-data sample-accuracy, which is biased. 	
#' To get a more realistic performance estimate, we can test the model on an independent set of 10,000 testing data images.	
#' 	
#' 	
# Load and preprocess the testing data	
mnist <- dataset_mnist()	
test_images <- mnist$test$x	
test_labels <- mnist$test$y	
dim(test_images)    # [1] 10000    28    28	
length(test_labels) # [1] 10000	
test_images <- array_reshape(test_images, c(10000, 28 * 28))	
test_images <- test_images / 255	
test_labels <- to_categorical(test_labels)	
	
metrics <- network %>% evaluate(test_images, test_labels)	
metrics	
# metrics 	
#       loss   accuracy 	
# 0.04423446 0.98860002 	
#' 	
#' 	
#' The testing data accuracy is 0.9886, on par with the training data performance, which indicates no evidence of overfitting.	
#' 	
#' 	
#' ### Forecasting	
#' 	
#' Next, we will generate forecasting using the model on testing data and evaluate the prediction performance. The `preds` matrix has $28,000$ rows and $10$ columns, containing the desired classification probabilities from the `output layer` of the neural net. To extract the maximum label for each row, we can use the `max.col`:	
#' 	
#' 	
pred.label <- network %>% predict(test_images) %>%  k_argmax()	
heat <- table(factor(pred.label$numpy()),factor(mnist$test$y))	
keys = c(0:9)	
plot_ly(x =~keys, y = ~keys, z = ~matrix(heat, 10,10), name="NN Model Performance",	
        hovertemplate = paste('Matching: %{z:.0f}', 	
                              '
True: %{x}
', 'Pred: %{y}'),	
        colors = 'Reds', type = "heatmap") %>% 	
  layout(title="MNIST Predicated Number Classes vs. True Labels", 	
         xaxis=list(title="Actual Class"), yaxis=list(title="Predicted Class"))	
#' 	
#' 	
#' The predictions are stored in a 1D $28,000(rows)$ vector, including the predicted classification labels generated by the network output layer.	
#' 	
#' 	
# For example, the ML-classification labels assigned to the first 7 images (from the 28,000 testing data collection) are:	
pred.label <- pred.label$numpy()	
	
head(pred.label, n = 7L)	
library(knitr)	
kable(head(pred.label, n = 7L), format = "markdown", align='c')	
	
label.names <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9")	
	
#initialize a list of m=7 images from the N=28,000 available images	
m_start <- 4	
m_end <- 10	
if (m_end <= m_start) 	
  { m_end = m_start+1 }  # check that m_end > m_start	
	
label_Ypositons <- vector()  # initialize the array of label positions on the plot	
for (i in m_start:m_end) {	
  if (i==m_start) { 	
    img1 <- as.cimg(test_images[m_start,], 28, 28)	
    # img1 <- image_2D(img_3D, m_start) 	
  }	
  else img1 <- imappend(list(img1, as.cimg(test_images[i,], 28, 28)),"y")	
  # else img1 <- imappend(list(img1, image_2D(img_3D, i)),"y")	
  label.names[i+1-m_start] <- pred.label[i]	
  label_Ypositons[i+1-m_start] <- 15 + 28*(i-m_start)	
}	
	
plot(img1, axes=FALSE)	
text(40, label_Ypositons, labels=label.names[1:(m_end-m_start)], cex= 1.2, col="blue")	
mtext(paste((m_end+1-m_start), " Random Images \n Indices (m_start=", m_start, " : m_end=", m_end, ")"), side=2, line=-6, col="black")	
mtext("NN Classification Labels", side=4, line=-5, col="blue") 	
#' 	
#' 	
#' 	
#' ### Examining the Network Structure	
#' 	
#' There are a variety of network topologies, e.g., two-branch networks, multi-head networks, inception blocks, etc., that encode the *a priori* hypothesis space of predefined possibilities. Specifying the network topology constrains the space of possibilities to a specific series of tensor operations that map input data onto outputs. Then, the learning only searches for a good set of network parameter values (the weight tensors involved in these tensor operations). Specifying the network architecture in advance is as much an art as it is science.	
#' 	
#' There are two main strategies to define an a priori network topology model. Linear stacks of network layers are specified using the `keras::keras_model_sequential()` method, whereas functional API's provide interfaces for specifying directed acyclic graph (DAG) layer networks with more flexible architectures. Functional network APIs facilitate managing the data tensors processed by the model as well as applying layers to tensors just as though the layers are abstract functions. In the compilation step below, we configure the learning process by specifying the model optimizer and loss functions, along with the metrics for tracking the iterative learning process.	
#' 	
#' 	
#' 	
#  Linear stacks of network layers	
network <- keras_model_sequential() %>%	
  layer_dense(units = 512, activation = "relu", input_shape = c(28 * 28)) %>%	
  layer_dense(units = 10, activation = "softmax")	
	
# vs. functional API network (DAG)	
input_tensor  <- layer_input(shape = c(784))	
output_tensor <- input_tensor %>%	
  layer_dense(units = 32, activation = "relu") %>%	
  layer_dense(units = 10, activation = "softmax")	
model <- keras_model(inputs = input_tensor, outputs = output_tensor)	
	
model %>% compile(	
    optimizer = optimizer_adam(),	
    loss = "mse",	
    metrics = c("accuracy")	
)	
	
model %>% fit(train_images, train_labels, epochs = 10, batch_size = 128)	
#' 	
#' 	
#' ### Model Validation	
#' 	
#' We can use *accuracy* to track the performance of the NN training during the learning process on new (prospective) data.	
#' 	
#' 	
val_indices <- sample(1:dim(train_images)[1], size=10000)  # randomly choose 10K images	
x_val <- train_images[val_indices, ]	
y_val <- to_categorical(train.y[val_indices])	
	
partial_x_train <- train_images[-val_indices, ]	
partial_y_train <- to_categorical(train.y[-val_indices])	
	
# train the model for 20 iterations over all samples in the x_train and y_train tensors (20 epochs), 	
# using mini-batches of 512 samples and track the loss and accuracy on the 10,000 validation samples 	
	
model %>% compile(optimizer = "rmsprop", loss = "binary_crossentropy", metrics = c("accuracy"))	
history <- model %>% fit(partial_x_train, partial_y_train, epochs = 20, batch_size = 512, 	
                         validation_data = list(x_val, y_val) )	
tic <- proc.time()	
print(paste0("Total Compute Time: ", proc.time() - tic))	
	
library(plotly)	
epochs <- 20	
time <- 1:epochs	
hist_df <- data.frame(time=time, loss=history$metrics$loss, acc=history$metrics$accuracy, 	
                      valid_loss=history$metrics$val_loss, valid_acc=history$metrics$val_accuracy)	
	
plot_ly(hist_df, x = ~time)  %>% 	
  add_trace(y = ~loss, name = 'training loss', mode = 'lines') %>% 	
  add_trace(y = ~acc, name = 'training accuracy', mode = 'lines+markers') %>% 	
  add_trace(y = ~valid_loss, name = 'validation loss',mode = 'lines+markers') %>% 	
  add_trace(y = ~valid_acc, name = 'validation accuracy', mode = 'lines+markers') %>% 	
  layout(title="MNIST Digits NN Model Performance",	
           legend = list(orientation = 'h'), yaxis=list(title="metric"))	
	
# Finally prediction of MNIST testing image classification (auto-labeling):	
pred.label <- model %>% predict(t(test))	
for (i in 1:9) {	
  print(sprintf("NN predicted Label for image %d is %s", i, which.max(pred.label[i,])-1))	
}	
	
array_3D <- array(t(test), c(28000, 28, 28))	
plot(as.cimg(array_3D[1,,], nrow = 28, ncol = 28))	
	
#initialize a list of m=9 testing images from the N=28,000 available images	
m_start <- 1	
m_end <- 9	
label_Ypositons <- vector()  # initialize the array of label positions on the plot	
for (i in m_start:m_end) {	
  if (i==m_start) { 	
    img1 <- as.cimg(array_3D[1,,], nrow = 28, ncol = 28)	
  }	
  else img1 <- imappend(list(img1, as.cimg(array_3D[i,,], nrow = 28, ncol = 28)), "y")	
  label.names[i] <- which.max(pred.label[i,])-1	
  label_Ypositons[i+1-m_start] <- 15 + 28*(i-m_start)	
}	
	
plot(img1, axes=FALSE)	
text(40, label_Ypositons, labels=label.names[1:(m_end-m_start+1)], cex= 1.2, col="blue")	
mtext(paste((m_end+1-m_start), " Random Images \n Indices (m_start=", m_start, " : m_end=", m_end, ")"), side=2, line=-6, col="black")	
mtext("NN Classification Labels", side=4, line=-5, col="blue") 	
#' 	
#' 	
#' Note that the `keras::predict()` method only works with *Sequential* network models. However, when using the functional API network model we need to use the `keras::predict()` method to obtain a vector of probabilities and then get the *argmax* of this vector to find the most likely class label for the image.	
#' 	
#' # Classifying Real-World Images using *Tensorflow* and *Keras*	
#' 	
#' A real-world example of deep learning is the classification of 2D images (pictures) or 3D volumes (e.g., neuroimages).	
#' 	
#' We will demonstrate the use of **pre-trained** network models ([resnet50](https://keras.io/api/applications/resnet/), [vgg16](https://keras.io/api/applications/vgg/), and [vgg19](https://keras.io/api/applications/vgg/)) to predict the class-labels of real world images. There are [dozens of pre-trained models that are made available to the entire community](https://keras.io/api/applications/). These advanced Deep Network models yield state-of-the-art predictions that accurately label different types of 2D images. We will use the `keras` and `tensorflow` packages to load the pre-trained network models and classify the images, along with the `imager` package to load and preprocess raw images in `R`.	
#' 	
#' 	
# install.packages("imager")	
library(imager)	
#' 	
#' 	
#' ## Load the Pre-trained Model	
#' 	
#' You can [download, unzip. and examine this pre-trained model](https://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/data/Inception.zip). There are many different types of pre-trained deep neural network models, e.g., 	
#' 	
#'  - [Resnet50: Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385),	
#'  - [VGG16: Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/abs/1409.1556), by Oxford's Visual Geometry Group, and	
#'  - [VGG19: Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/abs/1409.1556).	
#' 	
#' The VGG's are deep convolutional networks, trained to classify images, with VGG19 model layers comprised of:	
#' 	
#'  -  Conv3x3 (64)	
#'  -  Conv3x3 (64)	
#'  -  *MaxPool*	
#'  -  Conv3x3 (128)	
#'  -  Conv3x3 (128)	
#'  -  *MaxPool*	
#'  -  Conv3x3 (256)	
#'  -  Conv3x3 (256)	
#'  -  Conv3x3 (256)	
#'  -  Conv3x3 (256)	
#'  -  *MaxPool*	
#'  -  Conv3x3 (512)	
#'  -  Conv3x3 (512)	
#'  -  Conv3x3 (512)	
#'  -  Conv3x3 (512)	
#'  -  *MaxPool*	
#'  -  Conv3x3 (512)	
#'  -  Conv3x3 (512)	
#'  -  Conv3x3 (512)	
#'  -  Conv3x3 (512)	
#'  -  *MaxPool*	
#'  -  Fully Connected (4096)	
#'  -  Fully Connected (4096)	
#'  -  Fully Connected (1000)	
#'  -  *SoftMax*	
#' 	
#' More information about the [VGG architecture is available online](https://iq.opengenus.org/vgg19-architecture/).	
#' 	
#' 	
#' ## Load and Preprocess a New Image	
#' 	
#' To classify a new image, start with selecting and importing the image into R. Below, we show the classifications of several different types of images.	
#' 	
#' 	
library("imager")	
library("EBImage")	
library("keras")	
	
# Check system python config	
# reticulate::py_config()	
	
# One should be able to load the image directly from the web (but sometimes there may be problems, in which case, we need to first download the image and then load it in R:	
# im <- imager::load.image("https://wiki.socr.umich.edu/images/6/69/DataManagementFig1.png")	
	
# download file to local working directory, use "wb" mode to avoid problems	
download.file("https://wiki.socr.umich.edu/images/6/69/DataManagementFig1.png", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')	
	
# report download image path	
paste(getwd(),"results/image.png", sep="/")	
	
img <- image_load(paste(getwd(),"results/image.png", sep="/"), target_size = c(224,224)) 	
# dim(image_to_array(img))  # [1] 1084 1875    3	
img <- rgbImage(red=t(image_to_array(img)[,,1]/255), 	
                          green=t(image_to_array(img)[,,2]/255), blue=t(image_to_array(img)[,,3]/255))	
display(img)	
#' 	
#' 	
#' Before feeding the image to the deep learning network for classification, we *may* need to do some preprocessing to make it fit the network input requirements. This image preprocessing (e.g., cropping, intensity mean-centralization and scaling, etc.) can be done manually in `R`. For example below is an instance of an image-preprocessing function. In practice, we can also use the function `keras::imagenet_preprocess_input()`.	
#' 	
#' 	
preproc.image <-function(im) {	
  # crop the image	
     mean.img <- mean(im)	
     shape <- dim(im)	
     # short.edge <- min(shape[1:2])	
     # xx <- floor((shape[1] - short.edge) / 2)	
     # yy <- floor((shape[2] - short.edge) / 2)	
     # resize to 224 x 224, needed by input of the model.	
     resized <- resize(im, 224, 224)	
     plot(resized)	
     # Reshape to format (width, height, channel, num)	
     dim(resized) <- c(224, 224, 3, 1)	
     return(resized)	
}	
#' 	
#' 	
#' Here is an example of calling the preprocessing function to obtain a conforming (normalized) image ready for auto-classification.	
#' 	
#' 	
normed <- preproc.image(img)	
# plot(normed)	
#' 	
#' 	
#' ## Image Classification	
#' 	
#' Use the `predict()` function to get the probability estimates over all (learned) classes and classify the image type.	
#' 	
#' 	
# get info about local version of Python installation	
reticulate::py_config()	
# The first time you run this install Pillow!	
# tensorflow::install_tensorflow(extra_packages='pillow')	
	
# Preprocess input image	
x <- image_to_array(img)	
# ensure we have a 4d tensor with single element in the first (batch) dimension,	
# then preprocess the input for prediction using resnet50	
x <- array_reshape(x, c(1, dim(x)))	
x <- imagenet_preprocess_input(x)   # note centralization: display(100*(x[1,,,1]+103))	
	
# Specify and compare Predictions based on different Pre-trained Models	
# Model 1: resnet50	
model_resnet50 <- application_resnet50(weights = 'imagenet')	
	
# make predictions then decode and print them	
preds_resnet50 <- model_resnet50 %>% predict(x)	
imagenet_decode_predictions(preds_resnet50, top = 10)	
	
# Model2: VGG19	
model_vgg19 <- application_vgg19(weights = 'imagenet')	
preds_vgg19 <- model_vgg19 %>% predict(x)	
imagenet_decode_predictions(preds_vgg19, top = 10)[[1]]	
	
# Model 3: VGG16	
model_vgg16 <- application_vgg16(weights = 'imagenet')	
preds_vgg16 <- model_vgg16 %>% predict(x)	
imagenet_decode_predictions(preds_vgg16, top = 10)[[1]]	
	
dim(preds_vgg16)	
#' 	
#' 	
#' The `prob` prediction generates a $1000 \times 1$ array representing the (vector) of probabilities reflecting the likelihood that the input image resembles (is classified as) each of the top 1,000 known image categories. We can report the indices of the top-10 closest image classes to the input image.	
#' 	
#' Clearly, this US weather pattern image is not well classified by either of the three different deep networks. The optimal predictions include *television*, *digital_clock*, *theater_curtain*, etc., however, the prediction confidence is very low, $Prob< 0.052$. None of the other top-10 class-labels capture the type of this weather-pattern image.	
#' 	
#' ## Additional Image Classification Examples	
#' 	
#' The machine learning image classification results won't always be this poor. Let's try classifying several alternative images.	
#' 	
#' ###  Lake Mapourika, New Zealand	
#' 	
#' Let's try the automated image classification of this lakeside panorama.	
#' 	
#' 	
# load the image	
download.file("https://upload.wikimedia.org/wikipedia/commons/2/23/Lake_mapourika_NZ.jpeg", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')	
img <- image_load(paste(getwd(),"results/image.png", sep="/"), target_size = c(224,224))	
	
imgRGB <- rgbImage(red=t(image_to_array(img)[,,1]/255), 	
                          green=t(image_to_array(img)[,,2]/255), blue=t(image_to_array(img)[,,3]/255))	
display(imgRGB)	
	
# Preprocess input image	
x <- image_to_array(img)	
# ensure we have a 4d tensor with single element in the batch dimension,	
# the preprocess the input for prediction using resnet50	
x <- array_reshape(x, c(1, dim(x)))	
x <- imagenet_preprocess_input(x)	
	
# Specify and compare Predictions based on different Pre-trained Models	
# Model 1: resnet50	
model_resnet50 <- application_resnet50(weights = 'imagenet')	
	
# make predictions then decode and print them	
preds_resnet50 <- model_resnet50 %>% predict(x)	
imagenet_decode_predictions(preds_resnet50, top = 10)	
	
# Model2: VGG19	
model_vgg19 <- application_vgg19(weights = 'imagenet')	
preds_vgg19 <- model_vgg19 %>% predict(x)	
imagenet_decode_predictions(preds_vgg19, top = 10)[[1]]	
	
# Model 3: VGG16	
model_vgg16 <- application_vgg16(weights = 'imagenet')	
preds_vgg16 <- model_vgg16 %>% predict(x)	
imagenet_decode_predictions(preds_vgg16, top = 10)[[1]]	
	
dim(preds_vgg16)	
#' 	
#' 	
#' This photo does represent a lakeside, which is reflected by the top three class labels:	
#' 	
#' * Model 1 (resnet50): lakeside,  boathouse, dock, breakwater.	
#' * Model 1 (VGG19): lakeside, breakwater, boathouse, dock.	
#' * Model 1 (VGG16): lakeside, breakwater, dock, canoe.	
#' 	
#' ###  Beach 	
#' 	
#' Another coastal boundary between water and land is represented in this [beach image](https://upload.wikimedia.org/wikipedia/commons/9/90/Holloways_beach_1920x1080.jpg).	
#' 	
#' 	
download.file("https://upload.wikimedia.org/wikipedia/commons/9/90/Holloways_beach_1920x1080.jpg", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')	
img <- image_load(paste(getwd(),"results/image.png", sep="/"), target_size = c(224,224))	
	
imgRGB <- rgbImage(red=t(image_to_array(img)[,,1]/255), 	
                          green=t(image_to_array(img)[,,2]/255), blue=t(image_to_array(img)[,,3]/255))	
display(imgRGB)	
	
# Preprocess input image	
x <- image_to_array(img)	
# ensure we have a 4d tensor with single element in the batch dimension,	
# the preprocess the input for prediction using resnet50	
x <- array_reshape(x, c(1, dim(x)))	
x <- imagenet_preprocess_input(x)	
	
# Specify and compare Predictions based on different Pre-trained Models	
# Model 1: resnet50	
model_resnet50 <- application_resnet50(weights = 'imagenet')	
	
# make predictions then decode and print them	
preds_resnet50 <- model_resnet50 %>% predict(x)	
imagenet_decode_predictions(preds_resnet50, top = 10)	
	
# Model2: VGG19	
model_vgg19 <- application_vgg19(weights = 'imagenet')	
preds_vgg19 <- model_vgg19 %>% predict(x)	
imagenet_decode_predictions(preds_vgg19, top = 10)[[1]]	
	
# Model 3: VGG16	
model_vgg16 <- application_vgg16(weights = 'imagenet')	
preds_vgg16 <- model_vgg16 %>% predict(x)	
imagenet_decode_predictions(preds_vgg16, top = 10)[[1]]	
#' 	
#' 	
#' This photo was classified appropriately and with high-confidence as:	
#' 	
#' * sandbar.                                      	
#' * lakeside.                                 	
#' * seashore.	
#'  	
#' ###  Volcano	
#' 	
#' Here is another natural image representing the [Mount St. Helens Vocano](https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/MSH82_st_helens_plume_from_harrys_ridge_05-19-82.jpg/1200px-MSH82_st_helens_plume_from_harrys_ridge_05-19-82.jpg).	
#' 	
#' 	
download.file("https://upload.wikimedia.org/wikipedia/commons/thumb/d/dc/MSH82_st_helens_plume_from_harrys_ridge_05-19-82.jpg/1200px-MSH82_st_helens_plume_from_harrys_ridge_05-19-82.jpg", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')	
img <- image_load(paste(getwd(),"results/image.png", sep="/"), target_size = c(224,224))	
	
imgRGB <- rgbImage(red=t(image_to_array(img)[,,1]/255), 	
                          green=t(image_to_array(img)[,,2]/255), blue=t(image_to_array(img)[,,3]/255))	
display(imgRGB)	
	
# Preprocess input image	
x <- image_to_array(img)	
# ensure we have a 4d tensor with single element in the batch dimension,	
# the preprocess the input for prediction using resnet50	
x <- array_reshape(x, c(1, dim(x)))	
x <- imagenet_preprocess_input(x)	
	
# Specify and compare Predictions based on different Pre-trained Models	
# Model 1: resnet50	
model_resnet50 <- application_resnet50(weights = 'imagenet')	
	
# make predictions then decode and print them	
preds_resnet50 <- model_resnet50 %>% predict(x)	
imagenet_decode_predictions(preds_resnet50, top = 10)	
	
# Model2: VGG19	
model_vgg19 <- application_vgg19(weights = 'imagenet')	
preds_vgg19 <- model_vgg19 %>% predict(x)	
imagenet_decode_predictions(preds_vgg19, top = 10)[[1]]	
	
# Model 3: VGG16	
model_vgg16 <- application_vgg16(weights = 'imagenet')	
preds_vgg16 <- model_vgg16 %>% predict(x)	
imagenet_decode_predictions(preds_vgg16, top = 10)[[1]]	
#' 	
#' 	
#' The predicted top class labels for this image are perfect:	
#' 	
#'  - volcano, alps.	
#'  - mountain_tent.	
#'  - geyser.	
#' 	
#' ### Brain Surface	
#' 	
#' The next image represents a 2D snapshot of 3D shape reconstruction of a brain cortical surface. This image is particularly difficult to automatically classify because (1) few people have ever seen a real brain, (2) the mathematical and computational models used to obtain the 2D manifold representing the brain surface do vary, and (3) the patterns of sulcal folds and gyral crests are quite inconsistent between people.	
#' 	
#' 	
download.file("https://wiki.socr.umich.edu/images/e/ea/BrainCortex2.png", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')	
im <- load.image(paste(getwd(),"results/image.png", sep="/"))	
	
download.file("https://wiki.socr.umich.edu/images/e/ea/BrainCortex2.png", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')	
img <- image_load(paste(getwd(),"results/image.png", sep="/"), target_size = c(224,224))	
	
imgRGB <- rgbImage(red=t(image_to_array(img)[,,1]/255), 	
                          green=t(image_to_array(img)[,,2]/255), blue=t(image_to_array(img)[,,3]/255))	
display(imgRGB)	
	
# Preprocess input image	
x <- image_to_array(img)	
# ensure we have a 4d tensor with single element in the batch dimension,	
# the preprocess the input for prediction using resnet50	
x <- array_reshape(x, c(1, dim(x)))	
x <- imagenet_preprocess_input(x)	
	
# Specify and compare Predictions based on different Pre-trained Models	
# Model 1: resnet50	
model_resnet50 <- application_resnet50(weights = 'imagenet')	
	
# make predictions then decode and print them	
preds_resnet50 <- model_resnet50 %>% predict(x)	
imagenet_decode_predictions(preds_resnet50, top = 10)	
	
# Model2: VGG19	
model_vgg19 <- application_vgg19(weights = 'imagenet')	
preds_vgg19 <- model_vgg19 %>% predict(x)	
imagenet_decode_predictions(preds_vgg19, top = 10)[[1]]	
	
# Model 3: VGG16	
model_vgg16 <- application_vgg16(weights = 'imagenet')	
preds_vgg16 <- model_vgg16 %>% predict(x)	
imagenet_decode_predictions(preds_vgg16, top = 10)[[1]]	
#' 	
#' 	
#' The top class labels for the brain image are:	
#' 	
#'  - brain, coral.	
#'  - mask, knot.	
#'  - cauliflower.	
#'  - acorn.	
#' 	
#' Imagine if we can train a brain image classifier that labels individuals (volunteers or patients) solely based on their brain scans into different classes reflecting their development, clinical phenotypes, disease states, or aging profiles. This will require a substantial amount of expert-labeled brain scans, model training and extensive validation. However any progress in this direction will lead to effective computational clinical decision support systems that can assist physicians with diagnosis, tracking, and prognostication of brain growth and aging in health and disease. 	
#' 	
#' ## Face mask: synthetic face image	
#' 	
#' We can also try the deep learning methods to see if they can uncover (recover) the core deterministic model or structure used to generate designed, synthetic, or simulated images.	
#' 	
#' This example is a synthetic computer-generated image representing a cartoon face or a mask.	
#' 	
#' 	
download.file("https://wiki.socr.umich.edu/images/f/fb/FaceMask1.png", paste(getwd(),"results/image.png", sep="/"), mode = 'wb')	
img <- image_load(paste(getwd(),"results/image.png", sep="/"), target_size = c(224,224))	
	
imgRGB <- rgbImage(red=t(image_to_array(img)[,,1]/255), 	
                          green=t(image_to_array(img)[,,2]/255), blue=t(image_to_array(img)[,,3]/255))	
display(imgRGB)	
	
# Preprocess input image	
x <- image_to_array(img)	
# ensure we have a 4d tensor with single element in the batch dimension,	
# the preprocess the input for prediction using resnet50	
x <- array_reshape(x, c(1, dim(x)))	
x <- imagenet_preprocess_input(x)	
	
# Specify and compare Predictions based on different Pre-trained Models	
# Model 1: resnet50	
model_resnet50 <- application_resnet50(weights = 'imagenet')	
	
# make predictions then decode and print them	
preds_resnet50 <- model_resnet50 %>% predict(x)	
imagenet_decode_predictions(preds_resnet50, top = 10)	
	
# Model2: VGG19	
model_vgg19 <- application_vgg19(weights = 'imagenet')	
preds_vgg19 <- model_vgg19 %>% predict(x)	
imagenet_decode_predictions(preds_vgg19, top = 10)[[1]]	
	
# Model 3: VGG16	
model_vgg16 <- application_vgg16(weights = 'imagenet')	
preds_vgg16 <- model_vgg16 %>% predict(x)	
imagenet_decode_predictions(preds_vgg16, top = 10)[[1]]	
#' 	
#' 	
#' The top class labels for the face mask are:	
#' 	
#'  - comic_book, mask.	
#'  - analog_clock.	
#'  - shield, abaya.	
#' 	
#' You can easily test the same image classifier on your own images and identify classes of pictures that are either well or poorly classified by the deep learning based machine learning model.	
#' 	
#' # Data Generation:  simulating synthetic data	
#' 	
#' ## Fractal shapes	
#' 	
#' One way to design fractal shapes is using [iterated function systems (IFS)](https://en.wikipedia.org/wiki/Iterated_function_system). each IFS is represented by finite set of [contraction mappings](https://en.wikipedia.org/wiki/Contraction_mapping) acting on complete metric spaces:	
#' 	
#' $$\{f_i : X \rightarrow X ∣ i=1,2,... , N\} , N \in \mathbb {N}.$$	
#' In the case of 2D sets and images, linear and contracting IFS's can be represented as linear operators:	
#' $$f(x,y) = A \begin{bmatrix} x \\ 	
#'   y \end{bmatrix}	
#'   +	
#' \begin{bmatrix} e \\ 	
#'   f 	
#' \end{bmatrix}	
#' =	
#' \begin{bmatrix} 	
#'   a & b \\ 	
#'   c & d 	
#'   \end{bmatrix} 	
#'   \begin{bmatrix} 	
#'   x \\ y 	
#'   \end{bmatrix} + 	
#'   \begin{bmatrix} e \\ 	
#'   f \end{bmatrix}.$$	
#' 	
#' Computationally, these linear IFS contraction maps can be expressed as $N\times 7$ matrices, where $N$ is the number of maps and $7$ is the number of parameters needed to describe an affine transformation in $R^2$.	
#' 	
#' map | $A_{1,1}$ | $A_{1,2}$ | $A_{2,1}$ | $A_{2,2}$ | $B_{1}$ | $B_{2}$ | probability	
#' ----|-----------|-----------|-----------|-----------|---------|---------|-----------	
#' w |	a |	b |	c |	d |	e |	f |	p	
#' 	
#' Let's take as an example [Barnsley's fern](https://en.wikipedia.org/wiki/Barnsley_fern), which is designed to model [real lady ferns (*athyrium filix-femina*)](https://en.wikipedia.org/wiki/Athyrium_filix-femina). It can be defined by a set of $N=4$ IFS contraction maps:	
#' 	
#' map | $A_{1,1}$ | $A_{1,2}$ | $A_{2,1}$ | $A_{2,2}$ | $B_{1}$ | $B_{2}$ | probability | Fern Portion	
#' ----|-----------|-----------|-----------|-----------|---------|---------|-------------|-------------	
#' $f_1$ | 	0 |	0 |	0 |	0.16 |	0 |	0 |	0.01 |	Stem 	
#' $f_2$ | 	0.85 |	0.02 |	−0.02 |	0.85 |	0 |	1.60 |	0.85 |	Successively smaller leaflets	
#' $f_3$ | 	0.20 |	−0.26 |	0.23 	| 0.22 |	0 |	1.60 |	0.1 |	Largest left-hand leaflet	
#' $f_4$ | −0.15 |	0.28 |	0.26 |	0.24 |	0 |	0.44 |	0.05 |	Largest right-hand leaflet 	
#' 	
#' Here is how the *Barnsley Fern* can be generated in R.	
#' 	
#' 	
# Barnsley's Fern	
# (1) create the 4-IFS functions of the probability and the current point	
fractal_BarnsleyFern <- function(x, p){	
    if (p <= 0.01) {	
        A <- matrix(c(0, 0, 0, 0.16), 2, 2)	
        B <- c(0, 0)	
    } else if (p <= 0.86) {	
        A <- matrix(c(.85, -.02, .02, .85), 2, 2)	
        B <- c(0, 1.6)	
    } else if (p <= 0.95) {	
        A <- matrix(c(.2, .23, -.26, .22), 2, 2)	
        B <- c(0, 1.6)	
    } else {	
        A <- matrix(c(-.15, .26, .28, .24), 2, 2)	
        B <- c(0, .44)	
    }	
    return(A %*% x + B)	
}	
	
# Fern Resolution depends on the number of iterative applications of the IFS system	
reps <- 100000	
	
# create a vector with probability values, and a matrix to store coordinates	
p <- runif(reps)	
	
# initialize a point at the origin	
init_coords <- c(0, 0)	
	
# compute the list of reps fractal coordinates: (X,Y) pairs	
A <- Reduce(fractal_BarnsleyFern, p, accumulate = T, init = init_coords)	
A <- t(do.call(cbind, A)) # unwind the list of (X,Y) pairs as (reps * 2) array	
	
# Plot	
# plot(A, type = "p", cex = 0.1, col = "darkgreen",	
#      xlim = c(-3, 3), ylim = c(0, 15), 	
#      xlab = NA, ylab = NA, axes = FALSE)	
plot_ly(x=~A[,1], y=~A[,2], type="scatter", mode="markers", name="Barnsley's Fern",	
        marker = list(color = 'rgb(157, 255, 157)', size = 1))	
	
# export Fern as JPG image	
jpeg(paste(getwd(), sep="/", "results/FernPlot.jpg"))	
plot(A, type = "p", cex = 0.1, col = "darkgreen", xlim = c(-3, 3), ylim = c(0, 15), 	
     xlab = NA, ylab = NA, axes = FALSE)	
dev.off()	
	
# Load the image back in and test the DNN classification	
img <- image_load(paste(getwd(),"results/FernPlot.jpg", sep="/"), target_size = c(224,224))	
# Plot Fern	
	
# Preprocess the Fern image and predict its class (label)	
imgRGB <- rgbImage(red=t(image_to_array(img)[,,1]/255), 	
                          green=t(image_to_array(img)[,,2]/255), blue=t(image_to_array(img)[,,3]/255))	
display(imgRGB)	
	
# Preprocess input image	
x <- image_to_array(img)	
# ensure we have a 4d tensor with single element in the batch dimension,	
# the preprocess the input for prediction using resnet50	
x <- array_reshape(x, c(1, dim(x)))	
x <- imagenet_preprocess_input(x)	
	
# Specify and compare Predictions based on different Pre-trained Models	
# Model 1: resnet50	
model_resnet50 <- application_resnet50(weights = 'imagenet')	
	
# make predictions then decode and print them	
preds_resnet50 <- model_resnet50 %>% predict(x)	
imagenet_decode_predictions(preds_resnet50, top = 10)	
	
# Model2: VGG19	
model_vgg19 <- application_vgg19(weights = 'imagenet')	
preds_vgg19 <- model_vgg19 %>% predict(x)	
imagenet_decode_predictions(preds_vgg19, top = 10)[[1]]	
	
# Model 3: VGG16	
model_vgg16 <- application_vgg16(weights = 'imagenet')	
preds_vgg16 <- model_vgg16 %>% predict(x)	
imagenet_decode_predictions(preds_vgg16, top = 10)[[1]]	
#' 	
#' 	
#' ## Fake images	
#' 	
#' You can also try to use `TensorFlow` and `Keras` to generate some "fake" *synthetic images* that can be then classified. This can be accomplished by using a [Generative Adversarial network (GAN)](https://en.wikipedia.org/wiki/Generative_adversarial_network) to synthetically sample from a collection of images like the MNIST image sets, e.g., `keras::dataset_fashion_mnist`, `keras::cifar10`, and `keras::dataset_mnist`. See [tutorial 1](https://blogs.rstudio.com/tensorflow/posts/2018-08-26-eager-dcgan/), [tutorial 2](https://jjallaire.github.io/deep-learning-with-r-notebooks/notebooks/8.5-introduction-to-gans.nb.html#), and [this R code](https://github.com/rstudio/keras/blob/master/vignettes/examples/eager_dcgan.R) for examples.	
#' 	
#' 	
library(keras)	
latent_dim <- 32	
height <- 32	
width <- 32	
channels <- 3	
generator_input <- layer_input(shape = c(latent_dim))	
generator_output <- generator_input %>%	
  layer_dense(units = 128 * 16 * 16) %>%                             	
  layer_activation_leaky_relu() %>%                                  	
  layer_reshape(target_shape = c(16, 16, 128)) %>%                   	
  layer_conv_2d(filters = 256, kernel_size = 5,	
                padding = "same") %>%	
  layer_activation_leaky_relu() %>%	
  layer_conv_2d_transpose(filters = 256, kernel_size = 4,            	
                          strides = 2, padding = "same") %>%         	
  layer_activation_leaky_relu() %>%                                  	
  layer_conv_2d(filters = 256, kernel_size = 5,	
                padding = "same") %>%	
  layer_activation_leaky_relu() %>%	
  layer_conv_2d(filters = 256, kernel_size = 5,	
                padding = "same") %>%	
  layer_activation_leaky_relu() %>%	
  layer_conv_2d(filters = channels, kernel_size = 7,                 	
                activation = "tanh", padding = "same")               	
generator <- keras_model(generator_input, generator_output)          	
	
discriminator_input <- layer_input(shape = c(height, width, channels))	
discriminator_output <- discriminator_input %>%	
  layer_conv_2d(filters = 128, kernel_size = 3) %>%	
  layer_activation_leaky_relu() %>%	
  layer_conv_2d(filters = 128, kernel_size = 4, strides = 2) %>%	
  layer_activation_leaky_relu() %>%	
  layer_conv_2d(filters = 128, kernel_size = 4, strides = 2) %>%	
  layer_activation_leaky_relu() %>%	
  layer_conv_2d(filters = 128, kernel_size = 4, strides = 2) %>%	
  layer_activation_leaky_relu() %>%	
  layer_flatten() %>%	
  layer_dropout(rate = 0.4) %>%                                            	
  layer_dense(units = 1, activation = "sigmoid")                           	
discriminator <- keras_model(discriminator_input, discriminator_output)    	
discriminator_optimizer <- optimizer_rmsprop(	
  lr = 0.0008,	
  clipvalue = 1.0,                                                         	
  decay = 1e-8                                                             	
)	
discriminator %>% compile(	
  optimizer = discriminator_optimizer,	
  loss = "binary_crossentropy"	
)	
	
# freeze_weights(discriminator)                         1	
gan_input <- layer_input(shape = c(latent_dim))	
gan_output <- discriminator(generator(gan_input))	
gan <- keras_model(gan_input, gan_output)	
gan_optimizer <- optimizer_rmsprop(	
  lr = 0.0004,	
  clipvalue = 1.0,	
  decay = 1e-8	
)	
gan %>% compile(	
  optimizer = gan_optimizer,	
  loss = "binary_crossentropy"	
)	
	
# mnist_fashion <- keras::dataset_fashion_mnist()  # shape (num_samples, 28, 28) 	
cifar10 <- dataset_cifar10()  # shape (num_samples, 3, 32, 32)                                          	
c(c(x_train, y_train), c(x_test, y_test)) %<-% cifar10	
x_train <- x_train[as.integer(y_train) == 6,,,]                           	
x_train <- x_train / 255                                                  	
iterations <- 100	
batch_size <- 20	
save_dir <- getwd()                                                   	
start <- 1	
for (step in 1:iterations) {	
  random_latent_vectors <- matrix(rnorm(batch_size * latent_dim),         	
                                  nrow = batch_size, ncol = latent_dim)	
  generated_images <- generator %>% predict(random_latent_vectors)        	
  stop <- start + batch_size - 1                                          	
  real_images <- x_train[start:stop,,,]                                   	
  rows <- nrow(real_images)                                               	
  combined_images <- array(0, dim = c(rows * 2, dim(real_images)[-1]))    	
  combined_images[1:rows,,,] <- generated_images                          	
  combined_images[(rows+1):(rows*2),,,] <- real_images                    	
  labels <- rbind(matrix(1, nrow = batch_size, ncol = 1),                 	
                  matrix(0, nrow = batch_size, ncol = 1))                 	
  labels <- labels + (0.5 * array(runif(prod(dim(labels))),               	
                                  dim = dim(labels)))                     	
  d_loss <- discriminator %>% train_on_batch(combined_images, labels)     	
  random_latent_vectors <- matrix(rnorm(batch_size * latent_dim),         	
                                  nrow = batch_size, ncol = latent_dim)   	
  misleading_targets <- array(0, dim = c(batch_size, 1))                  	
  a_loss <- gan %>% train_on_batch(                                       	
    random_latent_vectors,                                                	
    misleading_targets                                                    	
  )                                                                       	
  start <- start + batch_size	
  if (start > (nrow(x_train) - batch_size))	
    start <- 1	
  if (step %% 10 == 0) {                                                 	
    # save_model_weights_hdf5(gan, "gan.h5")                                	
    # Status Reporting	
    cat("Completion Status: ", round((100*step)/iterations,0), "% \n")	
    cat("\t discriminator loss:", d_loss, "\n")	
    cat("\t adversarial loss:", a_loss, "\n")	
	
    # Optionally save the real/generated images	
    # image_array_save(generated_images[1,,,]*255,path=file.path(save_dir,paste0("generated_img",step,".png")))	
    # image_array_save(real_images[1,,,]*255,path = file.path(save_dir, paste0("real_img", step, ".png")))	
  }	
}	
	
# Generated images are: generated_images[batch_size=20, x=32, y=32, channels=3]	
# Upscale the last generated image 32*32 -> 128*128*	
normed <- EBImage::resize(generated_images[10,,,2]*255, w = 224, h = 224)	
# image(normed, asp = 1, xaxt='n', yaxt='n', ann=FALSE, frame.plot=F)	
# title(main = "Synthetic Image", font.main = 10)	
plot_ly(z=~generated_images[15,,,1], type="contour", showscale=F)	
# plot_ly(z=~generated_images[10,,,2], type="image")	
	
#convert the image to 4D	
normed4D <- rbind (normed, normed, normed)	
dim(normed4D) <- c(224, 224, 3, 1)	
	
# predict class label of synth image (normed4D)	
x <- image_to_array(img)	
# ensure we have a 4d tensor with single element in the batch dimension,	
# the preprocess the input for prediction using resnet50	
x <- array_reshape(x, c(1, dim(x)))	
x <- imagenet_preprocess_input(x)	
	
# Specify and compare Predictions based on different Pre-trained Models	
# Model 1: resnet50	
model_resnet50 <- application_resnet50(weights = 'imagenet')	
	
# make predictions then decode and print them	
preds_resnet50 <- model_resnet50 %>% predict(x)	
imagenet_decode_predictions(preds_resnet50, top = 10)	
#' 	
#' 	
#' Clearly this is a very simple image and the DNN classification is not expected to be very informative. The results reported above will vary with the draw of the randomly generated synthetic image.	
#' 	
#' 	
#' # Generative Adversarial Networks (GANs)	
#' 	
#' The articles [Generating Sequences With Recurrent Neural Networks, by Alex Graves](https://arxiv.org/abs/1308.0850) and [Generative Adversarial Nets, by Goodfellow and colleagues](https://papers.nips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf), introduced a novel strategy to use recurrent neural networks to generate realistic signals, including audio generation (music, speech, dialogue), image generation, text-synthesis, and molecule design, etc.  GANs represent an alternative strategy to [variational auto-encoders (VAE)](https://ieeexplore.ieee.org/document/8285168) to generate synthetic data.	
#' 	
#' GAN frameworks estimate generative models using an adversarial process that simultaneously trains a pair of network models - a *generative model* $G$ that captures the data distribution, and a separate *discriminative model* $D$ that estimates the probability that a previously generated (synthetic) sample was real, i.e., came from the training data, rather than a synthetic $G$ output. 	
#' 	
#' For a binary classification, the $G$ training maximizes the probability of $D$ making a mistake (adversity), which corresponds to a *mini-max* optimization of a two-player game. The state space of all potential $G$ and $D$ permits a unique solution where $G$ recovers the training data distribution and $D=\frac{1}{2}$ is a constant, which corresponds to 50-50 change (largest entropy). Often, the $G$ and $D$ networks are defined as multilayer perceptrons (MLP) that can be jointly fit using [backpropagation](https://en.wikipedia.org/wiki/Backpropagation).	
#' 	
#' GAN learning requires iterative estimation of the generator’s distribution $p_g$ using the training data $x$, subject to some prior on noisy input latent variables $Z\sim p_Z(z)$. Denote a generator mapping to the data space as $G(z; \theta_g)$ where $G$ is a differentiable function representing a multilayer perceptron network with parameters $\theta_g$, and let the second multilayer perceptron network $D(x; \theta_g)$ represents the output scalar probability that the input $x$ came from the training data rather than from generator’s distribution $p_g$.	
#' 	
#' The iterative NN modeling fitting (learning) involves:	
#' 	
#'  - $D$ maximization of the probability of assigning the correct labels (true=real or false=synthetic) to both types of inputs $x$, either from training examples of synthetic samples from $G$, and	
#'  - Simultaneous training $G$ to minimize $\log(1 −D(G(z)))$. 	
#'  	
#' This dual optimization process for $D$ and $G$ corresponds to a two-player *mini-max* game with an objective value function $V (G,D)$:	
#' 	
#' $$\min_G \max_D V (D,G) = \mathbb{E}_{x∼p_{data}(x)} [\log D(x)] + \mathbb{E}_{x∼p_{Z}(z)} [\log(1 −D(G(z)))]. $$	
#' 	
#' This training approach enables recovering the data generating distribution using numerical iterative approaches. Note that for finite datasets, a perfect optimization of $D$ in the inner loop of training is computationally impractical and in general may result in overfitting. Therefore, the algorithm alternates the estimation process by performing $k$ steps of optimizing $D$ followed by $1$ step of optimizing $G$. When $G$ updates change slowly, repeating this process yields $D$ estimation near its optimal solution. In practice, direct gradient optimization of the objective value function $V (G,D)$ may be insufficient to learn/estimate $G$. Therefore, early in the learning process when $G$ may be poorly estimated, $D$ can reject samples with higher confidence because these early generations are expected to be obviously simple and fake, i.e., different from the training data and unrealistic, as for the early initial iterations, $\log(1 −D(G(z)))$ may saturate. Therefore, in the early training process, rather than training $G$ to minimize $\log(1 −D(G(z)))$, the $G$ training may focus on maximizing $\log D(G(z))$. Eventually, we transition to minimizing the correct cost $\log(1 −D(G(z)))$ and the final result of this dynamic optimization still has the same fixed point for $G$ and $D$, but provides stronger gradients early in the learning process.	
#' 	
#' ## CIFAR10 Archive	
#' 	
#' We show GAN training using [CIFAR10 images, dataset of 50K $32\times 32$ RGB images, representing 10 classes (5K images per class)](https://en.wikipedia.org/wiki/CIFAR-10). Let's focus on `birds` (label=2). All (low-resolution) images are of dimension $\left (\underbrace{32, 32}_{pixels}, \underbrace{3}_{RGB\ colors} \right )$. Note the 3-channel RGB intensities. Below is a $10\times 10$ collage of the first 100 bird images in the CIFAR10 archive.	
#' 	
#' 	
library(keras)	
library(tensorflow)	
# install_tensorflow(version = "gpu")	
# install_keras(tensorflow = "gpu")	
# library(EBImage)	
	
# CIFAR10 original labels: https://www.cs.toronto.edu/~kriz/cifar.html 	
# The label data is just a list of 10,000 numbers ranging from 0 to 9, which corresponds to each of the 10 classes in CIFAR-10.	
#     airplane : 0	
#     automobile : 1	
#     bird : 2	
#     cat : 3	
#     deer : 4	
#     dog : 5	
#     frog : 6	
#     horse : 7	
#     ship : 8	
#     truck : 9	
# Focus on CIFAR10 BIRD images(label 2)!	
	
# Loads CIFAR10 data	
cifar10 <- dataset_cifar10()                                              	
c(c(x_train, y_train), c(x_test, y_test)) %<-% cifar10	
	
# Selects bird images (class 2)	
x_train <- x_train[as.integer(y_train) == 2,,,]                           	
	
# Normalizes image intensities (bytes [0,255] --> [0,1])	
x_train <- x_train / 255                                                  	
	
# Display a grid of 10*10 Bird images	
# img_real <- list()	
# bird_images <- x_train[1:(10*10),,,]	
# for (i in 1:10) {	
#   for (j in 1:10) {	
#     img_real[[i+ (j-1)*10]] <- rotate(rgbImage(   # normalize the RGB values to [0,1] to use EBImage::display()	
#       normalize(bird_images[i+ (j-1)*10,,,1]), normalize(bird_images[i+ (j-1)*10,,,2]), 	
#       normalize(bird_images[i+ (j-1)*10,,,3])), 90) 	
#   }	
# } 	
# img_comb = EBImage::combine(img_real)	
# # Display the Bird images	
# EBImage::display(img_comb, method="raster", all = TRUE)	
# 	
# plt_list <- list()	
# N=100 	
# for (i in 1:10) {	
#   for (j in 1:10) {	
#     plt_list[[i+(j-1)*10]] <- 	
#       plot_ly(z=bird_images[i+ (j-1)*10,,,1], type="heatmap", showscale=FALSE) %>% 		
#       layout(showlegend=FALSE,  # hovermode = "y unified",	
#              xaxis=list(zeroline=F, showline=F, showticklabels=F, showgrid=F),	
#              yaxis=list(autorange = "reversed", zeroline=F,showline=F,showticklabels=F,showgrid=F)) #,		
#              # yaxis = list(scaleratio = 1, scaleanchor = 'x'))		
#   }	
# }	
# 	
# # plt_list[[2]] 		
# plt_list %>%		
#   subplot(nrows = 10, margin = 0.001, which_layout=1) %>%		
#   layout(title="CIFAR-10 - a collage of random birds")		
	
bird_images <- x_train[1:(10*10),,,]	
plt_list <- list()	
N=100 	
for (i in 1:10) {	
  for (j in 1:10) {	
    plt_list[[i+(j-1)*10]] <- 	
      plot_ly(z=255*bird_images[i+ (j-1)*10,,,], type="image", showscale=FALSE) %>% 		
      layout(showlegend=FALSE,  # hovermode = "y unified",	
             xaxis=list(zeroline=F, showline=F, showticklabels=F, showgrid=F),	
             yaxis=list(zeroline=F,showline=F,showticklabels=F,showgrid=F)) #,	
             # yaxis = list(scaleratio = 1, scaleanchor = 'x'))	
  }	
}	
	
# plt_list[[2]] 		
plt_list %>%		
  subplot(nrows = 10, margin = 0.0001, which_layout=1) %>%		
  layout(title="CIFAR-10 - a collage of random birds")	
#' 	
#' 	
#' 	
#' ## Generator ($G$)	
#' 	
#' Recall that the GAN represents a forger (adversarial) network $G$ and an expert network $D$ duking it out for superiority. Let's first examine $G$ and experiment with a generator network which takes a random vector input (a stochastic point in the latent space) and outputs a decoded synthetic image that is sent to the expert (discriminator) for auto-labeling.	
#' 	
#' We will demonstrate a [keras](https://keras.io/examples/generative/dcgan_overriding_train_step/) implementation of GAN modeling using deep convolutional GAN (DCGAN). Both the generator $G$ and discriminator $D$ will be [deep convnets](https://en.wikipedia.org/wiki/Convolutional_neural_network). 	
#' 	
#' The method `layer_conv_2d_transpose()` is used for image upsampling in the generator. GAN model includes:	
#' 	
#'  - A generator network $G$ mapping vectors of shape (*latent_dim*) to (fake) RGB images of dimension $\left (\underbrace{32, 32}_{pixels}, \underbrace{3}_{RGB\ colors} \right )$.	
#'  - A discriminator network $D$ mapping images of the same dimension to a binary score estimating the probability that the image is real.	
#'  - A GAN network concatenating the generator and the discriminator together: `gan(x) <- discriminator(generator(x))` to map latent space vectors $x$ to the discriminator decoding real/fake and an assessment of the realism of generator output images.	
#'  - $D$ is trained using examples of *real* and *synthetic* $G$-output images along with their corresponding “real”/”synth” labels.	
#'  - $G$ training using the gradients of the generator’s weights reflecting the loss of the GAN objective function. At each iteration, these $G$ weights are updated to optimize the cost function in a direction to improve $D$ performance to correctly ID  “real” and "synth" images supplied by the generator.	
#'  - To avoid getting the generator stuck with generating purely noisy images, we use dropout on both the discriminator and the generator.	
#' 	
#' 	
#' 	
latent_dim <- 32	
height <- 32	
width <- 32	
channels <- 3	
generator_input <- layer_input(shape = c(latent_dim))	
generator_output <- generator_input %>%	
  # transforms the input into a 16 × 16, 128 channel feature	
  layer_dense(units = 128 * 16 * 16) %>%                             	
  layer_activation_leaky_relu() %>%                                  	
  layer_reshape(target_shape = c(16, 16, 128)) %>%    	
  layer_conv_2d(filters = 256, kernel_size = 5,	
                padding = "same") %>%	
  layer_activation_leaky_relu() %>%	
  # Upsample images 16 x 16 --> 32 × 32	
  layer_conv_2d_transpose(filters = 256, kernel_size = 4,            	
                          strides = 2, padding = "same") %>%         	
  layer_activation_leaky_relu() %>%                                  	
  layer_conv_2d(filters = 256, kernel_size = 5,	
                padding = "same") %>%	
  layer_activation_leaky_relu() %>%	
  # generate a 3-channel RGB feature-map 32 × 32 reflecting the dimensions of all CIFAR10 images	
  layer_conv_2d(filters = 256, kernel_size = 5, padding = "same") %>%	
  layer_activation_leaky_relu() %>%	
  # instantiate the generator model, mapping the input of latent_dim into an image of dimension (32, 32, 3)	
  layer_conv_2d(filters = channels, kernel_size = 7,                 	
                activation = "tanh", padding = "same")               	
generator <- keras_model(generator_input, generator_output)   	
#' 	
#' 	
#' 	
#' ## Discriminator	
#' 	
#' The expert (Discriminator) network takes as input a real or synthetic image and outputs a label (or probability prediction) about the chance that the image came from the real training set or was synthetically created by the generator network ($G$).	
#' 	
#' $G$ is trained to confuse the discriminator network $D$ and evolve toward generating increasingly more realistic output images. As the number of training epochs increases, the artificially created images become similar to the real training data images.	
#' 	
#' Thus, $D$ continuously adapts its neural network to increase the probability of catching fake images. However, this process also gradually improves the $G$ capability to generate highly realistic output images. As the dual optimization process stabilizes and the training terminates, the generator is producing realistic images from random points in the state space, and the discriminator improves with detection of fakes.	
#' 	
#' Below is an implementation of a discriminator model taking a real or synthetic candidate image as input and outputting a classification label "generated (synth) image" or "real image from the training set".	
#' 	
#' 	
discriminator_input <- layer_input(shape = c(height, width, channels))	
discriminator_output <- discriminator_input %>%	
  layer_conv_2d(filters = 128, kernel_size = 3) %>%	
  layer_activation_leaky_relu() %>%	
  layer_conv_2d(filters = 128, kernel_size = 4, strides = 2) %>%	
  layer_activation_leaky_relu() %>%	
  layer_conv_2d(filters = 128, kernel_size = 4, strides = 2) %>%	
  layer_activation_leaky_relu() %>%	
  layer_conv_2d(filters = 128, kernel_size = 4, strides = 2) %>%	
  layer_activation_leaky_relu() %>%	
  layer_flatten() %>%	
  # one dropout layer, see text	
  layer_dropout(rate = 0.4) %>%     	
  # classification layer	
  layer_dense(units = 1, activation = "sigmoid")                           	
# instantiate the discriminator model, taking a (32, 32, 3) input into a binary classification decision (fake/real)	
discriminator <- keras_model(discriminator_input, discriminator_output)    	
discriminator_optimizer <- optimizer_rmsprop(	
  lr = 0.0008,	
  # gradient clipping (by value) in the optimizer	
  clipvalue = 1.0,  	
  # to stabilize training, specify a learning-rate decay	
  decay = 1e-8                                                             	
)	
discriminator %>% compile(	
  optimizer = discriminator_optimizer,	
  loss = "binary_crossentropy"	
)	
#' 	
#' 	
#' ## The adversarial network 	
#' 	
#' 	
# For the GAN model, set the discriminator weights to non-trainable	
freeze_weights(discriminator)                         	
gan_input <- layer_input(shape = c(latent_dim))	
gan_output <- discriminator(generator(gan_input))	
gan <- keras_model(gan_input, gan_output)	
gan_optimizer <- optimizer_rmsprop(	
  lr = 0.0004,	
  clipvalue = 1.0,	
  decay = 1e-8	
)	
gan %>% compile(	
  optimizer = gan_optimizer,	
  loss = "binary_crossentropy"	
)	
#' 	
#' 	
#' 	
#' ## Training the DCGAN	
#' 	
#' Just like most other deep learning processes, the DCGAN *design*, *training*, and *tuning* involve significant scientific rigorous and artistry.  The theoretical foundations are intertwined with heuristic approaches translating some intuition and human-intelligence into computational modeling. Some of the exemplary heuristics involved in DCGAN modeling  and the implementation of the GAN *generator* and *discriminator* include:	
#' 	
#'  - Using the $\tanh()$ function in the last activation of the generator, as opposed to the more standard `sigmoid` function commonly employed in other types of DL models.	
#'  - Random sampling points from the latent space rely on *Gaussian (normal)* distribution, rather than a high-entropy *uniform* distribution. This randomness and stochasticity during training yields more reliability and robustness in the final models. 	
#'  - DCGAN training aims to achieve a dynamic equilibrium (tug-of-war between $G$ and $D$ nets). To ensure the GAN models avoid local minima (sub-optimal solutions), ), randomness is embedded in the training process. Stochasticity is introduced by using dropout in the discriminator (omitting or dropping out the feedback of each discriminator in the framework with some probability at the end of each batch) and by adding random noise to the labels for the discriminator.	
#'  - Sparse gradients can negatively impact the GAN training process. Sparsity is often a desirable property in DL as it makes many theoretically intractable computational problems solvable in practice. Gradient sparsity in DCGANs is the result of (1) *max-pooling* operations for calculating the largest value in each patch of each feature map, i.e., down sampling or pooled feature maps to highlight the most salient feature in the patch (instead of averaging the signal as is the case of average pooling); or (2) *ReLU activations* (rectified linear activation function, ReLU, is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero). `Max-pooling` can be swapped with strided convolutions for downsampling. Whereas `ReLU` activation can be replaced by `layer_activation_leaky_relu`, which is similar to ReLU, but it relaxes sparsity constraints by allowing small negative activation values.	
#'  - The $G$ generated output images may exhibit checkerboard artifacts caused by unequal coverage of the pixel space in the generator. This problem may be addressed by employing a kernel of size divisible by the image stride size whenever we use a strided `layer_conv_2d_transpose` or `layer_conv_2d` in both the generator and the discriminator. The stride, or pitch, is the number of bytes from one row of pixels in memory to the next row of pixels in memory; the presence of padding bytes widens the stride relative to the width of the image.	
#' 	
#' Recall that stochastic gradient descent optimization facilitates iterative learning using a training dataset to update the learnt model at each iteration:	
#'     	
#'  - The [*batch size* is a hyperparameter](https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/) of gradient descent that controls the number of training samples to work through before the model’s internal parameters are updated.	
#'  - The *number of epochs* is a hyperparameter of gradient descent that controls the number of complete passes through the training dataset.	
#' 	
#' Let's demonstrate the synthetic image generation using the [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) imaging data archive of 10K images labeled in 10 different categories (e.g., airplanes, horses).	
#' 	
#' The example below just does 2 epochs. Increasing the *iterations* parameter ($k\times 100$) would generate more, and increasingly accurate synthetic images (in this case we are focusing on birds, label 2).	
#' 	
#' ## Elements of the DCGAN Training	
#' 	
#' The DCGAN training involves looping (iterating over each epoch) the following steps:	
#' 	
#'  - Randomly traverse the latent space (introduce random noise by random sampling).	
#'  - Use $G$ to generate images based on the random noise in the previous step.	
#'  - Mix the synth-generated images with real real-images (from training data).	
#'  - Train $D$ to discriminate (label) these mixed images, outputting “real” or “fake” class labels.	
#'  - Again, randomly traverse the latent space drawing new random points (in the latent space).	
#'  - Train the DCGAN model using these random vectors, with a fixed target label="real" for all images. This process updates only the network weights of the generator! The discriminator is static inside the GAN. Hence, these updates force the discriminator to predict “real images” for synthetically-generated images. This is the adversarial phase, as it trains the generator to fool the (frozen) discriminator.	
#'  - In the experiment below, we use a low number of `iterations <- 200`. To generate more realistic results, this number needs to be much higher (e.g., $10,000$).	
#'  	
#' GPU computing Note: these DCGAN models are very computationally intensive. The performance is enhanced by installing [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) and [NVIDIA cuDNN](https://developer.nvidia.com/cudnn), which allow you to run the calculations on the GPU, instead of the default, CPU.	
#' 	
#' 	
iterations <- 200 # 10000 increase  to get better results (improved synthetic images)	
# build the layout matrix with additional separating cells	
nx <- 10 # number of images in a row	
ny <- 10 # number of images in a column	
batch_size <- 20	
start <- 1	
	
# List of nx*ny synthetic and real images (as matrices)	
img_gan <- list()	
#img_gan[[1]] <- x_train[1,,,1]; image(img_gan[[1]])	
img_real <- list()	
plt_list <- list()	
	
for (step in 1:iterations) {	
  # First-tier sampling of random points in the latent space	
  random_latent_vectors <- matrix(rnorm(batch_size*latent_dim),  nrow=batch_size, ncol=latent_dim)	
  	
  # str(generated_images) (Batch-size, 2D-grid, RGB-colors)	
  # Array[1:20, 1:32, 1:32, 1:3]	
  # decode/generate and save synth-generated images	
  generated_images <- generator %>% predict(random_latent_vectors)  	
  	
  # Combine synth and real images	
  stop <- start + batch_size - 1                                          	
  real_images <- x_train[start:stop,,,]                                   	
  rows <- nrow(real_images)                                               	
  combined_images <- array(0, dim = c(rows * 2, dim(real_images)[-1]))    	
  combined_images[1:rows,,,] <- generated_images                          	
  combined_images[(rows+1):(rows*2),,,] <- real_images                    	
  	
  # Assemble the labels discriminating synth from real images	
  labels <- rbind(matrix(1, nrow = batch_size, ncol = 1),                 	
                  matrix(0, nrow = batch_size, ncol = 1))                 	
  	
  # Add random noise to the image-labels to avoid trapping in local minima	
  labels <- labels + (0.5 * array(runif(prod(dim(labels))),               	
                                  dim = dim(labels)))                     	
  	
  # First, train the discriminator, $D$	
  d_loss <- discriminator %>% train_on_batch(combined_images, labels)     	
  	
  # Second-tier sampling of random points in the latent space	
  random_latent_vectors <- matrix(rnorm(batch_size*latent_dim), nrow = batch_size, ncol = latent_dim)   	
  	
  # Assemble labels = "real images” (fake incorrect labels)	
  misleading_targets <- array(0, dim = c(batch_size, 1))                  	
  	
  # Second, train the generator $G$ using the GAN model. Note that the discriminator weights are fixed (static) during this optimization	
  a_loss <- gan %>% train_on_batch(                                       	
    random_latent_vectors,                                                	
    misleading_targets                                                    	
  )                                                                       	
  start <- start + batch_size	
  if (start > (nrow(x_train) - batch_size))   start <- 1	
  	
  # Display some of the generated images for visual inspection (number = iterations/(nx*ny))	
  if (step %% (nx*ny) == 0) {                                                 	
    # Save the GAN model weights in h5 format	
    save_model_weights_hdf5(gan, "gan.h5")                                	
    	
    # Report metrics	
    cat("Step=", step, "; discriminator loss=", d_loss, "\n")                              	
    cat("Step=", step, "; adversarial loss=", a_loss, "\n")                                	
	
    # # Save one synth-generated image, Need to normalize the RGB values to [0,1] 	
    # img_gan[[step/(nx*ny)]] <- rotate(rgbImage(generated_images[1,,,1],	
    #                           generated_images[1,,,2], generated_images[1,,,3]), 90)                         # Save one real (bird) image	
    # img_real[[step/(nx*ny)]] <- rotate(rgbImage(real_images[1,,,1], 	
    #                           real_images[1,,,2], real_images[1,,,3]), 90) 	
    	
    # plot_ly rendering	
    pl1 <- plot_ly(z=255*real_images[step/(nx*ny), , , ], type="image") %>% 		
        layout(showlegend=FALSE,  # hovermode = "y unified",	
             xaxis=list(zeroline=F, showline=F, showticklabels=F, showgrid=F),	
             yaxis=list(zeroline=F,showline=F,showticklabels=F,showgrid=F))	
    pl2 <- plot_ly(z=255*generated_images[step/(nx*ny), , , ], type="image") %>% 		
        layout(showlegend=FALSE,  # hovermode = "y unified",	
             xaxis=list(zeroline=F, showline=F, showticklabels=F, showgrid=F),	
             yaxis=list(zeroline=F,showline=F,showticklabels=F,showgrid=F))	
    plt_list[[step/(nx*ny)]] <- subplot(pl1, pl2, nrows = 1, margin = 0.0001, which_layout=1)	
  }	
}	
	
# img_comb = EBImage::combine(c(img_real, img_gan))	
# # Display the Bird images	
# EBImage::display(img_comb, method="raster", all = TRUE)	
	
# plt_list[[2]] 		
plt_list %>%		
  subplot(nrows = 2, margin = 0.0001, which_layout=1) %>%		
  layout(title="Observed (left) and Synthetic (right) Bird Images")	
#' 	
#' 	
#' 	
#' The **generator** transforms *random latent vectors* into *images*. The **discriminator** attempts to correctly identify the real and synthetically-generated images. The generator is trained to fool the discriminator. 	
#' 	
#' **Iterative Protocol**:	
#' 	
#'  - *Inputs*: Random vector from the latent space (`random_latent_vectors <- matrix(rnorm(batch_size * latent_dim), nrow = batch_size, ncol = latent_dim)`) and Real Images (`real_images[1,,,] * 255`);	
#'  - Generator (decoder) - receives *Inputs* and *training feedback* from Discriminator including real and synth images and their discriminated labels (real or synthetic);	
#'  - Generator outputs new synth (decoded) image that is sent along with another real image as input to the Discriminator below for another real vs. synth labeling	
#'  - Discriminator receives a pair of (real and synth) images as inputs, and outputs labels (real or synth) to them and forwards the results to generator	
#'  - This iterative process continues until a certain stopping criterion is reached.	
#'  	
#' The GAN (generator network) is iteratively trained and tuned to fool the discriminator network (i.e., pass synth images as real). This training cycle continues and the neural network evolves toward generating increasingly realistic images. Simulated artificial images begin to look indistinguishable from their real counterparts. The discriminator network becomes less effective in telling the two types of images apart. In this iterative process, the discriminator is constantly adapting to the gradually improving capabilities of the generator. This constant reinforcement yields realistic versions of synthetic computer-generated images. At the end of the training process, which is highly non-linear and discontinuous, the generator churns out input latent space points into realistic-looking images.	
#'  	
#' 	
#' # Transfer Learning	
#' 	
#' Humans learn complex tasks by capitalizing on their prior experiences, no matter how remote these previous encounters may appear to be. By the age of 5, most kids can learn how to ride a bicycle in a couple of training sessions. This riding ability is acquired after they have already mastered the arts of *running*, also known as  *controlled falling*, navigating complex 3D environments, and anticipating dynamic 4D spatio-temporal events. In effect, before kids start pedaling, their many prior holistic training experiences ensure that they “know” the basics of bike balancing. Children's formative years include a very large number of trial-and-errors, parental guidance sessions, and societal cues. These events already provide the basic building blocks necessary to learn bicycle riding. And this is well in advance of the actual "bicycle training" experience, which we typically associate with bicycle riding. 	
#' 	
#' This learning process is very different for machines. It's extremely difficult to train a machine (a robot) to ride a bike, because these prior experiences kids go through are missing and can not be easily built and transferred to complete the new task of learning how to balance a bike. In a way, humans learn new tasks easily as (1) they already have a large collection of skills they have already mastered, and (2) they can *transfer*, mix, match, integrate, and harness their prior experiences to the process of learning a *new task*. **Transfer machine learning** attempts to replicate this human transfer learning process into the domain of artificial intelligence. The goals are to expedite the ML training process by capitalizing on prior knowledge, expand the realm of ML/AI applications, and enable their “last mile” training to ensure they generate "reasonable decisions and actions" without starting with blank slate *de novo* learning.	
#' 	
#' ## Deep Network Transfer Learning in Text Classification	
#' 	
#' One of the main challenges of AI/ML interpretation of free text is the extreme heterogeneity of the information and the unstructured format of the text content. This problem can be resolved by structurizing the input text and establishing homologies between multiple text samples (e.g., clinical notes). In a nutshell, transfer learning facilitates this process and enables (1) synthetic text generation (new data) that simulates realistic textual content (non-human data); and (2) transformation of unstructured text to structured data elements. For instance, if an $input=clinical\ notes$, a DNN model generates $output=vector$ representing a quantitative signature vector of the input text; think of it as a vector of principal components associated with the specific free text. 	
#' 	
#' The result of this AI process is that independently of the text length or type, DNN always generates a numeric vector of a fixed size (say 128 values). This canonical representation establishes *homologies* between any given set of strings (character arrays).	
#' 	
#' Let's demonstrate *transfer machine learning* using the [medical specialty text-mining example of clinical notes example that we saw in Chapter 5](https://socr.umich.edu/DSPA2/DSPA2_notes/05_SupervisedClassification.html). This data includes a binary outcome indicating whether  the medical specialty unit (there are 40 such units) is a **surgical** unit or not. We’ll split the 4,999 cases, each containing 6 data elements, including the medical-specialty unit and clinical notes, into training and testing sets. 	
#' 	
#' The key will be to use `keras` to build and train a ML model for predicting surgical vs. non-surgical units from the content in the corresponding medical notes by using a previously trained text-mining DNN that quantizes text of any size.	
#' 	
#' 	
# load the necessary libraries	
# May need some installations first, e.g., 	
#    in conda%> pip install tensorflow_datasets	
#    install.packages("remotes")	
#    remotes::install_github("rstudio/tfds")	
#    tfds::install_tfds()	
	
library(keras)	
library(reticulate)	
library(tfhub)	
library(tfds)	
library(tfdatasets)	
library(utf8)	
	
# specify r-reticulate or r-tensorflow python Anaconda environment	
# use_condaenv("r-tensorflow")	
# use_condaenv("r-reticulate", required = TRUE)	
# there are many ways to "finding" your conda environments, and using the reticulate package to set them	
conda_list()[[1]][1] %>%  use_condaenv(required = TRUE)	
	
# Check tensorflow install configuration	
tensorflow::tf_config()	
py_module_available("tensorflow_hub")	
	
# py_install("tensorflow_hub", pip = TRUE) # py_install("tensorflow_hub")	
# py_install("tfds", pip = TRUE) # py_install("tfds")	
# py_install("tensorflow_datasets", pip = TRUE)	
	
# py_module_available("tensorflow_datasets")	
# py_module_available("tfds")  # tensorflow_datasets	
#' 	
#' 	
#' ### Binary Transfer Learning Label-Classification of Clinical Text	
#' 	
#' Let’s now **design a full DNN binary-classification model** composed of 4 layers stacked sequentially. The first *transfer learning* layer represents the pre-trained TensorFlow Hub layer (prior model), which is loaded as the a priori left-most base layer in the full DNN and maps clinical notes (description sentences) into its embedding vector (canonical signature vector). There are a number of *pre-trained text embedding models* we can choose in this transfer-learning example. For instance, we can use [google/tf2-preview/gnews-swivel-20dim/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1), which splits the sentences into tokens, embeds each token, and then combines the embedding yielding an output of dimensions: (num_examples, embedding_dimension). The output of this initial *transfer learning prior model layer* is a fixed-length output vector, which is fed into the next fully-connected (Dense) *layer-2* with 16 hidden units. Layer-2 output feeds into the next (dense) *layer-3* with 6 nodes. Finally, Layer-3 output goes into the last *layer-4*, which also is a densely connected layer with a single output (class label). Using the `sigmoid` activation function, this output represents a probability value between 0 and 1 indicating the model predicted chance, or confidence level, that the medical note text was written in a hospital *surgical unit*.	
#'  	
#' Other examples of pre-trained text mining models that can be used for transfer learning include:	
#' 	
#'  - [google/tf2-preview/gnews-swivel-20dim/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1),	
#'  - [google/tf2-preview/nnlm-en-dim128/1](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1), and	
#'  - [google/tf2-preview/gnews-swivel-20dim-with-oov/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1), similar to [google/tf2-preview/gnews-swivel-20dim/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1), but with 2.5% vocabulary converted to OOV buckets, which helps when the training and testing vocabularies are not fully overlapping.	
#'  - [google/tf2-preview/nnlm-en-dim50/1](https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1) is a much larger pre-trained model with vocabulary of size 1M and 50 dimensions.	
#'   - [google/tf2-preview/nnlm-en-dim128/1](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1) is another large model, vocabulary of size 1M, and 128 dimensions.	
#' 	
#' We will demonstrate NN-augmentation (transfer learning) modifying the base-model using the [pre-trained NN English Google News 200B corpus](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1) by adding 4 extra layers at the end, which will be tuned for our specific clinical text (medical notes). Of course, similarly, any of the other pre-trained models can be used as alternatives.	
#' 	
#' Download the [clinical dataset](http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/07_NaiveBayesianClass.html#64_Medical_Specialty_Text-Notes_Classification) and split it into training:training (80:20).	
#' 	
#' Note that in this clinical-notes example, the input data consists of medical text transcriptions stored as string sentences. In the first demonstration, we will try to predict a binary integer label, 0 or 1, representing a non-surgical or surgical clinical unit where the clinical note was transcribed. To structurize the free-text as a computable data object (a matrix), we will automatically convert sentences into embedding vectors. This can be accomplished using [text2vec](https://github.com/dselivanov/text2vec) or [keras::layer_text_vectorization()](https://www.rdocumentation.org/packages/keras/versions/2.4.0/topics/layer_text_vectorization) transformations, or by including a pre-trained text embedding as the first layer. This takes care of the text preprocessing, facilitates transfer learning, and makes the text-to-matrix independent of the text and the size of the clinical note.	
#' 	
#' 	
dataCT <- read.csv('https://umich.instructure.com/files/21152999/download?download_frd=1', header=T) 	
str(dataCT)	
	
# 'data.frame':	4999 obs. with  6 variables	
colnames(dataCT)	
	
# Binarize the 40 hospital units as Surgery-type and Non-Surgery types	
dataCT$surgLabel <- ifelse(grepl('Surg', dataCT$medical_specialty), 1, 0)	
table(grepl('Surg', dataCT$medical_specialty))	
	
# Fix the descriptions to UTF-8 encoding	
library(stringi)	
# table(stri_enc_mark(dataCT$description)) # ASCII native #  4994      5	
dataCT$description <- stri_encode(dataCT$description, "", "UTF-8") 	
dataCT$transcription <- stri_encode(dataCT$transcription, "", "UTF-8") 	
	
dataCT$clinicalNotes <- paste(dataCT$description, dataCT$transcription)	
	
# Clean the clinical notes	
library(tm)	
## Vectorize the text	
train_corpus <- VCorpus(VectorSource(dataCT$clinicalNotes))	
## Remove Punctuation	
train_corpus <- tm_map(train_corpus, content_transformer(removePunctuation))	
## Remove numbers	
train_corpus <- tm_map(train_corpus, removeNumbers)	
## Convert text to lower case	
train_corpus <- tm_map(train_corpus, content_transformer(tolower))	
## Remove stop words	
train_corpus <- tm_map(train_corpus, content_transformer(removeWords), stopwords("english"))	
## Stemming	
train_corpus <- tm_map(train_corpus, stemDocument)	
## Remove multiple whitespaces	
train_corpus <- tm_map(train_corpus, stripWhitespace)	
# Extract only the simplified text from the complex train_corpus object	
dataCT$clinicalNotes <- unlist(lapply(train_corpus, `[[`, 1))	
	
# Split the data 80:20	
train_set_ind <- sample(nrow(dataCT), floor(nrow(dataCT)*0.8)) # 80:20 split training:testing	
train_data <- dataCT[train_set_ind , ]	
test_data <- dataCT[-train_set_ind , ]  	
	
num_words <- 10000	
max_length <- 300	
text_vectorization <- layer_text_vectorization(max_tokens = num_words, output_sequence_length = max_length)	
	
# # `adapt()` the Clinical Notes Text Vectorization layer. Calling adapt allows the input layer to learn about 	
# # the unique Medical Text in this dataset and assign an integer value for each word	
# text_vectorization %>%  adapt(train_data$clinicalNotes)	
# 	
# # Confirm the Medical Notes vocabulary is in the text vectorization layer.	
# get_vocabulary(text_vectorization)	
# 	
# # Input Layer shape - the text vectorization layer transforms it’s inputs	
# trainDataX <- text_vectorization(matrix(train_data$clinicalNotes, ncol = 1))	
# trainDataY_one_hot_labels <- to_categorical(train_data$surgLabel, num_classes = 2)	
	
text_vectorization %>% adapt(train_data$clinicalNotes)	
	
# Define and fit the model - the input data consists of an array of word-indices. 	
# The predicted labels are either 0 or 1.	
# The classifier is based on sequentially stacking the network layers	
# The first embedding layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. 	
# These vectors are learned as the model trains. 	
# The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).	
# A global_average_pooling_1d layer returns a fixed-length output vector for each example by averaging over the sequence dimension. 	
# This allows the model to handle *variable-length* inputs	
# The fixed-length output vector is piped through a fully-connected (dense) layer with 16 hidden units.	
# The last output layer is densely connected with a single output node. 	
# Sigmoid activation function yields a probability between 0 and 1 indicating the confidence of the binary level.	
#' 	
#' 	
#' ####  Define a fresh new `model1` *de novo*	
#' 	
#' 	
# 1. Define a new fresh model1 de novo	
# input <- layer_input(input_shape = input_shape = c(300))  # For numerical input, e.g., trainDataX	
input <- layer_input(shape = c(1), dtype = "string")   # for raw text input as string, needs to match exp next layer	
output <- input %>% 	
  text_vectorization() %>% 	
  layer_embedding(input_dim = num_words + 1, output_dim = 32) %>%	
  layer_global_average_pooling_1d() %>%	
  layer_dense(units = 16, activation = "relu") %>%	
  layer_dropout(0.5) %>% 	
  layer_dense(units = 1, activation = "sigmoid")	
model1 <- keras_model(input, output)	
	
model1 %>% compile(	
  optimizer = 'adam',	
  loss = 'binary_crossentropy',	
  metrics = list('accuracy')	
)	
	
history <- model1 %>% fit(train_data$clinicalNotes,	
          as.numeric(train_data$surgLabel),  	
          epochs = 10, batch_size = 512, validation_split = 0.2, verbose=2)	
	
# Evaluate the model1 performance	
results <- model1 %>% evaluate(test_data$clinicalNotes, as.numeric(test_data$surgLabel), verbose = 0)	
results	
#' 	
#' 	
#' #### Naive - out-of-the-box prior-model assessment (without retraining)	
#' 	
#' In a naive approach, we can even evaluate the performance of the *prior model* ([English Google News 200B](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1)), i.e., assess transfer learning without any additional add-on training using the new problem-specific data. Remember that we have a univariate (binary) outcome and if we use `dataset_batch(32)`, the output will include a vector of 32 probability estimates.	
#' 	
#' We will see next that Keras knows how to extract elements from TensorFlow Datasets automatically making it a much more memory efficient alternative than loading the entire dataset to RAM before passing to Keras.	
#' 	
#' To build the DNN model, we need to specify the network topology as a stack of network layers that include (1) schema representing the unstructured text data (clinical note descriptions), and (2) Number and complexity of each subsequent layer in the model. For simplicity, in this example we will convert the 40 different medical units into binary "surgical" unit labels; 0 or 1 factors.	
#' 	
#' The unstructured text can be converted into embedding vectors of a fixed size, which simplifies the text processing. Using the transfer learning *prior* model, which includes a pre-trained text embedding and appears as the first DNN layer. This allows us to outsource the text preprocessing and transformation into quantitative information tensor. This is the key step illustrating the benefits of add-on based transfer-learning in fine-tuning previously trained models.	
#' 	
#' The result of using this transfer-learning prior is that the model is *invariant* with respect to the length of the input clinical text - the output shape of the embeddings is $(num\_examples\times embedding\_dimension)$.	
#' 	
#' 	
#  2. Naive - out-of-the-box prior-model assessment (without any retraining)	
# Transfer Learning based on nnlm-en-dim128 (prior model) Define only output layer structure	
library(keras)	
	
####### May have to remove outputs from prior runs!!!!! ########################	
# remove folders here: C:\Users\Dinov\AppData\Local\Temp\tfhub_modules ....#####	
model2 <- keras_model_sequential() %>% 	
  layer_hub(	
    handle = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1",	
    input_shape = list(),	
    dtype = tf$string,	
    trainable = FALSE   # Set to TRUE for full model retraining, we use FALSE for quick transfer learning	
  ) %>%	
  layer_dense(units = 1, activation = "sigmoid")  # add the binary labeling output layer format	
summary(model2)	
model2 %>% compile(	
  optimizer = 'adam',	
  loss = 'binary_crossentropy',	
  metrics = list('accuracy')	
)	
# Just estimate the final 128+1 coefficients of the final layer	
history <- model2 %>% fit(	
  train_data$clinicalNotes, train_data$surgLabel, 	
  epochs = 5, ### increase epochs for better performance	
  batch_size = 128	
)	
	
# Assess performance	
score <- model2 %>% evaluate(test_data$clinicalNotes, test_data$surgLabel)	
print(score)	
y_pred <- ifelse((model2 %>% predict(test_data$clinicalNotes)) >0.4, 1, 0)	
table(y_pred, test_data$surgLabel)	
#' 	
#' 	
#' 	
#' Clearly these surgical unit predictions can't be expected to be very reliable, as the model is not fine-tuned yet to respond specifically to *clinical text*.	
#' 	
#' The next step is to **compile the transfer-learning model** by specifying a *loss function* and an *optimizer* to facilitate the transfer-learning during the iterative network model fitting (fine-tuning). In this binary classification problem, we will use the `binary_crossentropy()` loss function. The model results in generating a probability value, which is presented as the output of the final DNN layer (the right-most single-unit layer with a sigmoid activation).	
#' 	
#' Another possible loss function for binary outcome is `mean_squared_error()`. However, binary_crossentropy is often better for dealing with probabilities as it measures the “distances” between probability distributions representing the predicted outcome and the ground-truth in supervised problems. Yet, `mean_squared_error()` is also applicable in a regression model setting. We will also employ [Adaptive Moment Estimation (ADAM)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam) as it's an effective optimizer.	
#' 	
#' #### Simple Transfer Learning 	
#' 	
#' Let's use the `nnlm-en-dim128` (prior model) to define an expanded DNN model by adding additional four layers at the end to customize the deep neural network to our specific clinical data.	
#' 	
#' 	
# 3. Transfer Learning based on the nnlm-en-dim128 (prior model) Define expanded DNN model structure + 4 layers	
model3 <- keras_model_sequential() %>% 	
  layer_hub(	
    handle = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1",	
    input_shape = list(),	
    dtype = tf$string,	
    trainable = FALSE   # Set to TRUE for full model retraining, we use FALSE for quick transfer learning	
  ) %>% 	
  # modify default pre-trained model by adding 4 extra layers at the end tuned for our clinical text (medical notes)	
  layer_dense(units = 64, activation = "sigmoid") %>% 	
  #layer_dropout(rate = 0.5) %>% 	
  layer_dense(units = 32, activation = "sigmoid") %>% 	
  #layer_dropout(rate = 0.5) %>%	
  layer_dense(units = 16, activation = "sigmoid") %>% 	
  #layer_dropout(rate = 0.5) %>%	
  layer_dense(units = 1, activation = "sigmoid")	
#   layer_dense(units = 16, activation = "relu") %>% 	
#   layer_dense(units = 6, activation = "relu") %>% 	
#   layer_dense(units = 1, activation = "sigmoid")	
	
summary(model3)	
model3 %>% compile(	
  optimizer = 'adam',	
  loss = 'binary_crossentropy',	
  metrics = list('accuracy')	
)	
	
history <- model3 %>% fit(	
  train_data$clinicalNotes, train_data$surgLabel, 	
  epochs = 10, ### increase epochs for better performance	
  batch_size = 128	
)	
	
# Assess performance	
score <- model3 %>% evaluate(test_data$clinicalNotes, test_data$surgLabel)	
print(score)	
y_pred <- ifelse((model3 %>% predict(test_data$clinicalNotes)) >0.48, 1, 0)	
table(y_pred, test_data$surgLabel)	
#' 	
#' 	
#' #### Full-scale Transfer learning 	
#' 	
#' Next we will use the structure/topology of the pre-trained model, but estimate all $124M$ network parameters, not only the final $11K$ parameters at the end, as we did earlier.	
#' 	
#' 	
# 4. Full-scale Transfer learning using the skeleton of the pre-trained model, but estimating all parameters 	
model4 <- keras_model_sequential() %>% 	
  layer_hub(	
    handle = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1",	
    input_shape = list(),	
    dtype = tf$string,	
    trainable = TRUE   # Set to FALSE for simple TL-model retraining, we use TRUE for full-transfer learning	
  ) %>% 	
  # modify default pre-trained model by adding 4 extra layers at the end tuned for our clinical text (medical notes)	
  layer_dense(units = 64, activation = "sigmoid") %>% 	
  #layer_dropout(rate = 0.5) %>% 	
  layer_dense(units = 32, activation = "sigmoid") %>% 	
  #layer_dropout(rate = 0.5) %>%	
  layer_dense(units = 16, activation = "sigmoid") %>% 	
  #layer_dropout(rate = 0.5) %>%	
  layer_dense(units = 1, activation = "sigmoid")	
#   layer_dense(units = 16, activation = "relu") %>% 	
#   layer_dense(units = 6, activation = "relu") %>% 	
#   layer_dense(units = 1, activation = "sigmoid")	
	
summary(model4)	
model4 %>% compile(	
  optimizer = 'adam',	
  loss = 'binary_crossentropy',	
  metrics = list('accuracy')	
)	
	
history <- model4 %>% fit(	
  train_data$clinicalNotes, train_data$surgLabel, 	
  epochs = 10, ### increase epochs for better performance	
  batch_size = 128	
)	
	
# Assess performance	
score <- model4 %>% evaluate(test_data$clinicalNotes, test_data$surgLabel)	
print(score)	
y_pred <- ifelse((model4 %>% predict(test_data$clinicalNotes)) >0.47, 1, 0)	
table(y_pred, test_data$surgLabel)	
#' 	
#' 	
#' The final pair of steps include:	
#' 	
#'  - **Training**. *Transfer learning* involving fine-tuning the model starting with a prior pre-trained model, which is re-trained on the specific medical text (training) data,  and 	
#'  - **Validation**. The learning process involves repeated model estimation using mini-batches of 512 samples (see `dataset_batch()`) with 10 (for speed) or more (e.g., 100+, for accuracy and precision) epochs. This process involves 10 (or 100+) iterations over all samples in the dataset. During the fine-tuning training process, the transfer learner will report the initial and each subsequent model *loss-value* (optimization measure) and *accuracy* (fidelity measure) on sets of 10,000 samples from the validation set (see `dataset_shuffle()`).	
#' 	
#' 	
# Evaluate the model	
# Examine the model performance. 	
# mind the trajectories of the Loss (representing the error), 	
# lower values are better), and accuracy, high values are better	
library(plotly)	
	
plot_ly(x = ~c(1:history$params$epochs),  y = ~history$metrics$loss,	
        type = "scatter", mode="markers+lines", name="Loss") %>% 	
  add_trace(x = ~c(1:history$params$epochs),  y = ~history$metrics$accuracy,	
        type = "scatter", mode="markers+lines", name="Accuracy") %>%	
  layout(title="DNN Training Performance", xaxis=list(title="epoch"),	
         yaxis=list(title="Metric Value"), legend = list(orientation='h'),	
         hovermode = "x unified")	
	
# subplot(pl_loss, pl_acc, nrows=2, shareX = TRUE, titleX = TRUE)	
#' 	
#' 	
#' This simple transfer learning approach achieves an accuracy of about 73-76%. More model customization and longer training are expected to significantly improve the performance of the fine-tuned transfer-learning DNN model. Additional information about [`R`-based tensorflow DNN modeling is available here](https://tensorflow.rstudio.com/guide/tensorflow/eager_execution/) and [here](https://tensorflow.rstudio.com/learn/resources/).	
#' 	
#' 	
#' ### Multinomial Transfer Learning classification of Clinical Text	
#' 	
#' Load all the appropriate R/Python packages and set up the RStudio environment.	
#' 	
#' 	
library(keras)	
library(reticulate)	
library(tfhub)	
library(tfds) 	
library(tfdatasets)	
library(utf8)	
	
use_condaenv("r-reticulate", required = TRUE)	
tensorflow::tf_config()	
py_module_available("tensorflow_hub")	
#' 	
#' 	
#' The same clinical data can be used for multinomial classification, where the *outcome* is the clinical specialty unit (there are 40 hospital units in this case-study), the *input* is the given clinical text. Start by defining the special labels (clinical units). The prediction of the 40-class labels will depend on the input $x$ consisting of the string `clinicalNotes`, representing the concatenated *transcriptions* and *descriptions*.	
#' 	
#' In this transfer learning example of multiclass text classification, we will utilize the [gnews-swivel-20dim model with text embedding trained on English Google News 130GB corpus](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1).	
#' 	
#' 	
library(stringi)	
dataCT <- read.csv('https://umich.instructure.com/files/21152999/download?download_frd=1', header=T) 	
	
dataCT$description <- stri_encode(dataCT$description, "", "UTF-8") 	
dataCT$transcription <- stri_encode(dataCT$transcription, "", "UTF-8") 	
	
# Concatenate Transcriptions and Descriptions into one string/character: clinicalNotes	
dataCT$clinicalNotes <- paste(dataCT$description, dataCT$transcription)	
	
convert_specialty <- list()	
keys <- unique(dataCT$medical_specialty)	
medical_specialtyNames <- dataCT$medical_specialty	
values <- 1:length(keys)	
for(i in 1:length(keys)) { convert_specialty[keys[i]] <- values[i] }	
specialty <- c()	
for (i in 1:length(dataCT$medical_specialty)){	
  specialty[i] <- as.numeric(convert_specialty[dataCT$medical_specialty[i]])	
}	
	
dataCT$medical_specialty <- specialty	
dataCT$medical_specialty <- matrix(dataCT$medical_specialty,	
                                   nrow = length(dataCT$medical_specialty), ncol = 1)	
	
# Convert labels to categorical one-hot encoding	
one_hot_SpecialtyLabels <- to_categorical(dataCT$medical_specialty,	
               num_classes = length(unique(dataCT$medical_specialty))+1)	
one_hot_SpecialtyLabels <- one_hot_SpecialtyLabels[, -1] # remove empty column 1	
# library(keras)	
# labels <- to_categorical	
# sum(one_hot_SpecialtyLabels)  [1] 4999	
	
num_words <- 10000	
max_length <- 300	
text_vectorization <- layer_text_vectorization(max_tokens = num_words, output_sequence_length = max_length)	
	
train_set_ind <- sample(nrow(dataCT), floor(nrow(dataCT)*0.8)) # 80:20 plot training:testing	
train_data <- dataCT[train_set_ind, ]	
test_data <- dataCT[-train_set_ind, ]	
one_hot_SpecialtyLabels_trainY <- one_hot_SpecialtyLabels[train_set_ind, ]	
one_hot_SpecialtyLabels_testY  <- one_hot_SpecialtyLabels[-train_set_ind, ]	
 	
# input <- layer_input(shape = c(1), dtype = "string")   # for raw text input as string, needs to match exp next layer	
# output <- input %>% 	
#   text_vectorization() %>% 	
#   layer_embedding(input_dim = num_words + 1, output_dim = 256) %>%	
#   layer_global_average_pooling_1d() %>%	
#   layer_dense(units = 256, activation = "relu") %>%	
#   layer_dropout(0.25) %>% 	
#   layer_dense(units = 128, activation = "relu") %>%	
#   layer_dropout(0.25) %>% 	
#   layer_dense(units = 64, activation = "relu") %>%	
#   # layer_dropout(0.25) %>% 	
#   layer_dense(units = length(keys), activation = 'softmax')	
# model2 <- keras_model(input, output)	
# 	
# model2 %>% compile(	
#   loss = 'categorical_crossentropy', 	
#   optimizer = optimizer_sgd(learning_rate = 0.01, decay = 1e-6, momentum = 0.9, nesterov = TRUE),	
#   metrics = list('accuracy')	
# )	
# 	
# history2 <- model2 %>% fit(train_data$clinicalNotes, one_hot_SpecialtyLabels_trainY,  	
#                   epochs = 10, batch_size = 512, validation_split = 0.2, verbose=2)	
# 	
# # Evaluate the model2 performance	
# results2 <- model2 %>% evaluate(test_data$clinicalNotes, one_hot_SpecialtyLabels_testY, verbose = 2)	
# results2  	
# 	
# score <- model2 %>% evaluate(test_data$clinicalNotes, one_hot_SpecialtyLabels_testY)	
# print(score)	
# y_pred <- model2 %>% predict(test_data$clinicalNotes)	
# head(apply(y_pred, 1, which.max))  # table(apply(y_pred, 1, which.max))	
# # hist(y_pred[,8])	
# table(y_pred, test_data$medical_specialty)	
# 	
# ============================================	
  	
model3 <- keras_model_sequential() %>% 	
  layer_hub(	
    handle = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1",	
    input_shape = list(),	
    dtype = tf$string,	
    trainable = TRUE	
  ) %>% 	
  layer_dense(units = 256, activation = "relu") %>%	
  layer_dropout(0.25) %>% 	
  layer_dense(units = 128, activation = "relu") %>%	
  layer_dropout(0.25) %>% 	
  layer_dense(units = 64, activation = "relu") %>%	
  # layer_dropout(0.25) %>% 	
  layer_dense(units = length(keys), activation = 'softmax')	
	
summary(model3)	
	
model3 %>% compile(	
  loss = 'categorical_crossentropy', 	
  optimizer = optimizer_sgd(learning_rate = 0.01, decay = 1e-6, momentum = 0.9, nesterov = TRUE),	
  metrics = list('accuracy')	
)	
	
history3 <- model3 %>% fit(train_data$clinicalNotes, one_hot_SpecialtyLabels_trainY,  	
                  epochs = 100, batch_size = 512, validation_split = 0.2, verbose=2)	
	
results3 <- model3 %>% evaluate(test_data$clinicalNotes, one_hot_SpecialtyLabels_testY, verbose = 2)	
print(paste0("Mind that the testing-case performance metrics (Loss=", round(results3["loss"], 3), 	
             " and Accuracy=", round(results3["accuracy"], 3),	
             ") of the DNN text classification reflect results of ",	
             length(keys), " medical specialties (classes), not a binary classification!")) 	
	
score <- model3 %>% evaluate(test_data$clinicalNotes, one_hot_SpecialtyLabels_testY)	
print(score)	
y_pred <- model3 %>% predict(test_data$clinicalNotes)	
head(apply(y_pred, 1, which.max))  # table(apply(y_pred, 1, which.max))	
y_pred_class <- apply(y_pred, 1, which.max)	
# hist(y_pred[,8])	
table(y_pred_class, test_data$medical_specialty[,1])	
# DT::datatable(matrix(table(y_pred_class, test_data$medical_specialty[,1]),40,40) )	
	
heat <- matrix(0, 40, 40)	
for ( i in 1:length(test_data$clinicalNotes)) { 	
  heat[test_data$medical_specialty[i, 1], y_pred_class[i]] =  	
    heat[test_data$medical_specialty[i, 1], y_pred_class[i]] + 1 	
}	
	
plot_ly(x =~keys, y = ~keys, z = ~heat, name="Model Performance",	
        hovertemplate = paste('Matching: %{z:.0f}', 	
                              '
True: %{x}
', 'Pred: %{y}'),	
        colors = 'Reds', type = "heatmap") %>% 	
  layout(title="Predicated Classes vs. True Clinical Units", 	
         xaxis=list(title="Actual Class"), yaxis=list(title="Predicted Class"))	
#' 	
#' 	
#' 	
#' 	
#' 	
#' 	
#' 	
#' ### Binary Classification of Film Reviews	
#' 	
#' All readers are encouraged to try text-based transfer learning using [alternative datasets](https://umich.instructure.com/courses/38100/files/folder/Case_Studies), e.g., the [50,000 movie reviews dataset](https://github.com/tensorflow/datasets/blob/master/docs/datasets.md#imdb_reviews). The code skeleton below illustrates the basic pipeline workflow for the movie review's binary classifications.	
#' 	
#' 	
# Load Movie Reviews (50K)	
# split the entire dataset into a list of 3 objects:	
# imdb[[1]]=training_set, imdb[[2]]=testing_set, imdb[[3]]=validation_set	
imdb <-	
    tfds::tfds_load	(	
    "imdb_reviews:1.0.0",	
    split = list("train[:60%]", "train[-40%:]", "test"),	
    as_supervised = TRUE	
)	
# summary(imdb)	
	
# tfds_load returns a TensorFlow Dataset, an abstraction representing a list	
# of elements, in which each element consists of one or more components.	
# To access individual elements of a Dataset:	
# 	
firstBatch <- imdb[[1]] %>%	
  dataset_batch(1) %>% # Used to get only the first example	
  reticulate::as_iterator() %>%	
  reticulate::iter_next()	
str(firstBatch)	
	
review1 <- as_utf8(as.character(firstBatch[1][[1]]$numpy()[1][[1]])) # get text-review (string)	
label1 <- as.numeric(firstBatch[2][[1]]$numpy())    # get binary class (0/1)	
	
embedding_layer <- layer_hub(	
  handle ="https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1")  	
	
embedding_layer(firstBatch[[1]])	
	
# build the complete model	
model <- keras_model_sequential() %>% 	
  layer_hub(	
    handle = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1",	
    input_shape = list(),	
    dtype = tf$string,	
    trainable = TRUE	
  ) %>% 	
  layer_dense(units = 16, activation = "relu") %>% 	
  layer_dense(units = 8, activation = "relu") %>% 	
  layer_dense(units = 1, activation = "sigmoid")	
summary(model)	
	
# compile model	
model %>% 	
  compile(optimizer="adam", loss="binary_crossentropy", metrics="accuracy")	
	
# model training	
history <- model %>% 	
  fit(	
    imdb[[1]] %>% dataset_shuffle(10000) %>% dataset_batch(512),	
    epochs = 4,  # for convergence, use larger number of epochs (e.g., 20+)	
    validation_data = imdb[[2]] %>% dataset_batch(512), verbose = 2)	
	
# plot performance	
pl_loss <- plot_ly(x = ~c(1:history$params$epochs),  y = ~history$metrics$loss,	
        type = "scatter", mode="markers+lines", name="Loss") %>% 	
  add_trace(x = ~c(1:history$params$epochs),  y = ~history$metrics$val_loss,	
        type = "scatter", mode="markers+lines", name="Validation Loss") %>%	
  layout(title="DNN Training/Validation Performance", xaxis=list(title="epoch"),	
         yaxis=list(title="Metric Value"), legend = list(orientation='h'),	
         hovermode = "x unified")	
	
pl_acc <- plot_ly(x = ~c(1:history$params$epochs),  y = ~history$metrics$accuracy,	
        type = "scatter", mode="markers+lines", name="Accuracy") %>% 	
  add_trace(x = ~c(1:history$params$epochs),  y = ~history$metrics$val_accuracy,	
        type = "scatter", mode="markers+lines", name="Validation Accuracy") %>%	
  layout(title="DNN Training/Validation Performance", xaxis=list(title="epoch"),	
         yaxis=list(title="Metric Value"), legend = list(orientation='h'),	
         hovermode = "x unified")	
	
subplot(pl_loss, pl_acc, nrows=2, shareX = TRUE, titleX = TRUE)	
# model evaluation on testing data	
model %>% 	
  evaluate(imdb[[3]] %>% dataset_batch(512), verbose = 0)	
#' 	
#' 	
#' ## Image classification	
#' 	
#' Similar to the unstructured text-mining (film review case) we illustrated above, we can use DNN transfer learning for *image classification.* 	
#' 	
#' ### Performance Metrics	
#' 	
#' #### Binary Cross-Entropy Measure	
#' 	
#' The *cross-entropy* measure of dissimilarity between two discrete probability distributions $p$ (true state) and $q$ (predicted state) with identical support $X$ is defined as	
#' 	
#' $$H(p,q) = -\sum _{x_i\in X}{p(x_{i})\log q(x_{i})}.$$	
#' For binary outcomes, logistic regression transforms the log-loss over all training observations, i.e., it optimizes the average cross-entropy in the sample. 	
#' 	
#' For a sample indexed by $n = 1, \cdots, N$, the expected (average) loss function is:	
#' 	
#' $$J(w) ={\frac{1}{N}} \sum _{n=1}^{N}H(p_{n},q_{n})\ =\ -{\frac {1}{N}}\sum _{n=1}^{N}\ {\bigg [}y_{n}\log {\hat {y}}_{n}+(1-y_{n})\log(1-{\hat {y}}_{n}){\bigg ]},$$	
#' 	
#' where ${\hat {y}}_{n}\equiv g(w \cdot x_{n})=\frac{1}{1+e^{-w \cdot x_{n}}}$ and $g(z)$ is the logistic function. The logistic loss is the *cross-entropy loss* or *log-loss*, and binary refers to the situation of  binary outcome labels $\{-1,+1\}$.	
#' 	
#' Hence, the *binary cross-entropy* (*BCE*) is simply	
#' 	
#' $$H(p,q) = -\sum _{x_i\in X}{p(x_{i})\log q(x_{i})}	
#' = -y\log {\hat {y}}-(1-y)\log(1-{\hat {y}}),$$	
#' where $p \in \{ y , 1 − y \}$ and $q \in \{ \hat {y}, 1 − \hat {y} \}$ represent the probability of the *true* and *predicted* binary outcomes, respectively.	
#' 	
#' High or low BCE values indicate "bad" or "good" model performance, respectively, with a perfect model having a $BSE\approx 0$.	
#' 	
#' #### Dice Coefficient	
#' 	
#' The [Sørensen–Dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) (*Dice Coefficient*) is another measure to assess the similarity between two sets, samples, or distributions. In our case we are applying the dice coefficient to track the overlap between the true brain-tumor masks, and the DCNN-derived mask-estimate (prediction) of the tumor based on the raw brain image.	
#' 	
#' | Discrete sets $X$ and $Y$ | (Boolean) Binary Data | Probabilities (e.g., quantiles) |	
#' |----------------|---------------|-------------------|	
#' | $D=\frac{2 |X\cap Y|}{|X|+|Y|}$, $|\cdot |$ is set cardinality |  TP=true positive, FP=false positive, FN=false negative, $D=\frac {2TP}{2TP+FP+FN}$ | $D=\frac {2|{\bf{p}}\cdot {\bf {q}}|}{|{\bf{p}}|^{2}+|{\bf {q}}|^{2}}$ |	
#' 	
#' 	
#' ### `Torch` Deep Convolutional Neural Network (CNN)	
#' 	
#' The [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/), shown on the image below, is an example of a DCNN.	
#' 	
#' The U-shaped CNN (U-Net) represents successive convolutional layers with max-pooling. During the auto-encoding (left-down-hill branch) the U-Net reduces image resolution (downsampling), whereas during the subsequent decoding phase (right-uphill branch) upsamples the images to arrive at an output of the same size as the original input. The information analysis (encoding) and synthesis (decoding) facilitate the labeling of each output image pixel by feeding information in each decoding layer from the corresponding encoding layer with matching resolution in the downsizing encoding layer.	
#' 	
#' Each upsampling (decoding) step concatenates the output from the previous layer with that from its counterpart in the compression (encoding) step. The final decoding output is a mask of the same size as the original image, derived by a $1\times 1$-convolution, which does not require a dense layer at the end as the output convolutional layer represents a single filter. Below we show how to load, train, and use a U-Net for transfer learning in 2D image segmentation. Note that this model has over $3M$ trainable parameters. You can see an [R example of a Unet model for input-output tensors of shape=c(128,128)](https://github.com/rstudio/keras/blob/master/vignettes/examples/unet.R), see lines 73-183.	
#' 	
#' 	
# If necessary, download the U-Net package, before you load it into R	
# remotes::install_github("r-tensorflow/unet")	
library(unet)	
library(tibble)	
	
# The u-Net call takes additional parameters, e.g., number of downsizing blocks, number of filters to start with, 	
# number of classes to identify; # ?unet provides details. For instance, we can specify the shape	
# of the input images we will be segmenting tumors for: 256*256 3-channel RGB images.	
model <- unet(input_shape = c(256, 256, 3))	
	
# to print the model as text output, run:	
# model	
# Results: # Trainable params: 31,031,745	
#' 	
#' 	
#' 	
#' #### Data Import	
#' 	
#' Let's first download and load in the [Brain Tumor Imaging dataset](https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes). These data come from a 2019 study on [Association of genomic subtypes of lower-grade gliomas with shape features automatically extracted by a deep learning algorithm](https://doi.org/10.1016/j.compbiomed.2019.05.002). The 2D brain MR images are paired with 2D tumor masks, which are trivial for controls and non-trivial for patients, using [The Cancer Imaging Archive (TCIA)](https://wiki.cancerimagingarchive.net/display/Public/TCGA-LGG). The data represent 110 patients with lower-grade glioma and include fluid-attenuated inversion recovery (FLAIR) MRI scans. There are 3-channels of the MRI data; *pre-contrast, FLAIR*, and *post-contrast*. The corresponding tumor masks were obtained by manual-delineations on the FLAIR images by board-certified radiologists. 	
#' 	
#' 	
#' 	
# If you need to start a clean fresh run, remove all old files first! Be careful with this! set eval=T in all R-blocks!	
##### First check > list.files("/data/")	
##### do.call(file.remove, list(list.files("/data", full.names = TRUE)))	
##### unlink("/data/*", recursive=TRUE, force=TRUE)	
pathToZip <- tempfile() 	
download.file("https://umich.instructure.com/files/21813670/download?download_frd=1", pathToZip, mode = "wb")	
zip::unzip(pathToZip, files=NULL, exdir = "/data")	
#' 	
#' 	
#' 	
library(tibble)	
library(rsample)	
	
train_dir <- "/data/data"	
valid_dir <- "/data/mri_valid"	
	
library(magick)   # Needed for TIFF --> PNG image conversion and other image processing tasks	
#' 	
#' 	
#' Create the necessary directories to store the training and validation imaging data (brain MRIs and tumor masks).	
#' 	
#' 	
# check if ReadMe file is accessible	
# file.rename("/data/ReadMe_TCGA_MRI_Segmentation_Data_Phenotypes.txt", train_dir)	
	
# Import the meta-data	
meta_data <- read.csv("/data/TCGA_MRI_Segmentation_Data_Phenotypes.csv")	
	
# note that these are relative file/directory names. To see the complete local path	
# tempdir(); getwd()	
	
# Create a validation folder	
dir.create(valid_dir)	
	
# Check all n=110 patients are accessible	
patients <- list.dirs(train_dir, recursive = FALSE)	
length(patients)	
	
# Randomly select 20 Patients for validation, remaining 90=110-20 are for training the DNN model	
valid_indices <- sample(1:length(patients), 20)	
valid_indices	
patients[valid_indices] # prints the actual folders where the validation participants' data is	
	
# Extract and Relocate the Validation cases (separate them from training data)	
for (i in valid_indices) {	
  dir.create(file.path(valid_dir, basename(patients[i])))	
  for (f in list.files(patients[i])) {    	
    file.rename(file.path(train_dir, basename(patients[i]), f), file.path(valid_dir, basename(patients[i]), f))    	
  }	
  unlink(file.path(train_dir, basename(patients[i])), recursive = TRUE) # clean	
}	
	
# Confirm that only 80 patients are left in the standard data folder	
# list all training data imaging files: list.dirs(train_dir, recursive = FALSE)	
length(list.dirs(train_dir, recursive = FALSE))	
	
# and 30-60 validation cases are in the validation folder	
length(list.dirs(valid_dir, recursive = FALSE))	
	
# and check validation data	
length(list.files(valid_dir, recursive = T)) # [1] 1268	
#' 	
#' 	
#' Define data-frames containing the file-names for all training and validation data.	
#' 	
#' 	
# Identify the TRAINING and VALIDATION data objects (raw images + tumor masks) as filenames	
data_train <- tibble(	
  img = grep(list.files(train_dir, full.names = TRUE, pattern = "tif", recursive = TRUE),	
        pattern = 'mask', invert = TRUE, value = TRUE),	
  mask = grep(list.files(train_dir, full.names = TRUE, pattern = "tif", recursive = TRUE),	
        pattern = 'mask', value = TRUE)	
)	
data_valid <- tibble(	
  img = grep(list.files(valid_dir, full.names = TRUE, pattern = "tif", recursive = TRUE),	
        pattern = 'mask', invert = TRUE, value = TRUE),	
  mask = grep(list.files(valid_dir, full.names = TRUE, pattern = "tif", recursive = TRUE),	
        pattern = 'mask', value = TRUE)	
)	
#' 	
#' 	
#' (Optionally) convert all 2D TIFF images to PNG RGB format; this may be necessary to ensure the input images are 3-channels, and are correctly interpreted as tensorflow objects.	
#' 	
#' 	
# If all training + testing data are in one folder, split them by:	
#  data <- initial_split(data_train, prop = 0.8)	
	
# convert all Training Data: TIFF images and masks to PNG format (for easier TF processing downstream)	
files_img_tif <- data_train$img[grepl("\\.tif$", data_train$img), drop = TRUE]	
data_train_img_png <- lapply(files_img_tif,	
      function(x) { 	
        # image_write(image_read(x), path = gsub(".tif$", ".png", x), format = "png") 	
        a = image_convert(image_read(x),  format = "png")	
        image_write(a, path = gsub(".tif$", ".png", x), format = "png") 	
    }	
  )	
	
files_mask_tif <- data_train$mask[grepl("\\.tif$", data_train$mask), drop = TRUE]	
data_train_mask_png <- lapply(files_mask_tif, 	
     function(x) { 	
       # image_write(image_read(x), path = gsub(".tif$", ".png", x), format = "png") 	
       a = image_convert(image_read(x),  format = "png")	
       image_write(a, path = gsub(".tif$", ".png", x), format = "png") 	
    }	
  )	
	
# Similarly convert all Validation Data	
# convert all TIFF images and masks to PNG format (for easier TF processing downstream)	
files_valid_img_tif <- data_valid$img[grepl("\\.tif$", data_valid$img), drop = TRUE]	
data_valid_img_png <- lapply(files_valid_img_tif,	
      function(x) { 	
        # image_write(image_read(x), path = gsub(".tif$", ".png", x), format = "png") 	
        a = image_convert(image_read(x),  format = "png")	
        image_write(a, path = gsub(".tif$", ".png", x), format = "png") 	
    }	
  )	
	
files_valid_mask_tif <- data_valid$mask[grepl("\\.tif$", data_valid$mask), drop = TRUE]	
data_valid_mask_png <- lapply(files_valid_mask_tif, 	
     function(x) { 	
       # image_write(image_read(x), path = gsub(".tif$", ".png", x), format = "png") 	
       a = image_convert(image_read(x),  format = "png")	
       image_write(a, path = gsub(".tif$", ".png", x), format = "png") 	
    }	
  )	
	
# Check that the TIF --> PNG conversion worked, inspect one case	
head(list.files("/data/data/TCGA_HT_A61A_20000127"))	
# data_valid  # check root directory	
	
# Inspect some of the images/masks	
# image_info(image_read(data_train_img_png[[3]]))	
# image_write(image_read(data_train$img[3]), format = "tiff")	
# image_write(image_read(data_train$img[3]), path = paste0(data_train$img[3], ".png"), format = "png")	
# a <- image_read(paste0(data_train$img[3], ".png"))	
	
# list.files(train_dir) 	
# To clean previous file references	
# # delete a directory -- must add recursive = TRUE	
# unlink("/data", recursive = TRUE); # Clean space # gc(full=T)	
#' 	
#' 	
#' Derive a binary class label - cancer (for non-trivial tumor masks) or control (for empty tumor masks).	
#' 	
#' 	
# Compute a new binary outcome variable 1=Brain Tumor (mask has at least 1 white pixel), 0=Normal Brain, no white pixels in the mask	
pos_neg_diagnosis <- sapply(data_train$mask,	
     function(x) {   value = max(imager::magick2cimg(image_read(x)))	
         ifelse (value > 0, 1, 0)  }	
  )	
table(pos_neg_diagnosis)   #; head(data_train)	
# pos_neg_diagnosis	
#    0    1 	
# 2046 1103 	
	
# Add the normal vs. cancer label to training and testing datasets	
data_train$label <- pos_neg_diagnosis	
	
pos_neg_diagnosis_valid <- sapply(data_valid$mask,	
     function(x) {   value = max(imager::magick2cimg(image_read(x)))	
         ifelse (value > 0, 1, 0)  }	
  )	
table(pos_neg_diagnosis_valid)	
data_valid$label <- pos_neg_diagnosis_valid	
# head(data_valid)	
#' 	
#' 	
#' #### Torch-based Transfer Learning	
#' 	
#' Next we will ingest the *3-channel (RGB) imaging data* and the corresponding *tumor masks* (binary images) for each participant. The method `torch::dataset()` allows specifying `initialize()` and `.getitem()` methods for complex computable data objects. The first method `initialize()` creates the archive of *imaging* and *mask* file names that can be utilized by the second method `.getitem()` for iterating over all cases. The method `.getitem()` returns ordered input-output pairs and performs weighted sampling, with prevalence to large lesion images, which is useful for accounting for DNN training with imbalanced classes.	
#' 	
#' The training sets can be enhanced by *data augmentation* -- a process expanding the set of training images and masks via operations such as *flipping, resizing, and rotating* based on certain specifications.	
#' 	
#' Below we use [PyTorch](https://blog.rstudio.com/2020/09/29/torch/) to define a **brain_dataset** method providing a larger *augmented* *training* dataset, new size `length(train_ds) ~ 2K`, and a larger *validation* set, new size `length(valid_ds)~1K`. In practice, we can use any alternative transfer-learning strategy including `pytorch`, `tensorflow`, `theano`, etc.	
#' 	
#' Note that `unet` training takes significant computational time; training 20-epochs took a total of 600 compute hours, which translates into a couple of days of computing on a 20-core server. We have provided [several precomputed/pre-trained *.pt models on Canvas](https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes).	
#' 	
#' 	
# First check that all brain imaging data is already downloaded and unzipped, if not, see the "Data Import" section above	
	
# # train_dir <- "/data/data"	
# # valid_dir <- "/data/mri_valid"	
# # 	
# # library(magick)   # Needed for TIFF --> PNG image conversion and other image processing tasks	
# # 	
# # # check if ReadMe file is accessible	
# # #file.rename("/data/ReadMe_TCGA_MRI_Segmentation_Data_Phenotypes.txt", train_dir)	
# # # Import the meta-data	
# # meta_data <- read.csv("/data/TCGA_MRI_Segmentation_Data_Phenotypes.csv")	
# 	
# # note that these are relative file/directory names. To see the complete local path	
# # tempdir(); getwd()	
# 	
# # Create a validation folder	
# # dir.create(valid_dir)	
# 	
# # Check all n=110 = 90+20 patients are accessible	
# patients <- list.dirs(train_dir, recursive = FALSE)	
# length(patients)	
# 	
# # Confirm that only 80 patients are left in the standard data folder	
# # list all training data imaging files: list.dirs(train_dir, recursive = FALSE)	
# length(list.dirs(train_dir, recursive = FALSE))	
# 	
# # and 20-70 validation cases are in the validation folder	
# length(list.dirs(valid_dir, recursive = FALSE))	
# 	
# # and check validation data	
# length(list.files(valid_dir, recursive = T)) 	
# 	
# library(tibble)	
# # Identify the TRAINING and VALIDATION data objects (raw images + tumor masks) as filenames	
# data_train <- tibble(	
#   img = grep(list.files(train_dir, full.names = TRUE, pattern = "png", recursive = TRUE),	
#         pattern = 'mask', invert = TRUE, value = TRUE),	
#   mask = grep(list.files(train_dir, full.names = TRUE, pattern = "png", recursive = TRUE),	
#         pattern = 'mask', value = TRUE)	
# )	
# data_valid <- tibble(	
#   img = grep(list.files(valid_dir, full.names = TRUE, pattern = "png", recursive = TRUE),	
#         pattern = 'mask', invert = TRUE, value = TRUE),	
#   mask = grep(list.files(valid_dir, full.names = TRUE, pattern = "png", recursive = TRUE),	
#         pattern = 'mask', value = TRUE)	
# )	
# 	
# # library(rsample)	
# #	
# # # Compute a new binary outcome variable 1=Brain Tumor (mask has at least 1 white pixel), 0=Normal Brain, no white pixels in the mask	
# pos_neg_diagnosis <- sapply(data_train$mask,	
#      function(x) {   value = max(imager::magick2cimg(image_read(x)))	
#          ifelse (value > 0, 1, 0)  }	
#   )	
# table(pos_neg_diagnosis)   #; head(data_train)	
# # pos_neg_diagnosis	
# #    0    1	
# # 2046 1103	
# 	
# # Add the normal vs. cancer label to training and testing datasets	
# data_train$label <- pos_neg_diagnosis	
# 	
# pos_neg_diagnosis_valid <- sapply(data_valid$mask,	
#      function(x) {   value = max(imager::magick2cimg(image_read(x)))	
#          ifelse (value > 0, 1, 0)  }	
#   )	
# table(pos_neg_diagnosis_valid)	
# data_valid$label <- pos_neg_diagnosis_valid	
# head(data_valid)	
#' 	
#' 	
#' Below, we define a function `brain_dataset()` required for iterative retrieval of pytorch datasets. All such datasets need to have a method called `initialize()` instantiating the inventory of *imaging* and *mask* file names that will be used by the second method, `.getitem()`, to ingest the imaging data from these files, return input-image plus mask-target pairs, and perform data-augmentation. The parameter `random_sampling=TRUE`, `.getitem()` controls the weighted sample loading of image-mask pairs with larger in size tumors. This option is used with the training set to counter any class-label imbalances.	
#' 	
#' The training sets, but not the validation sets, use of *data augmentation* to ensure DCNN-invariance to specific spatiotemporal and intensity transformations. During imaging-data augmentation, the training images/masks may be *flipped, resized*, and *rotated* with specifiable probabilities for each type of augmentation transformation.	
#' 	
#' 	
#' 	
library(torch)	
library(torchvision)	
	
# data wrangling	
library(tidyverse)	
library(zeallot)   # needed for the piping function "%<-%" in "brain_dataset()"	
	
# image processing and visualization	
library(magick)	
#library(cowplot)	
	
# dataset loading 	
library(pins)	
library(zip)	
	
torch_manual_seed(1234)	
set.seed(1234)	
	
brain_dataset <- dataset(	
  name = "brain_dataset",	
  # 1. Initialize	
  initialize = function(img_dir, augmentation_params = NULL, random_sampling = FALSE) {	
    self$images <- tibble(img = grep(list.files(img_dir, full.names = TRUE, pattern = "png", recursive = TRUE),	
                                     pattern = 'mask', invert = TRUE, value = TRUE),	
                          mask = grep(list.files(img_dir, full.names = TRUE,pattern = "png", recursive = TRUE),	
                                      pattern = 'mask',value = TRUE)	
                          )	
    self$slice_weights <- self$calc_slice_weights(self$images$mask)	
    self$augmentation_params <- augmentation_params	
    self$random_sampling <- random_sampling	
  },	
  	
  # 2. Load and transform images from files into TF object elements (tensors)	
  .getitem = function(i) {index <- if (self$random_sampling == TRUE) sample(1:self$.length(), 1, prob = self$slice_weights)	
    else i	
    img <- self$images$img[index] %>% image_read() %>% transform_to_tensor() 	
    mask <- self$images$mask[index] %>% image_read() %>% transform_to_tensor() %>% transform_rgb_to_grayscale() %>%	
      torch_unsqueeze(1)	
    img <- self$min_max_scale(img)	
    	
    if (!is.null(self$augmentation_params)) {	
      scale_param <- self$augmentation_params[1]	
      c(img, mask) %<-% self$resize(img, mask, scale_param)	
      	
      rot_param <- self$augmentation_params[2]	
      c(img, mask) %<-% self$rotate(img, mask, rot_param)	
      	
      flip_param <- self$augmentation_params[3]	
      c(img, mask) %<-% self$flip(img, mask, flip_param)	
    }	
    list(img = img, mask = mask)	
  },	
  	
  # 3. Save the total number of imaging files	
  .length = function() { nrow(self$images)  },	
  	
  # 4. Estimate 2D image weights: Bigger tumor-masks correspond to higher weights	
  calc_slice_weights = function(masks) {	
    weights <- map_dbl(masks, function(m) {	
      img <- as.integer(magick::image_data(image_read(m), channels = "gray"))	
      sum(img / 255)	
    })	
    	
    sum_weights <- sum(weights)	
    num_weights <- length(weights)	
    	
    weights <- weights %>% map_dbl(function(w) {	
      w <- (w + sum_weights * 0.1 / num_weights) / (sum_weights * 1.1)	
    })	
    weights	
  },	
  	
  # 5. Estimate the image intensity range (min, max)	
  min_max_scale = function(x) {	
    min = x$min()$item()	
    max = x$max()$item()	
    x$clamp_(min = min, max = max)	
    x$add_(-min)$div_(max - min + 1e-5)	
    x	
  },	
  	
  # 6. Image tensor shape resizing (when necessary)	
  resize = function(img, mask, scale_param) {	
    img_size <- dim(img)[2]	
    rnd_scale <- runif(1, 1 - scale_param, 1 + scale_param)	
    img <- transform_resize(img, size = rnd_scale * img_size)	
    mask <- transform_resize(mask, size = rnd_scale * img_size)	
    diff <- dim(img)[2] - img_size	
    if (diff > 0) {	
      top <- ceiling(diff / 2)	
      left <- ceiling(diff / 2)	
      img <- transform_crop(img, top, left, img_size, img_size)	
      mask <- transform_crop(mask, top, left, img_size, img_size)	
    } else {	
      img <- transform_pad(img,padding = -c(ceiling(diff/2),floor(diff/2),ceiling(diff/2),floor(diff/2)))	
      mask <- transform_pad(mask, padding = -c(ceiling(diff/2), floor(diff/2),ceiling(diff/2),floor(diff/2)))	
    }	
    list(img, mask)	
  },	
  	
  # 7. Rotation (if/when augmentation is requested)	
  rotate = function(img, mask, rot_param) {	
    rnd_rot <- runif(1, 1 - rot_param, 1 + rot_param)	
    img <- transform_rotate(img, angle = rnd_rot)	
    mask <- transform_rotate(mask, angle = rnd_rot)	
    list(img, mask)	
  },	
  	
  # 8. Flipping (if/when augmentation is requested)	
  flip = function(img, mask, flip_param) {	
    rnd_flip <- runif(1)	
    if (rnd_flip > flip_param) {	
      img <- transform_hflip(img)	
      mask <- transform_hflip(mask)	
    }	
    list(img, mask)	
  }	
)	
#' 	
#' 	
#' Next, using the `brain_dataset()` method, we actually generate the (training and validation) imaging datasets as computable objects using the raw filenames in the training and validation data-frames defined above.	
#' 	
#' 	
train_ds <- brain_dataset(	
  train_dir,	
  augmentation_params = c(0.05, 15, 0.5),	
  random_sampling = TRUE	
)	
	
length(train_ds)	
# ~3K	
	
valid_ds <- brain_dataset(	
  valid_dir,	
  augmentation_params = NULL,	
  random_sampling = FALSE	
)	
length(valid_ds)	
# ~700	
#' 	
#' 	
#' Let's visualize one testing and one validation image-mask pairs using `plot_ly()`.	
#' 	
#' 	
library (plotly)	
rasterPlotly <- function (image, name="", hovermode = NULL) {	
  myPlot <- plot_ly(type="image", z=image, name=name, hoverlabel=name, text=name, 	
            hovertext=name, hoverinfo="name+x+y") %>% 	
    layout(hovermode = hovermode, xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
  return(myPlot)	
}	
	
# Training Case 20	
img_and_mask <- train_ds[20]	
img <- img_and_mask[[1]]  # 3-channel image	
img <- img_and_mask[[1]]$permute(c(2, 3, 1)) %>% as.array()  # 3-channel image	
mask <- img_and_mask[[2]]$squeeze() %>% as.array() # tumor mask	
mask <- EBImage::rgbImage(255*mask, 255*mask, 255*mask)  # manually generate a gray scale mask image	
	
p1 <- rasterPlotly(image=255*img, name = "RGB Image", hovermode = "y unified")	
p2 <- rasterPlotly(mask, name = "Tumor Mask", hovermode = "y unified")	
subplot(p1,p2, shareY = TRUE) %>% layout(title="Training Case 20: 3-Channel Image (Left) & Tumor Mask (Right)")	
	
# Validation Case 18	
img_and_mask <- valid_ds[18]	
img <- img_and_mask[[1]]	
mask <- img_and_mask[[2]]	
img <- img_and_mask[[1]]$permute(c(2, 3, 1)) %>% as.array()  # 3-channel image	
mask <- img_and_mask[[2]]$squeeze() %>% as.array() # tumor mask	
mask <- EBImage::rgbImage(255*mask, 255*mask, 255*mask)  # manually generate a gray scale mask image	
	
p1 <- rasterPlotly(image=255*img, name = "RGB Image", hovermode = "y unified")	
p2 <- rasterPlotly(mask, name = "Tumor Mask", hovermode = "y unified")	
subplot(p1,p2, shareY = TRUE) %>% layout(title="Validation Case 18: 3-Channel Image (Left) & Tumor Mask (Right)")	
#' 	
#' 	
#' Demonstrate training-data image-augmentation by using one image to generate 7 rows of 4 augmented images.	
#' 	
#' 	
N <- 7*4	
img_and_mask <- valid_ds[77]	
img <- img_and_mask[[1]]	
mask <- img_and_mask[[2]]	
	
imgs <- map (1:N, function(i) {	
  # spatial-scale factor	
  c(img, mask) %<-% train_ds$resize(img, mask, 0.25) 	
  c(img, mask) %<-% train_ds$flip(img, mask, 0.3)	
  c(img, mask) %<-% train_ds$rotate(img, mask, 45) 	
  	
  img %>%	
    transform_rgb_to_grayscale() %>%	
    as.array() %>%	
    plot_ly(z=., type="heatmap", showscale = FALSE, name=paste0("AugmImg=", i)) %>% 	
    layout(showlegend=FALSE,  hovermode = "y unified") #,	
          # yaxis = list(scaleratio = 1, scaleanchor = 'x'))	
})	
	
# imgs[[2]] 	
imgs %>%	
  subplot(nrows = 7, shareX = TRUE, shareY = TRUE, which_layout=1) %>%	
  layout(title="Validation Case 77: Random Rotation/Flop/Scale Image Augmentation")	
#' 	
#' 	
#' Instantiate training/validation data loaders using `torch::dataloader()`, which combines a data-set and a sampler-iterator into a single/multi-process iterator over the entire dataset.	
#' 	
#' 	
batch_size <- 25	
train_dl <- dataloader(train_ds, batch_size)	
valid_dl <- dataloader(valid_ds, batch_size)	
#' 	
#' 	
#' Next, we specify the `Unet` model indicating the number of “down” or encoding (analysis) depth for shrinking the input images and incrementing the number of filters, as well as specify how do we go “up” again during the decoding (synthesis) phase.	
#' 	
#' 	
# forward(), keeps track of layer outputs seen going “down,” to be added back in going “up.”	
unet <- nn_module(name="unet",	
  initialize = function(channels_in = 3, n_classes = 1, depth = 5, n_filters = 6) {	
    self$down_path <- nn_module_list()	
    prev_channels <- channels_in	
    for (i in 1:depth) {	
      self$down_path$append(down_block(prev_channels, 2^(n_filters+i-1)))	
      prev_channels <- 2^(n_filters+i-1)	
    }	
    	
    self$up_path <- nn_module_list()	
    	
    for (i in ((depth - 1):1)) {	
      self$up_path$append(up_block(prev_channels, 2^(n_filters+i-1)))	
      prev_channels <- 2^(n_filters+i-1)	
    }	
    	
    self$last = nn_conv2d(prev_channels, n_classes, kernel_size = 1)	
  },	
  	
  forward = function(x) {	
    blocks <- list()	
    for (i in 1:length(self$down_path)) {	
      x <- self$down_path[[i]](x)	
      if (i != length(self$down_path)) {	
        blocks <- c(blocks, x)	
        x <- nnf_max_pool2d(x, 2)	
      }	
    }	
    	
    for (i in 1:length(self$up_path)) {  	
      x <- self$up_path[[i]](x, blocks[[length(blocks)-i+1]]$to(device=device))	
    }	
    	
    torch_sigmoid(self$last(x))	
  }	
)	
	
# unet utilizes `down_block` and `up_block`	
# down_block delegates to its own workhorse, conv_block, and up_block “bridges” the UNET	
	
down_block <- nn_module(	
  classname="down_block",	
  initialize = function(in_size, out_size) {	
    self$conv_block <- conv_block(in_size, out_size)	
  },	
  	
  forward = function(x) {   self$conv_block(x)  }	
)	
	
up_block <- nn_module(	
  classname="up_block",	
  initialize = function(in_size, out_size) {	
    self$up = nn_conv_transpose2d(in_size, out_size, kernel_size=2, stride=2)	
    self$conv_block = conv_block(in_size, out_size)	
  },	
  	
  forward = function(x, bridge) {	
    up <- self$up(x)	
    torch_cat(list(up, bridge), 2) %>% self$conv_block()	
  }	
)	
	
conv_block <- nn_module( 	
  classname="conv_block",	
  initialize = function(in_size, out_size) {	
    self$conv_block <- nn_sequential(	
      nn_conv2d(in_size, out_size, kernel_size = 3, padding = 1),	
      nn_relu(),	
      nn_dropout(0.6),	
      nn_conv2d(out_size, out_size, kernel_size = 3, padding = 1),	
      nn_relu()	
    )	
  },	
  	
  forward = function(x){	
    self$conv_block(x)	
  }	
)	
#' 	
#' 	
#' Before we start the `unet` model training, we need to specify CPU/GPU device and optimization scheme.	
#' 	
#' 	
# Initialize the model with appropriate CPU/GPU	
device <- torch_device(if(cuda_is_available()) "cuda" else "cpu")	
model <- unet(depth = 5)$to(device = device)	
	
# OPTIMIZATION	
	
# DCNN model training using cross_entropy and dice_loss.	
calc_dice_loss <- function(y_pred, y_true) {	
  smooth <- 1	
  y_pred <- y_pred$view(-1)	
  y_true <- y_true$view(-1)	
  intersection <- (y_pred * y_true)$sum()	
  1 - ((2*intersection+smooth)/(y_pred$sum()+y_true$sum()+smooth))	
}	
	
dice_weight <- 0.3	
	
optimizer <- optim_sgd(model$parameters, lr = 0.1, momentum = 0.9)	
#' 	
#' 	
#' On a multi-core machine, training the `UNET` model will require ~2-hrs per epoch. This step can be skipped if you load in a previously pre-trained model (`model`) that is saved (`torch_save(model, "/path/model.pt")`) and available for import (`torch_load("/path/model.ptt")`), [see several *.pt model here](https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes).	
#' 	
#' 	
num_epochs <- 1  # for knitting, otherwise increase to 5+	
	
scheduler <- lr_one_cycle(optimizer, max_lr = 0.1, steps_per_epoch = length(train_dl), epochs = num_epochs)	
	
# TRAINING	
train_batch <- function(b) {	
  optimizer$zero_grad()	
  output <- model(b[[1]]$to(device = device))	
  target <- b[[2]]$to(device = device)	
  	
  bce_loss <- nnf_binary_cross_entropy(output, target)	
  dice_loss <- calc_dice_loss(output, target)	
  loss <-  dice_weight*dice_loss + (1-dice_weight)*bce_loss	
  	
  loss$backward()	
  optimizer$step()	
  scheduler$step()	
	
  list(bce_loss$item(), dice_loss$item(), loss$item())	
}	
	
valid_batch <- function(b) {	
  output <- model(b[[1]]$to(device = device))	
  target <- b[[2]]$to(device = device)	
	
  bce_loss <- nnf_binary_cross_entropy(output, target)	
  dice_loss <- calc_dice_loss(output, target)	
  loss <-  dice_weight * dice_loss + (1-dice_weight)*bce_loss	
  	
  list(bce_loss$item(), dice_loss$item(), loss$item())	
}	
#' 	
#' 	
#' Start the `Unet` training process (this is suspended here to enable real-time knitting of the HTML doc).	
#' 	
#' 	
for (epoch in 1:num_epochs) {	
  model$train()	
  train_bce <- c()	
  train_dice <- c()	
  train_loss <- c()	
  	
  coro::loop(for (b in train_dl) {	
    c(bce_loss, dice_loss, loss) %<-% train_batch(b)	
    train_bce <- c(train_bce, bce_loss)	
    train_dice <- c(train_dice, dice_loss)	
    train_loss <- c(train_loss, loss)	
  })	
  	
  torch_save(model, paste0("model_", epoch, ".pt"))	
  	
  cat(sprintf("\nEpoch %d, training: loss:%3f, bce: %3f, dice: %3f\n",	
              epoch, mean(train_loss), mean(train_bce), mean(train_dice)))	
  	
  model$eval()	
  valid_bce <- c()	
  valid_dice <- c()	
  valid_loss <- c()	
  	
  i <- 0	
  coro::loop(for (b in valid_dl) {	
    	
    i <<- i + 1	
    c(bce_loss, dice_loss, loss) %<-% valid_batch(b)	
    valid_bce <- c(valid_bce, bce_loss)	
    valid_dice <- c(valid_dice, dice_loss)	
    valid_loss <- c(valid_loss, loss)	
    	
  })	
  	
  cat(sprintf("\n Epoch %d, validation: loss: %3f, bce: %3f, dice: %3f\n",	
              epoch, mean(valid_loss), mean(valid_bce), mean(valid_dice)))	
}	
	
# note that DCNN model mask prediction will be based on the model complexity!	
	
# Epoch 1, training: loss:0.495716, bce: 0.307674, dice: 0.934480	
# Epoch 1, validation: loss: 0.336556, bce: 0.070743, dice: 0.956785	
# 	
# Epoch 2, training: loss:0.290850, bce: 0.114878, dice: 0.701452	
# Epoch 2, validation: loss: 0.213417, bce: 0.046690, dice: 0.602447	
# 	
# Epoch 3, training: loss:0.204054, bce: 0.102973, dice: 0.439909	
# Epoch 3, validation: loss: 0.218209, bce: 0.051383, dice: 0.607468	
# 	
# Epoch 4, training: loss:0.199831, bce: 0.101564, dice: 0.429123	
# Epoch 4, validation: loss: 0.243201, bce: 0.068757, dice: 0.650236	
# 	
# Epoch 5, training: loss:0.201025, bce: 0.101525, dice: 0.433192	
# Epoch 5, validation: loss: 0.258278, bce: 0.077821, dice: 0.679345	
# 	
# Epoch 6, training: loss:0.188058, bce: 0.094279, dice: 0.406877	
# Epoch 6, validation: loss: 0.206198, bce: 0.038769, dice: 0.596865	
# 	
# Epoch 7, training: loss:0.187285, bce: 0.094894, dice: 0.402866	
# Epoch 7, validation: loss: 0.216684, bce: 0.050536, dice: 0.604361	
# 	
# Epoch 8, training: loss:0.184097, bce: 0.093609, dice: 0.395237	
# Epoch 8, validation: loss: 0.235675, bce: 0.060247, dice: 0.645006	
# 	
# Epoch 9, training: loss:0.179406, bce: 0.090458, dice: 0.386951	
# Epoch 9, validation: loss: 0.238004, bce: 0.055283, dice: 0.664353	
# 	
# Epoch 10, training: loss:0.175083, bce: 0.089144, dice: 0.375608	
# Epoch 10, validation: loss: 0.258206, bce: 0.080033, dice: 0.673942	
# 	
# Epoch 11, training: loss:0.175821, bce: 0.089795, dice: 0.376549	
# Epoch 11, validation: loss: 0.213895, bce: 0.037936, dice: 0.624464	
# 	
# Epoch 12, training: loss:0.174915, bce: 0.087026, dice: 0.379987	
# Epoch 12, validation: loss: 0.264673, bce: 0.083884, dice: 0.686514	
# 	
# Epoch 13, training: loss:0.171973, bce: 0.087210, dice: 0.369754	
# Epoch 13, validation: loss: 0.236396, bce: 0.065432, dice: 0.635311	
# 	
# Epoch 14, training: loss:0.167253, bce: 0.084660, dice: 0.359970	
# Epoch 14, validation: loss: 0.210641, bce: 0.047418, dice: 0.591495	
# 	
# Epoch 15, training: loss:0.170835, bce: 0.083572, dice: 0.374449	
# Epoch 15, validation: loss: 0.246905, bce: 0.067056, dice: 0.666552	
# 	
# Epoch 16, training: loss:0.167938, bce: 0.083285, dice: 0.365460	
# Epoch 16, validation: loss: 0.228558, bce: 0.057135, dice: 0.628546	
# 	
# Epoch 17, training: loss:0.161383, bce: 0.080367, dice: 0.350421	
# Epoch 17, validation: loss: 0.221012, bce: 0.055709, dice: 0.606719	
# 	
# Epoch 18, training: loss:0.158193, bce: 0.078068, dice: 0.345150	
# Epoch 18, validation: loss: 0.222881, bce: 0.052606, dice: 0.620188	
# 	
# Epoch 19, training: loss:0.158190, bce: 0.078865, dice: 0.343280	
# Epoch 19, validation: loss: 0.223159, bce: 0.054643, dice: 0.616361	
# 	
# Epoch 20, training: loss:0.161496, bce: 0.079807, dice: 0.352102	
# Epoch 20, validation: loss: 0.221487, bce: 0.051793, dice: 0.617438	
	
# > proc.time()  # In Seconds! About 2,154,523 sec ~ 598 hours	
#      user    system   elapsed	
# 2154523.7  776791.3  141279.2	
#' 	
#' 	
#' Evaluate the trained DCNN model using `model_6.pt` and a batch of $n=10$ validation datasets (2D brain imaging scans).	
#' 	
#' 	
# EVALUATION	
# without random sampling, we'd mainly see lesion-free patches	
N <- 10 # number of cases to predict the masks for	
	
eval_ds <- brain_dataset(valid_dir, augmentation_params = NULL, random_sampling = TRUE)	
eval_dl <- dataloader(eval_ds, batch_size = N)	
	
batch <- eval_dl %>% dataloader_make_iter() %>% dataloader_next()	
	
# Load a previously save torch model	
# epoch = 1	
# torch_load(model, paste0("model_", epoch, ".pt"))	
	
# torch_save(model, "C:/Users/Dinov/Desktop/model_epoch_1.pt")	
	
##### Requirements for loading a pre-trained model .....	
# load torch packages above	
# The pre-computed model files are available on Canvas:	
# https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes	
localModelsFolder <- "C:/Users/dinov/Desktop/Ivo.dir/Research/UMichigan/Proposals/2014/UMich_MINDS_DataSciInitiative_2014/2016/MIDAS_HSXXX_DataScienceAnalyticsCourse_Porposal/MOOC_Implementation/appendix/"	
	
model <- torch_load(paste0(localModelsFolder, "model_6.pt"))	
# train_dir <- "/data/data"	
# valid_dir <- "/data/mri_valid"	
# brain_dataset ...	
# train_ds; valid_ds	
# eval_ds <- brain_dataset(valid_dir,augmentation_params=NULL,random_sampling=TRUE)	
# eval_dl <- dataloader(eval_ds, batch_size = 8)	
# batch <- eval_dl %>% dataloader_make_iter() %>% dataloader_next()	
# device <- torch_device(if(cuda_is_available()) "cuda" else "cpu")	
# calc_dice_loss	
# library(plotly)	
	
imgsTruePred <- map (1:N, function(i) {	
  # Get the 3 images	
  img <- batch[[1]][i, .., drop = FALSE]                            # Image to predict the tumor mask for	
  inferred_mask <- model(img$to(device = device))                   # Predicted MASK	
  true_mask <- batch[[2]][i, .., drop = FALSE]$to(device = device)  # True manual mask delineation	
  	
  # compute/report the BCE/Dice performance metrics	
  bce <- nnf_binary_cross_entropy(inferred_mask, true_mask)$to(device = "cpu") %>% as.numeric()	
  dc <- calc_dice_loss(inferred_mask, true_mask)$to(device = "cpu") %>% as.numeric()	
  cat(sprintf("\nSample %d, bce: %3f, dice: %3f\n", i, bce, dc))	
  	
  # extract the inferred predicted mask as a 2D image/array of probability values per voxel!	
  inferred_mask <- inferred_mask$to(device = "cpu") %>% as.array() %>% .[1, 1, , ]	
  # Binarize the probability tumor prediction to binary mask	
  inferred_mask <- ifelse(inferred_mask > 0.48, 1, 0)  # In a real run, use "inferred_mask > 0.5, 1, 0"	
  # hist(as.matrix(inferred_mask[1,1,,]))	
  	
  imgs <- img[1, 1, ,] %>% as.array() %>% as.array() %>% 	
    plot_ly(z=., type="heatmap", showscale = FALSE, name="Image")  %>% 	
      layout(showlegend=FALSE,  yaxis = list(scaleanchor = "x", scaleratio = 1))    # hovermode = "y unified"	
  masks <- true_mask$to(device = "cpu")[1, 1, ,] %>% as.array() %>% as.array() %>% 	
    plot_ly(z=., type="heatmap", showscale = FALSE, name="True Mask") %>% 	
      layout(showlegend=FALSE,  yaxis = list(scaleanchor = "x", scaleratio = 1))    # hovermode = "y unified",	
  predMasks <- inferred_mask %>% as.array() %>% 	
    plot_ly(z=., type="heatmap", showscale = FALSE, name="Unet-Derived Mask") %>% 	
        layout(showlegend=FALSE,  yaxis = list(scaleanchor = "x", scaleratio = 1))  # hovermode = "y unified",	
  rowSubPlots <- subplot(imgs, masks, predMasks, nrows = 1, shareY = TRUE) %>%	
            # , widths = c(0.3, 0.3, 0.3))	
    layout(hovermode = "y unified") #, yaxis = list(scaleanchor = "x", scaleratio = 1))	
})	
#' 	
#' 	
#' Finally, we can assess the `unet model` performance on the independent validation images, report the DCNN-derived tumor masks (as images) and some quantitative measures (e.g., Binary Cross-Entropy, Dice coefficient). The figure displays the results - brain image, true mask, and DCNN-estimated mask.	
#' 	
#' 	
# imgsTruePred[[2]] 	
# p1 <- imgsTruePred[c(1:2)] %>% subplot(nrows = 2)	
# p2 <- imgsTruePred[c(3:4)] %>% subplot(nrows = 2)	
# p3 <- imgsTruePred[c(5:6)] %>% subplot(nrows = 2)	
# p4 <- imgsTruePred[c(7:8)] %>% subplot(nrows = 2)	
# p5 <- imgsTruePred[c(9:10)] %>% subplot(nrows = 2)	
# subplot(p1, p2, p3, p4, p5, nrows = 5) %>%  layout(title="Model Validation using N=10 Cases")	
	
imgsTruePred %>% subplot(nrows = N) %>%  layout(title="Model Validation using N=10 Cases")	
#' 	
#' 	
#' #### Torch-based DNN-based Image Reconstruction and Synthetic Image Generation	
#' 	
#' ##### Data Reparametrization	
#' 	
#' This image-normalization (*standardization*) is necessary to avoid extreme differences in image intensities (e.g., sMRI, fMRI, PET, SPECT, dMDI/DTI, MRS, and other imaging modalities tend to have vastly different scales).	
#' 	
#' 	
reparametrize <- function(mu, logvar){	
    std = torch_exp(0.5*logvar) 	
    eps = torch_randn_like(std)	
    z = mu + std * eps	
    return (z)	
}	
#' 	
#' 	
#' ##### Define the Architecture of the UNet *Encoder* Branch	
#' 	
#' 	
h_dim <- c(32, 64, 128, 256, 512)	
	
net_encoder <- nn_module(	
        initialize = function(latent_size=1000){	
        	
        self$latent_size <- latent_size  # Z	
	
        # h_dim <- c(32, 64, 128, 256, 512)	
        in_channels = 3	
        	
        # Encoder	
        	
        self$encoder <- nn_sequential(	
                nn_conv2d(in_channels, out_channels=h_dim[1],	
                          kernel_size= 3, stride= 2, padding = 1),	
                nn_batch_norm2d(h_dim[1]),	
                nn_leaky_relu(),	
                	
                nn_conv2d(h_dim[1], out_channels=h_dim[2],	
                          kernel_size= 3, stride= 2, padding = 1),	
                nn_batch_norm2d(h_dim[2]),	
                nn_leaky_relu(),	
                	
                nn_conv2d(h_dim[2], out_channels=h_dim[3],	
                          kernel_size= 3, stride= 2, padding = 1),	
                nn_batch_norm2d(h_dim[3]),	
                nn_leaky_relu(),	
                	
                nn_conv2d(h_dim[3], out_channels=h_dim[4],	
                          kernel_size= 3, stride= 2, padding = 1),	
                nn_batch_norm2d(h_dim[4]),	
                nn_leaky_relu(),	
                	
                nn_conv2d(h_dim[4], out_channels=h_dim[5],	
                          kernel_size= 3, stride= 2, padding = 1),	
                nn_batch_norm2d(h_dim[5]),	
                nn_leaky_relu()	
                )	
        	
        self$mu_layer <- nn_linear(h_dim[5]*64, latent_size)	
        self$logvar_layer <- nn_linear(h_dim[5]*64, latent_size)	
        },	
	
        forward = function(x){	
            	
            list_output = c()	
            hidden = self$encoder(x)	
            hidden = torch_flatten(hidden, start_dim=2)	
            mu = self$mu_layer(hidden)	
            logvar = self$logvar_layer(hidden)	
            z = reparametrize(mu, logvar)	
            list_output = append(list_output, z)	
            list_output = append(list_output, mu)	
            list_output = append(list_output, logvar)	
            	
            return (list_output)	
        }	
)	
#' 	
#' 	
#' ##### Define the Architecture of the UNet *Decoder* Branch	
#' 	
#' 	
net_decoder <- nn_module(	
        initialize = function(latent_size=1000){	
        	
        self$latent_size <- latent_size  # Z	
	
        self$decoder_input <- nn_linear(latent_size, h_dim[5]*64)	
        	
        hidden_dims = c(32, 64, 128, 256, 512)	
        	
        # decoder	
        self$decoder <- nn_sequential(	
                nn_conv_transpose2d(hidden_dims[5],	
                                   hidden_dims[4],	
                                   kernel_size=3,	
                                   stride = 2,	
                                   padding=1,	
                                   output_padding=1),	
                nn_batch_norm2d(hidden_dims[4]),	
                nn_leaky_relu(),	
                	
                nn_conv_transpose2d(hidden_dims[4],	
                                   hidden_dims[3],	
                                   kernel_size=3,	
                                   stride = 2,	
                                   padding=1,	
                                   output_padding=1),	
                nn_batch_norm2d(hidden_dims[3]),	
                nn_leaky_relu(),	
                	
                nn_conv_transpose2d(hidden_dims[3],	
                                   hidden_dims[2],	
                                   kernel_size=3,	
                                   stride = 2,	
                                   padding=1,	
                                   output_padding=1),	
                nn_batch_norm2d(hidden_dims[2]),	
                nn_leaky_relu(),	
                	
                nn_conv_transpose2d(hidden_dims[2],	
                                   hidden_dims[1],	
                                   kernel_size=3,	
                                   stride = 2,	
                                   padding=1,	
                                   output_padding=1),	
                nn_batch_norm2d(hidden_dims[1]),	
                nn_leaky_relu()	
                )	
	
        self$final_layer <- nn_sequential(	
                            nn_conv_transpose2d(hidden_dims[1],	
                                               hidden_dims[1],	
                                               kernel_size=3,	
                                               stride=2,	
                                               padding=1,	
                                               output_padding=1),	
                            nn_batch_norm2d(hidden_dims[1]),	
                            nn_leaky_relu(),	
                            nn_conv2d(hidden_dims[1], out_channels= 3,	
                                      kernel_size= 3, padding= 1),	
                            nn_sigmoid())	
        },	
	
        forward = function(z){	
            	
            z = self$decoder_input(z)	
            z = z$view(c(-1, 512, 8, 8))	
            x_hat = self$final_layer(self$decoder(z))	
            	
            return (x_hat)	
        }	
)	
#' 	
#' 	
#' ##### Define the Loss Function & Optimizer	
#' 	
#' 	
loss_function <- function(x_hat, x, mu, logvar){	
    N = mu$shape[1]	
    Z = mu$shape[2]	
    loss = nnf_binary_cross_entropy(x_hat, x, reduction='sum') -0.5*torch_sum(logvar-mu^2-torch_exp(logvar))-0.5*Z*N	
    return (loss/N)	
}	
#' 	
#' 	
#' ##### Train DNN UNet (Encoder & Decoder Branches)	
#' 	
#' 	
model_encoder = net_encoder()	
model_decoder = net_decoder()	
# model = model$to(device='cuda')	
optimizer <- optim_adam(model_decoder$parameters)	
# optimizer <- optim_adam(model$parameters)	
# train_data <- train_data / 255 # torch tensor with shape (N, C, H, W)	
	
# Epochs	
capture.output(     # Capture/Suppress Pytorch output, which overwhelms the knitted HTML report	
  for (epoch in 1:5) {   # could increase the number of epochs for better results	
    l <- c()	
    coro::loop(for (b in train_dl) {	
      optimizer$zero_grad()	
      gnd = b[[1]]	
      list_output <- model_encoder(gnd)	
      z = list_output[[1]]	
      mu = list_output[[2]]	
      logvar = list_output[[3]]	
      output <- model_decoder(z)	
      	
      loss <- loss_function(output, gnd, mu, logvar)	
      print(loss)	
      loss$backward()	
      optimizer$step()	
      l <- c(l, loss$item())	
    })	
  	
    cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(l)))	
  })	
	
# capture.output(     # Capture/Suppress Pytorch output, which overwhelms the knitted HTML report	
#   for (epoch in 1:5) {   # could increase the number of epochs for better results	
#     l <- c()	
#     coro::loop(for (b in train_dl) {	
#       # optimizer$zero_grad()	
#       encoderOptimizer$zero_grad()	
#       decoderOptimizer$zero_grad()	
#       gnd = b[[1]]	
#       list_output <- model_encoder(gnd)	
#       z = list_output[[1]]	
#       mu = list_output[[2]]	
#       logvar = list_output[[3]]	
#       output <- model_decoder(z)	
#       	
#       loss <- loss_function(output, gnd, mu, logvar)	
#       print(loss)	
#       loss$backward()	
#       # optimizer$step()	
#       encoderOptimizer$step()	
#       decoderOptimizer$step()	
#       l <- c(l, loss$item())	
#     })	
#   	
#     cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(l)))	
#   })	
#' 	
#' 	
#' ##### Obtain the DNN-derived Image Reconstructions and Random Synthetic images	
#' 	
#' The synthetic images are derived by randomly sampling the latent space of the DNN (bottom of network) and using solely the *decoding branch* of the UNet.	
#' 	
#' 	
img <- (train_ds[1][1]$img)$view(c(1, 3, 256, 256))	
z = (model_encoder(img))[[1]]	
reconstructed <-  model_decoder(z)	
reconstructed <- (reconstructed$view(c(3, 256, 256)))$permute(c(2, 3, 1)) %>% as.array()	
	
# Generate a random tensor of numbers in the Latent space using a uniform distribution on the interval [0,1)	
# these will be used to seed the VAE decoder branch of the Unet and synthetically generate 2D brain images	
z <- torch_rand(c(1, 1000)) 	
generated_images <- model_decoder(z)	
generated_images <- (generated_images$view(c(3, 256, 256)))$permute(c(2, 3, 1)) %>% as.array()	
#' 	
#' 	
#' ##### Result Visualization	
#' 	
#' Finally, plot examples of the DNN-derived image reconstruction and synthetic image generation.	
#' 	
#' 	
library (plotly)	
rasterPlotly <- function (image, name="", hovermode = NULL) {	
  myPlot <- plot_ly(type="image", z=image, name=name, hoverlabel=name, text=name, 	
            hovertext=name, hoverinfo="name+x+y+z") %>% 	
    layout(title=name, hovermode = hovermode, xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
  return(myPlot)	
}	
	
p1 <- rasterPlotly(image=255*reconstructed, name = "DNN-reconstructed RGB Image")	
# rasterPlotly(image=255*generated_images, name = "Synthetic DNN-generated RGB Image")	
p2 <- rasterPlotly(image=20/(0.1+generated_images), name = "Synthetic DNN-generated RGB Image")	
# subplot(p1, p2) %>% layout(title="DNN (Unet) Image Reconstruction and Synthetic Generation")	
p1	
p2	
#' 	
#' 	
#' ### Tensorflow Image Pre-processing Pipeline	
#' 	
#' The image prepossessing steps above use `torch` syntax. [This blog](https://blog.rstudio.com/2020/09/29/torch/) provides context on similarities and differences between `torch/pytorch/libtorch` and `tensorflow/keras`.	
#' 	
#' Next, we will use `tensorflow` to again ingest independently the images employing `tf$image` functions, e.g., `decode_png()`. This [introduction to RStudio-Tensorflow](https://tensorflow.rstudio.com/guide/tfdatasets/introduction/) is helpful to understand the `tf` syntax.	
#' 	
#' Assuming the data is already loaded, see the previous section (**Torch Data Import**), we will first preload all the necessary `tensorflow` libraries.	
#' 	
#' 	
# clean environment	
rm(list = ls())	
gc()	
	
library(tensorflow)	
library(tfdatasets)	
library(rsample)  # for training() method	
library(reticulate)	
library(purrr)	
library(keras)	
library(unet)	
library(tibble)	
library(plotly)	
#' 	
#' 	
#' Remember that we already have the ZIP data (*training=data* and *validation=mri_valid*) downloaded and expanded in a local partition `/data/`. We will split the data into training:testing $80:20$ and read the imaging/masking data from the PNG files into tensorflow data objects `training_dataset` and `testing_dataset` (this is the separate *validation set*).	
#' 	
#' 	
train_dir <- "/data/data"	
valid_dir <- "/data/mri_valid"	
	
# Load PNG images	
data_train <- tibble(	
  img = grep(list.files(train_dir, full.names = TRUE, pattern = "\\.png$", recursive = TRUE),	
        pattern = 'mask', invert = TRUE, value = TRUE),	
  mask = grep(list.files(train_dir, full.names = TRUE, pattern = "\\.png$", recursive = TRUE),	
        pattern = 'mask', value = TRUE)	
)	
length(data_train$mask)  # [1] 2520	
data_train$img[[1]]  # [1] "/data/data/TCGA_CS_4941_19960909/TCGA_CS_4941_19960909_1.tif"	
data_train$img[[2520]]; data_train$mask[[2520]]	
	
# # Or Load the TIFF images	
# data_train <- tibble(	
#   img = grep(list.files(train_dir, full.names = TRUE, pattern = "*\\d\\.tif$", recursive = TRUE),	
#              pattern = 'mask', invert = TRUE, value = TRUE),	
#   mask = grep(list.files(train_dir, full.names = TRUE, pattern = "*mask\\.tif$", recursive = TRUE),	
#             pattern = 'mask', value = TRUE)	
# )	
# length(data_train$mask)  # [1] 3216	
# data_train$img[[1]]  # [1] "/data/data/TCGA_CS_4941_19960909/TCGA_CS_4941_19960909_1.tif"	
# data_train$img[[3216]]; data_train$mask[[3216]]	
	
data_train_valid <- initial_split(data_train, prop = 0.8)	
	
# data_train_valid$data$img[1111];  data_train_valid$data$mask[1111]	
# # "/data/data/TCGA_DU_7299_19910417/TCGA_DU_7299_19910417_17_mask.png"	
# tarfile <- 'C:/Users/Dinov/Desktop/BrainTumorImagingDataZipArchive.tgz'	
# tar(tarfile,'/data/',compression='gzip')	
# untar(tarfile,'/data/',compression='gzip')	
	
# Check the tensor shape	
# tf$image$decode_png(tf$io$read_file(data_train$img[[1]]))	
# Inspect the structure of  data_train_valid	
# str(data_train_valid)	
	
training_dataset <- training(data_train_valid) %>%  	
  tensor_slices_dataset() %>% 	
  dataset_map(~.x %>% list_modify(	
    # decode_jpeg yields a 3d tensor of shape (256, 256, 3)	
    # Check tensor shapes!	
    img = tf$image$decode_png(tf$io$read_file(.x$img), channels = 3),	
    # img = ifelse(as.character(tf$image$decode_png(tf$io$read_file(.x$img)))$shape=="(256, 256, 1)",	
    #                    fix3ChannelImg(tf$image$decode_png(tf$io$read_file(.x$img))),	
    #                    tf$image$decode_png(tf$io$read_file(.x$img))),	
    # Note that decode_gif yields a 4d tensor of shape (1, 256, 256, 3),	
    # so we remove the unneeded batch dimension and all but one 	
    # of the 3 (identical) channels: tf$image$decode_gif(tf$io$read_file(.x$mask))[1,,,][,,1,drop=FALSE]	
    mask = tf$image$decode_png(tf$io$read_file(.x$mask))	
    # wrong_tensor_shape = (as.character(.x$shape) =="(256, 256, 1)")	
    # if (as.character(img$shape) == "(256, 256, 1)") 	
    #   img = tf$image$grayscale_to_rgb(img)	
  )) 	
  # %>%	
  # tfdatasets::dataset_filter(	
  #      function(record) {	
  #         tf$equal(as.character(record$img$shape), "(256, 256, 3)")	
  #      }	
  #   )	
	
testing_dataset <- testing(data_train_valid) %>%  	
  tensor_slices_dataset() %>% 	
  dataset_map(~.x %>% list_modify(	
    # decode_jpeg yields a 3d tensor of shape (256, 256, 3)	
    # Check tensor shapes!	
    img = tf$image$decode_png(tf$io$read_file(.x$img), channels = 3),	
    # img = ifelse(as.character(tf$image$decode_png(tf$io$read_file(.x$img)))$shape=="(256, 256, 1)",	
    #                    fix3ChannelImg(tf$image$decode_png(tf$io$read_file(.x$img))),	
    #                    tf$image$decode_png(tf$io$read_file(.x$img))),	
    # Note that decode_gif yields a 4d tensor of shape (1, 256, 256, 3),	
    # so we remove the unneeded batch dimension and all but one 	
    # of the 3 (identical) channels: tf$image$decode_gif(tf$io$read_file(.x$mask))[1,,,][,,1,drop=FALSE]	
    mask = tf$image$decode_png(tf$io$read_file(.x$mask))	
  )) 	
	
# Check the size of the entire (training & testing) TF datasets	
tf$data$experimental$cardinality(training_dataset)$numpy()  # [1] 2572	
tf$data$experimental$cardinality(testing_dataset)$numpy()   # [1] 644	
	
example <- training_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next(); example	
	
# Check to confirm all brain-images (tensors) are of the same RGB 3-channel shape (256, 256, 3)	
capture.output(	
  tryCatch({	
    ind2 <- list(); i <- 1	
    iter <- make_iterator_one_shot(training_dataset)	
    until_out_of_range({	
        case <- iterator_get_next(iter)	
        ind2[[i]] <- ifelse (as.character(case[[1]]$shape)=="(256, 256, 1)", "Incorrect", "Correct")	
        if (ind2[[i]] =="Incorrect") print(".")	
        i <- i+1 	
        # print("."); # str(case)	
    })	
    print("Check if any brain-images (tensors) are NOT of the correct RGB 3-channel shape (256, 256, 3)")	
    table(unlist(ind2))	
  }, error = function(e) e, finally = print("Done!")))	
#' 	
#' 	
#' In practice, the *uint8* data type of the RGB values in PNG/TIFF files are human-interpretable, the U-net expects floating point tensor elements. Thus, we will convert the 3-channel input images to real numbers and scale them to values in the interval $[0,1)$.	
#' 	
#' For improving computational efficiency, sometimes it may be necessary to reduce the computational burden by down-sizing the images, e.g., to $128\times 128$ or even $32\times 32$. In principle, we want to protect the native aspect ratio in the images, keep the pixels isotropic, and avoid distortion. In this case we won't reduce the image size, but it may be useful sometimes.	
#' 	
#' The Tensorflow protocol also includes augmentation of the imaging data. Earlier, we used `torch` augmentation, now we use `tensorflow`. Of course, we will apply the same augmentation-transformations (scale, rotate, flip) to the tumor-mask as well as the brain images, see these three methods `resize()`, `flip()`, `rotate()` in the `create_dataset()` function above. In addition, we can augment the data by intensity-transformations that preserve the spatial image structure but alter the contrast, brightness, and image saturation, see this `random_bsh()` method. 	
#' 	
#' 	
# For data augmentation - contrast, brightness & saturation perturbations	
random_bsh <- function(img) {	
  img %>% 	
    tf$image$random_brightness(max_delta = 0.2) %>% 	
    tf$image$random_contrast(lower = 0.3, upper = 0.6) %>% 	
    tf$image$random_saturation(lower = 0.5, upper = 0.6) %>% 	
    # ensure image intensities are between 0 and 1	
    tf$clip_by_value(0, 1) 	
}	
# Test Brightness-Saturation-Contrast Hue intensity augmentation	
myIterator <- training_dataset %>% reticulate::as_iterator() 	
example <- myIterator$get_next(); example <- myIterator$get_next(); example <- myIterator$get_next() # example	
bshExample <- random_bsh(tf$image$convert_image_dtype(example$img, dtype = tf$float32))	
	
arr1 <- array(as.numeric(unlist(example$img)), dim=c(256, 256, 3))	
p1 <- plot_ly(type="heatmap", z=~arr1[,,1], name="Image")  	
arr2 <- array(as.numeric(unlist(example$mask)), dim=c(256, 256, 1))	
p2 <- plot_ly(type="heatmap", z=~arr2[,,1], name="Mask") 	
arr3 <- 255*array(as.numeric(unlist(bshExample)), dim=c(256, 256, 3))	
p3 <- plot_ly(type="heatmap", z=~arr3[,,1], name="BSC-Image")  	
subplot(p1, p2, p3, nrows=1) %>% layout(title="Brightness-Saturation-Contrast Hue Image Intensity Augmentation") %>% hide_colorbar()	
	
# Random Rotation	
# Rotation requires TF-Addon package	
# reticulate::py_install(c('tensorflow-addons'), pip = TRUE)	
# tfaddons::install_tfaddons() 	
library(tfaddons)	
# random_rotate = function(img, mask, rot_param=15) {	
#     rnd_rot <- runif(1, -rot_param, rot_param)  # random rotation in degrees +/- rot_param	
#     img  <- img_rotate(img, angle = rnd_rot)    # tfa$image$rotate(img, angle = rnd_rot)	
#     mask <- img_rotate(mask, angle = rnd_rot)   # tfa$image$rotate(mask, angle = rnd_rot)	
#     # c(img, mask)	
#     return (list(img=img, mask=mask))	
# }	
	
#### This is a work-around for the data-augmentation rotation problem 	
#### related to the zeallot right operator `%->%` and the `magrittr` pipe operator `%>%`	
rot_param <- 15 # degrees of rotation	
rnd_rot <- runif(1000, -rot_param, rot_param)	
randomRotationPaired <- c(rbind(rnd_rot, rnd_rot)) # double the random rotations	
currentRotationIndex <- 1	
	
random_rotate = function(img, mask) {	
    img  <- img_rotate(img, angle=randomRotationPaired[currentRotationIndex])    # tfa$image$rotate(img, angle = rnd_rot)	
    mask <- img_rotate(mask, angle=randomRotationPaired[currentRotationIndex])   # tfa$image$rotate(mask, angle = rnd_rot)	
    # c(img, mask)	
    currentRotationIndex <- (currentRotationIndex + 1) %% 1000	
    return (list(img=img, mask=mask))	
}	
	
# Test rotator	
myIterator <- training_dataset %>% reticulate::as_iterator() 	
example <- myIterator$get_next(); example <- myIterator$get_next(); example <- myIterator$get_next() # example	
arr1 <- array(as.numeric(unlist(example$img)), dim=c(256, 256, 3))	
p1 <- plot_ly(type="heatmap", z=~arr1[,,1], name="Raw Img")  	
arr2 <- array(as.numeric(unlist(example$mask)), dim=c(256, 1, 256))	
p2 <- plot_ly(type="heatmap", z=~arr2[,1,], name="Raw Mask") 	
subplot(p1, p2, nrows=1) %>% layout(title="Original Image + Mask") %>% hide_colorbar()	
	
rotExample <- random_rotate(example$img, example$mask)	
rotExample$img$shape; rotExample$mask$shape	
arr1 <- array(as.numeric(unlist(rotExample$img)), dim=c(256, 256, 3))	
p1 <- plot_ly(type="heatmap", z=~arr1[,, 1], name="Rotated Img")  	
arr2 <- array(as.numeric(unlist(rotExample$mask)), dim=c(256, 256, 1))	
p2 <- plot_ly(type="heatmap", z=~arr2[,, 1], name="Rotated Mask") 	
subplot(p1, p2, nrows=1) %>% layout(title="Augmented/Rotated Image + Mask") %>% hide_colorbar()	
	
currentRotationIndex <- 1 # reset the currentRotationIndex	
	
# c(img, mask) %<-% random_rotate(example$img, example$mask)	
# img$shape; mask$shape	
# arr1 <- array(as.numeric(unlist(img)), dim=c(256, 256, 3))	
# p1 <- plot_ly(type="heatmap", z=~arr1[,, 1])  	
# arr2 <- array(as.numeric(unlist(mask)), dim=c(256, 256, 1))	
# p2 <- plot_ly(type="heatmap", z=~arr2[,, 1]) 	
# subplot(p1, p2, nrows=1) %>% layout(title="Augmented/Rotated Image + Mask") %>% hide_colorbar()	
	
# Inspect one case	
# example <- training_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
# example$img %>% as.array() %>% as.raster() %>% plot()	
#' 	
#' 	
#' The core functions we built above can be integrated into a new function, `create_dataset()`, which represents the complete end-to-end image preprocessing pipeline, The *input* to this pipeline is a dataframe of filenames containing the brain-images and tumor-masks, and the corresponding output contains the `training_dataset` and the `validation_dataset` as tensorflow dataset-objects that will be used in the model-fitting phase.	
#' 	
#' 	
library(zeallot)   # for the reverse-piping assignment operator: %<-%	
	
create_dataset <- function(data, train, batch_size = 32L) {	
  dataset <- data %>%   # load all PNG files as images	
    tensor_slices_dataset() %>% 	
    dataset_map(~.x %>% list_modify(	
      img = tf$image$decode_png(tf$io$read_file(.x$img), channels = 3),	
      mask = tf$image$decode_png(tf$io$read_file(.x$mask))	
    )) %>%     # convert image intensities from int8 to 32-bit float	
    dataset_map(~.x %>% list_modify(	
      img = tf$image$convert_image_dtype(.x$img, dtype = tf$float32),	
      mask = tf$image$convert_image_dtype(.x$mask, dtype = tf$float32)	
    )) %>%     # reshape all tensor-shape to (256 * 256) to ensure spatial-index homologies	
    dataset_map(~.x %>% list_modify(	
      img = tf$image$resize(.x$img, size = shape(256, 256)),	
      mask = tf$image$resize(.x$mask, size = shape(256, 256))	
    ))	
  	
  # data augmentation performed on training set only	
  if (train) {	
    dataset <- dataset %>%	
      dataset_map(~.x %>% list_modify(	
        # random_rotate(img=.x$img, mask=.x$mask) %->% c(img=img, mask=mask) 	
        # c(img=img, mask=mask) %<-% random_rotate(img=.x$img, mask=.x$mask) # <- Check this issue, it should work like this:	
          # example <- myIterator$get_next(); example <- myIterator$get_next(); example <- myIterator$get_next()	
          # c(img, mask) %<-% random_rotate(example[[1]], example[[2]])	
          # random_rotate(img=example[[1]], mask=example[[2]]) %->% c(img, mask)	
          # img$shape; mask$shape	
          # 	
          # Alternative Solutions	
          # c(img, mask) %<-% random_rotate(.x)	
          # Syntax: ListOfLists <- c(a, b) %<-% list(a=list(1,2,3), b=list(4,5,6,7)); ListOfLists$b	
          # c(img, mask) %<-% random_rotate(.x$img, .x$mask)	
          #	
          # This works, but we need joint, not separate, rotation of img + mask	
          img = random_rotate(img=.x$img, mask=.x$mask)$img,	
          mask= random_rotate(img=.x$img, mask=.x$mask)$mask	
          # c(img, mask) %<-% unlist(random_rotate(img=.x$img, mask=.x$mask))	
          # # mask = random_rotate(.x$img, .x$mask)[[2]]	
      )) %>%	
      dataset_map(~.x %>% list_modify(	
        img = random_bsh(.x$img)	
      ))	
  }	
  	
  # shuffling on training set only	
  if (train) {	
    dataset <- dataset %>% 	
      dataset_shuffle(buffer_size = batch_size*4)	
  }	
  	
  # train in batches; batch size might need to be adapted depending on	
  # available memory	
  dataset <- dataset %>% 	
    dataset_batch(batch_size)	
  	
  dataset %>% 	
    # output needs to be unnamed as required by Keras	
    dataset_map(unname) # makes example$img -> example[[1]]	
}	
	
# Generate the Training and Testing data at iterated TF-Data objects	
training_dataset <- create_dataset(training(data_train_valid), train = TRUE)	
validation_dataset <- create_dataset(testing(data_train_valid), train = FALSE)	
# tf$data$experimental$cardinality(training_dataset)$numpy()	
myIterator <- training_dataset %>% reticulate::as_iterator() 	
example <- myIterator$get_next()  # ; example <- myIterator$get_next()  # shape=(32, 256, 256, 3 or 1)	
c(img,mask) %<-% random_rotate(img=example[[1]][1,,,1], mask=example[[2]][1,,,1])	
img$shape; mask$shape	
	
arr1 <- array(as.numeric(img), dim=c(256, 256))	
p1 <- plot_ly(type="heatmap", z=~arr1, name="Rotated Img")	
arr2 <- array(as.numeric(mask), dim=c(256, 256))	
p2 <- plot_ly(type="heatmap", z=~arr2, name="Rotated Mask")	
subplot(p1, p2, nrows=1) %>% layout(title="Augmented/Rotated Image + Mask") %>% hide_colorbar()	
#' 	
#' 	
#' The UNet DCNN model summary includes:	
#' 	
#'  - Column 1: Layer type and specifications	
#'  - Column 2: “output shape” contains the expected network U-shape parameters. Note that during the encoding phase, the image *Width* and *Height* sizes decrease initially (until the middle of the "U", reaching a minimum resolution of $8\times 8$). Then, they start increasing again during the decoding phase, until reaching the sizes of the original image resolution. Similarly, the number of filters increases during encoding and then decreases during the decoding phase terminating with an output layer having a single filter. Finally, the model architecture includes *concatenation* layers in the decoding phase that aggregate information from “below” with information that comes “laterally” from the parallel nodes in the encoding phase.	
#'  - Column 3: shows the number of parameters used in each layer of the DCNN.	
#'  - Column 4: for each layer (row), column 4 shows the parent (previous) layer in the network. 	
#' 	
#' In this *image-segmentation problem*, the loss function needs to account for the result of labeling ALL pixel intensities in the brain images. Hence, every pixel location contributes equally to the total loss measure. As Binary classification yields a $1$ (tumor) or $0$ (normal brain tissue), the `binary_crossentropy()` method is one appropriate choice of a loss function. During the iterative transfer learning process, we can track the classification accuracy, dice coefficient or other evaluation metrics that capture the proportion of correctly classified pixels.	
#' 	
#' Next we will define, compile and test the base (pre-training default) model, prior to model fitting. This is a naive prediction without model tuning or transfer learning.	
#' 	
#' 	
# Model Training - Starting with a pre-trained U-net model and then expanding it with Transfer-learning	
library(unet)	
model <- unet::unet(input_shape = c(256, 256, 3))  # for RGB use 3-channels: input_shape = c(256, 256, 3)	
summary(model)	
	
# define a custom loss function (dice coefficient)	
dice <- custom_metric("dice", function(y_true, y_pred, smooth = 1.0) {	
  y_true_f <- k_flatten(y_true)	
  y_pred_f <- k_flatten(y_pred)	
  intersection <- k_sum(y_true_f * y_pred_f)	
  (2 * intersection + smooth) / (k_sum(y_true_f) + k_sum(y_pred_f) + smooth)	
})	
	
# Compile the DCNN model, packaging an optimizer, loss and performance metrics	
model %>% compile(	
  optimizer = optimizer_rmsprop(learning_rate = 1e-5),	
  loss = "binary_crossentropy",	
  metrics = list(dice, metric_binary_accuracy)	
)	
	
# Naive Prediction using the default model (prior to Re-Training or Transfer Learning)	
batch <- validation_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
predictions <- predict(model, batch[[1]])	
images <- tibble(	
  image = batch[[1]] %>% array_branch(1),	
  predicted_mask = predictions[,,,1] %>% array_branch(1),	
  mask = batch[[2]][,,,1]  %>% array_branch(1)	
) 	
	
# Check performance of the default base model on one case (# 22)	
i=22	
z1 <- ifelse(as.matrix(as.data.frame(images$predicted_mask[i])) > 0.5, 1, 0)	
pl1 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$mask[i])[,1:256]), type="heatmap", name=paste0("Tumor Mask ", i))	
pl2 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$image[i])[,1:256]), type="heatmap", name=paste0("Brain Image ", i))	
pl3 <- plot_ly(z = ~ z1, type="heatmap", name=paste0("Pred Mask ", i))	
subplot(pl1, pl2, pl3, nrows=1) %>% hide_colorbar()	
	
# training_dataset <- create_dataset(training(data_train_valid), train = TRUE)	
# validation_dataset <- create_dataset(testing(data_train_valid), train = FALSE)	
# 	
# library(unet)	
# 	
# model <- unet::unet(input_shape = c(256, 256, 3))  # for RGB use 3-channels: input_shape = c(256, 256, 3)	
# # summary(model)	
# 	
# library(keras)	
# dice <- custom_metric("dice", function(y_true, y_pred, smooth = 1.0) {	
#   y_true_f <- k_flatten(y_true)	
#   y_pred_f <- k_flatten(y_pred)	
#   intersection <- k_sum(y_true_f * y_pred_f)	
#   (2 * intersection + smooth) / (k_sum(y_true_f) + k_sum(y_pred_f) + smooth)	
# })	
# 	
# model %>% compile(	
#   optimizer = optimizer_rmsprop(learning_rate = 1e-5),	
#   loss = "binary_crossentropy",	
#   metrics = list(dice, metric_binary_accuracy)	
# )	
#' 	
#' 	
#' The part below shows the UNet/DCNN model fitting for a single epoch. This block of code is not run (`eval=F`) as it takes hours to complete, but below we show the results from offline DCNN trainings with different settings.	
#' 	
#' 	
# MODEL FITTING	
epochs = 1	
history <- model %>% fit(training_dataset, epochs = epochs, validation_data = validation_dataset)	
	
# ASSESSMENT - Actual Model Evaluation (after Transfer-learning retraining)	
# Naive Prediction	
batch <- validation_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
predictions <- predict(model, batch[[1]])	
images <- tibble(	
  image = batch[[1]] %>% array_branch(1),	
  predicted_mask = predictions[,,,1] %>% array_branch(1),	
  mask = batch[[2]][,,,1]  %>% array_branch(1)	
) 	
i=22	
pl1 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$mask[i])[,1:256]), type="heatmap", name=paste0("Tumor Mask ", i))	
pl2 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$image[i])[,1:256]), type="heatmap", name=paste0("Brain Image ", i))	
pl3 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$predicted_mask[i])[,1:256]), type="heatmap", name=paste0("Pred Mask ", i))	
subplot(pl1, pl2, pl3, nrows=1)	
	
library(keras)	
dice <- custom_metric("dice", function(y_true, y_pred, smooth = 1.0) {	
  y_true_f <- k_flatten(y_true)	
  y_pred_f <- k_flatten(y_pred)	
  intersection <- k_sum(y_true_f * y_pred_f)	
  (2 * intersection + smooth) / (k_sum(y_true_f) + k_sum(y_pred_f) + smooth)	
})	
	
model %>% compile(	
  optimizer = optimizer_rmsprop(learning_rate = 1e-5),	
  loss = "binary_crossentropy",	
  metrics = list(dice, metric_binary_accuracy)	
)	
#' 	
#' 	
#' When training the DCNN model, keep in mind that this step is very computationally intensive (each epoch can take hours to complete). Running a small number of epochs may be feasible in an interactive RStudio session, but dozens of epochs are necessary to train the network to perform well (e.g., yield high Dice coefficient values).	
#' 	
#' 	
# MODEL FITTING	
history <- model %>% fit(training_dataset, epochs = 1, validation_data = validation_dataset, verbose = 2)	
	
# 81/81 [==============================] - 3313s 41s/step - 	
# loss: 0.3121 - dice: 0.0106 - binary_accuracy: 0.9604 - val_loss: 0.1762 - val_dice: 1.7162e-04 - val_binary_accuracy: 0.9900	
	
# EVALUATION	
pl_loss <- plot_ly(x = ~c(1:history$params$epochs),  y = ~history$metrics$loss,	
        type = "scatter", mode="markers+lines", name="Loss") %>% 	
  add_trace(x = ~c(1:history$params$epochs),  y = ~history$metrics$val_loss,	
        type = "scatter", mode="markers+lines", name="Validation Loss") %>%	
  layout(title="DNN Training/Validation Performance", xaxis=list(title="epoch"),	
         yaxis=list(title="Metric Value"), legend = list(orientation='h'))	
	
pl_acc <- plot_ly(x = ~c(1:history$params$epochs),  y = ~history$metrics$binary_accuracy,	
        type = "scatter", mode="markers+lines", name="Accuracy") %>% 	
  add_trace(x = ~c(1:history$params$epochs),  y = ~history$metrics$binary_accuracy,	
        type = "scatter", mode="markers+lines", name="Validation Binary Accuracy") %>%	
  layout(title="DNN Training/Validation Performance", xaxis=list(title="epoch"),	
         yaxis=list(title="Metric Value"), legend = list(orientation='h'))	
	
subplot(pl_loss, pl_acc, nrows=2, shareX = TRUE, titleX = TRUE)	
	
# model %>% evaluate(training_dataset %>% dataset_batch(25), verbose = 0)	
model %>% evaluate(validation_dataset, verbose = 0)	
#           loss            dice    binary_accuracy 	
#   0.1762175262    0.0001716204        0.9899647236 	
	
# Save/Load the pretrained model in HDF5 format	
save_model_hdf5(model, "C:/Users/Dinov/Desktop/model_TF_Brain_epoch_1.h5", 	
                overwrite = TRUE, include_optimizer = TRUE)	
# Mind that loading a model with custom layers/functions, e.g., dice() method, requires special import using a list	
# https://github.com/rstudio/keras/issues/1240	
#' 	
#' 	
#' Once we have pre-trained (`history <- model %>% fit()`) and saved (`save_model_hdf5()`) the model (`model`), we can load it back as `mod1` in the interactive session and use it for prediction. Several [pre-trained models (e.g., 50 and 100 epochs) are available on the Canvas site](https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes). Below, we are using one of the pre-trained Unet models (`model_TF_Brain_epoch_100.h5`, 250MB) to illustrate the prediction performance, i.e., automated brain-tumor segmentation using 2D neuroimaging data.	
#' 	
#' Notice the steady model improvement (accuracy, binary tumor-labeling, and dice-coefficient) with respect to the epoch index.	
#' 	
#' 	
# # This HDF5 model is available on Canvas: 	
# # https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes	
# # https://umich.instructure.com/files/22559456/download?download_frd=1	
# mod1 <- 	
#   load_model_hdf5("E:/Ivo.dir/Research/UMichigan/Proposals/2014/UMich_MINDS_DataSciInitiative_2014/2016/MIDAS_HSXXX_DataScienceAnalyticsCourse_Porposal/MOOC_Implementation/appendix/model_TF_Brain_epoch_50.h5", 	
#                         custom_objects = list("dice" = dice), compile = TRUE)	
# 	
# # Update the model's last layer: https://tensorflow.rstudio.com/guide/keras/faq/ 	
# # mod2 <- mod1 %>% 	
# #   # get_layer(index=33) %>%	
# #   pop_layer() %>%  # remove the output layer first, then plug in a new output layer	
# #   # layer_conv_2d(filters = 64, kernel_size = c(3, 3), padding = "same") 	
# #   layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = "relu")	
# input <- mod1$input                            # UNet input layer	
# base <- (mod1 %>% get_layer(index=33))$output  # Get the last output layer of "mod1"	
# target <- base %>%                             # Replace the output layer by a new conv2D layer outputting a 3-channel 2D brain image	
#   layer_conv_2d(filters = 3, kernel_size = c(1,1), activation = "relu")	
# transfer_model <- keras_model(input, target)   # Update mod1 for transfer learning 	
# summary(transfer_model)                        	
# # Mind the final layer change: 	
# # Layer (type)        Output Shape         Param #    Connected to  	
# # conv2d_23 (Conv2D)  (None, 256, 256, 3)  6          conv2d_18[0][0] 	
	
# # Define the Dice coefficient	
# dice <- custom_metric("dice", function(y_true, y_pred, smooth = 1.0) {	
#   y_true_f <- k_flatten(y_true)	
#   y_pred_f <- k_flatten(y_pred)	
#   intersection <- k_sum(y_true_f * y_pred_f)	
#   (2 * intersection + smooth) / (k_sum(y_true_f) + k_sum(y_pred_f) + smooth)	
# })	
# 	
# # Recompile the model with our  custom dice() loss function	
# mod1 %>% compile(	
#   optimizer = optimizer_rmsprop(learning_rate = 1e-5),	
#   loss = "binary_crossentropy",	
#   metrics = list(dice, metric_binary_accuracy)	
# )	
# # Check re-loaded mod1 model and evaluate performance metrics on validation dataset	
# # summary(mod1)	
# mod1 %>% evaluate(validation_dataset)	
# 	
# # Finally display some of the results -- ASSESSMENT -- Actual Model Evaluation (after Transfer-learning retraining)	
# batch <- validation_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
# predictions <- predict(mod1, batch[[1]])	
# images <- tibble(	
#   image = batch[[1]] %>% array_branch(1),	
#   predicted_mask = predictions[,,,1] %>% array_branch(1),	
#   mask = batch[[2]][,,,1]  %>% array_branch(1)	
# ) 	
# i=16	
# pl1 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$mask[i])[,1:256]), type="heatmap", name=paste0("Tumor Mask ", i)) %>% 	
#     layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
# pl2 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$image[i])[,1:256]), type="heatmap", name=paste0("Brain Image ", i)) %>% 	
#     layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
# pl3 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$predicted_mask[i])[,1:256]), type="heatmap", name=paste0("Pred Mask ", i)) %>% 	
#     layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
# subplot(pl1, pl2, pl3, nrows=1, shareY = TRUE) %>%	
#   layout(title=paste0("Case ", i))	
	
# batch <- validation_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
# predictions <- predict(mod1, batch)	
# str(predictions)  # num [1:32, 1:256, 1:256, 1] 0.466 0.463 0.48 0.449 0.494 ...	
# hist(predictions)	
#  	
# images <- tibble(	
#   image = batch[[1]] %>% array_branch(1),	
#   predicted_mask = predictions[,,,1] %>% array_branch(1),	
#   mask = batch[[2]][,,,1]  %>% array_branch(1)	
# )	
	
# Display the Transfer-Learning Model Performance across epochs	
# These metrics are precomputed on SOCR-Lighthouse server, as the transfer-learning fine-tuning training	
# takes days to go through 50 epochs and dozens of iterations per epoch 	
## https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes	
## 	
# epochs <- 48	
# # Import the metrics: Loss,	Dice,	Binary_Accuracy,	Validation_Loss,	Validation_Dice,	Validation_Binary_Accuracy	
# historyMetrics_epochs48 <- read.csv("https://umich.instructure.com/files/22559501/download?download_frd=1", stringsAsFactors = F)		
# hist_df <- data.frame(epoch=historyMetrics_epochs48$Epoch, loss=historyMetrics_epochs48$Loss, 	
#                       dice=historyMetrics_epochs48$Dice, binAccuracy=historyMetrics_epochs48$Binary_Accuracy,	
#                       valid_loss=historyMetrics_epochs48$Validation_Loss, valid_dice=historyMetrics_epochs48$Validation_Dice,	
#                       valid_binAccuracy=historyMetrics_epochs48$Validation_Binary_Accuracy)	
# 	
# p_loss <- plot_ly(hist_df, x = ~epoch)  %>%	
#   add_trace(y = ~loss, name = 'Training Loss', type="scatter", mode = 'lines+markers') %>%	
#   add_trace(y = ~valid_loss, name = 'Validation Loss', type="scatter", mode = 'lines+markers') 	
# p_acc <- plot_ly(hist_df, x = ~epoch)  %>%	
#   add_trace(y = ~dice, name = 'Training Dice', type="scatter", mode = 'lines+markers') %>%       # DICE	
#   add_trace(y = ~valid_dice, name = 'Validation Accuracy', type="scatter", mode = 'lines+markers') %>% 	
#   add_trace(y = ~binAccuracy, name = 'Training Binary-Accuracy', type="scatter", mode = 'lines+markers') %>%    # ACCURACY	
#   add_trace(y = ~valid_binAccuracy, name = 'Validation Binary-Accuracy', type="scatter", mode = 'lines+markers')	
# subplot(p_loss, p_acc, nrows=2) %>%	
#   layout(legend = list(orientation = 'h'), title="Tensorflow Transfer-Learning Performance (Brain Imaging), Epochs=48")	
	
# # MODEL FITTING on Lighthouse SOCR Compute Server	
# epochs = 100	
# history <- model %>% fit(training_dataset, epochs = epochs, validation_data = validation_dataset)	
# 	
# # ASSESSMENT - Actual Model Evaluation (after Transfer-learning retraining)	
# # Naive Prediction	
# batch <- validation_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
# predictions <- predict(model, batch[[1]])	
# images <- tibble(	
#   image = batch[[1]] %>% array_branch(1),	
#   predicted_mask = predictions[,,,1] %>% array_branch(1),	
#   mask = batch[[2]][,,,1]  %>% array_branch(1)	
# ) 	
# # i=22	
# # pl1 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$mask[i])[,1:256]), type="heatmap", name=paste0("Tumor Mask ", i))	
# # pl2 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$image[i])[,1:256]), type="heatmap", name=paste0("Brain Image ", i))	
# # pl3 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$predicted_mask[i])[,1:256]), type="heatmap", name=paste0("Pred Mask ", i))	
# # subplot(pl1, pl2, pl3, nrows=1)	
# 	
# 	
# # MODEL FITTING	
# # history <- model %>% fit(training_dataset, epochs = 1, validation_data = validation_dataset, verbose = 2)	
# 	
# # 81/81 [==============================] - 3313s 41s/step - 	
# # loss: 0.3121 - dice: 0.0106 - binary_accuracy: 0.9604 - val_loss: 0.1762 - val_dice: 1.7162e-04 - val_binary_accuracy: 0.9900	
# 	
# # EVALUATION	
# # pl_loss <- plot_ly(x = ~c(1:history$params$epochs),  y = ~history$metrics$loss,	
# #                    type = "scatter", mode="markers+lines", name="Loss") %>% 	
# #   add_trace(x = ~c(1:history$params$epochs),  y = ~history$metrics$val_loss,	
# #             type = "scatter", mode="markers+lines", name="Validation Loss") %>%	
# #   layout(title="DNN Training/Validation Performance", xaxis=list(title="epoch"),	
# #          yaxis=list(title="Metric Value"), legend = list(orientation='h'))	
# # 	
# # pl_acc <- plot_ly(x = ~c(1:history$params$epochs),  y = ~history$metrics$binary_accuracy,	
# #                   type = "scatter", mode="markers+lines", name="Accuracy") %>% 	
# #   add_trace(x = ~c(1:history$params$epochs),  y = ~history$metrics$binary_accuracy,	
# #             type = "scatter", mode="markers+lines", name="Validation Binary Accuracy") %>%	
# #   layout(title="DNN Training/Validation Performance", xaxis=list(title="epoch"),	
# #          yaxis=list(title="Metric Value"), legend = list(orientation='h'))	
# # 	
# # subplot(pl_loss, pl_acc, nrows=2, shareX = TRUE, titleX = TRUE)	
# 	
# # model %>% evaluate(training_dataset %>% dataset_batch(25), verbose = 0)	
# model %>% evaluate(validation_dataset, verbose = 0)	
# #           loss            dice    binary_accuracy 	
# #   0.1762175262    0.0001716204        0.9899647236 	
# 	
# # Save/Load the pretrained model in HDF5 format	
# save_model_hdf5(model, "/home/dinov/DSPA/test_code/Chap22_DCNN_UNet_modeling/model_TF_Brain_epoch_100.h5", 	
#                 overwrite = TRUE, include_optimizer = TRUE)	
# # Save History object tracking model performance across epochs, 	
# # see https://cran.r-project.org/web/packages/keras/vignettes/training_visualization.html	
# history_df <- as.data.frame(history)	
# # str(history_df)	
# # 'data.frame':   120 obs. of  4 variables:	
# #   $ epoch : int  1 2 3 4 5 6 7 8 9 10 ...	
# # $ value : num  0.87 0.941 0.954 0.962 0.965 ...	
# # $ metric: Factor w/ 2 levels "acc","loss": 1 1 1 1 1 1 1 1 1 1 ...	
# # $ data  : Factor w/ 2 levels "training","validation": 1 1 1 1 1 1 1 1 1 1 ...	
# save(history_df, file="/home/dinov/DSPA/test_code/Chap22_DCNN_UNet_modeling/historyModel_TF_Brain_Epochs50.Rda")	
# # Then load it with:	
# #	
	
# REPORT Training and Validation Performance: load the performance metrics (history_df) over 100 epochs	
# https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes | historyModel_TF_Brain_Epochs100.Rda	
load(url("https://umich.instructure.com/files/22647027/download?download_frd=1"))	
lineWidth <- ifelse(history_df$data=='training', 1, 4)	
lineNames <- paste0(history_df$data, " ", history_df$metric)	
plot_ly(history_df, x = ~epoch, y=~value, color=~metric, type="scatter", mode="lines+markers", name=~lineNames,	
        line = list(color = ~metric, width = lineWidth))  %>%	
  layout(legend = list(orientation = 'h'), title="Tensorflow Transfer-Learning Performance (Brain Imaging), Epochs=100")	
	
# PREDICTION	
# Mind that loading a model with custom layers/functions, e.g., dice() method, requires special import using a list	
# https://github.com/rstudio/keras/issues/1240	
# Once we have trained (`history <- model %>% fit(()`) and saved (`save_model_hdf5()`) the model (`model`), we can load it back as `mod1`  the interactive session and use it for prediction.	
# Define the Dice coefficient	
dice <- custom_metric("dice", function(y_true, y_pred, smooth = 1.0) {	
  y_true_f <- k_flatten(y_true)	
  y_pred_f <- k_flatten(y_pred)	
  intersection <- k_sum(y_true_f * y_pred_f)	
  (2 * intersection + smooth) / (k_sum(y_true_f) + k_sum(y_pred_f) + smooth)	
})	
	
# The pre-computed model_TF_Brain_epoch_100.h5 model file is available in Canvas:	
# https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes	
localModelsFolder <- "C:/Users/dinov/Desktop/Ivo.dir/Research/UMichigan/Proposals/2014/UMich_MINDS_DataSciInitiative_2014/2016/MIDAS_HSXXX_DataScienceAnalyticsCourse_Porposal/MOOC_Implementation/appendix/"	
mod1 <- load_model_hdf5(paste0(localModelsFolder, "model_TF_Brain_epoch_100.h5"), compile = FALSE)	
                        #  custom_objects = list("dice" = dice), compile = TRUE)	
	
# # Recompile the model with our  custom dice() loss function	
mod1 %>% compile(	
  optimizer = optimizer_rmsprop(learning_rate = 1e-5),	
  loss = "binary_crossentropy",	
  metrics = list(dice, metric_binary_accuracy)	
)	
# Check re-loaded mod1 model and evaluate performance metrics on validation dataset	
# summary(mod1)	
mod1 %>% evaluate(validation_dataset)	
# 16/16 [==============================] - 180s 11s/step - loss: 0.0213 - dice: 0.6842 - binary_accuracy: 0.9932	
#           loss            dice binary_accuracy 	
#     0.02131428      0.68422282      0.99317247 	
	
# Finally display some of the results -- ASSESSMENT -- Actual Model Evaluation (after Transfer-learning retraining)	
batch <- validation_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
predictions <- predict(mod1, batch[[1]])	
images <- tibble(	
  image = batch[[1]] %>% array_branch(1),	
  predicted_mask = predictions[,,,1] %>% array_branch(1),	
  mask = batch[[2]][,,,1]  %>% array_branch(1)	
) 	
# i=16	
# pl1 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$mask[i])[,1:256]), type="heatmap", name=paste0("Tumor Mask ", i)) %>% 	
#   layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
# pl2 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$image[i])[,1:256]), type="heatmap", name=paste0("Brain Image ", i)) %>%	
#   layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
# pl3 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$predicted_mask[i])[,1:256]), type="heatmap", name=paste0("Pred Mask ", i)) %>%	
#   layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
# subplot(pl1, pl2, pl3, nrows=1, shareY = TRUE) %>%	
#   layout(title=paste0("DCNN HDF5 Model (epochs=100) Validation on Case ", i))	
	
pl_list <- list()	
n = 10	
for (i in 1:n) {    # to limit to only 1-channel, restrict the column range to 1:256: images$mask[i])[,1:256]	
  pl1 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$mask[i])), type="heatmap", name=paste0("Tumor Mask ", i)) %>% 	
  layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
pl2 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$image[i])), type="heatmap", name=paste0("Brain Image ", i)) %>%	
  layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
pl3 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$predicted_mask[i])), type="heatmap", name=paste0("Pred Mask ", i)) %>%	
  layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
pl4 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$mask[i+n])), type="heatmap", name=paste0("Tumor Mask ", i+n)) %>% 	
  layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
pl5 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$image[i+n])), type="heatmap", name=paste0("Brain Image ", i+n)) %>%	
  layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
pl6 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$predicted_mask[i+n])), type="heatmap", name=paste0("Pred Mask ", i+n)) %>%	
  layout(hovermode = "y unified", xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
pl_list[[i]] <- subplot(pl1, pl2, pl3, pl4, pl5, pl6, nrows=1, shareY = TRUE) %>%	
  layout(title=paste0("DCNN HDF5 Model (epochs=100) Validation on Case ", i)) %>% hide_colorbar()	
}	
pl_list %>%	
  subplot(nrows = length(pl_list)) %>%	
     layout(title=paste0("DCNN HDF5 Model (epochs=100) Predictions (N=", 2*n, " Cases)"))	
	
	
# batch <- validation_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
# predictions <- predict(mod1, batch)	
# str(predictions)  # num [1:32, 1:256, 1:256, 1] 0.466 0.463 0.48 0.449 0.494 ...	
# hist(predictions)	
#  	
# images <- tibble(	
#   image = batch[[1]] %>% array_branch(1),	
#   predicted_mask = predictions[,,,1] %>% array_branch(1),	
#   mask = batch[[2]][,,,1]  %>% array_branch(1)	
# )	
	
# Save Predictions on validation data	
# batch <- validation_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
# predictions <- predict(mod1, batch[[1]])	
PredictedMasks <- list(predictions=predictions, images=images)	
save(PredictedMasks, file="C:/Users/Dinov/Desktop/predictionsModel_TF_Brain_Epochs100.Rda")	
# Then load it with:	
# load("C:/Users/Dinov/Desktop/predictionsModel_TF_Brain_Epochs100.Rda")	
# https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes	
# load(url("https://umich.instructure.com/files/22648924/download?download_frd=1"))	
#' 	
#' 	
#' Finally, we can explicate some of the *Transfer Learning* process by modifying the pretrained model `mod2` and obtaining a new model, `modTransferLearning`, which demonstrates synthetic image generation. This new modified `modTransferLearning` DCNN will output 3-channel brain-images, not tumor masks (predicted by `mod2`). Note that 31M parameters part of the original pretrained `mod2` are fixed and we are only estimating the remaining $16K$ parameters in this transfer-learning tuning process.	
#' 	
#' 	
# This HDF5 model (model_TF_Brain_epoch_100.h5) is available on Canvas: 	
# https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes	
# https://umich.instructure.com/files/22647026/download?download_frd=1	
mod2 <- load_model_hdf5(paste0(localModelsFolder, "model_TF_Brain_epoch_100.h5"), compile=FALSE)	
          # custom_objects = list("dice" = dice), compile = TRUE)	
	
#  mod2 <- load_model_hdf5("E:/Ivo.dir/Research/UMichigan/Proposals/2014/UMich_MINDS_DataSciInitiative_2014/2016/MIDAS_HSXXX_DataScienceAnalyticsCourse_Porposal/MOOC_Implementation/appendix/model_TF_Brain_epoch_100.h5", custom_objects = list("dice" = dice), compile = TRUE)	
	
# Freeze the first 32 layers and inspect the number of trainable and non-trainable (frozen DCNN parameters)	
for (layer in mod2$layers) layer$trainable <- FALSE	
	
# summary(mod2); length(mod2$layers)	
# Update the model's last layer: https://tensorflow.rstudio.com/guide/keras/faq/ 	
# More about Keras layers: https://keras.rstudio.com/reference/index.html#section-core-layers 	
# mod2 <- mod1 %>% 	
#   # get_layer(index=33) %>%	
#   pop_layer() %>%  # remove the output layer first, then plug in a new output layer	
#   # layer_conv_2d(filters = 64, kernel_size = c(3, 3), padding = "same") 	
#   layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = "relu")	
input <- mod2$input                            # Pretrained UNet input layer	
base <- (mod2 %>% get_layer(index=32))$output  # Get the prior-to-last output layer of "mod2", total #layers =33	
target <- base %>%                             # Replace the output layer by a new conv2D layer outputting a 3-channel 2D brain image, not a mask	
  layer_conv_2d(filters = 64, kernel_size = c(1,1), activation = "relu") %>% 	
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 	
  layer_conv_2d(filters = 64, kernel_size = c(1,1), activation = "relu") %>% 	
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 	
  layer_conv_2d_transpose(filters = 32, kernel_size = c(2,2), strides = 2, padding = "same", activation = "relu") %>%  # deconvolution layers	
  layer_conv_2d_transpose(filters = 3, kernel_size = c(2,2), strides = 2, padding = "same", activation = "relu")	
modTransferLearning <- keras_model(input, target)   # Update mod2 for transfer learning 	
summary(modTransferLearning); length(modTransferLearning$layers) 	
	
# 31M parameters are fixed and we are only estimating 16K parameters in the Transfer-Learning tuning.	
	
# Inspect the number of trainable and non-trainable (frozen DCNN parameters)	
# ....	
# ...	
# conv2d_17 (Conv2D)    (None, 256, 256, 64)              36928              conv2d_16[0][0]   # Last Layers of base model mod2   	
# ___________________________________________________________________________________________	
# conv2d_44 (Conv2D)    (None, 256, 256, 64)              4160               conv2d_17[0][0]   # First new TF layer                                 	
# ___________________________________________________________________________________________	
# max_pooling2d_25 (MaxPooling2D)   (None, 128, 128, 64)    0                 conv2d_44[0][0]                                    	
# ___________________________________________________________________________________________	
# conv2d_43 (Conv2D)    (None, 128, 128, 64)              4160         max_pooling2d_25[0][0]                             	
# ___________________________________________________________________________________________	
# max_pooling2d_24 (MaxPooling2D)  (None, 64, 64, 64)       0                  conv2d_43[0][0]                                    	
# ___________________________________________________________________________________________	
# conv2d_transpose_17 (Conv2DTranspose)  (None, 128, 128, 32) 8224      max_pooling2d_24[0][0]                             	
# ____________________________________________________________________________________________	
# conv2d_transpose_16 (Conv2DTranspose)  (None, 256, 256, 3)  387    v2d_transpose_17[0][0]   # New Last Output Layer of TL model  modTransferLearning                      	
# ============================================================================================	
# Total params: 31,048,611	
# Trainable params: 16,931	
# Non-trainable params: 31,031,680	
# ________________________________	
# [1] 38 # Layers of new modTransferLearning model	
	
	
## MODEL RE-TRAINING (for Transfer Learning)	
# Compile the DCNN model, packaging an optimizer, loss and performance metrics	
modTransferLearning %>% compile(	
  optimizer = optimizer_rmsprop(learning_rate = 1e-5),	
  # loss = 'loss_kullback_leibler_divergence', 	
  loss = 'mse',   # as we are looking at minimizing ||OrigImage - SynthImage||, i.e., the RMSE	
  metrics = list(metric_mean_absolute_error, metric_poisson)	
)	
	
currentRotationIndex <- 1 # reset	
	
### PREP New Data (replace masks by the native brain images)	
create_TL_dataset <- function(data, train, batch_size = 32L) {	
  dataset <- data %>%   # load all PNG files as images	
    tensor_slices_dataset() %>% 	
    dataset_map(~.x %>% list_modify(	
      img = tf$image$decode_png(tf$io$read_file(.x$img), channels = 3),  # <------ IMG	
      mask= tf$image$decode_png(tf$io$read_file(.x$img), channels = 3)   # <------ MASK == IMG	
    )) %>%     # convert image intensities from int8 to 32-bit float	
    dataset_map(~.x %>% list_modify(	
      img = tf$image$convert_image_dtype(.x$img, dtype = tf$float32),	
      mask= tf$image$convert_image_dtype(.x$mask, dtype = tf$float32)	
    )) %>%     # reshape all tensor-shape to (256 * 256) to ensure spatial-index homologies	
    dataset_map(~.x %>% list_modify(	
      img = tf$image$resize(.x$img, size = shape(256, 256)),	
      mask= tf$image$resize(.x$mask, size = shape(256, 256))	
    ))	
  	
  # data augmentation performed on training set only	
  if (train) {	
    dataset <- dataset %>% 	
      dataset_map(~.x %>% list_modify(	
        img  = random_bsh(.x$img) # ,	
        # mask = img	
      )) %>%	
      dataset_map(~.x %>% list_modify(  # see discussion in create_dataset()	
        img = random_rotate(img=.x$img, mask=.x$mask)$img,	
        mask= random_rotate(img=.x$img, mask=.x$mask)$mask	
      ))	
  }	
  	
  # shuffling on training set only	
  if (train) {	
    dataset <- dataset %>% 	
      dataset_shuffle(buffer_size = batch_size*4)	
  }	
  	
  # train in batches; batch size might need to be adapted depending on	
  # available memory	
  dataset <- dataset %>% 	
    dataset_batch(batch_size)	
  	
  dataset %>% 	
    # output needs to be unnamed	
    dataset_map(unname) 	
}	
	
# Generate the Transfer Learning (TL) Training and Testing data at iterated TF-Data objects	
training_TL_dataset <- create_TL_dataset(training(data_train_valid), train = TRUE)	
validation_TL_dataset <- create_TL_dataset(testing(data_train_valid), train = FALSE)	
# tf$data$experimental$cardinality(training_TL_dataset)$numpy()	
	
# testing_TL_dataset <- testing(data_train_valid) %>%  	
#   tensor_slices_dataset() %>% 	
#   dataset_map(~.x %>% list_modify(	
#     # decode_jpeg yields a 3d tensor of shape (256, 256, 3)	
#     # Check tensor shapes!	
#     img = tf$image$decode_png(tf$io$read_file(.x$img), channels = 3),	
#     mask =tf$image$decode_png(tf$io$read_file(.x$img), channels = 3)	
#   )) 	
	
example_TL <- training_TL_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
# example_TL	
	
rasterPlotly <- function (image, name="", hovermode = NULL) {	
  myPlot <- plot_ly(type="image", z=image, name=name, hoverlabel=name, text=name, 	
            hovertext=name, hoverinfo="name+x+y") %>% 	
    layout(hovermode = hovermode, xaxis = list(hoverformat = '.1f'), yaxis = list(hoverformat = '.1f'))	
  return(myPlot)	
}	
	
# Training Case	
img_and_mask <- training_TL_dataset %>% 	
      reticulate::as_iterator() %>% reticulate::iter_next()  # shape=(batch=32, 256, 256, channel=3)	
img <- img_and_mask[[1]][1,,,1]  %>% as.array()  # 3-channel Input image	
target<- img_and_mask[[2]][1,,,1]  %>% as.array()  # 3-channel Output image	
p1 <- plot_ly(z=~255*img, type="heatmap", name = "Input Image") %>% hide_colorbar()	
p2 <- plot_ly(z=~255*target, type="heatmap", name = "Output Image") %>% hide_colorbar()	
subplot(p1,p2, shareY = TRUE) %>% layout(title="Training Case: 3-Channel Input Image (Left) & Output Image (Right)")	
	
	
# MODEL FITTING - uncomment this fragment to run 6-epochs (2+ hrs)	
# modTransferLearning_History_6Epochs <- modTransferLearning %>% fit(training_TL_dataset, epochs = 6, validation_data = validation_TL_dataset, verbose = 2)	
# 63/63 - 1009s - loss: 1.2179 - mean_absolute_error: 0.4255 - poisson: 1.1990 - kullback_leibler_divergence: 2.0313 - val_loss: 0.4258 - val_mean_absolute_error: 0.3001 - val_poisson: 1.0742 - val_kullback_leibler_divergence: 2.0180	
# Save model HDF5	
# save_model_hdf5(modTransferLearning, "C:/Users/Dinov/Desktop/model_TF_TL_SynthBrainImages_epoch_6.h5", 	
#                 overwrite = TRUE, include_optimizer = TRUE)	
# Model is on Canvas: https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes	
# mod3_TL <- 	
#  load_model_hdf5("E:/Ivo.dir/Research/UMichigan/Proposals/2014/UMich_MINDS_DataSciInitiative_2014/2016/MIDAS_HSXXX_DataScienceAnalyticsCourse_Porposal/MOOC_Implementation/appendix/model_TF_TL_SynthBrainImages_epoch_6.h5", compile = FALSE)	
#' 	
#' 	
#' Once we have trained (`history <- model %>% fit(()`) and saved (`save_model_hdf5()`) the model (`model`), we can load it back as `mod1`  the interactive session and use it for prediction.	
#' 	
#' 	
# mod3_TL <- modTransferLearning	
# mod3_TL <- load_model_hdf5("E:/Ivo.dir/Research/UMichigan/Proposals/2014/UMich_MINDS_DataSciInitiative_2014/2016/MIDAS_HSXXX_DataScienceAnalyticsCourse_Porposal/MOOC_Implementation/appendix/model_TF_TL_SynthBrainImages_epoch_6.h5", compile = F)	
# Load the h5 model from Canvas: https://umich.instructure.com/courses/38100/files/folder/Case_Studies/36_TCGA_MRI_Segmentation_Data_Phenotypes	
mod3_TL <- load_model_hdf5(paste0(localModelsFolder, "/model_TF_TL_SynthBrainImages_epoch_6.h5"), compile = FALSE)	
	
# Recompile the model with our  custom dice() loss function	
mod3_TL %>% compile(	
  optimizer = optimizer_rmsprop(learning_rate = 1e-5),	
  # loss = 'loss_kullback_leibler_divergence', 	
  loss = 'mse',   # as we are looking at minimizing ||OrigImage - SynthImage||, i.e., the RMSE	
  metrics = list(metric_mean_absolute_error, metric_poisson)	
)	
	
# Load the performance metrics of pre-computed 6-epoch model(modTransferLearning_History_6Epochs_DF.csv)	
modTransferLearning_History_6Epochs_DF <- read.csv("https://umich.instructure.com/files/22825191/download?download_frd=1", header = T)	
# modTransferLearning_History_6Epoc_DF <- as.data.frame(modTransferLearning_History_6Epochs$metrics)	
# 	
# modTransferLearning_History_6Epochs_DF <- data.frame(	
#   epoch = c(1:dim(modTransferLearning_History_6Epoc_DF)[1]),	
#   loss = modTransferLearning_History_6Epoc_DF$loss,	
#   mean_absolute_error=modTransferLearning_History_6Epoc_DF$mean_absolute_error,	
#   poisson=modTransferLearning_History_6Epoc_DF$poisson,	
#   val_loss = modTransferLearning_History_6Epoc_DF$val_loss,	
#   val_mean_absolute_error = modTransferLearning_History_6Epoc_DF$val_mean_absolute_error,	
#   val_poisson = modTransferLearning_History_6Epoc_DF$val_poisson)	
	
head(modTransferLearning_History_6Epochs_DF)	
# write.csv(modTransferLearning_History_6Epochs_DF, "C:/Users/Dinov/Desktop/modTransferLearning_History_6Epochs_DF.csv")	
# modTransferLearning_History_6Epochs_DF <- read.csv("https://umich.instructure.com/files/22825191/download?download_frd=1", header = T)	
# modTransferLearning_History_4Epochs_DF <- read.csv("https://umich.instructure.com/files/22704304/download?download_frd=1", header = T)	
	
# EVALUATION	
plot_ly(data=modTransferLearning_History_6Epochs_DF, x = ~epoch,  	
                   y = ~loss, type = "scatter", mode="markers+lines", name="Loss") %>% 	
  add_trace(x = ~epoch, y = ~mean_absolute_error, type = "scatter", mode="markers+lines", name="Mean Abs Error") %>%	
  add_trace(x = ~epoch, y = ~poisson, type = "scatter", mode="markers+lines", name="Poisson") %>% 	
  add_trace(x = ~epoch, y = ~val_loss, type = "scatter", mode="markers+lines", name="Valid Loss") %>%	
  add_trace(x = ~epoch, y = ~val_mean_absolute_error, type = "scatter", mode="markers+lines", name="Valid MAE") %>%	
  add_trace(x = ~epoch, y = ~val_poisson, type = "scatter", mode="markers+lines", name="Valid Poisson") %>% 	
  layout(title="Transfer-Learning Validation Synth Image Generation Performance", xaxis=list(title="epoch"),	
         yaxis=list(title="Metric Value"), legend = list(orientation='h'))	
	
# mod3_TL %>% evaluate(training_TL_dataset %>% dataset_batch(25), verbose = 0)	
mod3_TL %>% evaluate(validation_TL_dataset, verbose = 0)	
#  Epoch4 loss         mean_absolute_error   poisson      kullback_leibler_divergence	
#         0.1845085    0.2612749             0.8180779    1.2870733  	
# Epoch6       loss       mean_absolute_error             poisson 	
#              0.02177707          0.08500605          1.05714786 	
                  	
# TL Model Prediction using the default (prior to Transfer Learning) model	
batch <- validation_TL_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
predictions <- predict(mod3_TL, batch[[1]])	
images <- tibble(	
  image     = batch[[1]][,,,1]  %>% array_branch(1),	
  synth_img = predictions[,,,1] %>% array_branch(1),	
  target    = batch[[2]][,,,1]  %>% array_branch(1)	
) 	
	
# Check performance of the Transfer-Learning model on one case (# 22)	
i=22	
pl1 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$image[i])[,1:256]), type="heatmap", name=paste0("Input Image ", i))	
pl2 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$target[i])[,1:256]), type="heatmap", name=paste0("Target ", i))	
pl3 <- plot_ly(z = ~ 255*as.matrix(as.data.frame(images$synth_img[i])[,1:256]), type="heatmap", name=paste0("Synth Image ", i))	
subplot(pl1, pl2, pl3, nrows=1) %>% hide_colorbar()	
	
pl_list <- list()	
for (i in 1:10) {	
  image = as.matrix(as.data.frame(images$image[i])[,1:256])	
  synth_img = as.matrix(as.data.frame(images$synth_img[i])[,1:256])	
  target = as.matrix(as.data.frame(images$target[i])[,1:256])	
  	
  p1 <- plot_ly(z=~255*image, type="heatmap", name=paste0("Image ", i))	
  p2 <- plot_ly(z=~255*target, type="heatmap", name=paste0("Target=Image ", i))	
  p3 <- plot_ly(z=~255*synth_img, type="heatmap", name=paste0("Synth Img ", i))	
  pl_list[[i]] <- subplot(p1,p2, p3, nrows=1) %>% hide_colorbar() # %>% 	
    # layout(yaxis=list(scaleanchor = "x",  scaleratio = 1),  hovermode = "y unified")	
    # layout(hovermode = "y unified")	
}	
pl_list %>%	
  subplot(nrows = length(pl_list)) %>%	
     layout(title="Input Brain Images, Targets, and Synth Reconstructions for N=10 Cases")	
#' 	
#' 	
#' ### Notes about the Tensorflow pipeline protocol	
#' 	
#' #### Preprocessing	
#' 	
#' The preprocessing pipeline allows inspecting intermediate results using `reticulate::as_iterator()` on the dataset.	
#' 	
#' 	
library(reticulate)	
# example <- training_dataset %>% reticulate::as_iterator() %>% reticulate::iter_next()	
# ex_img <- example$img %>% as.array()  #  %>% as.raster()	
# plot_ly(type="image", z=255*ex_img)       #  255*EBImage::Image(t(ex_img)))	
	
# display one brain image	
batch <- training_dataset %>% reticulate::as_iterator()	
	
pl_list <- list()	
for (i in 1:10) {	
  record <- reticulate::iter_next(batch)	
  image = record[[1]] %>% as.array()	
  mask = record[[2]] %>% as.array()	
  p1 <- plot_ly(z=~image[i,,,3], type="heatmap", name=paste0("Image ", i))	
  p2 <- plot_ly(z=~mask[i,,,1], type="heatmap", name=paste0("Mask ", i))	
  pl_list[[i]] <- subplot(p1,p2, nrows=1) %>% hide_colorbar() # %>% 	
    # layout(yaxis=list(scaleanchor = "x",  scaleratio = 1),  hovermode = "y unified")	
    # layout(hovermode = "y unified")	
}	
pl_list %>%	
  subplot(nrows = length(pl_list)) %>%	
     layout(title="Brain Images and Masks for N=10 Cases")	
#' 	
#' 	
#' For the subsequent transfer-learning, we need to add additional layers to the base-model (`mod1`). The output of every `Conv2D` and `MaxPooling2D` layer is a 3D tensor of shape `(height, width, channels)`. In our-case, the final mask-output is a single-channel image. The *width* and *height* dimensions typically shrink with network layer depth. The number of output channels for each `Conv2D` layer is controlled by the `filters` parameter (e.g., 32 or 64). As the width and height shrink, we can add more output channels in each `Conv2D` layer in the NN.	
#' 	
#' Note that the new transfer-learning model (`TL_model`) has only $17K$ parameters to estimate, as the $31M$ parameters of the base model (`mod`) are now frozen, i.e., they will not be tuned or estimated during the `TL_model` re-fitting, which will be much faster than the estimation of the original model.	
#' 	
#'  - Total params: 31,048,611	
#'  - Trainable params: 16,931	
#'  - Non-trainable params: 31,031,680	
#' 	
#' After we design the DCNN model (Unet), we need to again compile it and estimate the fit to obtain the remaining *trainable parameters.*	
#' 	
#' #### Network Layers	
#' 	
#'  - **Convolutional Layers**: In the late 1990s, LeCun introduced one of the most popular strategies for generating signature (feature) vectors corresponding to single- or multi-channel 2D images. Previously, alternative methods, such as wavelet or spectral decomposition, could be used to map images as features. Then more classical AI/ML techniques, such as support vector machine, knn, logistic regression, among others, may be employed to model, analyze, predict and classify images.	
#' 	
#' Transforming 2D or higher-dimensional images as feature-vectors disregard some of the spatial interaction between pixels, voxels, and tensors. *Convolution layers* tend to capture and protect some of the spatial information from neighboring spatial locations. This is accomplished by down-sampling the image into features by convolving the images with *kernels* (*filters*) and then using the resampled convolution images to predict specific outcomes (images, values, classes, etc.) The use of multiple convolution kernels to "filter" the image involves computing a  product to extract different features from the images.	
#' 	
#' For an image $f(m,n)$, and a kernel $g(k,l)$ defined over an integer grid $\{m,n,k,l\in \mathbb{Z}\}$, the [discrete convolution](https://en.wikipedia.org/wiki/Multidimensional_discrete_convolution) of $f$ and $g$ is:	
#' 	
#' $$\underbrace{(f*g)}_{convolution}[m,n]=\sum_{m=-\infty}^{\infty} {\sum_{n=-\infty}^{\infty} {\left (f[k, l]\times g[m-k, n-l]\right )}},$$	
#' where typically, the support of $f$ and $g$ is compact, e.g., $0\leq m,k\leq M-1$ and $0\leq n,l\leq N-1$.	
#' 	
#' The convolution of two finite sequences is defined by extending the sequences to finitely supported functions on the set of integers. When the sequences are the coefficients of two polynomials, then the coefficients of the ordinary product of the two polynomials are the convolution of the original two sequences. This is known as the Cauchy product of the coefficients of the sequences.	
#' 	
#' For instance, edge detection in an image can be done using a [Sobel kernel matrix](https://en.wikipedia.org/wiki/Sobel_operator) for vertical ($y$) and horizontal ($x$) edges.	
#' 	
#' $${\displaystyle g_{x}={\begin{bmatrix}+1&0&-1\\+2&0&-2\\+1&0&-1\end{bmatrix}} \quad {\mbox{and}}\quad g_{y}={\begin{bmatrix}+1&+2&+1\\0&0&0\\-1&-2&-1\end{bmatrix}}}.$$	
#' 	
#' 	
sobelX = matrix(c(1,2,1, 0,0,0, -1,-2,-1), nrow = 3, ncol = 3); sobelX	
sobelY=t(sobelX); sobelY	
	
library(jpeg)		
img_url <- "https://umich.instructure.com/files/1627149/download?download_frd=1"			
f <- image_read(img_url)		
plot(f)	
# To apply the convolution process manually, we use a convolve() function of the 'magick' package.	
library(magick)	
imgX <- image_convolve(f, sobelX)	
imgY <- image_convolve(f, sobelY)	
# plot(imgX, imgY)	
	
# Rotate 90 degrees	
F <- imager::mirror(imager::imrotate(imager::magick2cimg(f), 90), "x")	
ImgX <- imager::mirror(imager::imrotate(imager::magick2cimg(imgX), 90), "x")	
ImgY <- imager::mirror(imager::imrotate(imager::magick2cimg(imgY), 90), "x")	
p1 <- plot_ly(z=~255*(ImgX)[,,1,], type="image", name="(f*SobelX)")	
p2 <- plot_ly(z=~255*(F)[,,1,], type="image", name="f")	
p3 <- plot_ly(z=~255*(ImgY)[,,1,], type="image", name="(f*SobelY)")	
subplot(p1, p2, p3, nrows=1) %>% 	
  layout(title="Image convolution with Sobel Kernel Filters (Left=f*SobelX, Middle=f, Right=f*SobelY)")	
#' 	
#' 	
#' During the DCNN training process, the encoding phase typically included 2D convolutional layers (`layer_conv_2d()`) paired with pooling layers that reduce the size of the tensor shape and transform images by sliding the kernel filter by a stride (1, 2, 3, etc.) pixels to the right and down. The relation between the resulting feature-vector size and the kernel size is represented by:	
#' 	
#' $$Feature\ size = \frac{Image\ size\ −\ Kernel\ size}{Stride} + 1.$$	
#' For instance, a 10x10 square image, a filter of size 4x4, and a stride of 2 pixels, the $Feature\ size = \frac{10 − 4}{2} + 1 = 3$. Similarly, during the decoding phase, we include deconvolutional layers (`layer_conv_2d_transpose()`) that reverse the process by increasing the grid-size (tensor shape sizes) until we reach the desired output layer shape, usually the same as the input images, but could also be different. More information about [`tensorflow/keras` layers](https://tensorflow.rstudio.com/reference/keras/#section-core-layers), [loss functions](https://tensorflow.rstudio.com/reference/keras/#section-losses), and [DCNN model performance metrics](https://tensorflow.rstudio.com/reference/keras/#section-metrics) is available on the [RStudio Tensorflow/Keras website](https://tensorflow.rstudio.com/reference/keras).	
#' 	
#'  - **Max Pooling Layer**. Max pooling layers shrink the spatial extent of the convolved features and reduce overfitting by providing an abstracted feature representation. Instead of convolving (i.e., dot-product multiplying the image and the kernel) between the input and the kernel, Max-Pooling layers take the maximum value of image intensity over the region covered by the kernel filter. There are many alternatives to max-pooling, e.g., average-pooling, which computes the mean (arithmetic-average) of all image intensities covered by the kernel filter.	
#' 	
#' 	
#'  - **Fully Connected layers**: Input nodes (from the left) in fully connected layers are connected to every node in the subsequent layer to the right. One or several fully connected layers may be common towards the end of a DCNN to provide support for learning non-linear affinities between high-level features generated as outputs of the prior convolutional layer. Good network designs typically include *dropout layers* between two consecutive fully connected layers (to reduce overfitting) and specify *activation functions* to capture non-linearity.	
#' 	
#' The final fully connected layer allows us to control the output tensor shape size which reflects the expected type of classification, prediction, or forecasting. For instance, if the outp[ut is expected to be 6-class labeling, the final layer will output a vector of size 6, i.e., one node for each possible class label. A *softmax* of this 6-feature vector would yield a 6D vector containing probabilities ($0\leq p_i\leq 1$, $\forall 1\leq i\leq 6$) one for each class label. Dropout layers provide a mechanism for regularizing the model and reducing overfitting of the DCNN. Dropout layers may follow fully connected layers or appear after other max-pooling layers to generate image noise augmentation. Dropout layers randomly annihilate (set to zero) some of the connections of the input tensor, according to a Bernoulli distribution; hence some inputs are triaged with probability $p$.	
#' 	
#' #### Model Tracking and Network Visualization	
#' 	
#' One can use `tensorboard` to dynamically track the progression of an ongoing training process as well as to visualize a neural network structure as a graph. See these examples for more details, [example 1](https://www.tensorflow.org/tensorboard/graphs) and [example 2](https://cran.r-project.org/web/packages/tfestimators/vignettes/tensorboard.html). The basic mechanism for this involves the following steps.	
#' 	
#'  - Install `tensorboard` (in a terminal/shell outside R/RStudio): This can be done in different ways and is system/OS dependent. Some examples include:	
#'     + `python -m pip install --user --upgrade pip`	
#'     + `conda install -c conda-forge tensorboard`	
#'     + `conda activate tensorflow`	
#'  - Launch the `tensorboard` UI (from terminal shell): `> tensorboard --logdir logs/run_a`. Alternatively, you can launch tensorboard from RStudio directly via *Rstudio --> Tools --> Shell* and entering this command in the shell `tensorboard --logdir logs/run_a`.	
#'  - Open a local browser and point to this URL address: `http://localhost:6006/`	
#'  - Your code has to use callback mechanism for tracking, see the example above (`print_dot_callback()`) with printing periods or [this example](https://tensorflow.rstudio.com/guide/keras/training_callbacks/).	
#'  - Start the NN training process in the Rmd/R/RStudio environment and observe the tracking metrics dynamically updating in the browser. Note that your python/conda/anaconda shell needs to be open/live for this (*localhost*) process to work and to keep the browser portal active and listening to python updates during the fitting/training process.	
#'  	
#' 	
#' 	
#' # References	
#' 	
#'  - Chollet and Allaire's [Deep Learning with R](https://www.manning.com/books/deep-learning-with-r) textbook (ISBN 9781617295546) and the corresponding [code notebook](https://github.com/jjallaire/deep-learning-with-r-notebooks/tree/master/notebooks)	
#'  - [Deep Neural Networks](https://github.com/ledell/useR-machine-learning-tutorial/blob/master/deep-neural-networks.Rmd)	
#'  - [Google's TensorFlow API](http://playground.tensorflow.org)	
#'  - Olaf Ronneberger, Philipp Fischer, Thomas Brox (2015). [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/), Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer, LNCS, Vol.9351: 234--241, 2015, [arXiv:1505.04597](http://arxiv.org/abs/1505.04597). 	
#' 	
#' 	
#' 	
#' 	
#' 	
#' 	
#' 	
#'     	
	
#' 	
#' 		
#' 			
#' 			
#' 			
#' 		
#' 		
#' 		
#' 			
#' 			
#' 		
#'