This CSCD HTML5 resource demonstrates:
- t-distributed stochastic neighbor embedding (t-SNE) statistical method for manifold dimension reduction,
- The TensorBoard machine learning platform, and
- Hands-on Big Data Analytics using the UK Biobank data.
Chiranjeevi Vegi, Hanbo Sun, Tuo Wang, Chao Gao, Alex Kalinin, Lu Zhao, Arthur Toga, Debra Barton, Ivo Dinov and the CSCD Team
This CSCD HTML5 resource demonstrates:
- t-distributed stochastic neighbor embedding (t-SNE) statistical method for manifold dimension reduction,
- The TensorBoard machine learning platform, and
- Hands-on Big Data Analytics using the UK Biobank data.
Clustering: the example #1 (and most probably the first one) a machine learning expert will give you if you ask "What are examples of unsupervised learning?".
Clustering is also a closet of shame of machine learning as a scientific domain. Nobody really knows what a good clustering is. There's no algorithmic way to optimally decide on the good initialization of clustering algorithms, the optimal number of clusters, the metric to compare the similarity/dissimilarity of points within one cluster. Only heuristics and advice of kind "try this/try that".
Classification/regression/sequence modeling/reinforcement learning are all living a boom of new discoveries and new problems being solved. Clustering desperately got stick in the 1980's.
source: Andriy Burkov on Linkedin In this article, we suggest TensorBoard interactive visualization as an additional tool to help visualize higher dimensional data and understand unsupervised models and results
With data increasing at an exponential rate, the datasets have million observations and attributes/features. One might argue, more the data the merrier. But this is not the case always. Datasets with high dimensions/features are subjected to what is colloquially known as the curse of dimensionality. Medical images generate thousands of features and are subjected to curse of dimensionality. This problem pushed researchers to explore dimensionality reduction procedures such as Principal Component Analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), Linear discriminant analysis (LDA) etc. Here, we demonstrate -SNE.
The math behind some of these dimensionality reduction methods is elaborately explained in the Data Science and Predictive Analytics Book (DSPA), this article , and more intuitively in this video.
Now that we established some understanding of visualizing higher dimensional data. Let's understand how one could leverage this to understand unsupervised model performance.
Multi-source, heterogeneous and multifaceted data for 11,000 participants were acquired for a large scale study. Imaging, genetics, clinical assessments and demographic information was recorded at multiple times. The data was preprocessed, derived neuroimaging morphometry measures were computed, and a single computable data object was created by harmonizing and aggregating all the available information. The final sample size was reduced to 9,914, as some cases were removed due to preprocessing errors, extreme missingness, or inconsistencies.The goal of the study was to examine thousands of data elements (3,300), predict specific clinical outcomes, determine the most salient features associated with computable clinical phenotypes, and interpret the joint data holistically, in a lower dimensional space
Description: For each participant with a structural MRI brain scan, we derived a set of 3,000 neuroimaging biomarkerss. These represent a quantitative signature vector of the 3D stereotactic brain anatomy. Additionally, each participant had clinical assessment, demographic, and phenotypic data, which was harmonized and integrated with the derived neuroimaging biomarkers.
Number of Observations: 9,914
Number of features: 3,297
After all the data pre-processing and feature extraction, it's time to find hidden patterns in the data. Since we do not have ground truth labels, unsupervised learning techinques are used. We would not go in depth on these machine learning models in this article.
Before training an unsupervised model, we need to note that data has 3,297 features which can results in poor performance of our model. So the first step employed is dimensionality reduction using PCA to get a minimum number of features which can explain 80% of the variance. As seen in the graph below, approximately 300 features can help explain 80% variance in the data. Hence, our final data that is fed into machine learning model has 9,914 Observations and 300 Features/attributes.
Model Performance To evaluate model performance in absence of information about ground truth labels, very few metrics are available to evaluate the model. These metrics are:
Interpretation: The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters. The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.
Interpretation: The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster. The score is fast to compute
Note: Since the focus of the article is on understanding the results of unsupervised learning using TensorBoard visualizations, we would not do much in terms of hyperparameter tuning of machine learning models other than K-Means++.
The parameter of interest in this model is choosing optimal 'K' i.e., Number of Clusters. Elbow graph as shown below is the popular method to estimate the value of 'K'
We are looking for a sharp bend in the plot of inertia vs. number of clusters where beyond that point change in inertia is very less and that point of the bend can be considered as optimal cluster number. In this case, we do not see such sharp bend. However, we see that after 3 clusters the variation in inertia is less/decreasing gradually. Thus, we will fit our data to 3 clusters.
Result of K-Means++ modelAs seen from the above label distribution plot, 85% of the observations are in clusters 1 and 3 with Silhouette Coefficient of 0.091 . Based on this, it can be infered that model performed poorly and their is overlap between the clusters. This can be seen using t-sne generated in python. Later we will use TensorBoard to generate 3D visualization of t-SNE
Note: All the models below are trained with default parametersResult: Each observation is a cluster
Result: Silhouette Coefficient = -0.010253807701991864
Result: Silhouette Coefficient = 0.0741
Result: All data observations in one cluster
Though we a metric to evaluate different model performance, without ground truth label we cannot ascertain that a particular model is performing well. Thus, one way to solve this is visualization of the underlying clusters formed by each model. Such visualizations can put our doubts at ease and also provide meaningful insights on model performance and lot being limited by Silhouette Coefficient
The computations you'll use TensorFlow for - like training a massive deep neural network - can be complex and confusing. To make it easier to understand, debug, and optimize TensorFlow programs, we've included a suite of visualization tools called TensorBoard. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it.
Out of vast majoirty of features TensorBoard offers we will use Embedding Projector. TensorBoard includes the Embedding Projector, a tool that lets you interactively visualize embeddings. This tool can read embeddings from your model and render them in two or three dimensions.
The Embedding Projector has three panels:The Embedding Projector provides three ways to reduce the dimensionality of a data set.
What does TensorBoard visualization look like?
Let's get started generating t-SNE visualization on tensorboard with our own data. Steps involved
Required Libraries: TensorFlow, Pandas, Numpy,
sklearn (PCA, StandardScaler). You can also create an environment
using the .yml file found here
here. To run the .yml, run the following command conda
env create -f filename.yml
in terminal(mac) or conda
prompt(windows)
Before jumping into code to visualize higher dimensional data
Running the code below generates necessary files such as embeddings for data, metadata, checkpoints and TensorFlow variables that TensorBoard reads during startup
## Importing required Libraries import os import tensorflow as tf from tensorflow.contrib.tensorboard.plugins import projector import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA ## Get working directory PATH = os.getcwd() ## Path to save the embedding and checkpoints generated LOG_DIR = PATH + '/project-tensorboard/log-1/' ## Load data df = pd.read_csv("scaled_data.csv",index_col =0) ## Load the metadata file. Metadata consists your labels. This is optional. Metadata helps us visualize(color) different clusters that form t-SNE metadata = os.path.join(LOG_DIR, 'df_labels.tsv') # Generating PCA and pca = PCA(n_components=50, random_state = 123, svd_solver = 'auto' ) df_pca = pd.DataFrame(pca.fit_transform(df)) df_pca = df_pca.values ## TensorFlow Variable from data tf_data = tf.Variable(df_pca) ## Running TensorFlow Session with tf.Session() as sess: saver = tf.train.Saver([tf_data]) sess.run(tf_data.initializer) saver.save(sess, os.path.join(LOG_DIR, 'tf_data.ckpt')) config = projector.ProjectorConfig() # One can add multiple embeddings. embedding = config.embeddings.add() embedding.tensor_name = tf_data.name # Link this tensor to its metadata(Labels) file embedding.metadata_path = metadata # Saves a config file that TensorBoard will read during startup. projector.visualize_embeddings(tf.summary.FileWriter(LOG_DIR), config)
Now, open terminal and run the following command
tensorboard --logdir= "where the log files are stored"(without quotes) --port=6006
Result
Let's summarize few of our observations from the plot. In the above visualization, different colors from metadata(label) that are predicted using unsupervised model in this case K-Means++. We see four clusters being formed. However, our unsupervised learning model was trained with 3 clusters. We also see blue and orange cluster seem to share observations while the rest are share few observations. This shows that a good parameter tuning and careful study of observations we can identify/predict clusters that are separted nicely from one another.
Another important feature is visualizing data points and their associated images. With minimal effort a subject matter expert can carefully study clusters and deduct insights on model performance. Thus, this helps us really on visual aid along side popular unsupervised performance metrics to improve our model.
Try-It-Now experiment: Now, its your turn to complete a hands-on activity ...
Try-It-Now experiment 2: Interactive activity using user-specified data.
You can also try to run similar analysis to the UK Biobank study above using your own dataset.
This will require you to provide a pair of ASCII text files that can be loaded from your computer.
The first file contains tab-delimitted (TSV) data including the predictor vectors (row=case * column=features).
The second file is an optional TSV file including metadata like labels for each case (row), if any.
Then you can run the same dimensionality reduction on your own data in 2D or 3D.
This link provides a direct
access to the User-Data-driven application ...
Acknowledgments