SOCR ≫ DSPA ≫ DSPA2 Topics ≫

This is Part 2 of the larger DSPA Visualization Chapter, which is difficult to render in a single browser window due to extreme memory demands. Visualization Chapter Part 1 includes data handling, statistical measures of centrality and dispersion, understanding categorical and numeric data, uniform and normal distributions, missing data imputation, web page parsing, visualization of tabular HTML data, and cohort-rebalancing (for imbalanced groups).

In this chapter, we will present a number of complementary strategies for data wrangling, harmonization, manipulation, aggregation, visualization, and graphical exploration. Specifically, we will discuss alternative methods for loading and saving computable data objects, importing and exporting different data structures, measuring sample statistics for quantitative variables, plotting sample histograms and model distribution functions, and scraping data from websites. In addition, we will cover exploratory data analytical (EDA) techniques, handling of incomplete (missing) data, and cohort-rebalancing of imbalanced groups.

1 Exploratory Data Analytics (EDA)

In this section, we will see a broad range of simulations and hands-on activities to highlight some of the basic data visualization techniques using R. A brief discussion of alternative visualization methods is followed by demonstrations of histograms, density, pie, jitter, bar, line and scatter plots, as well as strategies for displaying trees and graphs and 3D surface plots. Many of these are also used throughout the textbook in the context of addressing the graphical needs of specific case-studies.

It is practically impossible to cover all options of every different visualization routine. Readers are encouraged to experiment with each visualization type, change input data and parameters, explore the function documentation using R-help (e.g., ?plot), and search for new R visualization packages and new functionality, which are continuously being developed.

1.1 General Questions Driving Visualization

  • What exploratory visualization techniques are available to visually interrogate my specific data?
  • How to examine paired associations and correlations in a multivariate dataset?

1.2 Classification of visualization methods

Scientific data-driven or simulation-driven visualization methods are hard to classify. The following list of criteria can be used for classification:

  • Data Type: structured/unstructured, small/large, complete/incomplete, time/space, ASCII/binary, Euclidean/non-Euclidean, etc.
  • Task type: Task type is one of the aspects considered in classification of visualization techniques, which provides means of interaction between the researcher, the data and the display software/platform
  • Scalability: Visualization techniques are subject to some limitations, such as the amount of data that a particular technique can exhibit
  • Dimensionality: Visualization techniques can also be classified according to the number of attributes
  • Positioning and Attributes: the distribution of attributes on the chart may affect the interpretation of the display representation, e.g., correlation analysis, where the relative distance among the plotted attributes is relevant for observation
  • Investigative Need: the specific scientific question or exploratory interest may also determine the type of visualization:
  • Examining the composition of the data
  • Exploring the distribution of the data
  • Contrasting or comparing several data elements, relations, association
  • Unsupervised exploratory data mining.

Also, we have the following table for common data visualization methods according to task types: