SOCR ≫ DSPA ≫ DSPA2 Topics ≫

1 Data Manipulation

Let’s start with simple data importing, plotting, summarizing, and exporting. Load the following two datasets, generate summary statistics for all variables, plot some of the features (e.g., using histograms, box plots, density plots, etc.), and save the data locally as CSV files:

2 Bivariate relations

Next we can explore some data bivariate relations. Use ALS case-study data or SOCR Knee Pain Data to explicate some bivariate relations (e.g., use bivariate plots, correlations, table cross tables, etc.)

Use 07_UMich_AnnArbor_MI_TempPrecipitation_HistData_1900_2015 data to show the relations between temperature and time. [Hint: use geom_line(), geom_bar(), or plot_ly()].

Sample code for dealing with the temperatures data

Temp_Data <-“”, header=T, na.strings=c(““,”.”, “NA”, “NR”))) summary(Temp_Data) # View(Temp_Data); colnames(Temp_Data)

# Wide-to-Long transformation: reshape arguments include # (1) list of variable names that define the different times or metrics (varying), # (2) the name we wish to give the variable containing these values in our long dataset (v.names), # (3) the name we wish to give the variable describing the different times or metrics (timevar), # (4) the values this variable will have (times), and # (5) the end format for the data (direction) # Before reshaping make sure all data types are the same as putting them in 1 column will # otherwise generate inconsistencies/errors colN <- colnames(Temp_Data[,-1]) longTempData <- reshape(Temp_Data, varying = colN, v.names = “Temps”, timevar=“Months”, times = colN, direction = “long”)

# View(longTempData) bar2 <- ggplot(longTempData, aes(x = Months, y = Temps, fill = Months)) + geom_bar(stat = “identity”) print(bar2) bar3 <- ggplot(longTempData, aes(x = Year, y = Temps, fill = Months)) + geom_bar(stat = “identity”) print(bar3)

p <- ggplot(longTempData, aes(x=Year, y=as.integer(Temps), color=Months)) + geom_line() p

3 Missing data

Using one of the above datasets, introduce (artificially) some missing data, impute the missing values, and examine the differences between the original, incomplete, and imputed datasets.

4 Surface plots

Generate a surface plot for the (RF) Knee Pain data illustrating the 2D distribution of locations of the patient reported knee pain (use plot_ly() and kernel density estimation).

5 Unbalanced groups

Rebalance the groups of ALS (training data) patients according to \(Age\gt 50\) and \(Age\leq 50\) using synthetic minority oversampling (SMOTE) to ensure approximately equal cohort sizes.

6 Common plots

Use the TBI dataset (CaseStudy11_TBI) to display some interactive (SVG) visualization plots - e.g., histograms, density plots, pie charts, heatmaps, barplots, and paired correlation plots.

7 Trees and Graphs

Use the SOCR Resource Hierarchical Dataset (JSON) or the DSPA Dynamic Certificate Map (JSON) to generate tree/graph displays of the corresponding structural information contained in the JSON object.

Processing a JSON hierarchy

library(jsonlite) library(RCurl) library(data.tree) url <- “” raw_data <- getURL(url) document <- fromJSON(raw_data) tree <- Node\(new(document\)name) for(i in seq_len(length(document))) { tree\(AddChild(document\)children\(name[[i]]) for(j in seq_len(length(document\)children\(children[[i]]))) { tree\)children[[i]]\(AddChild(document\)children\(children[[i]]\)name[[j]]) for(k in seq_len(length(document\(children\)children[[i]]\(children[[j]]))){ tree\)children[[i]]\(children[[j]]\)AddChild((document\(children\)children[[i]]\(children[[j]]\)name[[k]])) } } } suppressMessages(library(igraph)) plot(as.igraph(tree, directed = T, direction = “climb”))

suppressMessages(library(networkD3)) treenetwork <- ToDataFrameNetwork(tree, “name”) simpleNetwork(treenetwork, fontSize = 10)

8 Data EDA examples

  • Use the SOCR_OilGasData to generate three individual bar plots for Fossil Fuels, Nuclear Electric Power and Renewable Energy respectively (Hint: you may use plot_ly(), ggplotandfacet_grid`). Include two lines for Productions and Consumption. The x-axis should be time (you may use year as numeric type directly), draw Consumption slightly wider and noticeable (e.g., using magenta color).

  • Use the SOCR_OzoneData to generate a correlation plot with the variables “MTH_1”, “MTH_2”, …, “MTH_12”,. (Hint: you need to compute the correlation matrix first, then apply corrplot() or plot_ly(), try to use multiple chart types, “circle”, “pie”, “mixed” etc.)

  • Use the SOCR_ CA_OzoneData to generate a 3D surface plot (using the variables Longitude, Latitude, and O3).

  • Generate random numbers from the Cauchy distribution. Draw a histogram and compare it with the histogram of normal distribution. What do you find? You may try different seeds to re-generate the Cauchy random numbers.

  • Use the SOCR_Data_PD_BiomedBigMetadata to generate a heatplot. Set RowSideColors and ColSideColors and use rainbow colors.

  • Use SOCR_Data_2011_US_JobsRanking to draw a scatter plot Overall_Score - Average_Income(USD). Specify title, legend, and axes labels. Then try plot_ly() or qplot()` to display Overall_Score vs. Average_Income(USD), color the blobs according to the Stress_Level and size them, according to Hiring_Potential, blob labels should represent Job_Title.

  • Use the SOCR_TurkiyeStudentEvalData to generate trees and graphs using cutree(). (use variables Q1 - Q28).

9 Data reports

Use the California Ozone Data to generate a summary report. Make sure to include: summary for every variable, structure and type of data elements, discuss the tendency of the ozone average concentration, explore the differences of the ozone concentration for separate regions (you may select year 2006), explore the change of ozone concentration by season.

SOCR Resource Visitor number Web Analytics SOCR Email