SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |

Let’s start with simple data importing, plotting, summarizing, and exporting. Load the following two datasets, generate summary statistics for all variables, plot some of the features (e.g., using histograms, box plots, density plots, etc.), and save the data locally as CSV files:

Next we can explore some data bivariate relations. Use ALS case-study data or SOCR Knee Pain Data to explicate some bivariate relations (e.g., use bivariate plots, correlations, table cross tables, etc.)

Use 07_UMich_AnnArbor_MI_TempPrecipitation_HistData_1900_2015 data to show the relations between *temperature* and *time.* [Hint: use `geom_line()`

, `geom_bar()`

, or `plot_ly()`

].

` Temp_Data <- as.data.frame(read.csv(“https://umich.instructure.com/files/706163/download?download_frd=1”, header=T, na.strings=c(““,”.”, “NA”, “NR”))) summary(Temp_Data) # View(Temp_Data); colnames(Temp_Data)`

```
```# Wide-to-Long transformation: reshape arguments include # (1) list of variable names that define the different times or metrics (varying), # (2) the name we wish to give the variable containing these values in our long dataset (v.names), # (3) the name we wish to give the variable describing the different times or metrics (timevar), # (4) the values this variable will have (times), and # (5) the end format for the data (direction) # Before reshaping make sure all data types are the same as putting them in 1 column will # otherwise generate inconsistencies/errors colN <- colnames(Temp_Data[,-1]) longTempData <- reshape(Temp_Data, varying = colN, v.names = “Temps”, timevar=“Months”, times = colN, direction = “long”)

# View(longTempData) bar2 <- ggplot(longTempData, aes(x = Months, y = Temps, fill = Months)) + geom_bar(stat = “identity”) print(bar2) bar3 <- ggplot(longTempData, aes(x = Year, y = Temps, fill = Months)) + geom_bar(stat = “identity”) print(bar3)

p <- ggplot(longTempData, aes(x=Year, y=as.integer(Temps), color=Months)) + geom_line() p

Using one of the above datasets, introduce (artificially) some missing data, impute the missing values, and examine the differences between the original, incomplete, and imputed datasets.

Generate a surface plot for the (`RF`

) Knee Pain data illustrating the 2D distribution of locations of the patient reported knee pain (use `plot_ly()`

and kernel density estimation).

Rebalance the groups of ALS (training data) patients according to \(Age\gt 50\) and \(Age\leq 50\) using synthetic minority oversampling (SMOTE) to ensure approximately equal cohort sizes.

Use the TBI dataset (CaseStudy11_TBI) to display some interactive (SVG) visualization plots - e.g., histograms, density plots, pie charts, heatmaps, barplots, and paired correlation plots.

Use the SOCR Resource Hierarchical Dataset (JSON) or the DSPA Dynamic Certificate Map (JSON) to generate tree/graph displays of the corresponding structural information contained in the JSON object.

` library(jsonlite) library(RCurl) library(data.tree) url <- “https://socr.umich.edu/html/navigators/D3/xml/SOCR_HyperTree.json” raw_data <- getURL(url) document <- fromJSON(raw_data) tree <- Node\(new(document\)name) for(i in seq_len(length(document))) { tree\(AddChild(document\)children\(name[[i]]) for(j in seq_len(length(document\)children\(children[[i]]))) { tree\)children[[i]]\(AddChild(document\)children\(children[[i]]\)name[[j]]) for(k in seq_len(length(document\(children\)children[[i]]\(children[[j]]))){ tree\)children[[i]]\(children[[j]]\)AddChild((document\(children\)children[[i]]\(children[[j]]\)name[[k]])) } } } suppressMessages(library(igraph)) plot(as.igraph(tree, directed = T, direction = “climb”))`

```
suppressMessages(library(networkD3)) treenetwork <- ToDataFrameNetwork(tree, “name”) simpleNetwork(treenetwork, fontSize = 10)
```

Use the SOCR_OilGasData to generate three individual bar plots for Fossil Fuels, Nuclear Electric Power and Renewable Energy respectively (

*Hint*: you may use`plot_ly()`

, ggplot`and`

facet_grid`). Include two lines for*Productions*and*Consumption.*The x-axis should be*time*(you may use year as numeric type directly), draw*Consumption*slightly wider and noticeable (e.g., using magenta color).Use the SOCR_OzoneData to generate a correlation plot with the variables “MTH_1”, “MTH_2”, …, “MTH_12”,. (

*Hint*: you need to compute the correlation matrix first, then apply`corrplot()`

or`plot_ly()`

, try to use multiple chart types, “circle”, “pie”, “mixed” etc.)Use the SOCR_ CA_OzoneData to generate a 3D surface plot (using the variables

*Longitude*,*Latitude*, and*O3*).Generate random numbers from the

`Cauchy`

distribution. Draw a histogram and compare it with the histogram of normal distribution. What do you find? You may try different seeds to re-generate the Cauchy random numbers.Use the SOCR_Data_PD_BiomedBigMetadata to generate a

*heatplot.*Set`RowSideColors`

and`ColSideColors`

and use rainbow colors.Use SOCR_Data_2011_US_JobsRanking to draw a scatter plot

*Overall_Score*-*Average_Income(USD)*. Specify title, legend, and axes labels. Then try`plot_ly()`

or qplot()` to display*Overall_Score*vs.*Average_Income(USD)*, color the blobs according to the*Stress_Level*and size them, according to*Hiring_Potential*, blob labels should represent*Job_Title.*Use the SOCR_TurkiyeStudentEvalData to generate trees and graphs using

`cutree()`

. (use variables*Q1*-*Q28*).

Use the California Ozone Data to generate a summary report. Make sure to include: summary for every variable, structure and type of data elements, discuss the tendency of the ozone average concentration, explore the differences of the ozone concentration for separate regions (you may select year 2006), explore the change of ozone concentration by season.