1 Import, plot, sumarize and save data

Load the following two datasets, generate summary statistics for all variables, plot some of the features (e.g., histograms, box plots, density plots, etc.) of several variables, and save the data locally as CSV files:

2 Explore some bivariate relations in the above data

Use ALS case-study data or SOCR Knee Pain Data to explore some bivariate relations (e.g. bivariate plot, correlation, table crosstable etc.)

Use 07_UMich_AnnArbor_MI_TempPrecipitation_HistData_1900_2015 data to show the relations between temperature and time. [Hint: use geom_line or geom_bar].

Some sample code for dealing with the table of temperatures data

<code>
  Temp_Data <- as.data.frame(read.csv("https://umich.instructure.com/files/706163/download?download_frd=1", header=T, na.strings=c("", ".", "NA", "NR")))
  summary(Temp_Data)
  # View(Temp_Data); colnames(Temp_Data)
  
  # Wide-to-Long transformation: reshape arguments include 
  # (1) list of variable names that define the different times or metrics (varying), 
  # (2) the name we wish to give the variable containing these values in our long dataset (v.names), 
  # (3) the name we wish to give the variable describing the different times or metrics (timevar), 
  # (4) the values this variable will have (times), and 
  # (5) the end format for the data (direction)
  # Before reshaping make sure all data types are the same as putting them in 1 column will
  # otherwise generate inconsistencies/errors
  colN <- colnames(Temp_Data[,-1])
  longTempData <- reshape(Temp_Data, varying = colN, v.names = "Temps", timevar="Months", times = colN, direction = "long")
 
  # View(longTempData)
  bar2 <- ggplot(longTempData, aes(x = Months, y = Temps, fill = Months)) +
  geom_bar(stat = "identity")
  print(bar2)
  bar3 <- ggplot(longTempData, aes(x = Year, y = Temps, fill = Months)) +
  geom_bar(stat = "identity")
  print(bar3)
  
  p <- ggplot(longTempData, aes(x=Year, y=as.integer(Temps), colour=Months)) +
  geom_line()
  p
</code>

3 Missing data

Introduce (artificially) some missing data, impute the missing values and examine the differences between the original, incomplete, and imputed datasets.

4 Surface plots

Generate a surface plot for the (RF) Knee Pain data illustrating the 2D distribution of locations of the patient reported knee pain (use plot_ly and kernel density estimation).

5 Unbalanced designs

Rebalance the groups of ALS (training data) patients according to \(Age\gt 50\) and \(Age\leq 50\) using synthetic minority oversampling (SMOTE) to ensure approximately equal cohort sizes.

6 Organize a data report

Use the California Ozone Data to generate a summary report. Make sure to include: summary for every variable, structure and type of data elements, discuss the tendency of the ozone average concentration, explore the differences of the ozone concentration for separate regions (you may select year 2006), explore the change of ozone concentration by season.

SOCR Resource Visitor number

Data Science and Predictive Analytics (UMich HS650)

Assignment 2: Managing Data in R