SOCR ≫ | DSPA ≫ | Topics ≫ |
Load the following two datasets, generate summary statistics for all variables, plot some of the features (e.g., histograms, box plots, density plots, etc.) of several variables, and save the data locally as CSV files:
Use ALS case-study data or SOCR Knee Pain Data to explore some bivariate relations (e.g. bivariate plot, correlation, table crosstable etc.)
Use 07_UMich_AnnArbor_MI_TempPrecipitation_HistData_1900_2015 data to show the relations between temperature and time. [Hint: use geom_line
or geom_bar
].
<code>
Temp_Data <- as.data.frame(read.csv("https://umich.instructure.com/files/706163/download?download_frd=1", header=T, na.strings=c("", ".", "NA", "NR")))
summary(Temp_Data)
# View(Temp_Data); colnames(Temp_Data)
# Wide-to-Long transformation: reshape arguments include
# (1) list of variable names that define the different times or metrics (varying),
# (2) the name we wish to give the variable containing these values in our long dataset (v.names),
# (3) the name we wish to give the variable describing the different times or metrics (timevar),
# (4) the values this variable will have (times), and
# (5) the end format for the data (direction)
# Before reshaping make sure all data types are the same as putting them in 1 column will
# otherwise generate inconsistencies/errors
colN <- colnames(Temp_Data[,-1])
longTempData <- reshape(Temp_Data, varying = colN, v.names = "Temps", timevar="Months", times = colN, direction = "long")
# View(longTempData)
bar2 <- ggplot(longTempData, aes(x = Months, y = Temps, fill = Months)) +
geom_bar(stat = "identity")
print(bar2)
bar3 <- ggplot(longTempData, aes(x = Year, y = Temps, fill = Months)) +
geom_bar(stat = "identity")
print(bar3)
p <- ggplot(longTempData, aes(x=Year, y=as.integer(Temps), colour=Months)) +
geom_line()
p
</code>
Introduce (artificially) some missing data, impute the missing values and examine the differences between the original, incomplete, and imputed datasets.
Generate a surface plot for the (RF
) Knee Pain data illustrating the 2D distribution of locations of the patient reported knee pain (use plot_ly and kernel density estimation).
Rebalance the groups of ALS (training data) patients according to \(Age\gt 50\) and \(Age\leq 50\) using synthetic minority oversampling (SMOTE) to ensure approximately equal cohort sizes.
Use the California Ozone Data to generate a summary report. Make sure to include: summary for every variable, structure and type of data elements, discuss the tendency of the ozone average concentration, explore the differences of the ozone concentration for separate regions (you may select year 2006), explore the change of ozone concentration by season.