SOCR ≫ | DSPA ≫ | Topics ≫ |
Load the following two datasets, generate summary statistics for all variables, plot some of the features (e.g., histograms, box plots, density plots, etc.) of some variables, and save the data locally as CSV files:
Use ALS case-study data and long-format SOCR Parkinson’s Disease data(extract rows with Time=0
) to explore some bivariate relations (e.g. bivariate plot, correlation, table, crosstable etc.)
Use 07_UMich_AnnArbor_MI_TempPrecipitation_HistData_1900_2015 data to show the relations between temperature and time. [Hint: use geom_line
and geom_bar
]
Introduce (artificially) some missing data, impute the missing values and examine the differences between the original, incomplete and imputed data in statistics.
Generate a surface plot for the SOCR Knee Pain Data illustrating the 2D distribution of locations of the patient reported knee pain (use plotly
and kernel density estimation).
Rebalance Parkinson’s Disease data(extract rows with Time=0
) according to disease(SWEED
OR PD
) and health(HC
) using synthetic minority oversampling (SMOTE) to ensure approximately equal cohort sizes. (Notice: need to set 1
as the minority class.)
Use the California Ozone Data to generate a summary report. Make sure include: summary for every variable, structure of data, proper data type convert(if needed), discuss the tendency of the ozone average concentration in terms of year’s average for each location, explore the differences of the ozone concentration for area, explore the change of ozone concentration as seasons.