SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
We start the DSPA journey with an overview of the mission and objectives of this textbook. Some early examples of driving motivational problems and challenges provide context into the common characteristics of big (biomedical and health) data. We will define data science and predictive analytics and emphasize the importance of their ethical, responsible, and reproducible practical use. This chapter also covers the foundations of R, contrasts R against other languages and computational data science platforms and introduces basic functions and data objects, formats, and simulation.
Let’s start with a quick overview illustrating some common data science challenges, qualitative descriptions of the fundamental principles, and awareness about the power and potential pitfalls of modern data-driven scientific inquiry.
The second edition of this textbook (DSPA2)is based on the HS650: Data Science and Predictive Analytics (DSPA) course I teach at the University of Michigan and the first DSPA edition. These materials collectively aim to provide learners with a deep understanding of the challenges, appreciation of the enormous opportunities, and a solid methodological foundation for designing, collecting, managing, processing, interrogating, analyzing and interpreting complex health and biomedical data. Readers that finish this course of training and successfully complete the examples and assignments included in the book will gain unique skills and acquire a tool-chest of methods, software tools, and protocols that can be applied to a broad spectrum of Big Data problems.
Before diving into the mathematical algorithms, statistical computing methods, software tools, and health analytics covered in the remaining chapters, we will discuss several driving motivational problems. These will ground all the subsequent scientific discussions, data modeling, and computational approaches.
For each of the studies below, we illustrate several clinically-relevant scientific questions, identify appropriate data sources, describe the types of data elements, and pinpoint various complexity challenges.
Data Source | Sample Size/Data Type | Summary |
---|---|---|
ADNI Archive | Clinical data: demographics, clinical assessments, cognitive assessments; Imaging data: sMRI, fMRI, DTI, PiB/FDG PET; Genetics data: Ilumina SNP genotyping; Chemical biomarker: lab tests, proteomics. Each data modality comes with a different number of cohorts. Generally, \(200\le N \le 1200\). For instance, previously conducted ADNI studies with N>500 [ doi: 10.3233/JAD-150335, doi: 10.1111/jon.12252, doi: 10.3389/fninf.2014.00041] | ADNI provides interesting data modalities, multiple cohorts (e.g., early-onset, mild, and severe dementia, controls) that allow effective model training and validation NACC Archive |
Data Source | Sample Size/Data Type | Summary |
---|---|---|
PPMI Archive | Demographics: age, medical history, sex; Clinical data: physical, verbal learning and language, neurological and olfactory (University of Pennsylvania Smell Identification Test, UPSIT) tests), vital signs, MDS-UPDRS scores (Movement Disorder; Society-Unified Parkinson’s Disease Rating Scale), ADL (activities of daily living), Montreal Cognitive Assessment (MoCA), Geriatric Depression Scale (GDS-15); Imaging data: structural MRI; Genetics data: llumina ImmunoChip (196,524 variants) and NeuroX (covering 240,000 exonic variants) with 100% sample success rate, and 98.7% genotype success rate genotyped for APOE e2/e3/e4. Three cohorts of subjects; Group 1 = {de novo PD Subjects with a diagnosis of PD for two years or less who are not taking PD medications}, N1 = 263; Group 2 = {PD Subjects with Scans without Evidence of a Dopaminergic Deficit (SWEDD)}, N2 = 40; Group 3 = {Control Subjects without PD who are 30 years or older and who do not have a first degree blood relative with PD}, N3 = 127 | The longitudinal PPMI dataset including clinical, biological and imaging data (screening, baseline, 12, 24, and 48 month follow-ups) may be used conduct model-based predictions as well as model-free classification and forecasting analyses |
Data Source | Sample Size/Data Type | Summary |
---|---|---|
MAWS Data / UMHS EHR / WHO AWS Data | Scores from Alcohol Use Disorders Identification Test-Consumption (AUDIT-C) [49], including dichotomous variables for any current alcohol use (AUDIT-C, question 1), total AUDIT-C score > 8, and any positive history of alcohol withdrawal syndrome (HAWS) | ~1,000 positive cases per year among 10,000 adult medical inpatients, % RAWS screens completed, % positive screens, % entered into MAWS protocol who receive pharmacological treatment for AWS, % entered into MAWS protocol without a completed RAWS screen |
Data Source | Sample Size/Data Type | Summary |
---|---|---|
ProAct Archive | Over 100 clinical variables are recorded for all subjects including: Demographics: age, race, medical history, sex; Clinical data: Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS), adverse events, onset_delta, onset_site, drugs use (riluzole) The PRO-ACT training dataset contains clinical and lab test information of 8,635 patients. Information of 2,424 study subjects with valid gold standard ALSFRS slopes will be used in out processing, modeling and analysis | The time points for all longitudinally varying data elements will be aggregated into signature vectors. This will facilitate the modeling and prediction of ALSFRS slope changes over the first three months (baseline to month 3) |
The SOCR Brain Visualization App has preloaded sMRI, ROI labels, and fiber track models for a normal brain. It also allows users to drag-and-drop their data into the browser to visualize and navigate through the stereotactic data (including imaging, parcellations and tractography).
A recent study of Structural Neuroimaging in Alzheimer’s Disease illustrates the Big Data challenges in modeling complex neuroscientific data. Specifically, 808 ADNI subjects were divided into 3 groups: 200 subjects with Alzheimer’s disease (AD), 383 subjects with mild cognitive impairment (MCI), and 225 asymptomatic normal controls (NC). Their sMRI data were parcellated using BrainParser, and the 80 most important neuroimaging biomarkers were extracted using the global shape analysis Pipeline workflow. Using a pipeline implementation of Plink, the authors obtained 80 SNPs highly-associated with the imaging biomarkers. The authors observed significant correlations between genetic and neuroimaging phenotypes in the 808 ADNI subjects. These results suggest that differences between AD, MCI, and NC cohorts may be examined by using powerful joint models of morphometric, imaging and genotypic data.
This HHMI disease detective activity illustrates genetic analysis of sequences of Ebola viruses isolated from patients in Sierra Leone during the Ebola outbreak of 2013-2016. Scientists track the spread of the virus using the fact that most of the genome is identical among individuals of the same species, most similar for genetically related individuals, and more different as the hereditary distance increases. DNA profiling capitalizes on these genetic differences. In particular, in regions of noncoding DNA, which is DNA that is not transcribed and translated into a protein. Variations in noncoding regions impact less individual’s traits. Such changes in noncoding regions may be immune to natural selection. DNA variations called short tandem repeats (STRs) are short bases, typically 2-5 bases long, that repeat multiple times. The repeat units are found at different locations, or loci, throughout the genome. Every STR has multiple alleles. These allele variants are defined by the number of repeat units present or by the length of the repeat sequence. STR are surrounded by non-variable segments of DNA known as flanking regions. The STR allele in the Figure below could be denoted by “6”, as the repeat unit (GATA) repeats 6 times, or as 70 base pairs (bps) because its length is 70 bases in length, including the starting/ending flanking regions. Different alleles of the same STR may correspond to different numbers of GATA repeats, with the same flanking regions.
Whole-genome and exome sequencing include essential clues for identifying genes responsible for simple Mendelian inherited disorders. This paper proposed methods can be applied to complex disorders based on population genetics. Next generation sequencing (NGS) technologies include bioinformatics resources to analyze the dense and complex sequence data. The Graphical Pipeline for Computational Genomics (GPCG) performs the computational steps required to analyze NGS data. The GPCG implements flexible workflows for basic sequence alignment, sequence data quality control, single nucleotide polymorphism analysis, copy number variant identification, annotation, and visualization of results. Applications of NGS analysis provide clinical utility for identifying miRNA signatures in diseases. Enabling hypotheses testing about the functional role of variants in the human genome will help to pinpoint the genetic risk factors of many diseases (e.g., neuropsychiatric disorders).
A computational infrastructure for high-throughput neuroimaging-genetics (doi: 10.3389/fninf.2014.00041) facilitates the data aggregation, harmonization, processing and interpretation of multisource imaging, genetics, clinical and cognitive data. A unique feature of this architecture is the graphical user interface to the Pipeline environment. Through its client-server architecture, the Pipeline environment provides a graphical user interface for designing, executing, monitoring, validating, and disseminating complex protocols that utilize diverse suites of software tools and web-services. These pipeline workflows are represented as portable XML objects, which transfer the execution instructions and user specifications from the client user machine to remote pipeline servers for distributed computing. Using Alzheimer’s and Parkinson’s data, this study provides examples of translational applications using this infrastructure
Software developments, student training, utilization of Cloud or IoT service platforms, and methodological advances associated with Big Data Discovery Science all present existing opportunities for learners, educators, researchers, practitioners and policy makers, alike. A review of many biomedical, health informatics and clinical studies suggests that there are indeed common characteristics of complex big data challenges. For instance, imagine analyzing observational data of thousands of Parkinson’s disease patients based on tens-of-thousands of signature biomarkers derived from multi-source imaging, genetics, clinical, physiologic, phenomics and demographic data elements. IBM had defined the qualitative characteristics of Big Data as 4V’s: Volume, Variety, Velocity and Veracity (there are additional V-qualifiers that can be added).
More recently (PMID:26998309) we defined a constructive characterization of Big Data that clearly identifies the methodological gaps and necessary tools:
BD Dimensions | Tools |
---|---|
Size | Harvesting and management of vast amounts of data |
Complexity | Wranglers for dealing with heterogeneous data |
Incongruency | Tools for data harmonization and aggregation |
Multi-source | Transfer and joint modeling of disparate elements |
Multi-scale | Macro to meso to micro scale observations |
Time | Techniques accounting for longitudinal patterns in the data |
Incomplete | Reliable management of missing data |
Data science is an emerging new field that (1) is extremely transdisciplinary - bridging between the theoretical, computational, experimental, and biosocial areas, (2) deals with enormous amounts of complex, incongruent and dynamic data from multiple sources, and (3) aims to develop algorithms, methods, tools and services capable of ingesting such datasets and generating semi-automated decision support systems. The latter can mine the data for patterns or motifs, predict expected outcomes, suggest clustering or labeling of retrospective or prospective observations, compute data signatures or fingerprints, extract valuable information, and offer evidence-based actionable knowledge. Data science techniques often involve data manipulation (wrangling), data harmonization and aggregation, exploratory or confirmatory data analyses, predictive analytics, validation and fine-tuning.
Predictive analytics is the process of utilizing advanced mathematical formulations, powerful statistical computing algorithms, efficient software tools and services to represent, interrogate and interpret complex data. As its name suggests, a core aim of predictive analytics is to forecast trends, predict patterns in the data, or prognosticate the process behavior either within the range or outside the range of the observed data (e.g., in the future, or at locations where data may not be available). In this context, process refers to a natural phenomenon that is being investigated by examining proxy data. Presumably, by collecting and exploring the intrinsic data characteristics, we can track the behavior and unravel the underlying mechanism of the system.
The fundamental goal of predictive analytics is to identify relationships, associations, arrangements or motifs in the dataset, in terms of space, time, features (variables) that may reduce the dimensionality of the data, i.e., its complexity. Using these process characteristics, predictive analytics may predict unknown outcomes, produce estimations of likelihoods or parameters, generate classification labels, or contribute other aggregate or individualized forecasts. We will discuss how the outcomes of these predictive analytics can be refined, assessed and compared, e.g., between alternative methods. The underlying assumptions of the specific predictive analytics technique determine its usability, affect the expected accuracy, and guide the (human) actions resulting from the (machine) forecasts. In this textbook, we will discuss supervised and unsupervised, model-based and model-free, classification and regression, as well as deterministic, stochastic, classical and machine learning-based techniques for predictive analytics. The type of the expected outcome (e.g., binary, polytomous, probability, scalar, vector, tensor, etc.) determines if the predictive analytics strategy provides prediction, forecasting, labeling, likelihoods, grouping or motifs.
The Pipeline Environment provides a large tool chest of software and services that can be integrated, merged and processed. The Pipeline workflow library and the workflow miner illustrate much of the functionality that is available. A Java-based and an HTML5 webapp based graphical user interfaces provide access to a powerful 4,000 core grid compute server.
There are many sources of data available on the Internet. A number of them provide open-access to the data based on FAIR (Findable, Accessible, Interoperable, Reusable) principles. Below are examples of open-access data sources that can be used to test the techniques presented in the textbook. We demonstrate the tasks of retrieval, manipulation, processing, analytics and visualization using example datasets from these archives.
In addition to being data-literate and skilled artisans, all data scientists, quantitative analysts, and informaticians need to be aware of certain global societal norms and exhibit professional work ethics that ensure the appropriate use, result reproducibility, unbiased reporting, as well as expected and unanticipated interpretations of data, analytical methods, and novel technologies. Examples of this basic etiquette include (1) promoting FAIR (findable, accessible, interoperable, reusable, and reproducible resource) sharing principles, (2) ethical conduct of research, (3) balancing and explicating potential benefits and probable detriments of findings, (4) awareness of relevant legislation, codes of practice, and respect for privacy, security and confidentiality of sensitive data, and (5) document provenance, attributions and longevity of resources.
The FAIR (findable, accessible, interoperable, reusable, and reproducible) resource sharing principles provide guiding essentials for appropriate development, deployment, use, management, and stewardship of data, techniques, tools, services, and information dissemination.
Ethical data science and predictive analytics research demands responsible scientific conduct and integrity in all aspects of practical scientific investigation and discovery. All analysts should be aware of, and practice, established professional norms and ethical principles in planning, designing, implementing, executing, and assessing activities related to data-driven scientific research.
Evidence and data-driven discovery is often bound to generate both questions and answers, some of which may be unexpected, undesired, or detrimental. Quantitative analysts are responsible for validating all their results, as well as for balancing and explicating all potential benefits and enumerating all probable detriments of positive and negative findings.
Decisions on security, privacy, and confidentiality of sensitive data collected and managed by data governors, manipulated by quantitative analysts, and interpreted by policy and decision makers is not trivial. The large number of people, devices, algorithms, and services that are within arms-length of the raw data suggests a multi-tier approach for sensible protection and security of sensitive information like personal health, biometric, genetic, and proprietary data. Data security, privacy, and confidentiality of sensitive information should always be protected throughout the data life cycle. This may require preemptive, on-going, and post-hoc analyses to identify and patch potential vulnerabilities. Often, there may be tradeoffs between data value benefits and potential risks of blind automated information interpretation. Neither of the extremes is practical, sustainable, or effective.
The digitalization of human experiences, the growth of data science, and the promise of artificial intelligence have led to enormous societal investments, excitements, and anxieties. There is a strong sentiment and anticipation that the vast amounts of available information will ubiquitously translate into quick insights, useful predictions, optimal risk-estimations, and cost-effective decisions. Proper recording of the data, algorithmic, scientific, computational, and human factors involved in these forecasts represents a critical component of data science and its essential contributions to knowledge.
Each of the complementary spaces of appropriate and inappropriate use of data science and predictive analytics resources are vast. The sections above outlined some of the guiding principles for ethical, respectful, appropriate, and responsible data-driven analytics and factual reporting. Below are some examples illustrating inappropriate use of data, resources, information, or knowledge to intentionally or unintentionally gain unfair advantage, spread fake news, misrepresent findings, or detrimental socioeconomic effects.
The heterogeneity of data science makes it difficult to identify a complete and exact list of prerequisites necessary to succeed in learning all the appropriate methods. However, the reader is strongly encouraged to glance over the preliminary prerequisites, the self-assessment pretest and remediation materials, and the outcome competencies. Throughout this journey, it is useful to remember the following points:
R
code (R markdown) for all examples and
demonstrations presented in the textbook are available as electronic supplement.DSPA.info @ umich.edu
) to correct, expand, or polish the
resources, accordingly. If you have alternative ideas, suggestions for
improvements, optimized code, interesting data and case-studies, or any
other refinements, please send these along, as well. All suggestions and
critiques will be carefully reviewed and potentially incorporated in
revisions and new editions.In this section, we will start with the foundations of R
programming for visualization, statistical computing and scientific
inference. Specifically, we will (1) discuss the rationale for selecting
R
as a computational platform for all DSPA demonstrations;
(2) present the basics of installing shell-based R
and
RStudio user-interface, (3) show some simple R
commands and
scripts (e.g., translate long-to-wide data format, data simulation, data
stratification and subsetting), (4) introduce variable types and their
manipulation; (5) demonstrate simple mathematical functions, statistics,
and matrix operators; (6) explore simple data visualization; and (7)
introduce optimization and model fitting. The chapter appendix includes
references to R
introductory and advanced resources, as
well as a primer on debugging.
R
?There are many different classes of software that can be used for
data interrogation, modeling, inference and statistical computing. Among
these are R
, Python, Java, C/C++, Perl, and many others.
The table below compares R
to various other statistical
analysis software packages and more
detailed comparison is available online.
Statistical Software | Advantages | Disadvantages |
---|---|---|
R | R is actively maintained (\(\ge
100,000\) developers, \(\ge
15K\) packages). Excellent connectivity to various types of data
and other systems. Versatile for solving problems in many domains. It’s
free, open-source code. Anybody can access/review/extend the source
code. R is very stable and reliable. If you change or
redistribute the R source code, you have to make those
changes available for anybody else to use. R runs anywhere
(platform agnostic). Extensibility: R supports extensions,
e.g., for data manipulation, statistical modeling, and graphics. Active
and engaged community supports R. Unparalleled question-and-answer
(Q&A) websites. R connects with other languages
(Java/C/JavaScript/Python/Fortran) & database systems, and other
programs, SAS, SPSS, etc. Other packages have add-ons to connect with R.
SPSS has incorporated a link to R, and SAS has protocols to move data
and graphics between the two packages. |
Mostly scripting language. Steeper learning curve |
SAS | Large datasets. Commonly used in business & Government | Expensive. Somewhat dated programming language. Expensive/proprietary |
Stata | Easy statistical analyses | Mostly classical stats |
SPSS | Appropriate for beginners Simple interfaces | Weak in more cutting edge statistical procedures lacking in robust methods and survey methods |
There exist substantial differences between different types of computational environments for data wrangling, preprocessing, analytics, visualization and interpretation. The table below provides some rough comparisons between some of the most popular data computational platforms. With the exception of ComputeTime, higher scores represent better performance within the specific category. Note that these are just estimates and the scales are not normalized between categories.
Language | OpenSource | Speed | ComputeTime | LibraryExtent | EaseOfEntry | Costs | Interoperability |
---|---|---|---|---|---|---|---|
Python | Yes | 16 | 62 | 80 | 85 | 10 | 90 |
Julia | Yes | 2941 | 0.34 | 100 | 30 | 10 | 90 |
R | Yes | 1 | 745 | 100 | 80 | 15 | 90 |
IDL | No | 67 | 14.77 | 50 | 88 | 100 | 20 |
Matlab | No | 147 | 6.8 | 75 | 95 | 100 | 20 |
Scala | Yes | 1428 | 0.7 | 50 | 30 | 20 | 40 |
C | Yes | 1818 | 0.55 | 100 | 30 | 10 | 99 |
Fortran | Yes | 1315 | 0.76 | 95 | 25 | 15 | 95 |
Let’s first look at some real peer-review publication data
(1995-2015), specifically comparing all published scientific reports
utilizing R
, SAS
and SPSS
, as
popular tools for data manipulation and statistical modeling. These data
are retrieved using GoogleScholar
literature searches.
# library(ggplot2)
# library(reshape2)
library(ggplot2)
library(reshape2)
library(plotly)
Data_R_SAS_SPSS_Pubs <-
read.csv('https://umich.instructure.com/files/2361245/download?download_frd=1', header=T)
df <- data.frame(Data_R_SAS_SPSS_Pubs)
# convert to long format (http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/)
# df <- melt(df , id.vars = 'Year', variable.name = 'Software')
# ggplot(data=df, aes(x=Year, y=value, color=Software, group = Software)) +
# geom_line(size=4) + labs(x='Year', y='Paper Software Citations') +
# ggtitle("Manuscript Citations of Software Use (1995-2015)") +
# theme(legend.position=c(0.1,0.8),
# legend.direction="vertical",
# axis.text.x = element_text(angle = 45, hjust = 1),
# plot.title = element_text(hjust = 0.5))
plot_ly(df, x = ~Year) %>%
add_trace(y = ~R, name = 'R', mode = 'lines+markers') %>%
add_trace(y = ~SAS, name = 'SAS', mode = 'lines+markers') %>%
add_trace(y = ~SPSS, name = 'SPSS', mode = 'lines+markers') %>%
layout(title="Manuscript Citations of Software Use (1995-2015)", legend = list(orientation = 'h'))
We can also look at a dynamic
Google Trends map, which provides longitudinal tracking of the
number of web-searches for each of these three statistical computing
platforms (R
, SAS, SPSS). The figure below shows one
example of the evolving software interest over the past 15 years. You
can expand
this plot by modifying the trend terms, expanding the search phrases,
and changing the time period. Static 2004-2018 monthly data of
popularity of SAS
, SPSS
, and R
programming Google searches is saved in this file GoogleTrends_Data_R_SAS_SPSS_Worldwide_2004_2018.csv.
The example below shows a dynamic pull of \(\sim20\) years of Google queries about
R
, SAS
, SPSS
, and
Python
, traced between 2004-01-01
and
2023-06-16
.
# require(ggplot2)
# require(reshape2)
# GoogleTrends_Data_R_SAS_SPSS_Worldwide_2004_2018 <-
# read.csv('https://umich.instructure.com/files/9310141/download?download_frd=1', header=T)
# # read.csv('https://umich.instructure.com/files/9314613/download?download_frd=1', header=T) # Include Python
# df_GT <- data.frame(GoogleTrends_Data_R_SAS_SPSS_Worldwide_2004_2018)
#
# # convert to long format
# # df_GT <- melt(df_GT , id.vars = 'Month', variable.name = 'Software')
# #
# # library(scales)
# df_GT$Month <- as.Date(paste(df_GT$Month,"-01",sep=""))
# ggplot(data=df_GT1, aes(x=Date, y=hits, color=keyword, group = keyword)) +
# geom_line(size=4) + labs(x='Month-Year', y='Worldwide Google Trends') +
# scale_x_date(labels = date_format("%m-%Y"), date_breaks='4 months') +
# ggtitle("Web-Search Trends of Statistical Software (2004-2018)") +
# theme(legend.position=c(0.1,0.8),
# legend.direction="vertical",
# axis.text.x = element_text(angle = 45, hjust = 1),
# plot.title = element_text(hjust = 0.5))
#### Pull dynamic Google-Trends data
# install.packages("prophet")
# install.packages("devtools")
# install.packages("ps"); install.packages("pkgbuild")
# devtools::install_github("PMassicotte/gtrendsR")
# Potential 429 Error, see:
# https://github.com/PMassicotte/gtrendsR/issues/431
# https://github.com/trendecon/trendecon/blob/master/R/gtrends_with_backoff.R
library(gtrendsR)
library(ggplot2)
library(prophet)
df_GT1 <- gtrends(c("R", "SAS", "SPSS", "Python"),
gprop = "web", time = "2004-01-01 2023-06-16")[[1]]
# geo = c("US","CN","GB", "EU")
# During repeated requests, to prevent gtrends error message
# "Status code was not 200. Returned status code:429", due to multiple queries
# we used the GoogleTrends online search to for the 4 terms
# https://trends.google.com/trends/explore?date=all&q=R,SAS,Python,SPSS&hl=en
# and saved the data to DSPA Canvas site:
# https://umich.instructure.com/courses/38100/files/folder/Case_Studies/
# https://umich.instructure.com/files/31071103/download?download_frd=1
# df_GT1_wide <- spread(df_GT1, key = keyword, value = hits)
# # colnames(df_GT1_wide)[7] <- "R"
# # colnames(df_GT1_wide) <- gsub(" ", "", colnames(data))
# # dim(df_GT1_wide ) # [1] 212 9
#
# plot_ly(df_GT1_wide, x = ~date) %>%
# add_trace(x = ~date, y = ~R, name = 'R', type = 'scatter', mode = 'lines+markers') %>%
# add_trace(x = ~date, y = ~SAS, name = 'SAS', type = 'scatter', mode = 'lines+markers') %>%
# add_trace(x = ~date, y = ~SPSS, name = 'SPSS', type = 'scatter', mode = 'lines+markers') %>%
# add_trace(x = ~date, y = ~Python, name = 'Python', type = 'scatter', mode = 'lines+markers') %>%
# layout(title="Monthly Web-Search Trends of Statistical Software (2004-2023)",
# legend = list(orientation = 'h'),
# xaxis = list(title = 'Time'),
# yaxis = list (title = 'Relative Search Volume'))
# load the data
df_GT1 <- read.csv(
"https://umich.instructure.com/files/31071103/download?download_frd=1",
header=T, as.is = T) # R_SAS_SPSS_Python_GoogleTrendsSearchDate_July_2023.csv
summary(df_GT1)
## Month R SAS Python
## Length:235 Min. : 52.00 Min. : 5.000 Min. : 9.00
## Class :character 1st Qu.: 58.00 1st Qu.: 8.000 1st Qu.:10.00
## Mode :character Median : 62.00 Median : 9.000 Median :12.00
## Mean : 64.77 Mean : 9.111 Mean :17.58
## 3rd Qu.: 69.00 3rd Qu.:11.000 3rd Qu.:25.00
## Max. :100.00 Max. :13.000 Max. :47.00
## SPSS
## Min. :1.000
## 1st Qu.:2.000
## Median :2.000
## Mean :2.068
## 3rd Qu.:3.000
## Max. :4.000
## Month R SAS Python SPSS
## 1 2004-01 61 11 13 3
## 2 2004-02 59 12 13 4
## 3 2004-03 61 11 12 4
## 4 2004-04 56 11 12 4
## 5 2004-05 53 12 12 4
## 6 2004-06 56 12 12 3
# keywords = c("R", "SAS", "SPSS", "Python")
# time_period = "2004-01-01 2023-06-16"
# geo = c("US","CN","GB", "EU")
# gtrends_data <- data.frame(gtrends(keyword=keywords,
# time=time_period,geo=geo)$interest_over_time)
library(tidyr)
# colnames(df_GT1_wide)[7] <- "R"
# colnames(df_GT1_wide) <- gsub(" ", "", colnames(data))
# dim(df_GT1_wide ) # [1] 212 9
plot_ly(df_GT1, x = ~Month) %>%
add_trace(x = ~Month, y = ~R, name = 'R', type = 'scatter', mode = 'lines+markers') %>%
add_trace(x = ~Month, y = ~SAS, name = 'SAS', type = 'scatter', mode = 'lines+markers') %>%
add_trace(x = ~Month, y = ~SPSS, name = 'SPSS', type = 'scatter', mode = 'lines+markers') %>%
add_trace(x = ~Month, y = ~Python, name = 'Python', type = 'scatter', mode = 'lines+markers') %>%
layout(title="Monthly Web-Search Trends of Statistical Software (2004-2023)",
# legend = list(orientation = 'h'),
xaxis = list(title = 'Monthly', automargin = TRUE),
yaxis = list (title = 'Relative Search Volume'))
R
R
R
is a free software that can be installed on any
computer. The ‘R’ website is https://R-project.org. There you can install a
shell-based R
-environment following this protocol:
R
(4.3, or higher
(newer) version for your specific operating system, e.g., Windows,
Linux, MacOS).R
Invocation (RStudio)For many readers, its best to also install and run R
via
RStudio graphical user interface. To install RStudio, go to: https://www.rstudio.org/
and do the following:
The RStudio interface consists of several windows.
R
will then execute your command. This is the most
important window, because this is where R
actually does
stuff.R
script. Just typing a command in the editor window is not
enough, it has to get into the command window before R
executes the command. If you want to run a line from the script window
(or the whole script), you can click Run or press CTRL+ENTER to send it
to the command window.R
has in its
memory. You can view and edit the values by clicking on them. The
history window shows what has been typed before.You can change the size of the windows by dragging the gray bars between the windows.
Updating and upgrading the R
environment involves a
three-step process:
R
from
CRAN or by auto-upgrading to the latest version of R
using the R
installr
package. Type this in the
R
console:
install.packages("installr"); library(installr); updateR()
,Help
menu and click Check for Updates,
andR
libraries: Go to the
Tools
menu and click Check for Package
Updates….Just like any other software, services, or applications, these
R
updates should be done regularly; preferably monthly or
at least semi-annually.
Quarto
Quarto is a multi-language
next-gen R Markdown from Posit,
the rebranded Public Benefit Corporation RStudio. Quarto includes new
features and capabilities expanding existing Rmd files without
further modification. It’s recommended, but not required, to install Quarto, after building
R
and RStudio
GUI. We still edit code
and markdown in the RStudio IDE, just as we normally do with any Rmd
computational protocol, as well as preview the rendered
document in the Viewer tab dynamically.
The Quarto markdown documents have the .qmd extension, as opposed to the classical R markdown extension (.rmd). Once knitted, the .qmd source can be rendered into many different formats, PDF, MS Word, HTML5, etc.
Quarto allows including executable expressions within markdown
text by enclosing code in r
expressions. For instance,
we can use inline code to report dynamically in the text the number of
observations in a dataset, e.g., the dimensions of the mpg
dataframe are 234, 11.
Manual creation of a new qmd document is accomplished by mouse-clicking \(File\to New\ File\to Quarto Document\) or by using the command palette (shortcut Ctrl+Shift+P), search for Create a new Quarto document and hit return.
Quarto includes native support for Observable JS, a set
of enhancements to raw JavaScript, which provides reactive
runtime useful for interactive data exploration and analysis. Observable JS (OJS) supports a hosted
service for creating and publication of Rmd/Qmd/Pyton notebooks. OJS
works in any Quarto document (Rmd, Jupyter, and Knitr documents) via an
{ojs}
executable code block.
This Posit Quarto video and the acompaning QMD slidedeck offer insights into the incredible power of markdown and interactive content integration across multiple programming languages.
R
environment installation comes with limited
core functionality. Everyone eventually will have to install more
packages, e.g., reshape2
, ggplot2
, and we will
show how to expand
your Rstudio library throughout these materials.R
environment also has to be upgraded
occasionally, e.g., every 3-6 months to get R
patches, to
fix known problems, and to add new functionality. This is also easy to do.R
is <-
(although =
may also be used), so to assign a value of
\(2\) to a variable \(x\), we can use x <- 2
or
equivalently x = 2
.R
provides documentation for different R
functions using the method help()
. Typing
help(topic)
in the R
console will provide
detailed explanations for each R
topic or function. Another
way of doing it is to call ?topic
, which is even
easier.
For example, if I want to check the function for linear models (i.e.
function lm()
), I will use the following function.
Below is a simple R
script for melting a small dataset
that illustrates the R
syntax for variable definition,
instantiation, function calls, and parameter setting.
rawdata_wide <- read.table(header=TRUE, text='
CaseID Gender Age Condition1 Condition2
1 M 5 13 10.5
2 F 6 16 11.2
3 F 8 10 18.3
4 M 9 9.5 18.1
5 M 10 12.1 19
')
# Make the CaseID column a factor
rawdata_wide$subject <- factor(rawdata_wide$CaseID)
rawdata_wide
## CaseID Gender Age Condition1 Condition2 subject
## 1 1 M 5 13.0 10.5 1
## 2 2 F 6 16.0 11.2 2
## 3 3 F 8 10.0 18.3 3
## 4 4 M 9 9.5 18.1 4
## 5 5 M 10 12.1 19.0 5
library(reshape2)
# Specify id.vars: the variables to keep (don't split apart on!)
melt(rawdata_wide, id.vars=c("CaseID", "Gender"))
## CaseID Gender variable value
## 1 1 M Age 5
## 2 2 F Age 6
## 3 3 F Age 8
## 4 4 M Age 9
## 5 5 M Age 10
## 6 1 M Condition1 13
## 7 2 F Condition1 16
## 8 3 F Condition1 10
## 9 4 M Condition1 9.5
## 10 5 M Condition1 12.1
## 11 1 M Condition2 10.5
## 12 2 F Condition2 11.2
## 13 3 F Condition2 18.3
## 14 4 M Condition2 18.1
## 15 5 M Condition2 19
## 16 1 M subject 1
## 17 2 F subject 2
## 18 3 F subject 3
## 19 4 M subject 4
## 20 5 M subject 5
There are specific options for the reshape2::melt()
function, from the reshape2
R
package, that
control the transformation of the original (wide-format) dataset
rawdata_wide
into the modified (long-format) object
data_long
.
data_long <- melt(rawdata_wide,
# ID variables - all the variables to keep but not split apart on
id.vars=c("CaseID", "Gender"),
# The source columns
measure.vars=c("Age", "Condition1", "Condition2" ),
# Name of the destination column that will identify the original
# column that the measurement came from
variable.name="Feature",
value.name="Measurement"
)
data_long
## CaseID Gender Feature Measurement
## 1 1 M Age 5.0
## 2 2 F Age 6.0
## 3 3 F Age 8.0
## 4 4 M Age 9.0
## 5 5 M Age 10.0
## 6 1 M Condition1 13.0
## 7 2 F Condition1 16.0
## 8 3 F Condition1 10.0
## 9 4 M Condition1 9.5
## 10 5 M Condition1 12.1
## 11 1 M Condition2 10.5
## 12 2 F Condition2 11.2
## 13 3 F Condition2 18.3
## 14 4 M Condition2 18.1
## 15 5 M Condition2 19.0
For an elaborate justification, detailed description, and multiple
examples of handling long-and-wide data, messy and tidy data, and data
cleaning strategies see the JSS
Tidy Data
article by Hadley Wickham.
Popular data generation functions include c()
,
seq()
, rep()
, and data.frame()
.
Sometimes we use list()
and array()
to create
data too.
c()
c()
creates a (column) vector. With option
recursive=T
, it descends through lists combining all
elements into one vector.
## [1] 1 2 3 5 6 7 10 1 4
## A.Z A.Y B.X C.W C.V C.U
## 1.0 2.0 7.0 7.0 3.0 -1.9
When combined with list()
, c()
successfully
created a vector with all the information in a list with three members
A
, B
, and C
.
seq(from, to)
seq(from, to)
generates a sequence. Adding option
by=
can help us specifies increment; Option
length=
specifies desired length. Also,
seq(along=x)
generates a sequence
1, 2, ..., length(x)
. This is used for loops to create ID
for each element in x
.
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
## [16] 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5
## [31] 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0
## [1] 1.000 3.375 5.750 8.125 10.500 12.875 15.250 17.625 20.000
## [1] 1 2 3 4
rep(x, times)
rep(x, times)
creates a sequence that repeats
x
a specified number of times. The option
each=
also allows us to repeat first over each element of
x
a certain number of times.
## [1] 1 2 3 1 2 3 1 2 3 1 2 3
## [1] 1 1 1 1 2 2 2 2 3 3 3 3
Compare this to replicating using replicate()
.
## [,1] [,2] [,3] [,4]
## [1,] 2 2 2 2
## [2,] 3 3 3 3
## [3,] 4 4 4 4
data.frame()
The function data.frame()
creates a data frame object of
named or unnamed arguments. We can combine multiple vectors of different
types into data frames with each vector stored as a column. Shorter
vectors are automatically wrapped around to match the length of the
longest vectors. With data.frame()
you can mix numeric and
characteristic vectors.
## v ch n
## 1 1 a 10
## 2 2 B 11
## 3 3 C 10
## 4 4 d 11
Note that the operator :
generates a sequence and the
expression 1:4
yields a vector of integers, from \(1\) to \(4\).
list()
Much like the column function c()
, the function
list()
creates a list of the named or unnamed
arguments - indexing rule: from \(1\)
to \(n\), including \(1\) and \(n\). Remember that in R
indexing of vectors, lists, arrays and tensors starts at \(1\), not \(0\), as in some other programming
languages.
## $a
## [1] 1 2
##
## $b
## [1] "hi"
##
## $c
## [1] -3+3i
As R
uses general objects to represent
different constructs, object elements are accessible via $
,
@
, .
, and other delimiters, depending on the
object type. For instance, we can refer to a member \(a\) and index \(i\) in the list of objects \(l\) containing an element \(a\) by l$a[[i]]
. For
example,
## [1] 2
## [1] "hi"
array(x, dim=)
array(x, dim=)
creates an array with specific
dimensions. For example, dim=c(3, 4, 2)
means two 3x4
matrices. We use []
to extract specific elements in the
array. [2, 3, 1]
means the element at the 2nd row 3rd
column in the 1st page. Leaving one number in the dimensions empty would
help us to get a specific row, column or page. [2, ,1]
means the second row in the 1st page. See this image:
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
## [1] 8
## [1] 2 5 8 11
Other useful functions are:
matrix(x, nrow=, ncol=)
: creates matrix elements of
nrow
rows and ncol
columns.factor(x, levels=)
: encodes a vector x
as
a factor.gl(n, k, length=n*k, labels=1:n)
: generate levels
(factors) by specifying the pattern of their levels. k is the
number of levels, and n is the number of replications.expand.grid()
: a data frame from all combinations of
the supplied vectors or factors.rbind()
combine arguments by rows for matrices, data
frames, and otherscbind()
combine arguments by columns for matrices, data
frames, and othersThe first pair of functions we will talk about are
save()
and load()
, which write and import
objects between the current R
environment RAM memory and
long term storage, e.g., hard drive, Cloud storage, SSD, etc. The script
below demonstrates the basic export and import operations with simple
data. Note that we saved the data in Rdata
(Rda
) format.
x <- seq(1, 10, by=0.5)
y <- list(a = 1, b = TRUE, c = "oops")
save(x, y, file="xy.RData")
load("xy.RData")
There are two basic functions data(x)
and
library(x)
that load specified data sets and R
packages, respectively. The R
base library is
always loaded by default. However, add-on libraries need to be
installed first and then imported (loaded) in the
working environment before functions and objects in these libraries are
accessible.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
read.table(file) reads a file in table format and
creates a data frame from it. The default separator sep=""
is any whitespace. Use header=TRUE
to read the first line
as a header of column names. Use as.is=TRUE
to prevent
character vectors from being converted to factors. Use
comment.char=""
to prevent "#"
from being
interpreted as a comment. Use skip=n
to skip n
lines before reading data. See the help for options on row naming, NA
treatment, and others.
The example below uses read.table()
to parse and load an
ASCII text file containing a simple dataset, which is available on the
supporting
canvas data archive.
data.txt<-read.table("https://umich.instructure.com/files/1628628/download?download_frd=1", header=T, as.is = T) # 01a_data.txt
summary(data.txt)
## Name Team Position Height
## Length:1034 Length:1034 Length:1034 Min. :67.0
## Class :character Class :character Class :character 1st Qu.:72.0
## Mode :character Mode :character Mode :character Median :74.0
## Mean :73.7
## 3rd Qu.:75.0
## Max. :83.0
## Weight Age
## Min. :150.0 Min. :20.90
## 1st Qu.:187.0 1st Qu.:25.44
## Median :200.0 Median :27.93
## Mean :201.7 Mean :28.74
## 3rd Qu.:215.0 3rd Qu.:31.23
## Max. :290.0 Max. :48.52
When using R
to access (read/write) data on a Cloud web
service, like Instructure/Canvas
or GoogleDrive/GDrive, mind that the direct URL reference to the raw
file will be different from the URL of the pointer to the file that can
be rendered in the browser window. For instance,
R
via this separate
URL.R
processing,
uc?export=download&id=.dataGDrive.txt<-read.table("https://drive.google.com/uc?export=download&id=1Zpw3HSe-8HTDsOnR-n64KoMRWYpeBBek", header=T, as.is = T) # 01a_data.txt
summary(dataGDrive.txt)
## Name Team Position Height
## Length:1034 Length:1034 Length:1034 Min. :67.0
## Class :character Class :character Class :character 1st Qu.:72.0
## Mode :character Mode :character Mode :character Median :74.0
## Mean :73.7
## 3rd Qu.:75.0
## Max. :83.0
## Weight Age
## Min. :150.0 Min. :20.90
## 1st Qu.:187.0 1st Qu.:25.44
## Median :200.0 Median :27.93
## Mean :201.7 Mean :28.74
## 3rd Qu.:215.0 3rd Qu.:31.23
## Max. :290.0 Max. :48.52
read.csv(“filename”, header=TRUE) is identical to
read.table()
but with defaults set for reading
comma-delimited files.
data.csv<-read.csv("https://umich.instructure.com/files/1628650/download?download_frd=1", header = T) # 01_hdp.csv
summary(data.csv)
## tumorsize co2 pain wound
## Min. : 33.97 Min. :1.222 Min. :1.000 Min. :1.000
## 1st Qu.: 62.49 1st Qu.:1.519 1st Qu.:4.000 1st Qu.:5.000
## Median : 70.07 Median :1.601 Median :5.000 Median :6.000
## Mean : 70.88 Mean :1.605 Mean :5.473 Mean :5.732
## 3rd Qu.: 79.02 3rd Qu.:1.687 3rd Qu.:6.000 3rd Qu.:7.000
## Max. :116.46 Max. :2.128 Max. :9.000 Max. :9.000
## mobility ntumors nmorphine remission
## Min. :1.00 Min. :0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.:5.00 1st Qu.:1.000 1st Qu.: 2.000 1st Qu.:0.0000
## Median :6.00 Median :3.000 Median : 3.000 Median :0.0000
## Mean :6.08 Mean :3.066 Mean : 3.624 Mean :0.2957
## 3rd Qu.:7.00 3rd Qu.:5.000 3rd Qu.: 5.000 3rd Qu.:1.0000
## Max. :9.00 Max. :9.000 Max. :18.000 Max. :1.0000
## lungcapacity Age Married FamilyHx
## Min. :0.01612 Min. :26.32 Min. :0.0 Length:8525
## 1st Qu.:0.67647 1st Qu.:46.69 1st Qu.:0.0 Class :character
## Median :0.81560 Median :50.93 Median :1.0 Mode :character
## Mean :0.77409 Mean :50.97 Mean :0.6
## 3rd Qu.:0.91150 3rd Qu.:55.27 3rd Qu.:1.0
## Max. :0.99980 Max. :74.48 Max. :1.0
## SmokingHx Sex CancerStage LengthofStay
## Length:8525 Length:8525 Length:8525 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 5.000
## Mode :character Mode :character Mode :character Median : 5.000
## Mean : 5.492
## 3rd Qu.: 6.000
## Max. :10.000
## WBC RBC BMI IL6
## Min. :2131 Min. :3.919 Min. :18.38 Min. : 0.03521
## 1st Qu.:5323 1st Qu.:4.802 1st Qu.:24.20 1st Qu.: 1.93039
## Median :6007 Median :4.994 Median :27.73 Median : 3.34400
## Mean :5998 Mean :4.995 Mean :29.07 Mean : 4.01698
## 3rd Qu.:6663 3rd Qu.:5.190 3rd Qu.:32.54 3rd Qu.: 5.40551
## Max. :9776 Max. :6.065 Max. :58.00 Max. :23.72777
## CRP DID Experience School
## Min. : 0.0451 Min. : 1.0 Min. : 7.00 Length:8525
## 1st Qu.: 2.6968 1st Qu.:100.0 1st Qu.:15.00 Class :character
## Median : 4.3330 Median :199.0 Median :18.00 Mode :character
## Mean : 4.9730 Mean :203.3 Mean :17.64
## 3rd Qu.: 6.5952 3rd Qu.:309.0 3rd Qu.:21.00
## Max. :28.7421 Max. :407.0 Max. :29.00
## Lawsuits HID Medicaid
## Min. :0.000 Min. : 1.00 Min. :0.1416
## 1st Qu.:1.000 1st Qu.: 9.00 1st Qu.:0.3369
## Median :2.000 Median :17.00 Median :0.5215
## Mean :1.866 Mean :17.76 Mean :0.5125
## 3rd Qu.:3.000 3rd Qu.:27.00 3rd Qu.:0.7083
## Max. :9.000 Max. :35.00 Max. :0.8187
read.delim(“filename”, header=TRUE) is very similar to the first two. However, it has defaults set for reading tab-delimited files.
Also we have
read.fwf(file, widths, header=FALSE, sep="\t", as.is=FALSE)
to read a table of fixed width formatted data into a data frame.
match(x, y) returns a vector of the positions of
(first) matches of its first argument in its second. For a specific
element in x
if no element matches it in y
,
then the output would be NA
.
## [1] 1 NA 2 4
save.image(file) saves all objects in the current workspace.
write.table(x, file=““, row.names=TRUE,
col.names=TRUE, sep=”“) prints x after converting to a data frame and
stores it into a specified file. If quote
is TRUE,
character or factor columns are surrounded by quotes (”).
sep
is the field separator. eol
is the
end-of-line separator. na
is the string for missing values.
Use col.names=NA
to add a blank column header to get the
column headers aligned correctly for spreadsheet input.
Most of the I/O functions have a file argument. This can often be a
character string naming a file or a connection. file=""
means the standard input or output. Connections can include files,
pipes, zipped files, and R
variables.
On windows, the file connection can also be used with
description = "clipboard"
. To read a table copied from
Excel, use x <- read.delim("clipboard")
To write a table to the clipboard for Excel, use
write.table(x, "clipboard", sep="\t", col.names=NA)
For database interaction, see packages RODBC, DBI, RMySQL, RPgSQL, and ROracle, as well as packages XML, hdf5, netCDF for reading other file formats. We will talk about some of them in later chapters.
Note, an alternative library called rio
handles
import/export of multiple data types with simple syntax.
The following table summarizes the basic vector indexing operations.
Expression | Explanation |
---|---|
x[n] |
nth element |
x[-n] |
all but the nth element |
x[1:n] |
first n elements |
x[-(1:n)] |
elements from n+1 to the end |
x[c(1, 4, 2)] |
specific elements |
x["name"] |
element named “name” |
x[x > 3] |
all elements greater than 3 |
x[x > 3 & x < 5] |
all elements between 3 and 5 |
x[x %in% c("a", "and", "the")] |
elements in the given set |
Indexing lists are similar but not identical to indexing vectors.
Expression | Explanation |
---|---|
x[n] |
list with n elements |
x[[n]] |
nth element of the list |
x[["name"]] |
element of the list named “name” |
Indexing for matrices and higher dimensional arrays (tensors) derive from vector indexing.
Expression | Explanation |
---|---|
x[i, j] |
element at row i, column j |
x[i, ] |
row i |
x[, j] |
column j |
x[, c(1, 3)] |
columns 1 and 3 |
x["name", ] |
row named “name” |
The following functions represent simple examples of convert data types:
as.array(x)
, as.data.frame(x)
,
as.numeric(x)
, as.logical(x)
,
as.complex(x)
, as.character(x)
, …
Typing methods(as)
in the console will generate a
complete list for variable conversion functions.
The following functions verify if the input is of a specific data type:
is.na(x)
, is.null(x)
,
is.array(x)
, is.data.frame(x)
,
is.numeric(x)
, is.complex(x)
,
is.character(x)
, …
For a complete list, type methods(is)
in the
R
console. The outputs for these functions are objects,
either single values (TRUE
or FALSE
), or
objects of the same dimensions as the inputs containing a Boolean
TRUE
or FALSE
element for each entry in the
dataset.
length(x) gives us the number of elements in
x
.
## [1] 6
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## [1] TRUE
dim(x) retrieves or sets the dimension of an array and length(y) reports the length of a list or a vector.
## [1] 12
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
dimnames(x) retrieves or sets the dimension names of
an object. For higher dimensional objects like matrices or arrays we can
combine dimnames()
with a list.
## C1 C2 C3 C4
## R1 1 4 7 10
## R2 2 5 8 11
## R3 3 6 9 12
nrow(x) and ncol(x) report the number of rows and number of columns or a matrix.
## [1] 3
## [1] 4
class(x) gets or sets the class of \(x\). Note that we can use
unclass(x)
to remove the class attribute of \(x\).
## [1] "matrix" "array"
## C1 C2 C3 C4
## R1 1 4 7 10
## R2 2 5 8 11
## R3 3 6 9 12
attr(x, which) gets or sets the attribute
which
of \(x\).
## NULL
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 3 5 7 9 11
## [2,] 2 4 6 8 10 12
The above script shows that applying unclass
to \(x\) sets its class to
NULL
.
attributes(obj) gets or sets the list of attributes of an object.
attributes(x) <- list(mycomment = "really special", dim = 3:4,
dimnames = list(LETTERS[1:3], letters[1:4]), names = paste(1:12))
x
## a b c d
## A 1 4 7 10
## B 2 5 8 11
## C 3 6 9 12
## attr(,"mycomment")
## [1] "really special"
## attr(,"names")
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12"
In this section, we will introduce some data manipulation functions.
In addition, tools from dplyr
provide easy dataset
manipulation routines.
which.max(x) returns the index of the greatest element (max) of \(x\), which.min(x) returns the index of the smallest element (min) of \(x\), and rev(x) reverses the elements of \(x\).
## [1] 6
## [1] 1
## [1] 3 40 10 1 2 5 1
sort(x) sorts the elements of \(x\) in increasing order. To sort in
decreasing order we can use rev(sort(x))
.
## [1] 1 1 2 3 5 10 40
## [1] 40 10 5 3 2 1 1
cut(x, breaks) divides \(x\) into intervals with the same length
(sometimes factors). The optional parameter breaks
specifies the number of cut intervals or a vector of cut points.
cut
divides the range of \(x\) into intervals coding the values in
\(x\) according to the intervals they
fall into.
## [1] 1 5 2 1 10 40 3
## [1] (0.961,14] (0.961,14] (0.961,14] (0.961,14] (0.961,14] (27,40] (0.961,14]
## Levels: (0.961,14] (14,27] (27,40]
## [1] (0,5] (0,5] (0,5] (0,5] (5,20] <NA> (0,5]
## Levels: (0,5] (5,20] (20,30]
which(x == a) returns a vector of the indices of
\(x\) if the comparison operation is
true. For example, it returns the value \(i\), if \(x[i]==
a\) is TRUE. Thus, the argument of this function (like
x==a
) must be a Boolean variable.
## [1] 1 5 2 1 10 40 3
## [1] 3
na.omit(x) suppresses the observations with missing
data (NA
). It suppresses the corresponding line if \(x\) is a matrix or a data frame.
na.fail(x) returns an error message if \(x\) contains at least one
NA
.
## a b
## 1 1 1
## 2 2 3
## 3 3 NA
## 4 4 9
## 5 5 8
## a b
## 1 1 1
## 2 2 3
## 4 4 9
## 5 5 8
unique(x) If \(x\) is a vector or a data frame, it returns a similar object suppressing the duplicate elements.
## a b
## 1 1 1
## 2 1 1
## 3 7 NA
## 4 6 9
## 5 8 8
## a b
## 1 1 1
## 3 7 NA
## 4 6 9
## 5 8 8
table(x) returns a table with the different values of \(x\) and their frequencies (typically used for integer or factor variables). The corresponding prop.table() function transforms these raw frequencies to relative frequencies (proportions, marginal mass).
## v
## 1 2 4 5 6 7 8
## 1 3 2 1 1 1 2
## [1] 0.02040816 0.04081633 0.08163265 0.04081633 0.04081633 0.10204082
## [7] 0.12244898 0.08163265 0.14285714 0.16326531 0.16326531
subset(x, …) returns a selection of \(x\) with respect to the specified criteria
...
. Typically ...
are comparisons like
x$V1 < 10
. If \(x\) is
a data frame, the option select=
allows using a negative
sign \(-\) to indicate values to keep
or drop from the object.
## a b
## 3 7 NA
## 4 6 9
## 5 8 8
## b
## 1 1
## 2 1
## 3 NA
## 4 9
## 5 8
## Subsampling
x <- matrix(rnorm(100), ncol = 5)
y <- c(1, seq(19))
z <- cbind(x, y)
z.df <- data.frame(z)
z.df
## V1 V2 V3 V4 V5 y
## 1 -1.89578014 0.1918691 -0.70813104 -0.10852975 0.44002605 1
## 2 0.63940388 0.1635994 1.12767712 -0.90115072 0.74473074 1
## 3 1.68417131 -1.9084592 0.35506625 0.35008846 -2.21866765 2
## 4 -0.90353339 -1.7996781 -0.53256018 2.05079775 -0.94195942 3
## 5 0.18702987 0.2779862 -0.74427552 0.55338980 -0.29256495 4
## 6 -1.82063172 -0.2497984 -0.24664083 2.88078516 -0.13539427 5
## 7 1.62618855 0.6542937 0.39698925 -0.38550863 -0.81405177 6
## 8 0.44584190 -0.4998778 -0.56358972 -0.07237357 -1.32991366 7
## 9 -0.09937767 -0.3484989 -0.62871404 1.15506339 1.58734728 8
## 10 -0.56186721 0.3980888 2.32924377 1.24999897 -0.01480825 9
## 11 -0.76589959 -1.5274742 -0.39412641 -0.85802060 -1.31031579 10
## 12 -0.23928375 -0.8795305 1.93130523 0.65934149 0.20002226 11
## 13 -2.02539426 -0.9383289 1.91016510 -1.17953572 0.11995736 12
## 14 -0.84036269 0.6193404 -1.29736621 -0.71366790 -1.43561738 13
## 15 -0.27641492 -0.8883591 -0.40880676 0.76796354 -0.63535437 14
## 16 1.57890351 -0.3421263 -0.11812953 -0.41099457 -0.62785865 15
## 17 -0.96484897 -1.2308699 -0.14697563 -1.88841290 0.81834206 16
## 18 0.54525994 -0.3997908 -0.07234845 -0.63193905 -0.28165875 17
## 19 0.49027125 2.2613809 -1.81953107 -0.25580313 1.68183079 18
## 20 -0.55580099 0.7075943 -0.80153193 0.69944331 0.90995301 19
## [1] "V1" "V2" "V3" "V4" "V5" "y"
## V1 V2 V3 V4 V5 y
## 4 -0.90353339 -1.7996781 -0.53256018 2.05079775 -0.94195942 3
## 5 0.18702987 0.2779862 -0.74427552 0.55338980 -0.29256495 4
## 6 -1.82063172 -0.2497984 -0.24664083 2.88078516 -0.13539427 5
## 7 1.62618855 0.6542937 0.39698925 -0.38550863 -0.81405177 6
## 8 0.44584190 -0.4998778 -0.56358972 -0.07237357 -1.32991366 7
## 9 -0.09937767 -0.3484989 -0.62871404 1.15506339 1.58734728 8
## 10 -0.56186721 0.3980888 2.32924377 1.24999897 -0.01480825 9
## 16 1.57890351 -0.3421263 -0.11812953 -0.41099457 -0.62785865 15
## 18 0.54525994 -0.3997908 -0.07234845 -0.63193905 -0.28165875 17
## 19 0.49027125 2.2613809 -1.81953107 -0.25580313 1.68183079 18
## V1 V2 V3 V4 V5 y
## 1 -1.8957801 0.1918691 -0.708131 -0.1085297 0.4400261 1
## 2 0.6394039 0.1635994 1.127677 -0.9011507 0.7447307 1
## V1 V2 V3 V4 V5 y
## 1 -1.8957801 0.1918691 -0.7081310 -0.1085297 0.4400261 1
## 2 0.6394039 0.1635994 1.1276771 -0.9011507 0.7447307 1
## 5 0.1870299 0.2779862 -0.7442755 0.5533898 -0.2925650 4
## V1 V2
## 1 -1.89578014 0.1918691
## 2 0.63940388 0.1635994
## 3 1.68417131 -1.9084592
## 4 -0.90353339 -1.7996781
## 5 0.18702987 0.2779862
## 6 -1.82063172 -0.2497984
## 7 1.62618855 0.6542937
## 8 0.44584190 -0.4998778
## 9 -0.09937767 -0.3484989
## 10 -0.56186721 0.3980888
## 11 -0.76589959 -1.5274742
## 12 -0.23928375 -0.8795305
## 13 -2.02539426 -0.9383289
## 14 -0.84036269 0.6193404
## 15 -0.27641492 -0.8883591
## 16 1.57890351 -0.3421263
## 17 -0.96484897 -1.2308699
## 18 0.54525994 -0.3997908
## 19 0.49027125 2.2613809
## 20 -0.55580099 0.7075943
sample(x, size) resamples randomly, without
replacement, size elements in the vector \(x\). The option replace = TRUE
allows resampling with replacement.
## [1] 7 1 1 1 1 7 7 1 6 1 1 1 7 7 7 1 1 1 6 8
Many mathematical functions, statistical summaries, and function optimizers will be discussed throughout the book. Below are the very basic functions to keep in mind.
Basic math functions like sin
, cos
,
tan
, asin
, acos
,
atan
, atan2
, log
,
log10
, exp
and “set” functions
union(x, y)
, intersect(x, y)
,
setdiff(x, y)
, setequal(x, y)
,
is.element(el, set)
are available in R.
lsf.str("package:base")
displays all base functions
built in a specific R
package (like base
).
This table summarizes the core functions for most basic
R
for calculations.
Expression | Explanation |
---|---|
choose(n, k) |
computes the combinations of k events among n repetitions. Mathematically it equals to \(\frac{n!}{[(n-k)!k!]}\) |
max(x) |
maximum of the elements of x |
min(x) |
minimum of the elements of x |
range(x) |
minimum and maximum of the elements of x |
sum(x) |
sum of the elements of x |
diff(x) |
lagged and iterated differences of vector x |
prod(x) |
product of the elements of x |
mean(x) |
mean of the elements of x |
median(x) |
median of the elements of x |
quantile(x, probs=) |
sample quantiles corresponding to the given probabilities (defaults to 0, .25, .5, .75, 1) |
weighted.mean(x, w) |
mean of x with weights w |
rank(x) |
ranks of the elements of x |
var(x) or cov(x) |
variance of the elements of x (calculated on n>1). If x is a matrix or a data frame, the variance-covariance matrix is calculated |
sd(x) |
standard deviation of x |
cor(x) |
correlation matrix of x if it is a matrix or a data frame (1 if x is a vector) |
var(x, y) or cov(x, y) |
covariance between x and y, or between the columns of x and those of y if they are matrices or data frames |
cor(x, y) |
linear correlation between x and y, or correlation matrix if they are matrices or data frames |
round(x, n) |
rounds the elements of x to n decimals |
log(x, base) |
computes the logarithm of x with base base |
scale(x) |
if x is a matrix, centers and reduces the data. Without centering
use the option center=FALSE . Without scaling use
scale=FALSE (by default center=TRUE, scale=TRUE) |
pmin(x, y, ...) |
a vector whose i-th element is the minimum of x[i], y[i], . . . |
pmax(x, y, ...) |
a vector whose i-th element is the maximum of x[i], y[i], . . . |
cumsum(x) |
a vector which ith element is the sum from x[1] to x[i] |
cumprod(x) |
id. for the product |
cummin(x) |
id. for the minimum |
cummax(x) |
id. for the maximum |
Re(x) |
real part of a complex number |
Im(x) |
imaginary part of a complex number |
Mod(x) |
modulus. abs(x) is the same |
Arg(x) |
angle in radians of the complex number |
Conj(x) |
complex conjugate |
convolve(x, y) |
compute several types of convolutions of two sequences |
fft(x) |
Fast Fourier Transform of an array |
mvfft(x) |
FFT of each column of a matrix |
filter(x, filter) |
applies linear filtering to a univariate time series or to each series separately of a multivariate time series |
Note: many math functions have a logical parameter
na.rm=TRUE
to specify missing data (NA
)
removal.
The following table summarizes basic operation functions. We will discuss this topic in detail in Chapter 3 (Linear Algebra, Matrix Computing, and Regression Modeling).
Expression | Explanation |
---|---|
t(x) |
transpose |
diag(x) |
diagonal |
%*% |
matrix multiplication |
solve(a, b) |
solves a %*% x = b for x |
solve(a) |
matrix inverse of a |
rowsum(x) |
sum of rows for a matrix-like object. rowSums(x) is a
faster version |
colsum(x) , colSums(x) |
id. for columns |
rowMeans(x) |
fast version of row means |
colMeans(x) |
id. for columns |
mat1 <- cbind(c(1, -1/5), c(-1/3, 1))
mat1.inv <- solve(mat1)
mat1.identity <- mat1.inv %*% mat1
mat1.identity
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
## [1] 1.785714 2.357143
par
is initial values, fn
is a function to
optimize (normally minimize).formula
is typically of the form response ~ termA + termB + ...
;
use I(x*y) + I(x^2)
for terms made of nonlinear
components.family
is a description of the error distribution and link
function to be used in the model; see ?family
.Many of the formula-based modeling functions have several common arguments:
data=
the data frame for the formula variables,
subset=
a subset of variables used in the fit,
na.action=
action for missing values:
"na.fail"
, "na.omit"
, or a function.
The following generics often apply to model fitting functions:
predict(fit, ...)
predictions from fit based on input
data.df.residual(fit)
returns the number of residual degrees
of freedom.coef(fit)
returns the estimated coefficients (sometimes
with their standard-errors).residuals(fit)
returns the residuals.deviance(fit)
returns the deviance.fitted(fit)
returns the fitted values.logLik(fit)
computes the logarithm of the likelihood
and the number of parameters.AIC(fit)
computes the Akaike information criterion
(AIC).Other functions include: binom.test()
,
pairwise.t.test()
, power.t.test()
,
prop.test()
, t.test()
, … use
help.search("test")
to see details.
The Probability Distributome Project provides many details about univariate probability distributions. The SOCR R Shiny Distribution Calculators and the SOCR Bivariate and Trivariate Interactive Graphical Calculators provide additional demonstrations of multivariate probability distribution.
In R
, there are four complementary functions supporting
each probability distribution. For Normal distribution, these
four functions are dnorm()
- density, pnorm()
- distribution function, qnorm()
- quantile function, and
rnorm()
- random generating function. For Poisson
distribution, the corresponding functions are dpois()
,
ppois()
, qpois()
, and
rpois()
.
The table below shows the invocation syntax for generating random samples from a number of different probability distributions.
Expression | Explanation |
---|---|
rnorm(n, mean=0, sd=1) |
Gaussian (normal) |
rexp(n, rate=1) |
exponential |
rgamma(n, shape, scale=1) |
gamma |
rpois(n, lambda) |
Poisson |
rweibull(n, shape, scale=1) |
Weibull |
rcauchy(n, location=0, scale=1) |
Cauchy |
rbeta(n, shape1, shape2) |
beta |
rt(n, df) |
Student’s (t) |
rf(n, df1, df2) |
Fisher’s (F) (df1, df2) |
rchisq(n, df) |
Pearson rbinom(n, size, prob) binomial |
rgeom(n, prob) |
geometric |
rhyper(nn, m, n, k) |
hypergeometric |
rlogis(n, location=0, scale=1) |
logistic |
rlnorm(n, meanlog=0, sdlog=1) |
lognormal |
rnbinom(n, size, prob) |
negative binomial |
runif(n, min=0, max=1) |
uniform |
rwilcox(nn, m, n) , rsignrank(nn, n) |
Wilcoxon’s statistics |
Obviously, replacing the first letter r
with
d
, p
or q
would reference the
corresponding probability density (dfunc(x, ...)
), the
cumulative probability density (pfunc(x, ...)
), and the
value of quantile (qfunc(p, ...)
, with \(0 < p < 1\)).
In this section, we will introduce some useful functions that are
useful in many data analytic protocols. The family of
*apply()
functions act on lists, arrays, vectors, data
frames and other objects.
apply(X, INDEX, FUN=) returns a vector or array or
list of values obtained by applying a function FUN
to
margins (INDEX=1
means row, INDEX=2
means
column) of \(X\). Additional options
may be specified after the FUN
argument.
## a b
## 1 1 1
## 2 1 1
## 3 7 NA
## 4 6 9
## 5 8 8
## a b
## 4.60 4.75
lapply(X, FUN) applies FUN
to each
member of the list \(X\). If \(X\) is a data frame then it will apply the
FUN
to each column and return a list.
## $a
## [1] 4.6
##
## $b
## [1] 4.75
## $a
## [1] 5
##
## $b
## [1] 90
tapply(X, INDEX, FUN=) applies FUN
to
each cell of a ragged array given by \(X\) with indexes equals to
INDEX
. Note that \(X\) is
an atomic object, typically a vector.
## [1] 1 2 4 2 2 5 6 4 7 8 8
## fac
## 1 2 3
## 4 4 3
## 1 2 3
## 17 16 16
by(data, INDEX, FUN) applies FUN
to
data frame data subsetted by INDEX. In this example, we apply the
sum
function using column 1 (a
) as an
index.
## df1[, 1]: 1
## [1] 4
## ------------------------------------------------------------
## df1[, 1]: 6
## [1] 15
## ------------------------------------------------------------
## df1[, 1]: 7
## [1] NA
## ------------------------------------------------------------
## df1[, 1]: 8
## [1] 16
merge(a, b) merges two data frames by common columns
or row names. We can use option by=
to specify the index
column.
## a c
## 1 1 1
## 2 1 2
## 3 7 3
## 4 6 4
## 5 8 5
## a b c
## 1 1 1 1
## 2 1 1 2
## 3 1 1 1
## 4 1 1 2
## 5 6 9 4
## 6 7 NA 3
## 7 8 8 5
xtabs(a ~ b, data=x) reports specific factorized contingency tables. The example below uses the 1973 UC Berkeley admissions dataset to report gender-by-status breakdown.
DF <- as.data.frame(UCBAdmissions)
## 'DF' is a data frame with a grid of the factors and the counts
## in variable 'Freq'.
DF
## Admit Gender Dept Freq
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
## 7 Admitted Female B 17
## 8 Rejected Female B 8
## 9 Admitted Male C 120
## 10 Rejected Male C 205
## 11 Admitted Female C 202
## 12 Rejected Female C 391
## 13 Admitted Male D 138
## 14 Rejected Male D 279
## 15 Admitted Female D 131
## 16 Rejected Female D 244
## 17 Admitted Male E 53
## 18 Rejected Male E 138
## 19 Admitted Female E 94
## 20 Rejected Female E 299
## 21 Admitted Male F 22
## 22 Rejected Male F 351
## 23 Admitted Female F 24
## 24 Rejected Female F 317
## Admit
## Gender Admitted Rejected
## Male 1198 1493
## Female 557 1278
## Call: xtabs(formula = Freq ~ ., data = DF)
## Number of cases in table: 4526
## Number of factors: 3
## Test for independence of all factors:
## Chisq = 2000.3, df = 16, p-value = 0
aggregate(x, by, FUN) splits the data frame \(x\) into subsets, computes summary
statistics for each part, and reports the results. by
is a
list of grouping elements, each of which has the same length as the
variables in \(x\). For example, we can
apply the function sum
to the data frame df3
subject to the index created by
list(rep(1:3, length=7))
.
## [[1]]
## [1] 1 2 3 1 2 3 1
## Group.1 a b c
## 1 1 10 10 8
## 2 2 7 10 6
## 3 3 8 NA 4
stack(x, …) transforms data stored as separate
columns in a data frame, or list, into a single column vector;
unstack(x, …) is the inverse of
stack()
.
## values ind
## 1 1 a
## 2 1 a
## 3 1 a
## 4 1 a
## 5 6 a
## 6 7 a
## 7 8 a
## 8 1 b
## 9 1 b
## 10 1 b
## 11 1 b
## 12 9 b
## 13 NA b
## 14 8 b
## 15 1 c
## 16 2 c
## 17 1 c
## 18 2 c
## 19 4 c
## 20 3 c
## 21 5 c
## a b c
## 1 1 1 1
## 2 1 1 2
## 3 1 1 1
## 4 1 1 2
## 5 6 9 4
## 6 7 NA 3
## 7 8 8 5
reshape(x, …) reshapes a data frame between
wide format, with repeated measurements in separate columns of
the same record, and long format, with the repeated
measurements in separate records. We can specify the transformation
direction, direction="wide"
or
direction="long"
.
df4 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),
time = rep(c(1, 1, 2, 2), 3), score = rnorm(12))
wide <- reshape(df4, idvar = c("school", "class"), direction = "wide")
wide
## school class score.1 score.2
## 1 1 9 0.1816010 1.5497968
## 2 1 10 -1.1171385 -0.7183000
## 5 2 9 -0.5870414 0.4728245
## 6 2 10 0.3253032 -0.1950426
## 9 3 9 -0.7122067 1.4221143
## 10 3 10 -0.8927900 0.8552861
## school class time score.1
## 1.9.1 1 9 1 0.1816010
## 1.10.1 1 10 1 -1.1171385
## 2.9.1 2 9 1 -0.5870414
## 2.10.1 2 10 1 0.3253032
## 3.9.1 3 9 1 -0.7122067
## 3.10.1 3 10 1 -0.8927900
## 1.9.2 1 9 2 1.5497968
## 1.10.2 1 10 2 -0.7183000
## 2.9.2 2 9 2 0.4728245
## 2.10.2 2 10 2 -0.1950426
## 3.9.2 3 9 2 1.4221143
## 3.10.2 3 10 2 0.8552861
Notes
rnorm
used in reshape might generate
different results for each call, unless set.seed(1234)
is
used to ensure reproducibility of random-number generation.The following functions are useful for handling strings in
R
.
paste(…) and paste0(…) concatenate
vectors after converting the arguments to a vector of characters. There
are several options, sep=
to use a string to separate terms
(a single space is the default), collapse=
to separate
“collapsed” results.
## [1] "today is a good day"
## [1] "today, is a good day"
substr(x, start, stop) substrings in a character
vector. Using substr(x, start, stop) <- value
, it can
also assign values (with the same length) to part of a string.
## [1] "going gets tough, the tough get"
## [1] ".........going gets tough, the tough get going!"
Note that characters at start
and stop
indexes are inclusive in the output.
strsplit(x, split) splits \(x\) according to the substring split. Use
fixed=TRUE
for non-regular expressions.
## [[1]]
## [1] "a" "b" "c"
grep(pattern, x) searches for pattern matches within
\(x\) and returns a vector of the
indices of the elements of \(x\) that
had a match. Use regular expressions for pattern
(unless
fixed=TRUE
), see ?regex
for details.
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26
gsub(pattern, replacement, x) replaces matching patterns in \(x\), allowing for use of regular expression matching; sub() is the same but it only replaces the first occurrence of the matched pattern.
## [1] "letters" "0" "lettersletters" "10"
## [5] ";"
## [1] "letters" "0" "lettersj" "10" ";"
tolower(x) converts strings to lowercase and toupper(x) converts to uppercase.
match(x, table) yields a vector of the positions of
first matches for the elements of \(x\)
among table
, x %in% table
returns a logical
vector.
## [1] 1 NA 2 NA NA
## [1] TRUE FALSE TRUE FALSE FALSE
pmatch(x, table) reports partial matches for the elements of \(x\).
Dates and Times
The class Date
stores calendar dates, without times.
POSIXct()
has dates and times, including time zones.
Comparisons (e.g. \(>\)),
seq()
, and difftime()
are useful to compare
dates. ?DateTimeClasses
gives more information, see also
package chron
.
The functions as.Date(s)
and as.POSIXct(s)
convert to the respective class; format(dt)
converts to a
string representation. The default string format is
2001-02-21
. These accept a second argument to specify a
format for conversion. Some common formats are:
Formats | Explanations |
---|---|
%a, %A |
Abbreviated and full weekday name. |
%b, %B |
Abbreviated and full month name. |
%d |
Day of the month (01 … 31). |
%H |
Hours (00 … 23). |
%I |
Hours (01 … 12). |
%j |
Day of year (001 … 366). |
%m |
Month (01 … 12). |
%M |
Minute (00 … 59). |
%p |
AM/PM indicator. |
%S |
Second as a decimal number (00 … 61). |
%U |
Week (00 … 53); the first Sunday as day 1 of week 1. |
%w |
Weekday (0 … 6, Sunday is 0). |
%W |
Week (00 … 53); the first Monday as day 1 of week 1. |
%y |
Year without century (00 … 99). Don’t use it. |
%Y |
Year with century. |
%z (output only.) |
Offset from Greenwich; -0800 is 8 hours west of. |
%Z (output only.) |
Time zone as a character string (empty if not available). |
Where leading zeros are shown they will be used on output but are
optional on input. See ?strftime
for details.
The following functions represent the basic plotting functions in
R
. Later, in Chapter
2 (Visualization & EDA), we will discuss more elaborate
visualization in and exploratory data analytic strategies.
horiz=FALSE
for horizontal bars.boxplot()
for small sample sizes).fun
allows you to choose the summary statistic of y (by
default fun=mean
).dim(z)=c(length(x), length(y))
(x and y
may be omitted).mod.obj
).The following parameters are common to many plotting functions:
Parameters | Explanations |
---|---|
add=FALSE |
if TRUE superposes the plot on the previous one (if it exists) |
axes=TRUE |
if FALSE does not draw the axes and the box |
type="p" |
specifies the type of plot, “p”: points, “l”: lines, “b”: points connected by lines, “o”: id. But the lines are over the points, “h”: vertical lines, “s”: steps, the data are represented by the top of the vertical lines, “S”: id. However, the data are represented at the bottom of the vertical lines |
xlim=, ylim= |
specifies the lower and upper limits of the axes, for example with
xlim=c(1, 10) or xlim=range(x) |
xlab=, ylab= |
annotates the axes, must be variables of mode character |
main= |
main title, must be a variable of mode character |
sub= |
subtitle (written in a smaller font) |
Let’s look at one simple example - quantile-quantile probability plot. Suppose \(X\sim N(0,1)\) and \(Y\sim Cauchy\) represent the observed/raw and simulated/generated data for one feature (variable) in the data.
# This commended example illustrates a linear model based approach (below is a more direct QQ-plot demonstration)
# X_norm1 <- rnorm(1000)
# X_norm2 <- rnorm(1000, m=-75, sd=3.7)
# X_Cauchy <- rcauchy(1000)
#
# # compare X to StdNormal distribution
# # qqnorm(X,
# # main="Normal Q-Q Plot of the data",
# # xlab="Theoretical Quantiles of the Normal",
# # ylab="Sample Quantiles of the X (Normal) Data")
# # qqline(X)
# # qqplot(X, Y)
# fit_norm_norm = lm(X_norm2 ~ X_norm1)
# fit_norm_cauchy = lm(X_Cauchy ~ X_norm1)
#
# # Get model fitted values
# Fitted.Values.norm_norm <- fitted(fit_norm_norm)
# Fitted.Values.norm_cauchy <- fitted(fit_norm_cauchy)
#
# # Extract model residuals
# Residuals.norm_norm <- resid(fit_norm_norm)
# Residuals.norm_cauchy <- resid(fit_norm_cauchy)
#
# # Compute the model standardized residuals from lm() object
# Std.Res.norm_norm <- MASS::stdres(fit_norm_norm)
# Std.Res.norm_cauchy <- MASS::stdres(fit_norm_cauchy)
#
# # Extract the theoretical (Normal) quantiles
# Theoretical.Quantiles.norm_norm <- qqnorm(Residuals.norm_norm, plot.it = F)$x
# Theoretical.Quantiles.norm_cauchy <- qqnorm(Residuals.norm_cauchy, plot.it = F)$x
#
# qq.df.norm_norm <- data.frame(Std.Res.norm_norm, Theoretical.Quantiles.norm_norm)
# qq.df.norm_cauchy <- data.frame(Std.Res.norm_cauchy, Theoretical.Quantiles.norm_cauchy)
#
# qq.df.norm_norm %>%
# plot_ly(x = ~Theoretical.Quantiles.norm_norm) %>%
# add_markers(y = ~Std.Res.norm_norm, name="Normal(0,1) vs. Normal(-75, 3.7) Data") %>%
# add_lines(x = ~Theoretical.Quantiles.norm_norm, y = ~Theoretical.Quantiles.norm_norm,
# mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>%
# layout(title = "Q-Q Normal Plot", legend = list(orientation = 'h'))
#
# # Normal vs. Cauchy
# qq.df.norm_cauchy %>%
# plot_ly(x = ~Theoretical.Quantiles.norm_cauchy) %>%
# add_markers(y = ~Std.Res.norm_cauchy, name="Normal(0,1) vs. Cauchy Data") %>%
# add_lines(x = ~Theoretical.Quantiles.norm_norm, y = ~Theoretical.Quantiles.norm_norm,
# mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>%
# layout(title = "Normal vs. Cauchy Q-Q Plot", legend = list(orientation = 'h'))
# Q-Q plot data (X) vs. simulation(Y)
#
# myQQ <- function(x, y, ...) {
# #rang <- range(x, y, na.rm=T)
# rang <- range(-4, 4, na.rm=T)
# qqplot(x, y, xlim=rang, ylim=rang)
# }
#
# myQQ(X, Y) # where the Y is the newly simulated data for X
# qqline(X)
# Sample different number of observations from all the 3 processes
X_norm1 <- rnorm(500)
X_norm2 <- rnorm(1000, m=-75, sd=3.7)
X_Cauchy <- rcauchy(1500)
# estimate the quantiles (scale the values to ensure measuring-unit invariance of both processes)
qX_norm1 <- quantile(scale(X_norm1), probs = seq(from=0.01, to=0.99, by=0.01))
qX_norm2 <- quantile(scale(X_norm2), probs = seq(from=0.01, to=0.99, by=0.01))
qq.df.norm_norm <- data.frame(qX_norm1, qX_norm2)
# Normal(0,1) vs. Normal(-75, 3.7)
qq.df.norm_norm %>%
plot_ly(x = ~qX_norm1) %>%
add_markers(y = ~qX_norm2, name="Normal(0,1) vs. Normal(-75, 3.7) Data") %>%
add_lines(x = ~qX_norm1, y = ~qX_norm1,
mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>%
layout(title = "Q-Q Normal Plot", legend = list(orientation = 'h'))
# Normal(0,1) vs. Cauchy
qX_norm1 <- quantile(X_norm1, probs = seq(from=0.01, to=0.99, by=0.01))
qX_Cauchy <- quantile(X_Cauchy, probs = seq(from=0.01, to=0.99, by=0.01))
qq.df.norm_cauchy <- data.frame(qX_norm1, qX_Cauchy)
qq.df.norm_cauchy %>%
plot_ly(x = ~qX_norm1) %>%
add_markers(y = ~qX_Cauchy, name="Normal(0,1) vs. Cauchy Data") %>%
add_lines(x = ~qX_norm1, y = ~qX_norm1,
mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>%
layout(title = "Normal vs. Cauchy Q-Q Plot", legend = list(orientation = 'h'))
type=
can be used)plot(x, y, type="n"); text(x, y, names)
axis()
below);
line specifies the line from the plotting area.(x0, y0)
to points (x1, y1)
(x0, y0)
, if code=2
. The
arrow is at point (x1, y1)
, if code=1
. Arrows
are at both if code=3
. Angle controls the angle from the
shaft of the arrow to the edge of the arrow head.b
and intercept a
.lm.obj
. abline(h=0, col=2) #color (col) is often used(x, y)
with the symbols given by legend
.side=1
), on the left (side=2
), at the top
(side=3
), or on the right (side=4
);
vect
(optional) gives the abscissa (or ordinates) where
tick-marks are drawn.(x, y)
after the user has clicked n times on the plot with
the mouse; also draws symbols (type="p"
) or lines
(type="l"
) with respect to optional graphic parameters (…);
by default nothing is drawn (type="n"
).These can be set globally with par(…). Many can be passed as parameters to plotting commands.
adj=0
left-justified, adj=0.5
centered, adj=1
right-justified).bg="red"
, bg="blue"
, …the list of the 657
available colors is displayed with colors()
).bty="n"
the box is not
drawn.cex.axis
, the axis
labels-cex.lab
, the title-cex.main
, and the
subtitle-cex.sub
.colors()
or as “#RRGGBB”;
see rgb()
, hsv()
, gray()
, and
rainbow()
; as for cex there are: col.axis
,
col.lab
, col.main
, col.sub
.font.axis
, font.lab
, font.main
,
font.sub
.lty="44"
will have the same effect than
lty=2
.c(bottom, left, top, right)
, the default values are
c(5.1, 4.1, 4.1, 2.1)
.c(nr, nc)
which partitions the graphic window as a matrix of nr lines and nc
columns, the plots are then drawn in columns.tck=1
a grid is drawn.tcl=-0.5
).xaxt="n"
the x-axis is set but
not drawn (useful in conjunction with
axis(side=1, ...)
).yaxt="n"
the y-axis is set but
not drawn (useful in conjunction with
axis(side=2, ...)
).Expression | Explanation |
---|---|
xyplot(y~x) | bivariate plots (with many functionalities). |
barchart(y~x) | histogram of the values of y with respect to those of x. |
dotplot(y~x) | Cleveland dot plot (stacked plots line-by-line and column-by-column) |
densityplot(~x) | density functions plot |
histogram(~x) | histogram of the frequencies of x |
bwplot(y~x) | “box-and-whiskers” plot |
qqmath(~x) | quantiles of x with respect to the values expected under a theoretical distribution |
stripplot(y~x) | single dimension plot, x must be numeric, y may be a factor |
qq(y~x) | quantiles to compare two distributions, x must be numeric, y may be numeric, character, or factor but must have two “levels” |
splom(~x) | matrix of bivariate plots |
parallel(~x) | parallel coordinates plot |
levelplot(\(z\sim x*y\|g1*g2\)) | colored plot of the values of z at the coordinates given by x and y (x, y and z are all of the same length) |
wireframe(\(z\sim x*y\|g1*g2\)) | 3d surface plot |
cloud(\(z\sim x*y\|g1*g2\)) | 3d scatter plot |
In the normal Lattice formula, y~x|g1*g2
, combinations
of optional conditioning variables g1
and g2
plotted on separate panels. Lattice functions take many of the same
arguments as base graphics plus also data=
the data frame
for the formula variables and subset=
for subsetting. Use
panel=
to define a custom panel function (see
apropos("panel")
and ?lines
). Lattice
functions return an object of class trellis and have to be printed to
produce the graph. Use print(xyplot(...))
inside functions
where automatic printing doesn’t work. Use lattice.theme
and lset
to change Lattice defaults.
R
ProgrammingThe standard setting for our own function is:
function.name<-function(x) {
expr(an expression)
return(value)
}
Where \(x\) is the parameter in the expression. A simple example of this is:
## [1] 15
Conditions setting: if(cond) {expr}
or
if(cond) cons.expr else alt.expr
.
## [1] "F"
Alternatively, ifelse
represents a vectorized and
extremely efficient conditional mechanism that provides one of the main
advantages of R
.
For loop: for(var in seq) expr
.
## [1] 1 2 3 4 5 6 7 8 9 10
Other loops: While loop:
while(cond) expr
, repeat: repeat expr
. Applied
to the innermost of nested loops: break
, next
.
Use braces {}
around statements.
ifelse(test, yes, no) returns a value with the same shape as test, filled with yes or no Boolean values.
do.call(funname, args) executes a function call from the name of the function and a list of arguments to be passed to it.
Before we demonstrate how to synthetically simulate data that resembles closely the characteristics of real observations from the same process, let’s import some observed data for initial exploratory analytics.
Using the SOCR Parkinson’s Disease Case-study available in the Canvas Data Archive, we can import some data and extract some descriptions of the sample data (05_PPMI_top_UPDRS_Integrated_LongFormat1.csv).
PPMI <- read.csv("https://umich.instructure.com/files/330397/download?download_frd=1")
# summary(PPMI)
Hmisc::describe(PPMI)
## PPMI
##
## 31 Variables 1764 Observations
## --------------------------------------------------------------------------------
## FID_IID
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 3534 390.9 3054 3089
## .25 .50 .75 .90 .95
## 3272 3476 3817 4072 4102
##
## lowest : 3001 3002 3003 3004 3006, highest: 4122 4123 4126 4136 4139
## --------------------------------------------------------------------------------
## L_insular_cortex_ComputeArea
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 2255 794.6 808.5 959.5
## .25 .50 .75 .90 .95
## 1976.9 2498.7 2744.1 2962.3 3156.7
##
## lowest : 50.0355 92.941 220.452 225.999 306.361
## highest: 3474.54 3490.82 3580.98 3630.07 3650.81
## --------------------------------------------------------------------------------
## L_insular_cortex_Volume
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 6491 3255 1035 1539
## .25 .50 .75 .90 .95
## 4881 7237 8405 9616 10424
##
## lowest : 22.6262 47.417 116.775 120.79 242.591
## highest: 12172.3 12544.2 12852.2 13148.2 13499.9
## --------------------------------------------------------------------------------
## R_insular_cortex_ComputeArea
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 1711 655.8 562.4 715.0
## .25 .50 .75 .90 .95
## 1345.7 1889.6 2115.8 2329.0 2431.5
##
## lowest : 40.9245 70.1356 86.8461 129.144 159.828
## highest: 2631.22 2631.4 2637.81 2737.28 2791.92
## --------------------------------------------------------------------------------
## R_insular_cortex_Volume
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 3973 2158 592.0 886.9
## .25 .50 .75 .90 .95
## 2652.1 4386.1 5243.2 6368.4 6795.0
##
## lowest : 11.8398 32.4826 48.7982 68.7627 111.06
## highest: 7595.27 7659.64 7671.54 8122.7 8179.4
## --------------------------------------------------------------------------------
## L_cingulate_gyrus_ComputeArea
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 3315 1423 917.9 1226.7
## .25 .50 .75 .90 .95
## 2379.5 3654.1 4198.0 4705.9 5126.2
##
## lowest : 127.779 214.424 267.473 338.737 360.349
## highest: 5530.18 5562.61 5675.14 5694.19 5944.19
## --------------------------------------------------------------------------------
## L_cingulate_gyrus_Volume
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 7949 4613 872.1 1343.3
## .25 .50 .75 .90 .95
## 4173.6 8840.8 10673.2 12718.9 14393.3
##
## lowest : 57.3298 120.742 169.844 225.796 242.425
## highest: 16406.7 16515.7 16659.9 16765.9 17153.2
## --------------------------------------------------------------------------------
## R_cingulate_gyrus_ComputeArea
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 3277 1465 792.9 1144.9
## .25 .50 .75 .90 .95
## 2311.6 3658.9 4211.6 4610.4 4968.7
##
## lowest : 104.135 169.473 190.46 241.615 285.22
## highest: 5581.44 5599.98 5607.32 6076.93 6593.7
## --------------------------------------------------------------------------------
## R_cingulate_gyrus_Volume
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 7896 4624 791.1 1366.9
## .25 .50 .75 .90 .95
## 4639.3 8926.5 10719.4 12226.0 13625.3
##
## lowest : 47.6712 87.2522 102.801 189.956 193.222
## highest: 16076.2 16091.2 17213.7 18046.8 19761.8
## --------------------------------------------------------------------------------
## L_caudate_ComputeArea
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 635.6 437.7 42.06 73.96
## .25 .50 .75 .90 .95
## 277.55 694.18 945.41 1112.96 1212.08
##
## lowest : 1.78156 4.09485 4.09981 4.19613 4.19617
## highest: 1328.71 1328.74 1347.38 1350.86 1453.51
## --------------------------------------------------------------------------------
## L_caudate_Volume
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 952 827.2 15.09 32.66
## .25 .50 .75 .90 .95
## 186.65 975.10 1519.82 1906.58 2192.25
##
## lowest : 0.192801 0.630879 0.633874 0.669029 0.669087
## highest: 2581.29 2582.98 2627.04 2632.19 2746.62
## --------------------------------------------------------------------------------
## R_caudate_ComputeArea
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 869.4 484.1 55.44 112.09
## .25 .50 .75 .90 .95
## 439.06 1034.77 1184.46 1294.08 1401.43
##
## lowest : 1.78156 6.38316 6.38347 11.8365 15.8257
## highest: 1550.8 1568.45 1615.92 1666.42 1684.56
## --------------------------------------------------------------------------------
## R_caudate_Volume
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 1496 1039 21.36 51.32
## .25 .50 .75 .90 .95
## 502.03 1768.10 2135.23 2531.07 2811.08
##
## lowest : 0.192801 1.22382 1.22397 2.42792 4.23963
## highest: 3360.75 3365.66 3451.12 3464.1 3579.37
## --------------------------------------------------------------------------------
## L_putamen_ComputeArea
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 922.6 481.2 95.41 266.46
## .25 .50 .75 .90 .95
## 660.77 993.58 1229.51 1418.33 1539.90
##
## lowest : 6.75987 9.16345 15.8901 16.1646 16.3758
## highest: 1711.69 1734.75 1786.51 2060.12 2129.67
## --------------------------------------------------------------------------------
## L_putamen_Volume
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 1764 1250 54.77 190.98
## .25 .50 .75 .90 .95
## 916.76 1792.46 2507.00 3184.42 3648.16
##
## lowest : 1.2275 1.90048 3.44451 3.89709 4.11645
## highest: 4197.61 4299.28 4331.44 4363.21 4712.66
## --------------------------------------------------------------------------------
## R_putamen_ComputeArea
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 1297 535.6 303.5 457.9
## .25 .50 .75 .90 .95
## 1023.5 1463.7 1627.2 1776.9 1894.2
##
## lowest : 13.9263 18.3367 28.3 29.8511 41.4573
## highest: 2050.44 2068.86 2127.9 2249.68 2251.41
## --------------------------------------------------------------------------------
## R_putamen_Volume
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 441 1 2965 1697 236.2 433.0
## .25 .50 .75 .90 .95
## 1805.2 3380.4 3959.4 4647.8 5179.2
##
## lowest : 3.20741 4.82362 10.4877 11.3539 16.1329
## highest: 5969.68 6022.53 6359 6739.35 7096.58
## --------------------------------------------------------------------------------
## Sex
## n missing distinct Info Mean Gmd
## 1764 0 2 0.673 1.34 0.4491
##
## Value 1 2
## Frequency 1164 600
## Proportion 0.66 0.34
## --------------------------------------------------------------------------------
## Weight
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 276 1 82.05 18.51 55.4 61.8
## .25 .50 .75 .90 .95
## 70.5 81.2 90.9 103.4 110.0
##
## lowest : 43.2 45 46.2 46.7 47.3 , highest: 124 131.2 134.5 134.7 135
## --------------------------------------------------------------------------------
## ResearchGroup
## n missing distinct
## 1764 0 3
##
## Value Control PD SWEDD
## Frequency 512 1092 160
## Proportion 0.290 0.619 0.091
## --------------------------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 434 1 61.07 11.59 43.20 47.54
## .25 .50 .75 .90 .95
## 54.07 62.15 68.82 73.68 76.20
##
## lowest : 31.1781 31.8849 31.9205 32.2959 32.3534
## highest: 81.8411 81.9452 82.2849 82.7699 83.0329
## --------------------------------------------------------------------------------
## chr12_rs34637584_GT
## n missing distinct Info Sum Mean Gmd
## 1688 76 2 0.028 16 0.009479 0.01879
##
## --------------------------------------------------------------------------------
## chr17_rs11868035_GT
## n missing distinct Info Mean Gmd
## 1688 76 3 0.816 0.6161 0.6983
##
## Value 0 1 2
## Frequency 836 664 188
## Proportion 0.495 0.393 0.111
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## chr17_rs11012_GT
## n missing distinct Info Mean Gmd
## 1688 76 3 0.658 0.346 0.489
##
## Value 0 1 2
## Frequency 1152 488 48
## Proportion 0.682 0.289 0.028
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## chr17_rs393152_GT
## n missing distinct Info Mean Gmd
## 1688 76 3 0.723 0.4265 0.5645
##
## Value 0 1 2
## Frequency 1052 552 84
## Proportion 0.623 0.327 0.050
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## chr17_rs12185268_GT
## n missing distinct Info Mean Gmd
## 1688 76 3 0.707 0.4028 0.5399
##
## Value 0 1 2
## Frequency 1076 544 68
## Proportion 0.637 0.322 0.040
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## chr17_rs199533_GT
## n missing distinct Info Mean Gmd
## 1688 76 3 0.691 0.3791 0.514
##
## Value 0 1 2
## Frequency 1100 536 52
## Proportion 0.652 0.318 0.031
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## UPDRS_part_I
## n missing distinct Info Mean Gmd .05 .10
## 1215 549 13 0.9 1.286 1.638 0 0
## .25 .50 .75 .90 .95
## 0 1 2 3 5
##
## Value 0 1 2 3 4 5 6 7 8 9 10
## Frequency 527 296 182 96 42 33 21 6 4 3 2
## Proportion 0.434 0.244 0.150 0.079 0.035 0.027 0.017 0.005 0.003 0.002 0.002
##
## Value 11 13
## Frequency 2 1
## Proportion 0.002 0.001
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## UPDRS_part_II
## n missing distinct Info Mean Gmd .05 .10
## 1211 553 28 0.994 6.087 5.432 0.0 1.0
## .25 .50 .75 .90 .95
## 2.0 5.0 9.0 13.0 15.5
##
## lowest : 0 1 2 3 4, highest: 23 24 25 27 28
## --------------------------------------------------------------------------------
## UPDRS_part_III
## n missing distinct Info Mean Gmd .05 .10
## 1210 554 58 0.999 19.44 13.04 0 2
## .25 .50 .75 .90 .95
## 12 20 27 35 39
##
## lowest : 0 1 2 3 4, highest: 53 56 59 60 61
## --------------------------------------------------------------------------------
## time_visit
## n missing distinct Info Mean Gmd .05 .10
## 1764 0 12 0.993 23.5 20.09 0.00 3.00
## .25 .50 .75 .90 .95
## 8.25 21.00 37.50 48.00 54.00
##
## Value 0 3 6 9 12 18 24 30 36 42 48
## Frequency 147 147 147 147 147 147 147 147 147 147 147
## Proportion 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083
##
## Value 54
## Frequency 147
## Proportion 0.083
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
# data driven age estimates
m = round (mean(PPMI$Age), 2)
sd = round(sd(PPMI$Age), 2)
x.norm <- rnorm(n=200, m=m, sd=sd)
# hist(x.norm, main='N(10, 20) Histogram')
plot_ly(x = ~x.norm, type = "histogram") %>%
layout(bargap=0.1, title=paste0('N(', m, ', ', sd, ') Histogram'))
## [1] 61.07281
## [1] 10.26669
Next, we will simulate new synthetic data to match the
properties/characteristics of the observed PPMI data (using
Uniform
, Normal
, and Poisson
distributions).
# age m=62, sd=10
# Demographics variables
# Define number of subjects
NumSubj <- 282
NumTime <- 4
# Define data elements
# Cases
Cases <- c(2, 3, 6, 7, 8, 10, 11, 12, 13, 14, 17, 18, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35, 37, 41, 42, 43, 44, 45, 53, 55, 58, 60, 62, 67, 69, 71, 72, 74, 79, 80, 85, 87, 90, 95, 97, 99, 100, 101, 106, 107, 109, 112, 120, 123, 125, 128, 129, 132, 134, 136, 139, 142, 147, 149, 153, 158, 160, 162, 163, 167, 172, 174, 178, 179, 180, 182, 192, 195, 201, 208, 211, 215, 217, 223, 227, 228, 233, 235, 236, 240, 245, 248, 250, 251, 254, 257, 259, 261, 264, 268, 269, 272, 273, 275, 279, 288, 289, 291, 296, 298, 303, 305, 309, 314, 318, 324, 325, 326, 328, 331, 332, 333, 334, 336, 338, 339, 341, 344, 346, 347, 350, 353, 354, 359, 361, 363, 364, 366, 367, 368, 369, 370, 371, 372, 374, 375, 376, 377, 378, 381, 382, 384, 385, 386, 387, 389, 390, 393, 395, 398, 400, 410, 421, 423, 428, 433, 435, 443, 447, 449, 450, 451, 453, 454, 455, 456, 457, 458, 459, 460, 461, 465, 466, 467, 470, 471, 472, 476, 477, 478, 479, 480, 481, 483, 484, 485, 486, 487, 488, 489, 492, 493, 494, 496, 498, 501, 504, 507, 510, 513, 515, 528, 530, 533, 537, 538, 542, 545, 546, 549, 555, 557, 559, 560, 566, 572, 573, 576, 582, 586, 590, 592, 597, 603, 604, 611, 619, 621, 623, 624, 625, 631, 633, 634, 635, 637, 640, 641, 643, 644, 645, 646, 647, 648, 649, 650, 652, 654, 656, 658, 660, 664, 665, 670, 673, 677, 678, 679, 680, 682, 683, 686, 687, 688, 689, 690, 692)
# Imaging Biomarkers
L_caudate_ComputeArea <- rpois(NumSubj, 600)
L_caudate_Volume <- rpois(NumSubj, 800)
R_caudate_ComputeArea <- rpois(NumSubj, 893)
R_caudate_Volume <- rpois(NumSubj, 1000)
L_putamen_ComputeArea <- rpois(NumSubj, 900)
L_putamen_Volume <- rpois(NumSubj, 1400)
R_putamen_ComputeArea <- rpois(NumSubj, 1300)
R_putamen_Volume <- rpois(NumSubj, 3000)
L_hippocampus_ComputeArea <- rpois(NumSubj, 1300)
L_hippocampus_Volume <- rpois(NumSubj, 3200)
R_hippocampus_ComputeArea <- rpois(NumSubj, 1500)
R_hippocampus_Volume <- rpois(NumSubj, 3800)
cerebellum_ComputeArea <- rpois(NumSubj, 16700)
cerebellum_Volume <- rpois(NumSubj, 14000)
L_lingual_gyrus_ComputeArea <- rpois(NumSubj, 3300)
L_lingual_gyrus_Volume <- rpois(NumSubj, 11000)
R_lingual_gyrus_ComputeArea <- rpois(NumSubj, 3300)
R_lingual_gyrus_Volume <- rpois(NumSubj, 12000)
L_fusiform_gyrus_ComputeArea <- rpois(NumSubj, 3600)
L_fusiform_gyrus_Volume <- rpois(NumSubj, 11000)
R_fusiform_gyrus_ComputeArea <- rpois(NumSubj, 3300)
R_fusiform_gyrus_Volume <- rpois(NumSubj, 10000)
Sex <- ifelse(runif(NumSubj)<.5, 0, 1)
Weight <- as.integer(rnorm(NumSubj, 80, 10))
Age <- as.integer(rnorm(NumSubj, 62, 10))
# Diagnosis
Dx <- c(rep("PD", 100), rep("HC", 100), rep("SWEDD", 82))
# Genetics
chr12_rs34637584_GT <- c(ifelse(runif(100)<.3, 0, 1), ifelse(runif(100)<.6, 0, 1), ifelse(runif(82)<.4, 0, 1)) # NumSubj Bernoulli trials
chr17_rs11868035_GT <- c(ifelse(runif(100)<.7, 0, 1), ifelse(runif(100)<.4, 0, 1), ifelse(runif(82)<.5, 0, 1)) # NumSubj Bernoulli trials
# Clinical # rpois(NumSubj, 15) + rpois(NumSubj, 6)
UPDRS_part_I <- c( ifelse(runif(100)<.7, 0, 1) + ifelse(runif(100) < .7, 0, 1),
ifelse(runif(100)<.6, 0, 1)+ ifelse(runif(100)<.6, 0, 1),
ifelse(runif(82)<.4, 0, 1)+ ifelse(runif(82)<.4, 0, 1) )
UPDRS_part_II <- c(sample.int(20, 100, replace=T), sample.int(14, 100, replace=T),
sample.int(18, 82, replace=T) )
UPDRS_part_III <- c(sample.int(30, 100, replace=T), sample.int(20, 100, replace=T),
sample.int(25, 82, replace=T) )
# Time: VisitTime - done automatically below in aggregator
# Data (putting all components together)
sim_PD_Data <- cbind(
rep(Cases, each= NumTime), # Cases
rep(L_caudate_ComputeArea, each= NumTime), # Imaging
rep(Sex, each= NumTime), # Demographics
rep(Weight, each= NumTime),
rep(Age, each= NumTime),
rep(Dx, each= NumTime), # Dx
rep(chr12_rs34637584_GT, each= NumTime), # Genetics
rep(chr17_rs11868035_GT, each= NumTime),
rep(UPDRS_part_I, each= NumTime), # Clinical
rep(UPDRS_part_II, each= NumTime),
rep(UPDRS_part_III, each= NumTime),
rep(c(0, 6, 12, 18), NumSubj) # Time
)
# Assign the column names
colnames(sim_PD_Data) <- c(
"Cases",
"L_caudate_ComputeArea",
"Sex", "Weight", "Age",
"Dx", "chr12_rs34637584_GT", "chr17_rs11868035_GT",
"UPDRS_part_I", "UPDRS_part_II", "UPDRS_part_III",
"Time"
)
# some QC
summary(sim_PD_Data)
## Cases L_caudate_ComputeArea Sex Weight
## Length:1128 Length:1128 Length:1128 Length:1128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Age Dx chr12_rs34637584_GT chr17_rs11868035_GT
## Length:1128 Length:1128 Length:1128 Length:1128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## UPDRS_part_I UPDRS_part_II UPDRS_part_III Time
## Length:1128 Length:1128 Length:1128 Length:1128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## [1] 1128 12
## Cases L_caudate_ComputeArea Sex Weight Age Dx chr12_rs34637584_GT
## [1,] "2" "588" "1" "71" "70" "PD" "1"
## [2,] "2" "588" "1" "71" "70" "PD" "1"
## [3,] "2" "588" "1" "71" "70" "PD" "1"
## [4,] "2" "588" "1" "71" "70" "PD" "1"
## [5,] "3" "610" "1" "82" "76" "PD" "0"
## [6,] "3" "610" "1" "82" "76" "PD" "0"
## chr17_rs11868035_GT UPDRS_part_I UPDRS_part_II UPDRS_part_III Time
## [1,] "0" "0" "4" "10" "0"
## [2,] "0" "0" "4" "10" "6"
## [3,] "0" "0" "4" "10" "12"
## [4,] "0" "0" "4" "10" "18"
## [5,] "0" "2" "7" "17" "0"
## [6,] "0" "2" "7" "17" "6"
# hist(PPMI$Age, freq=FALSE, right=FALSE, ylim = c(0,0.05))
# lines(density(as.numeric(as.data.frame(sim_PD_Data)$Age)), lwd=2, col="blue")
# legend("topright", c("Raw Data", "Simulated Data"), fill=c("black", "blue"))
x <- PPMI$Age
fit <- density(as.numeric(as.data.frame(sim_PD_Data)$Age))
plot_ly(x = x, type = "histogram", name = "Histogram (Raw Age)") %>%
add_trace(x = fit$x, y = fit$y, type = "scatter", mode = "lines",
fill = "tozeroy", yaxis = "y2", name = "Density (Simulated Age)") %>%
layout(title='Observed and Simulated Ages', yaxis2 = list(overlaying = "y", side = "right"))
The Tidyverse
represents a suite of integrated R
packages that provide
support for data science
and
Big Data analytics
, including functionality for data import
(readr
), data manipulation (dplyr
), data
visualization (ggplot2
), expanded data frames
(tibble
), data tidying (tidyr
), and functional
programming (purrr
). These learning
modules provide introduction to tidyverse.
R
documentation and resourcesR
Book (Verzani).R
Resources.SOCR
Datasets can automatically be downloaded into the R
environment using the following protocol, which uses the Parkinson’s
Disease dataset as an example:
library(rvest)
# Loading required package: xml2
wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata") # UMich SOCR Data
# wiki_url <- read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_PD_BiomedBigMetadata") # UCLA SOCR Data
html_nodes(wiki_url, "#content")
## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body" role="main">\n\t\t\t<a id="top"></a>\n\ ...
## # A tibble: 2 × 12
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 (default) Deutsch Españ… Fran… Ital… Port… 日本… Бълг… الام… Suomi इस भ… Norge
## 2 한국어 中文 繁体… Русс… Nede… Ελλη… Hrva… Česk… Danm… Pols… Româ… Sver…
## X1 X2 X3 X4
## Length:2 Length:2 Length:2 Length:2
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X5 X6 X7 X8
## Length:2 Length:2 Length:2 Length:2
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## X9 X10 X11 X12
## Length:2 Length:2 Length:2 Length:2
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
Also see the SMHS Simulation Primer.
R
DebuggingMost programs that give incorrect results are impacted by logical errors. When errors (bugs, exceptions) occur, we need to explore deeper – this procedure to identify and fix bugs is “debugging”.
R tools for debugging: traceback(), debug() browser() trace() recover()
traceback(): Failing R
functions report
to the screen immediately the run-time errors. Calling
traceback()
shows the place where the error occurred. The
traceback()
function prints the list of functions that were
called before the error occurred. The stacked function calls are printed
in reverse order.
f1 <- function(x) { r<- x-g1(x); `R` }
g1 <- function(y) { r<-y*h1(y); `R` }
h1 <- function(z) { r<-log(z); if(r<10) r^2 else r^3}
f1(-1)
## Warning in log(z): NaNs produced
## Error in if (r < 10) r^2 else r^3: missing value where TRUE/FALSE needed
## Error in h(y): could not find function "h"
## Error in g(x): could not find function "g"
## Error in f(-1): could not find function "f"
debug() - traceback()
does not tell you
where the error is. To find out which line causes the error, we may step
through the function using debug()
.
debug(foo) flags the function foo()
for
debugging. undebug(foo)
unflags the function. When a
function is flagged for debugging, each statement in the function is
executed one at a time. After a statement is executed, the function
suspends and the user can interact with the R
shell. This
allows us to inspect a function line-by-line.
Example: compute sum of squared error SS.
## compute sum of squares
SS <- function(mu, x) {
d<-x-mu;
d2<-d^2;
ss<-sum(d2);
ss
}
set.seed(100);
x<-rnorm(100);
SS(1, x)
## debugging in: SS(1, x)
## debug at <text>#2: {
## d <- x - mu
## d2 <- d^2
## ss <- sum(d2)
## ss
## }
## debug at <text>#3: d <- x - mu
## debug at <text>#4: d2 <- d^2
## debug at <text>#5: ss <- sum(d2)
## debug at <text>#6: ss
## exiting from: SS(1, x)
## [1] 202.5615
In the debugging shell (“Browse[1]>”), users can:
## debugging in: SS(1, x)
## debug at <text>#2: {
## d <- x - mu
## d2 <- d^2
## ss <- sum(d2)
## ss
## }
## debug at <text>#3: d <- x - mu
## debug at <text>#4: d2 <- d^2
## debug at <text>#5: ss <- sum(d2)
## debug at <text>#6: ss
## exiting from: SS(1, x)
## [1] 202.5615
Browse[1]> n
debug: d <- x - mu ## the next command
Browse[1]> ls() ## current environment [1] “mu” “x” ## there is no
d
Browse[1]> n ## go one step debug: d2 <- d^2 ## the next
command
Browse[1]> ls() ## current environment [1] “d” “mu” “x” ## d has been
created
Browse[1]> d[1:3] ## first three elements of d [1] -1.5021924
-0.8684688 -1.0789171
Browse[1]> hist(d) ## histogram of d
Browse[1]> where ## current position in call stack where 1: SS(1,
x)
Browse[1]> n
debug: ss <- sum(d2)
Browse[1]> Q ## quit
undebug(SS) ## remove debug label, stop debugging process
SS(1, x) ## now call SS again will without debugging
You can label a function for debugging while debugging another function.
f <- function(x) {
r<-x-g(x);
`R`
}
g <- function(y) {
r<-y*h(y);
`R`
}
h <- function(z) {
r<-log(z);
if(r<10) r^2
else r^3
}
debug(f) # ## If you only debug f, you will not go into g
f(-1)
## Warning in log(z): NaNs produced
## Error in if (r < 10) r^2 else r^3: missing value where TRUE/FALSE needed
Browse[1]> n
Browse[1]> n
But, we can also label g and h for debugging when we debug f
f(-1)
Browse[1]> n
Browse[1]> debug(g)
Browse[1]> debug(h)
Browse[1]> n
Inserting a call to browser() in a function will pause the execution of a function at the point where browser() is called. Similar to using debug(), except that you can control where execution gets paused. Here is another example.
## Error in if (r < 10) r^2 else r^3: missing value where TRUE/FALSE needed
Browse[1]> ls() Browse[1]> z
Browse[1]> n
Browse[1]> n
Browse[1]> ls()
Browse[1]> c
Calling trace() on a function allows inserting new code into a function.
as.list(body(h))
trace(“h”, quote(
if(is.nan(r))
{browser()}), at=3, print=FALSE)
f(1)
f(-1)
trace(“h”, quote(if(z<0) {z<-1}), at=2, print=FALSE)
f(-1)
untrace()
During the debugging process, recover() allows
checking the status of variables in upper level functions. recover() can
be used as an error handler using options()
(e.g. options(error=recover)
). When functions throw
exceptions, execution stops at the point of failure. Browsing the
function calls and examining the environment may indicate the source of
the problem.