We start the DSPA journey with an overview of the mission and objectives of this textbook. Some early examples of driving motivational problems and challenges provide context into the common characteristics of big (biomedical and health) data. We will define data science and predictive analytics and emphasize the importance of their ethical, responsible, and reproducible practical use. This chapter also covers the foundations of R, contrasts R against other languages and computational data science platforms and introduces basic functions and data objects, formats, and simulation.

1 Motivation

Let’s start with a quick overview illustrating some common data science challenges, qualitative descriptions of the fundamental principles, and awareness about the power and potential pitfalls of modern data-driven scientific inquiry.

1.1 DSPA Mission and Objectives

The second edition of this textbook (DSPA2)is based on the HS650: Data Science and Predictive Analytics (DSPA) course I teach at the University of Michigan and the first DSPA edition. These materials collectively aim to provide learners with a deep understanding of the challenges, appreciation of the enormous opportunities, and a solid methodological foundation for designing, collecting, managing, processing, interrogating, analyzing and interpreting complex health and biomedical data. Readers that finish this course of training and successfully complete the examples and assignments included in the book will gain unique skills and acquire a tool-chest of methods, software tools, and protocols that can be applied to a broad spectrum of Big Data problems.

Vision: Enable active-learning by integrating driving motivational challenges with mathematical foundations, computational statistics, and modern scientific inference
Values: Effective, reliable, reproducible, and transformative data-driven discovery supporting open-science
Strategic priorities: Trainees will develop scientific intuition, computational skills, and data-wrangling abilities to tackle Big biomedical and health data problems. Instructors will provide well-documented R-scripts and software recipes implementing atomic data-filters as well as complex end-to-end predictive big data analytics solutions.

Before diving into the mathematical algorithms, statistical computing methods, software tools, and health analytics covered in the remaining chapters, we will discuss several driving motivational problems. These will ground all the subsequent scientific discussions, data modeling, and computational approaches.

1.2 Examples of driving motivational problems and challenges

For each of the studies below, we illustrate several clinically-relevant scientific questions, identify appropriate data sources, describe the types of data elements, and pinpoint various complexity challenges.

1.2.1 Alzheimer’s Disease

Identify the relation between observed clinical phenotypes and expected behavior;
Prognosticate future cognitive decline (3-12 months prospectively) as a function of imaging data and clinical assessment (both model-based and model-free machine learning prediction methods will be used);
Derive and interpret the classifications of subjects into clusters using the harmonized and aggregated data from multiple sources.

Data Source	Sample Size/Data Type	Summary
ADNI Archive	Clinical data: demographics, clinical assessments, cognitive assessments; Imaging data: sMRI, fMRI, DTI, PiB/FDG PET; Genetics data: Ilumina SNP genotyping; Chemical biomarker: lab tests, proteomics. Each data modality comes with a different number of cohorts. Generally, $200\le N \le 1200$. For instance, previously conducted ADNI studies with N>500 [ doi: 10.3233/JAD-150335, doi: 10.1111/jon.12252, doi: 10.3389/fninf.2014.00041]	ADNI provides interesting data modalities, multiple cohorts (e.g., early-onset, mild, and severe dementia, controls) that allow effective model training and validation NACC Archive

1.2.2 Parkinson’s Disease

Predict the clinical diagnosis of patients using all available data (with and without the UPDRS clinical assessment, which is the basis of the clinical diagnosis by a physician);
Compute derived neuroimaging and genetics biomarkers that can be used to model the disease progression and provide automated clinical decisions support;
Generate decision trees for numeric and categorical responses (representing clinically relevant outcome variables) that can be used to suggest an appropriate course of treatment for specific clinical phenotypes.

Data Source	Sample Size/Data Type	Summary
PPMI Archive	Demographics: age, medical history, sex; Clinical data: physical, verbal learning and language, neurological and olfactory (University of Pennsylvania Smell Identification Test, UPSIT) tests), vital signs, MDS-UPDRS scores (Movement Disorder; Society-Unified Parkinson’s Disease Rating Scale), ADL (activities of daily living), Montreal Cognitive Assessment (MoCA), Geriatric Depression Scale (GDS-15); Imaging data: structural MRI; Genetics data: llumina ImmunoChip (196,524 variants) and NeuroX (covering 240,000 exonic variants) with 100% sample success rate, and 98.7% genotype success rate genotyped for APOE e2/e3/e4. Three cohorts of subjects; Group 1 = {de novo PD Subjects with a diagnosis of PD for two years or less who are not taking PD medications}, N1 = 263; Group 2 = {PD Subjects with Scans without Evidence of a Dopaminergic Deficit (SWEDD)}, N2 = 40; Group 3 = {Control Subjects without PD who are 30 years or older and who do not have a first degree blood relative with PD}, N3 = 127	The longitudinal PPMI dataset including clinical, biological and imaging data (screening, baseline, 12, 24, and 48 month follow-ups) may be used conduct model-based predictions as well as model-free classification and forecasting analyses

1.2.3 Drug and substance use

Is the Risk for Alcohol Withdrawal Syndrome (RAWS) screen a valid and reliable tool for predicting alcohol withdrawal in an adult medical inpatient population?
What is the optimal cut-off score from the AUDIT-C to predict alcohol withdrawal based on RAWS screening?
Should any items be deleted from, or added to, the RAWS screening tool to enhance its performance in predicting the emergence of alcohol withdrawal syndrome in an adult medical inpatient population?

Data Source	Sample Size/Data Type	Summary
MAWS Data / UMHS EHR / WHO AWS Data	Scores from Alcohol Use Disorders Identification Test-Consumption (AUDIT-C) [49], including dichotomous variables for any current alcohol use (AUDIT-C, question 1), total AUDIT-C score > 8, and any positive history of alcohol withdrawal syndrome (HAWS)	~1,000 positive cases per year among 10,000 adult medical inpatients, % RAWS screens completed, % positive screens, % entered into MAWS protocol who receive pharmacological treatment for AWS, % entered into MAWS protocol without a completed RAWS screen

1.2.4 Amyotrophic lateral sclerosis

Identify the most highly-significant variables that have power to jointly predict the progression of ALS (in terms of clinical outcomes like ALSFRS and muscle function)
Provide a decision tree prediction of adverse events based on subject phenotype and 0-3 month clinical assessment changes

Data Source	Sample Size/Data Type	Summary
ProAct Archive	Over 100 clinical variables are recorded for all subjects including: Demographics: age, race, medical history, sex; Clinical data: Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS), adverse events, onset_delta, onset_site, drugs use (riluzole) The PRO-ACT training dataset contains clinical and lab test information of 8,635 patients. Information of 2,424 study subjects with valid gold standard ALSFRS slopes will be used in out processing, modeling and analysis	The time points for all longitudinally varying data elements will be aggregated into signature vectors. This will facilitate the modeling and prediction of ALSFRS slope changes over the first three months (baseline to month 3)

1.2.5 Normal Brain Visualization

The SOCR Brain Visualization App has preloaded sMRI, ROI labels, and fiber track models for a normal brain. It also allows users to drag-and-drop their data into the browser to visualize and navigate through the stereotactic data (including imaging, parcellations and tractography).

1.2.6 Neurodegeneration

A recent study of Structural Neuroimaging in Alzheimer’s Disease illustrates the Big Data challenges in modeling complex neuroscientific data. Specifically, 808 ADNI subjects were divided into 3 groups: 200 subjects with Alzheimer’s disease (AD), 383 subjects with mild cognitive impairment (MCI), and 225 asymptomatic normal controls (NC). Their sMRI data were parcellated using BrainParser, and the 80 most important neuroimaging biomarkers were extracted using the global shape analysis Pipeline workflow. Using a pipeline implementation of Plink, the authors obtained 80 SNPs highly-associated with the imaging biomarkers. The authors observed significant correlations between genetic and neuroimaging phenotypes in the 808 ADNI subjects. These results suggest that differences between AD, MCI, and NC cohorts may be examined by using powerful joint models of morphometric, imaging and genotypic data.

1.2.7 Genomics computing

1.2.7.1 Genetic Forensics - 2013-2016 Ebola Outbreak

This HHMI disease detective activity illustrates genetic analysis of sequences of Ebola viruses isolated from patients in Sierra Leone during the Ebola outbreak of 2013-2016. Scientists track the spread of the virus using the fact that most of the genome is identical among individuals of the same species, most similar for genetically related individuals, and more different as the hereditary distance increases. DNA profiling capitalizes on these genetic differences. In particular, in regions of noncoding DNA, which is DNA that is not transcribed and translated into a protein. Variations in noncoding regions impact less individual’s traits. Such changes in noncoding regions may be immune to natural selection. DNA variations called short tandem repeats (STRs) are short bases, typically 2-5 bases long, that repeat multiple times. The repeat units are found at different locations, or loci, throughout the genome. Every STR has multiple alleles. These allele variants are defined by the number of repeat units present or by the length of the repeat sequence. STR are surrounded by non-variable segments of DNA known as flanking regions. The STR allele in the Figure below could be denoted by “6”, as the repeat unit (GATA) repeats 6 times, or as 70 base pairs (bps) because its length is 70 bases in length, including the starting/ending flanking regions. Different alleles of the same STR may correspond to different numbers of GATA repeats, with the same flanking regions.

1.2.7.2 Next Generation Sequence (NGS) Analysis

Whole-genome and exome sequencing include essential clues for identifying genes responsible for simple Mendelian inherited disorders. This paper proposed methods can be applied to complex disorders based on population genetics. Next generation sequencing (NGS) technologies include bioinformatics resources to analyze the dense and complex sequence data. The Graphical Pipeline for Computational Genomics (GPCG) performs the computational steps required to analyze NGS data. The GPCG implements flexible workflows for basic sequence alignment, sequence data quality control, single nucleotide polymorphism analysis, copy number variant identification, annotation, and visualization of results. Applications of NGS analysis provide clinical utility for identifying miRNA signatures in diseases. Enabling hypotheses testing about the functional role of variants in the human genome will help to pinpoint the genetic risk factors of many diseases (e.g., neuropsychiatric disorders).

1.2.7.3 Neuroimaging-genetics

A computational infrastructure for high-throughput neuroimaging-genetics (doi: 10.3389/fninf.2014.00041) facilitates the data aggregation, harmonization, processing and interpretation of multisource imaging, genetics, clinical and cognitive data. A unique feature of this architecture is the graphical user interface to the Pipeline environment. Through its client-server architecture, the Pipeline environment provides a graphical user interface for designing, executing, monitoring, validating, and disseminating complex protocols that utilize diverse suites of software tools and web-services. These pipeline workflows are represented as portable XML objects, which transfer the execution instructions and user specifications from the client user machine to remote pipeline servers for distributed computing. Using Alzheimer’s and Parkinson’s data, this study provides examples of translational applications using this infrastructure

1.3 Common Characteristics of Big (Biomedical and Health) Data

Software developments, student training, utilization of Cloud or IoT service platforms, and methodological advances associated with Big Data Discovery Science all present existing opportunities for learners, educators, researchers, practitioners and policy makers, alike. A review of many biomedical, health informatics and clinical studies suggests that there are indeed common characteristics of complex big data challenges. For instance, imagine analyzing observational data of thousands of Parkinson’s disease patients based on tens-of-thousands of signature biomarkers derived from multi-source imaging, genetics, clinical, physiologic, phenomics and demographic data elements. IBM had defined the qualitative characteristics of Big Data as 4V’s: Volume, Variety, Velocity and Veracity (there are additional V-qualifiers that can be added).

More recently (PMID:26998309) we defined a constructive characterization of Big Data that clearly identifies the methodological gaps and necessary tools:

BD Dimensions	Tools
Size	Harvesting and management of vast amounts of data
Complexity	Wranglers for dealing with heterogeneous data
Incongruency	Tools for data harmonization and aggregation
Multi-source	Transfer and joint modeling of disparate elements
Multi-scale	Macro to meso to micro scale observations
Time	Techniques accounting for longitudinal patterns in the data
Incomplete	Reliable management of missing data

1.4 Data Science

Data science is an emerging new field that (1) is extremely transdisciplinary - bridging between the theoretical, computational, experimental, and biosocial areas, (2) deals with enormous amounts of complex, incongruent and dynamic data from multiple sources, and (3) aims to develop algorithms, methods, tools and services capable of ingesting such datasets and generating semi-automated decision support systems. The latter can mine the data for patterns or motifs, predict expected outcomes, suggest clustering or labeling of retrospective or prospective observations, compute data signatures or fingerprints, extract valuable information, and offer evidence-based actionable knowledge. Data science techniques often involve data manipulation (wrangling), data harmonization and aggregation, exploratory or confirmatory data analyses, predictive analytics, validation and fine-tuning.

1.5 Predictive Analytics

Predictive analytics is the process of utilizing advanced mathematical formulations, powerful statistical computing algorithms, efficient software tools and services to represent, interrogate and interpret complex data. As its name suggests, a core aim of predictive analytics is to forecast trends, predict patterns in the data, or prognosticate the process behavior either within the range or outside the range of the observed data (e.g., in the future, or at locations where data may not be available). In this context, process refers to a natural phenomenon that is being investigated by examining proxy data. Presumably, by collecting and exploring the intrinsic data characteristics, we can track the behavior and unravel the underlying mechanism of the system.

The fundamental goal of predictive analytics is to identify relationships, associations, arrangements or motifs in the dataset, in terms of space, time, features (variables) that may reduce the dimensionality of the data, i.e., its complexity. Using these process characteristics, predictive analytics may predict unknown outcomes, produce estimations of likelihoods or parameters, generate classification labels, or contribute other aggregate or individualized forecasts. We will discuss how the outcomes of these predictive analytics can be refined, assessed and compared, e.g., between alternative methods. The underlying assumptions of the specific predictive analytics technique determine its usability, affect the expected accuracy, and guide the (human) actions resulting from the (machine) forecasts. In this textbook, we will discuss supervised and unsupervised, model-based and model-free, classification and regression, as well as deterministic, stochastic, classical and machine learning-based techniques for predictive analytics. The type of the expected outcome (e.g., binary, polytomous, probability, scalar, vector, tensor, etc.) determines if the predictive analytics strategy provides prediction, forecasting, labeling, likelihoods, grouping or motifs.

1.6 High-throughput Big Data Analytics

The Pipeline Environment provides a large tool chest of software and services that can be integrated, merged and processed. The Pipeline workflow library and the workflow miner illustrate much of the functionality that is available. A Java-based and an HTML5 webapp based graphical user interfaces provide access to a powerful 4,000 core grid compute server.

1.7 Examples of data repositories, archives and services

There are many sources of data available on the Internet. A number of them provide open-access to the data based on FAIR (Findable, Accessible, Interoperable, Reusable) principles. Below are examples of open-access data sources that can be used to test the techniques presented in the textbook. We demonstrate the tasks of retrieval, manipulation, processing, analytics and visualization using example datasets from these archives.

1.8 Responsible Data Science and Ethical Predictive Analytics

In addition to being data-literate and skilled artisans, all data scientists, quantitative analysts, and informaticians need to be aware of certain global societal norms and exhibit professional work ethics that ensure the appropriate use, result reproducibility, unbiased reporting, as well as expected and unanticipated interpretations of data, analytical methods, and novel technologies. Examples of this basic etiquette include (1) promoting FAIR (findable, accessible, interoperable, reusable, and reproducible resource) sharing principles, (2) ethical conduct of research, (3) balancing and explicating potential benefits and probable detriments of findings, (4) awareness of relevant legislation, codes of practice, and respect for privacy, security and confidentiality of sensitive data, and (5) document provenance, attributions and longevity of resources.

1.8.2 Research ethics

Ethical data science and predictive analytics research demands responsible scientific conduct and integrity in all aspects of practical scientific investigation and discovery. All analysts should be aware of, and practice, established professional norms and ethical principles in planning, designing, implementing, executing, and assessing activities related to data-driven scientific research.

1.8.3 Understanding the benefits and detriments of analytical findings

Evidence and data-driven discovery is often bound to generate both questions and answers, some of which may be unexpected, undesired, or detrimental. Quantitative analysts are responsible for validating all their results, as well as for balancing and explicating all potential benefits and enumerating all probable detriments of positive and negative findings.

1.8.4 Regulatory and practical issues in handling sensitive data

Decisions on security, privacy, and confidentiality of sensitive data collected and managed by data governors, manipulated by quantitative analysts, and interpreted by policy and decision makers is not trivial. The large number of people, devices, algorithms, and services that are within arms-length of the raw data suggests a multi-tier approach for sensible protection and security of sensitive information like personal health, biometric, genetic, and proprietary data. Data security, privacy, and confidentiality of sensitive information should always be protected throughout the data life cycle. This may require preemptive, on-going, and post-hoc analyses to identify and patch potential vulnerabilities. Often, there may be tradeoffs between data value benefits and potential risks of blind automated information interpretation. Neither of the extremes is practical, sustainable, or effective.

1.8.5 Resource provenance and longevity

The digitalization of human experiences, the growth of data science, and the promise of artificial intelligence have led to enormous societal investments, excitements, and anxieties. There is a strong sentiment and anticipation that the vast amounts of available information will ubiquitously translate into quick insights, useful predictions, optimal risk-estimations, and cost-effective decisions. Proper recording of the data, algorithmic, scientific, computational, and human factors involved in these forecasts represents a critical component of data science and its essential contributions to knowledge.

1.8.6 Examples of inappropriate, fake, or malicious use of resources

Each of the complementary spaces of appropriate and inappropriate use of data science and predictive analytics resources are vast. The sections above outlined some of the guiding principles for ethical, respectful, appropriate, and responsible data-driven analytics and factual reporting. Below are some examples illustrating inappropriate use of data, resources, information, or knowledge to intentionally or unintentionally gain unfair advantage, spread fake news, misrepresent findings, or detrimental socioeconomic effects.

Attempts to re-identify sensitive information or circumvent regulatory policies, common sense norms, or agreements. For instance, Big data and advanced analytics were employed to re-identify the Massachusetts Governor William Weld’s medical record using openly released insurance dataset stripped of direct personal identifiers.
Calibrated analytics that report findings relative to status-quo alternatives and level setting expected and computed inference in the context of the application domain. For example, ignoring placebo effects, methodological assumptions, potential conflicts and biases, randomization of unknown effects, and other strategies may significantly impact the efficacy of data-driven studies.
Unintended misuse of resource access may be common practice. In 2014, an Uber employee ignored the company’s policies and used his access to track the location of a journalist who was delayed for an Uber interview. Obviously, tracking people without their explicit consent is unethical, albeit it represents an innovative use of the available technology to answer a good question.
Gaming the system for personal gain represents an intentional misuse of resources. Insider trading and opportunistic wealth management represent such examples. In 2015, an analyst at Morgan Stanley inappropriately downloaded 10% of their account data, which was used for personal enrichment.
There are plenty of examples of misuse of analytical strategies to fake the results or strengthen a point beyond the observed evidence and inappropriate use of information and advanced technology.
Big data is naturally prone to innocent errors, e.g., selection bias, methodological development and applications, computational processing, empirical estimation instability, misunderstanding of data formats and metadata understanding, as well as malicious manipulations.
Collecting, managing and processing irrelevant Big Data may yield unnecessary details, skew the understanding of the phenomenon, or distract from the main discoveries. In these situations, there may be substantial socioeconomic costs, as well as negative returns associated with lost opportunities.

1.9 DSPA Expectations

The heterogeneity of data science makes it difficult to identify a complete and exact list of prerequisites necessary to succeed in learning all the appropriate methods. However, the reader is strongly encouraged to glance over the preliminary prerequisites, the self-assessment pretest and remediation materials, and the outcome competencies. Throughout this journey, it is useful to remember the following points:

You don’t have to satisfy all prerequisites, be versed in all mathematical foundations, have substantial statistical analysis expertise, or be an experienced programmer.
You don’t have to complete all chapters and sections in the order they appear in the DSPA Topics Flowchart. Completing one, or several of the suggested pathways may be sufficient for many readers.
The DSPA textbook aims to expand the trainees’ horizons, improve the understanding, enhance the skills, and provide a set of advanced, validated, and practice-oriented code, scripts, and protocols.
To varying degrees, readers will develop abilities to skillfully utilize the tool chest of resources provided in the DSPA textbook. These resources can be revised, improved, customized and applied to other biomedicine and biosocial studies, as well as to Big Data predictive analytics challenges in other disciplines.
The DSPA materials will challenge most readers. When going gets tough, seek help, engage with fellow trainees, search for help on the web, communicate via DSPA discussion forum/chat, review references and supplementary materials. Be proactive! Remember you will gain, but it will require commitment, prolonged immersion, hard work, and perseverance. If it were easy, its value would be compromised.
When covering some chapters, few readers may be underwhelmed or bored. If you are familiar with certain topics, you can skim over the corresponding chapters/sections and move forward to the next topic. Still, it’s worth reviewing some of the examples and trying the assignment problems to ensure you have a firm grasp of the material and your technical abilities are sound.
Although the return on investment (e.g., time, effort) may vary between readers. Those that complete the DSPA textbook will discover something new, acquire some advanced skills, learn novel data analytic protocols, or conceive of a cutting-edge idea.
The complete R code (R markdown) for all examples and demonstrations presented in the textbook are available as electronic supplement.
The instructor acknowledges that these materials may be improved. If you discover problems, typos, errors, inconsistencies, or other problems, please contact us (DSPA.info @ umich.edu) to correct, expand, or polish the resources, accordingly. If you have alternative ideas, suggestions for improvements, optimized code, interesting data and case-studies, or any other refinements, please send these along, as well. All suggestions and critiques will be carefully reviewed and potentially incorporated in revisions and new editions.

2 Foundations of R

In this section, we will start with the foundations of R programming for visualization, statistical computing and scientific inference. Specifically, we will (1) discuss the rationale for selecting R as a computational platform for all DSPA demonstrations; (2) present the basics of installing shell-based R and RStudio user-interface, (3) show some simple R commands and scripts (e.g., translate long-to-wide data format, data simulation, data stratification and subsetting), (4) introduce variable types and their manipulation; (5) demonstrate simple mathematical functions, statistics, and matrix operators; (6) explore simple data visualization; and (7) introduce optimization and model fitting. The chapter appendix includes references to R introductory and advanced resources, as well as a primer on debugging.

2.1 Why use `R`?

There are many different classes of software that can be used for data interrogation, modeling, inference and statistical computing. Among these are R, Python, Java, C/C++, Perl, and many others. The table below compares R to various other statistical analysis software packages and more detailed comparison is available online.

Statistical Software	Advantages	Disadvantages
R	R is actively maintained ($\ge 100,000$ developers, $\ge 15K$ packages). Excellent connectivity to various types of data and other systems. Versatile for solving problems in many domains. It’s free, open-source code. Anybody can access/review/extend the source code. `R` is very stable and reliable. If you change or redistribute the `R` source code, you have to make those changes available for anybody else to use. `R` runs anywhere (platform agnostic). Extensibility: `R` supports extensions, e.g., for data manipulation, statistical modeling, and graphics. Active and engaged community supports R. Unparalleled question-and-answer (Q&A) websites. `R` connects with other languages (Java/C/JavaScript/Python/Fortran) & database systems, and other programs, SAS, SPSS, etc. Other packages have add-ons to connect with R. SPSS has incorporated a link to R, and SAS has protocols to move data and graphics between the two packages.	Mostly scripting language. Steeper learning curve
SAS	Large datasets. Commonly used in business & Government	Expensive. Somewhat dated programming language. Expensive/proprietary
Stata	Easy statistical analyses	Mostly classical stats
SPSS	Appropriate for beginners Simple interfaces	Weak in more cutting edge statistical procedures lacking in robust methods and survey methods

There exist substantial differences between different types of computational environments for data wrangling, preprocessing, analytics, visualization and interpretation. The table below provides some rough comparisons between some of the most popular data computational platforms. With the exception of ComputeTime, higher scores represent better performance within the specific category. Note that these are just estimates and the scales are not normalized between categories.

Language	OpenSource	Speed	ComputeTime	LibraryExtent	EaseOfEntry	Costs	Interoperability
Python	Yes	16	62	80	85	10	90
Julia	Yes	2941	0.34	100	30	10	90
R	Yes	1	745	100	80	15	90
IDL	No	67	14.77	50	88	100	20
Matlab	No	147	6.8	75	95	100	20
Scala	Yes	1428	0.7	50	30	20	40
C	Yes	1818	0.55	100	30	10	99
Fortran	Yes	1315	0.76	95	25	15	95

Let’s first look at some real peer-review publication data (1995-2015), specifically comparing all published scientific reports utilizing R, SAS and SPSS, as popular tools for data manipulation and statistical modeling. These data are retrieved using GoogleScholar literature searches.

# library(ggplot2)
# library(reshape2)
library(ggplot2)
library(reshape2)
library(plotly)
Data_R_SAS_SPSS_Pubs <- 
  read.csv('https://umich.instructure.com/files/2361245/download?download_frd=1', header=T)
df <- data.frame(Data_R_SAS_SPSS_Pubs) 
# convert to long format (http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/) 
# df <- melt(df ,  id.vars = 'Year', variable.name = 'Software') 
# ggplot(data=df, aes(x=Year, y=value, color=Software, group = Software)) + 
#   geom_line(size=4) + labs(x='Year', y='Paper Software Citations') +
#   ggtitle("Manuscript Citations of Software Use (1995-2015)") +
#   theme(legend.position=c(0.1,0.8), 
#         legend.direction="vertical",
#         axis.text.x = element_text(angle = 45, hjust = 1),
#         plot.title = element_text(hjust = 0.5))

plot_ly(df, x = ~Year)  %>%
  add_trace(y = ~R, name = 'R', mode = 'lines+markers') %>%
  add_trace(y = ~SAS, name = 'SAS', mode = 'lines+markers') %>%
  add_trace(y = ~SPSS, name = 'SPSS', mode = 'lines+markers') %>% 
  layout(title="Manuscript Citations of Software Use (1995-2015)", legend = list(orientation = 'h'))

We can also look at a dynamic Google Trends map, which provides longitudinal tracking of the number of web-searches for each of these three statistical computing platforms (R, SAS, SPSS). The figure below shows one example of the evolving software interest over the past 15 years. You can expand this plot by modifying the trend terms, expanding the search phrases, and changing the time period. Static 2004-2018 monthly data of popularity of SAS, SPSS, and R programming Google searches is saved in this file GoogleTrends_Data_R_SAS_SPSS_Worldwide_2004_2018.csv.

The example below shows a dynamic pull of $\sim20$ years of Google queries about R, SAS, SPSS, and Python, traced between 2004-01-01 and 2023-06-16.

# require(ggplot2)
# require(reshape2)
# GoogleTrends_Data_R_SAS_SPSS_Worldwide_2004_2018 <- 
#   read.csv('https://umich.instructure.com/files/9310141/download?download_frd=1', header=T)
#   # read.csv('https://umich.instructure.com/files/9314613/download?download_frd=1', header=T) # Include Python
# df_GT <- data.frame(GoogleTrends_Data_R_SAS_SPSS_Worldwide_2004_2018) 
# 
# # convert to long format 
# # df_GT <- melt(df_GT ,  id.vars = 'Month', variable.name = 'Software') 
# # 
# # library(scales)
# df_GT$Month <- as.Date(paste(df_GT$Month,"-01",sep=""))
# ggplot(data=df_GT1, aes(x=Date, y=hits, color=keyword, group = keyword)) +
#   geom_line(size=4) + labs(x='Month-Year', y='Worldwide Google Trends') +
#   scale_x_date(labels = date_format("%m-%Y"), date_breaks='4 months') +
#   ggtitle("Web-Search Trends of Statistical Software (2004-2018)") +
#   theme(legend.position=c(0.1,0.8),
#         legend.direction="vertical",
#         axis.text.x = element_text(angle = 45, hjust = 1),
#         plot.title = element_text(hjust = 0.5))


#### Pull dynamic Google-Trends data
# install.packages("prophet")
# install.packages("devtools")
# install.packages("ps"); install.packages("pkgbuild")
# devtools::install_github("PMassicotte/gtrendsR")

# Potential 429 Error, see: 
#      https://github.com/PMassicotte/gtrendsR/issues/431
#      https://github.com/trendecon/trendecon/blob/master/R/gtrends_with_backoff.R 

library(gtrendsR)
library(ggplot2)
library(prophet)
df_GT1 <- gtrends(c("R", "SAS", "SPSS", "Python"), 
                 gprop = "web", time = "2004-01-01 2023-06-16")[[1]]
                 # geo = c("US","CN","GB", "EU")
# During repeated requests, to prevent gtrends error message 
# "Status code was not 200. Returned status code:429", due to multiple queries
# we used the GoogleTrends online search to for the 4 terms
# https://trends.google.com/trends/explore?date=all&q=R,SAS,Python,SPSS&hl=en
# and saved the data to DSPA Canvas site:
# https://umich.instructure.com/courses/38100/files/folder/Case_Studies/
# https://umich.instructure.com/files/31071103/download?download_frd=1

# df_GT1_wide <- spread(df_GT1, key = keyword, value = hits)
# # colnames(df_GT1_wide)[7] <- "R"
# # colnames(df_GT1_wide) <- gsub(" ", "", colnames(data))
# # dim(df_GT1_wide ) # [1] 212   9
# 
# plot_ly(df_GT1_wide, x = ~date)  %>%
#   add_trace(x = ~date, y = ~R, name = 'R', type = 'scatter', mode = 'lines+markers') %>%
#   add_trace(x = ~date, y = ~SAS, name = 'SAS', type = 'scatter', mode = 'lines+markers') %>%
#   add_trace(x = ~date, y = ~SPSS, name = 'SPSS', type = 'scatter', mode = 'lines+markers') %>%
#   add_trace(x = ~date, y = ~Python, name = 'Python', type = 'scatter', mode = 'lines+markers') %>% 
#   layout(title="Monthly Web-Search Trends of Statistical Software (2004-2023)", 
#          legend = list(orientation = 'h'),
#          xaxis = list(title = 'Time'), 
#          yaxis = list (title = 'Relative Search Volume'))

# load the data
df_GT1 <- read.csv(
  "https://umich.instructure.com/files/31071103/download?download_frd=1", 
  header=T, as.is = T) # R_SAS_SPSS_Python_GoogleTrendsSearchDate_July_2023.csv
summary(df_GT1)

##     Month                 R               SAS             Python     
##  Length:235         Min.   : 52.00   Min.   : 5.000   Min.   : 9.00  
##  Class :character   1st Qu.: 58.00   1st Qu.: 8.000   1st Qu.:10.00  
##  Mode  :character   Median : 62.00   Median : 9.000   Median :12.00  
##                     Mean   : 64.77   Mean   : 9.111   Mean   :17.58  
##                     3rd Qu.: 69.00   3rd Qu.:11.000   3rd Qu.:25.00  
##                     Max.   :100.00   Max.   :13.000   Max.   :47.00  
##       SPSS      
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.000  
##  Mean   :2.068  
##  3rd Qu.:3.000  
##  Max.   :4.000

head(df_GT1)

##     Month  R SAS Python SPSS
## 1 2004-01 61  11     13    3
## 2 2004-02 59  12     13    4
## 3 2004-03 61  11     12    4
## 4 2004-04 56  11     12    4
## 5 2004-05 53  12     12    4
## 6 2004-06 56  12     12    3

# keywords = c("R", "SAS", "SPSS", "Python")
# time_period = "2004-01-01 2023-06-16"
# geo = c("US","CN","GB", "EU")
# gtrends_data <- data.frame(gtrends(keyword=keywords,
#                                    time=time_period,geo=geo)$interest_over_time)

library(tidyr)
# colnames(df_GT1_wide)[7] <- "R"
# colnames(df_GT1_wide) <- gsub(" ", "", colnames(data))
# dim(df_GT1_wide ) # [1] 212   9

plot_ly(df_GT1, x = ~Month)  %>%
  add_trace(x = ~Month, y = ~R, name = 'R', type = 'scatter', mode = 'lines+markers') %>%
  add_trace(x = ~Month, y = ~SAS, name = 'SAS', type = 'scatter', mode = 'lines+markers') %>%
  add_trace(x = ~Month, y = ~SPSS, name = 'SPSS', type = 'scatter', mode = 'lines+markers') %>%
  add_trace(x = ~Month, y = ~Python, name = 'Python', type = 'scatter', mode = 'lines+markers') %>% 
  layout(title="Monthly Web-Search Trends of Statistical Software (2004-2023)", 
         # legend = list(orientation = 'h'),
         xaxis = list(title = 'Monthly', automargin = TRUE), 
         yaxis = list (title = 'Relative Search Volume'))

2.2 Getting started with `R`

2.2.1 Install Basic Shell-based `R`

R is a free software that can be installed on any computer. The ‘R’ website is https://R-project.org. There you can install a shell-based R-environment following this protocol:

click download CRAN in the left bar (you can go directly to the UMich CRAN Server)
choose a download site
choose your operating system (e.g., Windows, Mac, Linux)
select base
choose the latest version to Download R (4.4, or higher (newer) version for your specific operating system, e.g., Windows, Linux, MacOS).

2.2.2 GUI based `R` Invocation (`RStudio`)

For many readers, its best to also install and run R via RStudio graphical user interface. To install RStudio, go to Posit downloads and do the following:

click Download RStudio
click Download RStudio Desktop
click Recommended For Your System
download the appropriate executable file (e.g., .exe) and run it (choose default answers for all questions).

2.2.3 RStudio GUI Layout

The RStudio interface consists of several windows.

Bottom left: console window (also called command window). Here you can type simple commands after the “>” prompt and R will then execute your command. This is the most important window, because this is where R actually does stuff.
Top left: editor window (also called script window). Collections of commands (scripts) can be edited and saved. When you don’t get this window, you can open it with File > New > R script. Just typing a command in the editor window is not enough, it has to get into the command window before R executes the command. If you want to run a line from the script window (or the whole script), you can click Run or press CTRL+ENTER to send it to the command window.
Top right: workspace / history window. In the workspace window, you can see which data and values R has in its memory. You can view and edit the values by clicking on them. The history window shows what has been typed before.
Bottom right: files / plots / packages / help window. Here you can open files, view plots (also previous plots), install and load packages or use the help function.

You can change the size of the windows by dragging the gray bars between the windows.

2.2.4 Software Updates

Updating and upgrading the R environment involves a three-step process:

Updating the R-core: This can be accomplished either by manually downloading and installing the latest version of R from CRAN or by auto-upgrading to the latest version of R using the R installr package. Type this in the R console: install.packages("installr"); library(installr); updateR(),
Updating RStudio: This installs new versions of RStudio using RStudio itself. Go to the Help menu and click Check for Updates, and
Updating R libraries: Go to the Tools menu and click Check for Package Updates….

Just like any other software, services, or applications, these R updates should be done regularly; preferably monthly or at least semi-annually.

2.2.5 (Optional) Install `Quarto`

Quarto is a multi-language next-gen R Markdown from Posit, the rebranded Public Benefit Corporation RStudio. Quarto includes new features and capabilities expanding existing Rmd files without further modification. It’s recommended, but not required, to install Quarto, after building R and RStudio GUI. We still edit code and markdown in the RStudio IDE, just as we normally do with any Rmd computational protocol, as well as preview the rendered document in the Viewer tab dynamically.

The Quarto markdown documents have the .qmd extension, as opposed to the classical R markdown extension (.rmd). Once knitted, the .qmd source can be rendered into many different formats, PDF, MS Word, HTML5, etc.

Quarto allows including executable expressions within markdown text by enclosing code in r expressions. For instance, we can use inline code to report dynamically in the text the number of observations in a dataset, e.g., the dimensions of the mpg dataframe are 234, 11.

Manual creation of a new qmd document is accomplished by mouse-clicking $File\to New\ File\to Quarto Document$ or by using the command palette (shortcut Ctrl+Shift+P), search for Create a new Quarto document and hit return.

Quarto includes native support for Observable JS, a set of enhancements to raw JavaScript, which provides reactive runtime useful for interactive data exploration and analysis. Observable JS (OJS) supports a hosted service for creating and publication of Rmd/Qmd/Pyton notebooks. OJS works in any Quarto document (Rmd, Jupyter, and Knitr documents) via an {ojs} executable code block.

This Posit Quarto video and the acompaning QMD slidedeck offer insights into the incredible power of markdown and interactive content integration across multiple programming languages.

2.2.6 Some notes

The basic R environment installation comes with limited core functionality. Everyone eventually will have to install more packages, e.g., reshape2, ggplot2, and we will show how to expand your Rstudio library throughout these materials.
The core R environment also has to be upgraded occasionally, e.g., every 3-6 months to get R patches, to fix known problems, and to add new functionality. This is also easy to do.
The assignment operator in R is <- (although = may also be used), so to assign a value of $2$ to a variable $x$, we can use x <- 2 or equivalently x = 2.

2.2.7 Help

R provides documentation for different R functions using the method help(). Typing help(topic) in the R console will provide detailed explanations for each R topic or function. Another way of doing it is to call ?topic, which is even easier.

For example, if I want to check the function for linear models (i.e. function lm()), I will use the following function.

help(lm)
?lm

2.2.8 Simple Wide-to-Long Data format translation

Below is a simple R script for melting a small dataset that illustrates the R syntax for variable definition, instantiation, function calls, and parameter setting.

rawdata_wide <- read.table(header=TRUE, text='
 CaseID Gender Age  Condition1  Condition2
       1   M    5   13          10.5
       2   F    6   16          11.2
       3   F    8   10          18.3
       4   M    9       9.5     18.1
       5   M    10      12.1        19
')
# Make the CaseID column a factor
rawdata_wide$subject <- factor(rawdata_wide$CaseID)

rawdata_wide

##   CaseID Gender Age Condition1 Condition2 subject
## 1      1      M   5       13.0       10.5       1
## 2      2      F   6       16.0       11.2       2
## 3      3      F   8       10.0       18.3       3
## 4      4      M   9        9.5       18.1       4
## 5      5      M  10       12.1       19.0       5

library(reshape2)

# Specify id.vars: the variables to keep (don't split apart on!)
melt(rawdata_wide, id.vars=c("CaseID", "Gender"))

##    CaseID Gender   variable value
## 1       1      M        Age     5
## 2       2      F        Age     6
## 3       3      F        Age     8
## 4       4      M        Age     9
## 5       5      M        Age    10
## 6       1      M Condition1    13
## 7       2      F Condition1    16
## 8       3      F Condition1    10
## 9       4      M Condition1   9.5
## 10      5      M Condition1  12.1
## 11      1      M Condition2  10.5
## 12      2      F Condition2  11.2
## 13      3      F Condition2  18.3
## 14      4      M Condition2  18.1
## 15      5      M Condition2    19
## 16      1      M    subject     1
## 17      2      F    subject     2
## 18      3      F    subject     3
## 19      4      M    subject     4
## 20      5      M    subject     5

There are specific options for the reshape2::melt() function, from the reshape2 R package, that control the transformation of the original (wide-format) dataset rawdata_wide into the modified (long-format) object data_long.

data_long <- melt(rawdata_wide, 
        # ID variables - all the variables to keep but not split apart on
    id.vars=c("CaseID", "Gender"), 
        # The source columns
    measure.vars=c("Age", "Condition1", "Condition2" ), 
        # Name of the destination column that will identify the original
        # column that the measurement came from
    variable.name="Feature", 
    value.name="Measurement"
)
data_long

##    CaseID Gender    Feature Measurement
## 1       1      M        Age         5.0
## 2       2      F        Age         6.0
## 3       3      F        Age         8.0
## 4       4      M        Age         9.0
## 5       5      M        Age        10.0
## 6       1      M Condition1        13.0
## 7       2      F Condition1        16.0
## 8       3      F Condition1        10.0
## 9       4      M Condition1         9.5
## 10      5      M Condition1        12.1
## 11      1      M Condition2        10.5
## 12      2      F Condition2        11.2
## 13      3      F Condition2        18.3
## 14      4      M Condition2        18.1
## 15      5      M Condition2        19.0

Below is another example showing transforming the SOCR Housing Price Dataset from its native long format to wide and back to long, demonstrating the invertible (bijective) data reformatting operations.

# Load required libraries
library(rvest)  # For web scraping the table
library(tidyr)  # For pivot_wider and pivot_longer
library(dplyr)  # For the pipe operator %>%
library(DT)     # for nice datatable reporting

# URL of the SOCR Housing Price Dataset wiki page
url <- "https://wiki.socr.umich.edu/index.php/SOCR_Data_Dinov_010309_HousingPriceIndex"

# Load the data from the HTML table on the page, note the use of the magritr piping operator %>%
page <- read_html(url)
data_long <- page %>% html_table() %>% .[[1]]

# Convert relevant columns to numeric types (they may be read as characters)
data_long <- data_long %>%
  mutate(
    Year = as.integer(Year),
    HPI = as.numeric(HPI),
    UR = as.numeric(UR),
    Pop = as.integer(Pop),
    Percent = as.numeric(Percent)
  )

# Print the first few rows of the original long format data
# print("Original Long Format:")
# head(data_long)
datatable(data_long, caption = "Original Housing Price Data in Long Format")

# Convert from long to wide format
# Each state has one row, with year-tagged columns for HPI, UR, Region, Pop, Percent
data_wide <- data_long %>%
  pivot_wider(
    names_from = Year,
    values_from = c(HPI, UR, Region, Pop, Percent)
  )

# Print the first few rows of the wide format data
# print("Wide Format:")
# head(data_wide)
datatable(data_wide, caption = "Transformed Housing Price Data in Wide Format")

# Convert back from wide to long format
data_long_back <- data_wide %>%
  pivot_longer(
    cols = -State,
    names_to = c(".value", "Year"),
    names_sep = "_",
    values_to = "value"
  ) %>%
  mutate(Year = as.integer(Year))  # Convert Year back to integer

# Print the first few rows of the restored long format data
# print("Restored Long Format:")
# head(data_long_back)
datatable(data_long_back, caption = "Reconstructed Housing Price Data Back in Long Format")

For an elaborate justification, detailed description, and multiple examples of handling long-and-wide data, messy and tidy data, and data cleaning strategies see the JSS Tidy Data article by Hadley Wickham.

2.2.9 Data generation

Popular data generation functions include c(), seq(), rep(), and data.frame(). Sometimes we use list() and array() to create data too.

c()

c() creates a (column) vector. With option recursive=T, it descends through lists combining all elements into one vector.

a<-c(1, 2, 3, 5, 6, 7, 10, 1, 4)
a

## [1]  1  2  3  5  6  7 10  1  4

c(list(A = c(Z = 1, Y = 2), B = c(X = 7), C = c(W = 7, V=3, U=-1.9)), recursive = TRUE)

##  A.Z  A.Y  B.X  C.W  C.V  C.U 
##  1.0  2.0  7.0  7.0  3.0 -1.9

When combined with list(), c() successfully created a vector with all the information in a list with three members A, B, and C.

seq(from, to)

seq(from, to) generates a sequence. Adding option by= can help us specifies increment; Option length= specifies desired length. Also, seq(along=x) generates a sequence 1, 2, ..., length(x). This is used for loops to create ID for each element in x.

seq(1, 20, by=0.5)

##  [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
## [16]  8.5  9.0  9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5
## [31] 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0

seq(1, 20, length=9)

## [1]  1.000  3.375  5.750  8.125 10.500 12.875 15.250 17.625 20.000

seq(along=c(5, 4, 5, 6))

## [1] 1 2 3 4

rep(x, times)

rep(x, times) creates a sequence that repeats x a specified number of times. The option each= also allows us to repeat first over each element of x a certain number of times.

rep(c(1, 2, 3), 4)

##  [1] 1 2 3 1 2 3 1 2 3 1 2 3

rep(c(1, 2, 3), each=4)

##  [1] 1 1 1 1 2 2 2 2 3 3 3 3

Compare this to replicating using replicate().

X <- seq(along=c(1, 2, 3)); replicate(4, X+1)

##      [,1] [,2] [,3] [,4]
## [1,]    2    2    2    2
## [2,]    3    3    3    3
## [3,]    4    4    4    4

data.frame()

The function data.frame() creates a data frame object of named or unnamed arguments. We can combine multiple vectors of different types into data frames with each vector stored as a column. Shorter vectors are automatically wrapped around to match the length of the longest vectors. With data.frame() you can mix numeric and characteristic vectors.

data.frame(v=1:4, ch=c("a", "B", "C", "d"), n=c(10, 11))

##   v ch  n
## 1 1  a 10
## 2 2  B 11
## 3 3  C 10
## 4 4  d 11

Note that the operator : generates a sequence and the expression 1:4 yields a vector of integers, from $1$ to $4$.

list()

Much like the column function c(), the function list() creates a list of the named or unnamed arguments - indexing rule: from $1$ to $n$, including $1$ and $n$. Remember that in R indexing of vectors, lists, arrays and tensors starts at $1$, not $0$, as in some other programming languages.

l<-list(a=c(1, 2), b="hi", c=-3+3i)
l

## $a
## [1] 1 2
## 
## $b
## [1] "hi"
## 
## $c
## [1] -3+3i

# Note Complex Numbers a <- -1+3i; b <- -2-2i; a+b

As R uses general objects to represent different constructs, object elements are accessible via $, @, ., and other delimiters, depending on the object type. For instance, we can refer to a member $a$ and index $i$ in the list of objects $l$ containing an element $a$ by l$a[[i]] . For example,

l$a[[2]]

## [1] 2

l$b

## [1] "hi"

array(x, dim=)

array(x, dim=) creates an array with specific dimensions. For example, dim=c(3, 4, 2) means two 3x4 matrices. We use [] to extract specific elements in the array. [2, 3, 1] means the element at the 2nd row 3rd column in the 1st page. Leaving one number in the dimensions empty would help us to get a specific row, column or page. [2, ,1] means the second row in the 1st page. See this image:

ar<-array(1:24, dim=c(3, 4, 2))
ar

## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24

ar[2, 3, 1]

## [1] 8

ar[2, ,1]

## [1]  2  5  8 11

In General, multi-dimensional arrays are called “tensors” (of order=number of dimensions).

Other useful functions are:

matrix(x, nrow=, ncol=): creates matrix elements of nrow rows and ncol columns.
factor(x, levels=): encodes a vector x as a factor.
gl(n, k, length=n*k, labels=1:n): generate levels (factors) by specifying the pattern of their levels. k is the number of levels, and n is the number of replications.
expand.grid(): a data frame from all combinations of the supplied vectors or factors.
rbind() combine arguments by rows for matrices, data frames, and others
cbind() combine arguments by columns for matrices, data frames, and others

2.2.10 Input/Output (I/O)

The first pair of functions we will talk about are save() and load(), which write and import objects between the current R environment RAM memory and long term storage, e.g., hard drive, Cloud storage, SSD, etc. The script below demonstrates the basic export and import operations with simple data. Note that we saved the data in Rdata (Rda) format.

x <- seq(1, 10, by=0.5)
y <- list(a = 1, b = TRUE, c = "oops")
save(x, y, file="xy.RData")
load("xy.RData")

There are two basic functions data(x) and library(x) that load specified data sets and R packages, respectively. The R base library is always loaded by default. However, add-on libraries need to be installed first and then imported (loaded) in the working environment before functions and objects in these libraries are accessible.

data("iris")
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

library(base)

read.table(file) reads a file in table format and creates a data frame from it. The default separator sep="" is any whitespace. Use header=TRUE to read the first line as a header of column names. Use as.is=TRUE to prevent character vectors from being converted to factors. Use comment.char="" to prevent "#" from being interpreted as a comment. Use skip=n to skip n lines before reading data. See the help for options on row naming, NA treatment, and others.

The example below uses read.table() to parse and load an ASCII text file containing a simple dataset, which is available on the supporting canvas data archive.

data.txt<-read.table("https://umich.instructure.com/files/1628628/download?download_frd=1", header=T, as.is = T) # 01a_data.txt
summary(data.txt)

##      Name               Team             Position             Height    
##  Length:1034        Length:1034        Length:1034        Min.   :67.0  
##  Class :character   Class :character   Class :character   1st Qu.:72.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :74.0  
##                                                           Mean   :73.7  
##                                                           3rd Qu.:75.0  
##                                                           Max.   :83.0  
##      Weight           Age       
##  Min.   :150.0   Min.   :20.90  
##  1st Qu.:187.0   1st Qu.:25.44  
##  Median :200.0   Median :27.93  
##  Mean   :201.7   Mean   :28.74  
##  3rd Qu.:215.0   3rd Qu.:31.23  
##  Max.   :290.0   Max.   :48.52

When using R to access (read/write) data on a Cloud web service, like Instructure/Canvas or GoogleDrive/GDrive, mind that the direct URL reference to the raw file will be different from the URL of the pointer to the file that can be rendered in the browser window. For instance,

This GDrive TXT file, 1Zpw3HSe-8HTDsOnR-n64KoMRWYpeBBek (01a_data.txt),
Can be downloaded and ingested in R via this separate URL.
While the file reference is unchanged (1Zpw3HSe-8HTDsOnR-n64KoMRWYpeBBek), note the change of syntax from viewing the file in the browser, open?id=, to auto-downloading the file for R processing, uc?export=download&id=.

dataGDrive.txt<-read.table("https://drive.google.com/uc?export=download&id=1Zpw3HSe-8HTDsOnR-n64KoMRWYpeBBek", header=T, as.is = T) # 01a_data.txt
summary(dataGDrive.txt)

##      Name               Team             Position             Height    
##  Length:1034        Length:1034        Length:1034        Min.   :67.0  
##  Class :character   Class :character   Class :character   1st Qu.:72.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :74.0  
##                                                           Mean   :73.7  
##                                                           3rd Qu.:75.0  
##                                                           Max.   :83.0  
##      Weight           Age       
##  Min.   :150.0   Min.   :20.90  
##  1st Qu.:187.0   1st Qu.:25.44  
##  Median :200.0   Median :27.93  
##  Mean   :201.7   Mean   :28.74  
##  3rd Qu.:215.0   3rd Qu.:31.23  
##  Max.   :290.0   Max.   :48.52

read.csv(“filename”, header=TRUE) is identical to read.table() but with defaults set for reading comma-delimited files.

data.csv<-read.csv("https://umich.instructure.com/files/1628650/download?download_frd=1", header = T)  # 01_hdp.csv
summary(data.csv)

##    tumorsize           co2             pain           wound      
##  Min.   : 33.97   Min.   :1.222   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 62.49   1st Qu.:1.519   1st Qu.:4.000   1st Qu.:5.000  
##  Median : 70.07   Median :1.601   Median :5.000   Median :6.000  
##  Mean   : 70.88   Mean   :1.605   Mean   :5.473   Mean   :5.732  
##  3rd Qu.: 79.02   3rd Qu.:1.687   3rd Qu.:6.000   3rd Qu.:7.000  
##  Max.   :116.46   Max.   :2.128   Max.   :9.000   Max.   :9.000  
##     mobility       ntumors        nmorphine        remission     
##  Min.   :1.00   Min.   :0.000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:5.00   1st Qu.:1.000   1st Qu.: 2.000   1st Qu.:0.0000  
##  Median :6.00   Median :3.000   Median : 3.000   Median :0.0000  
##  Mean   :6.08   Mean   :3.066   Mean   : 3.624   Mean   :0.2957  
##  3rd Qu.:7.00   3rd Qu.:5.000   3rd Qu.: 5.000   3rd Qu.:1.0000  
##  Max.   :9.00   Max.   :9.000   Max.   :18.000   Max.   :1.0000  
##   lungcapacity          Age           Married      FamilyHx        
##  Min.   :0.01612   Min.   :26.32   Min.   :0.0   Length:8525       
##  1st Qu.:0.67647   1st Qu.:46.69   1st Qu.:0.0   Class :character  
##  Median :0.81560   Median :50.93   Median :1.0   Mode  :character  
##  Mean   :0.77409   Mean   :50.97   Mean   :0.6                     
##  3rd Qu.:0.91150   3rd Qu.:55.27   3rd Qu.:1.0                     
##  Max.   :0.99980   Max.   :74.48   Max.   :1.0                     
##   SmokingHx             Sex            CancerStage         LengthofStay   
##  Length:8525        Length:8525        Length:8525        Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 5.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 5.000  
##                                                           Mean   : 5.492  
##                                                           3rd Qu.: 6.000  
##                                                           Max.   :10.000  
##       WBC            RBC             BMI             IL6          
##  Min.   :2131   Min.   :3.919   Min.   :18.38   Min.   : 0.03521  
##  1st Qu.:5323   1st Qu.:4.802   1st Qu.:24.20   1st Qu.: 1.93039  
##  Median :6007   Median :4.994   Median :27.73   Median : 3.34400  
##  Mean   :5998   Mean   :4.995   Mean   :29.07   Mean   : 4.01698  
##  3rd Qu.:6663   3rd Qu.:5.190   3rd Qu.:32.54   3rd Qu.: 5.40551  
##  Max.   :9776   Max.   :6.065   Max.   :58.00   Max.   :23.72777  
##       CRP               DID          Experience       School         
##  Min.   : 0.0451   Min.   :  1.0   Min.   : 7.00   Length:8525       
##  1st Qu.: 2.6968   1st Qu.:100.0   1st Qu.:15.00   Class :character  
##  Median : 4.3330   Median :199.0   Median :18.00   Mode  :character  
##  Mean   : 4.9730   Mean   :203.3   Mean   :17.64                     
##  3rd Qu.: 6.5952   3rd Qu.:309.0   3rd Qu.:21.00                     
##  Max.   :28.7421   Max.   :407.0   Max.   :29.00                     
##     Lawsuits          HID           Medicaid     
##  Min.   :0.000   Min.   : 1.00   Min.   :0.1416  
##  1st Qu.:1.000   1st Qu.: 9.00   1st Qu.:0.3369  
##  Median :2.000   Median :17.00   Median :0.5215  
##  Mean   :1.866   Mean   :17.76   Mean   :0.5125  
##  3rd Qu.:3.000   3rd Qu.:27.00   3rd Qu.:0.7083  
##  Max.   :9.000   Max.   :35.00   Max.   :0.8187

read.delim(“filename”, header=TRUE) is very similar to the first two. However, it has defaults set for reading tab-delimited files.

Also we have read.fwf(file, widths, header=FALSE, sep="\t", as.is=FALSE) to read a table of fixed width formatted data into a data frame.

match(x, y) returns a vector of the positions of (first) matches of its first argument in its second. For a specific element in x if no element matches it in y, then the output would be NA.

match(c(1, 2, 4, 5), c(1, 4, 4, 5, 6, 7))

## [1]  1 NA  2  4

save.image(file) saves all objects in the current workspace.

write.table(x, file=““, row.names=TRUE, col.names=TRUE, sep=”“) prints x after converting to a data frame and stores it into a specified file. If quote is TRUE, character or factor columns are surrounded by quotes (”). sep is the field separator. eol is the end-of-line separator. na is the string for missing values. Use col.names=NA to add a blank column header to get the column headers aligned correctly for spreadsheet input.

Most of the I/O functions have a file argument. This can often be a character string naming a file or a connection. file="" means the standard input or output. Connections can include files, pipes, zipped files, and R variables.

On windows, the file connection can also be used with description = "clipboard". To read a table copied from Excel, use x <- read.delim("clipboard")

To write a table to the clipboard for Excel, use write.table(x, "clipboard", sep="\t", col.names=NA)

For database interaction, see packages RODBC, DBI, RMySQL, RPgSQL, and ROracle, as well as packages XML, hdf5, netCDF for reading other file formats. We will talk about some of them in later chapters.

Note, an alternative library called rio handles import/export of multiple data types with simple syntax.

2.2.11 Slicing and extracting data

The following table summarizes the basic vector indexing operations.

Expression	Explanation
`x[n]`	nth element
`x[-n]`	all but the nth element
`x[1:n]`	first n elements
`x[-(1:n)]`	elements from n+1 to the end
`x[c(1, 4, 2)]`	specific elements
`x["name"]`	element named “name”
`x[x > 3]`	all elements greater than 3
`x[x > 3 & x < 5]`	all elements between 3 and 5
`x[x %in% c("a", "and", "the")]`	elements in the given set

Indexing lists are similar but not identical to indexing vectors.

Expression	Explanation
`x[n]`	list with n elements
`x[[n]]`	nth element of the list
`x[["name"]]`	element of the list named “name”

Indexing for matrices and higher dimensional arrays (tensors) derive from vector indexing.

Expression	Explanation
`x[i, j]`	element at row i, column j
`x[i, ]`	row i
`x[, j]`	column j
`x[, c(1, 3)]`	columns 1 and 3
`x["name", ]`	row named “name”

2.2.12 Variable conversion

The following functions represent simple examples of convert data types:

as.array(x), as.data.frame(x), as.numeric(x), as.logical(x), as.complex(x), as.character(x), …

Typing methods(as) in the console will generate a complete list for variable conversion functions.

2.2.13 Variable information

The following functions verify if the input is of a specific data type:

is.na(x), is.null(x), is.array(x), is.data.frame(x), is.numeric(x), is.complex(x), is.character(x), …

For a complete list, type methods(is) in the R console. The outputs for these functions are objects, either single values (TRUE or FALSE), or objects of the same dimensions as the inputs containing a Boolean TRUE or FALSE element for each entry in the dataset.

length(x) gives us the number of elements in x.

x<-c(1, 3, 10, 23, 1, 3)
length(x)

## [1] 6

is.na(x)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

is.vector(x)

## [1] TRUE

dim(x) retrieves or sets the dimension of an array and length(y) reports the length of a list or a vector.

x<-1:12
length(x)

## [1] 12

dim(x)<-c(3, 4)
x

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

dimnames(x) retrieves or sets the dimension names of an object. For higher dimensional objects like matrices or arrays we can combine dimnames() with a list.

dimnames(x)<-list(c("R1", "R2", "R3"), c("C1", "C2", "C3", "C4"))
x

##    C1 C2 C3 C4
## R1  1  4  7 10
## R2  2  5  8 11
## R3  3  6  9 12

nrow(x) and ncol(x) report the number of rows and number of columns or a matrix.

nrow(x)

## [1] 3

ncol(x)

## [1] 4

class(x) gets or sets the class of $x$. Note that we can use unclass(x) to remove the class attribute of $x$.

class(x)

## [1] "matrix" "array"

class(x)<-"myclass"
x<-unclass(x)
x

##    C1 C2 C3 C4
## R1  1  4  7 10
## R2  2  5  8 11
## R3  3  6  9 12

attr(x, which) gets or sets the attribute which of $x$.

attr(x, "class")

## NULL

attr(x, "dim")<-c(2, 6)
x

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    3    5    7    9   11
## [2,]    2    4    6    8   10   12

The above script shows that applying unclass to $x$ sets its class to NULL.

attributes(obj) gets or sets the list of attributes of an object.

attributes(x) <- list(mycomment = "really special", dim = 3:4, 
   dimnames = list(LETTERS[1:3], letters[1:4]), names = paste(1:12))
x

##   a b c  d
## A 1 4 7 10
## B 2 5 8 11
## C 3 6 9 12
## attr(,"mycomment")
## [1] "really special"
## attr(,"names")
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12"

2.2.14 Data selection and manipulation

In this section, we will introduce some data manipulation functions. In addition, tools from dplyr provide easy dataset manipulation routines.

which.max(x) returns the index of the greatest element (max) of $x$, which.min(x) returns the index of the smallest element (min) of $x$, and rev(x) reverses the elements of $x$.

x<-c(1, 5, 2, 1, 10, 40, 3)
which.max(x)

## [1] 6

which.min(x)

## [1] 1

rev(x)

## [1]  3 40 10  1  2  5  1

sort(x) sorts the elements of $x$ in increasing order. To sort in decreasing order we can use rev(sort(x)).

sort(x)

## [1]  1  1  2  3  5 10 40

rev(sort(x))

## [1] 40 10  5  3  2  1  1

cut(x, breaks) divides $x$ into intervals with the same length (sometimes factors). The optional parameter breaks specifies the number of cut intervals or a vector of cut points. cut divides the range of $x$ into intervals coding the values in $x$ according to the intervals they fall into.

## [1]  1  5  2  1 10 40  3

cut(x, 3)

## [1] (0.961,14] (0.961,14] (0.961,14] (0.961,14] (0.961,14] (27,40]    (0.961,14]
## Levels: (0.961,14] (14,27] (27,40]

cut(x, c(0, 5, 20, 30))

## [1] (0,5]  (0,5]  (0,5]  (0,5]  (5,20] <NA>   (0,5] 
## Levels: (0,5] (5,20] (20,30]

which(x == a) returns a vector of the indices of $x$ if the comparison operation is true. For example, it returns the value $i$, if $x[i]== a$ is TRUE. Thus, the argument of this function (like x==a) must be a Boolean variable.

## [1]  1  5  2  1 10 40  3

which(x==2)

## [1] 3

na.omit(x) suppresses the observations with missing data (NA). It suppresses the corresponding line if $x$ is a matrix or a data frame. na.fail(x) returns an error message if $x$ contains at least one NA.

df<-data.frame(a=1:5, b=c(1, 3, NA, 9, 8))
df

##   a  b
## 1 1  1
## 2 2  3
## 3 3 NA
## 4 4  9
## 5 5  8

na.omit(df)

##   a b
## 1 1 1
## 2 2 3
## 4 4 9
## 5 5 8

unique(x) If $x$ is a vector or a data frame, it returns a similar object suppressing the duplicate elements.

df1<-data.frame(a=c(1, 1, 7, 6, 8), b=c(1, 1, NA, 9, 8))
df1

##   a  b
## 1 1  1
## 2 1  1
## 3 7 NA
## 4 6  9
## 5 8  8

unique(df1)

##   a  b
## 1 1  1
## 3 7 NA
## 4 6  9
## 5 8  8

table(x) returns a table with the different values of $x$ and their frequencies (typically used for integer or factor variables). The corresponding prop.table() function transforms these raw frequencies to relative frequencies (proportions, marginal mass).

v<-c(1, 2, 4, 2, 2, 5, 6, 4, 7, 8, 8)
table(v)

## v
## 1 2 4 5 6 7 8 
## 1 3 2 1 1 1 2

prop.table(v)

##  [1] 0.02040816 0.04081633 0.08163265 0.04081633 0.04081633 0.10204082
##  [7] 0.12244898 0.08163265 0.14285714 0.16326531 0.16326531

subset(x, …) returns a selection of $x$ with respect to the specified criteria .... Typically ... are comparisons like x$V1 < 10. If $x$ is a data frame, the option select= allows using a negative sign $-$ to indicate values to keep or drop from the object.

sub <- subset(df1, df1$a>5)
sub

##   a  b
## 3 7 NA
## 4 6  9
## 5 8  8

sub <- subset(df1, select=-a)
sub

##    b
## 1  1
## 2  1
## 3 NA
## 4  9
## 5  8

## Subsampling
x <- matrix(rnorm(100), ncol = 5)
y <- c(1, seq(19))

z <- cbind(x, y)

z.df <- data.frame(z)
z.df

##            V1          V2          V3          V4          V5  y
## 1   0.6160612 -1.96351204  0.67342247  1.22115252 -0.96725412  1
## 2   3.0006883  0.50574221  1.44705457  1.36376276  1.62524717  1
## 3   1.2239988  0.14980760 -0.01151221 -0.58898420  0.08126494  2
## 4  -0.2365176  0.74068179 -0.14143408 -0.04328428  0.02865983  3
## 5   0.0835121 -1.42089809 -0.37300180 -2.57809044  0.50884660  4
## 6   0.7465206  1.29856789  0.16615770  1.70561418 -0.65975398  5
## 7  -1.5497327  1.34517808 -0.55161858 -0.01960563 -0.33846092  6
## 8  -1.7771287 -0.45393268  0.07118207 -0.90300062  1.82445220  7
## 9  -0.3235453  1.15309506 -0.20338918 -1.00044201 -0.11527707  8
## 10 -0.6286514 -1.32516011 -0.02889094  0.11376413  1.08148501  9
## 11  0.4161752  0.95420692  0.55120643 -0.97427366 -0.94648476 10
## 12 -0.3907337  0.75278645  0.12667915  1.13627322 -0.62782831 11
## 13  2.2366444  1.17235711 -0.54226895 -0.82922849  0.98156136 12
## 14  0.8910452  1.44377867  2.27878594 -0.87751207  2.23772106 13
## 15  0.8021616  0.05075335 -0.90689932  0.60511944  0.97195494 14
## 16 -0.6613050 -0.53301063  0.23673393  1.11931142  0.45583413 15
## 17  0.1224261  0.43137220  0.74983504  0.52351057  0.36694341 16
## 18  1.2776686 -1.32409828 -0.32938833  1.40438807 -1.66208774 17
## 19 -1.0115144 -1.64752776  1.56785920  0.40575379 -1.25551293 18
## 20 -1.2161553 -1.84593073 -0.21539951  1.20872705  0.01911542 19

names(z.df)

## [1] "V1" "V2" "V3" "V4" "V5" "y"

# subsetting rows
z.sub <- subset(z.df, y > 2 & (y<10 | V1>0))
z.sub

##            V1          V2          V3          V4          V5  y
## 4  -0.2365176  0.74068179 -0.14143408 -0.04328428  0.02865983  3
## 5   0.0835121 -1.42089809 -0.37300180 -2.57809044  0.50884660  4
## 6   0.7465206  1.29856789  0.16615770  1.70561418 -0.65975398  5
## 7  -1.5497327  1.34517808 -0.55161858 -0.01960563 -0.33846092  6
## 8  -1.7771287 -0.45393268  0.07118207 -0.90300062  1.82445220  7
## 9  -0.3235453  1.15309506 -0.20338918 -1.00044201 -0.11527707  8
## 10 -0.6286514 -1.32516011 -0.02889094  0.11376413  1.08148501  9
## 11  0.4161752  0.95420692  0.55120643 -0.97427366 -0.94648476 10
## 13  2.2366444  1.17235711 -0.54226895 -0.82922849  0.98156136 12
## 14  0.8910452  1.44377867  2.27878594 -0.87751207  2.23772106 13
## 15  0.8021616  0.05075335 -0.90689932  0.60511944  0.97195494 14
## 17  0.1224261  0.43137220  0.74983504  0.52351057  0.36694341 16
## 18  1.2776686 -1.32409828 -0.32938833  1.40438807 -1.66208774 17

z.sub1 <- z.df[z.df$y == 1, ]
z.sub1

##          V1         V2        V3       V4         V5 y
## 1 0.6160612 -1.9635120 0.6734225 1.221153 -0.9672541 1
## 2 3.0006883  0.5057422 1.4470546 1.363763  1.6252472 1

z.sub2 <- z.df[z.df$y %in% c(1, 4), ]
z.sub2

##          V1         V2         V3        V4         V5 y
## 1 0.6160612 -1.9635120  0.6734225  1.221153 -0.9672541 1
## 2 3.0006883  0.5057422  1.4470546  1.363763  1.6252472 1
## 5 0.0835121 -1.4208981 -0.3730018 -2.578090  0.5088466 4

# subsetting columns
z.sub6 <- z.df[, 1:2]
z.sub6

##            V1          V2
## 1   0.6160612 -1.96351204
## 2   3.0006883  0.50574221
## 3   1.2239988  0.14980760
## 4  -0.2365176  0.74068179
## 5   0.0835121 -1.42089809
## 6   0.7465206  1.29856789
## 7  -1.5497327  1.34517808
## 8  -1.7771287 -0.45393268
## 9  -0.3235453  1.15309506
## 10 -0.6286514 -1.32516011
## 11  0.4161752  0.95420692
## 12 -0.3907337  0.75278645
## 13  2.2366444  1.17235711
## 14  0.8910452  1.44377867
## 15  0.8021616  0.05075335
## 16 -0.6613050 -0.53301063
## 17  0.1224261  0.43137220
## 18  1.2776686 -1.32409828
## 19 -1.0115144 -1.64752776
## 20 -1.2161553 -1.84593073

sample(x, size) resamples randomly, without replacement, size elements in the vector $x$. The option replace = TRUE allows resampling with replacement.

df1 <- data.frame(a=c(1, 1, 7, 6, 8), b=c(1, 1, NA, 9, 8))
sample(df1$a, 20, replace = T)

##  [1] 8 7 6 7 1 1 1 8 7 1 6 1 6 7 7 7 7 7 6 1

2.3 Mathematics, Statistics, and Optimization

Many mathematical functions, statistical summaries, and function optimizers will be discussed throughout the book. Below are the very basic functions to keep in mind.

2.3.1 Math Functions

Basic math functions like sin, cos, tan, asin, acos, atan, atan2, log, log10, exp and “set” functions union(x, y), intersect(x, y), setdiff(x, y), setequal(x, y), is.element(el, set) are available in R.

lsf.str("package:base") displays all base functions built in a specific R package (like base).

This table summarizes the core functions for most basic R for calculations.

Expression	Explanation
`choose(n, k)`	computes the combinations of k events among n repetitions. Mathematically it equals to $\frac{n!}{[(n-k)!k!]}$
`max(x)`	maximum of the elements of x
`min(x)`	minimum of the elements of x
`range(x)`	minimum and maximum of the elements of x
`sum(x)`	sum of the elements of x
`diff(x)`	lagged and iterated differences of vector x
`prod(x)`	product of the elements of x
`mean(x)`	mean of the elements of x
`median(x)`	median of the elements of x
`quantile(x, probs=)`	sample quantiles corresponding to the given probabilities (defaults to 0, .25, .5, .75, 1)
`weighted.mean(x, w)`	mean of x with weights w
`rank(x)`	ranks of the elements of x
`var(x)` or `cov(x)`	variance of the elements of x (calculated on n>1). If x is a matrix or a data frame, the variance-covariance matrix is calculated
`sd(x)`	standard deviation of x
`cor(x)`	correlation matrix of x if it is a matrix or a data frame (1 if x is a vector)
`var(x, y)` or `cov(x, y)`	covariance between x and y, or between the columns of x and those of y if they are matrices or data frames
`cor(x, y)`	linear correlation between x and y, or correlation matrix if they are matrices or data frames
`round(x, n)`	rounds the elements of x to n decimals
`log(x, base)`	computes the logarithm of x with base base
`scale(x)`	if x is a matrix, centers and reduces the data. Without centering use the option `center=FALSE`. Without scaling use `scale=FALSE` (by default center=TRUE, scale=TRUE)
`pmin(x, y, ...)`	a vector whose i-th element is the minimum of x[i], y[i], . . .
`pmax(x, y, ...)`	a vector whose i-th element is the maximum of x[i], y[i], . . .
`cumsum(x)`	a vector which ith element is the sum from x[1] to x[i]
`cumprod(x)`	id. for the product
`cummin(x)`	id. for the minimum
`cummax(x)`	id. for the maximum
`Re(x)`	real part of a complex number
`Im(x)`	imaginary part of a complex number
`Mod(x)`	modulus. `abs(x)` is the same
`Arg(x)`	angle in radians of the complex number
`Conj(x)`	complex conjugate
`convolve(x, y)`	compute several types of convolutions of two sequences
`fft(x)`	Fast Fourier Transform of an array
`mvfft(x)`	FFT of each column of a matrix
`filter(x, filter)`	applies linear filtering to a univariate time series or to each series separately of a multivariate time series

Note: many math functions have a logical parameter na.rm=TRUE to specify missing data (NA) removal.

2.3.2 Matrix Operations

The following table summarizes basic operation functions. We will discuss this topic in detail in Chapter 3 (Linear Algebra, Matrix Computing, and Regression Modeling).

Expression	Explanation
`t(x)`	transpose
`diag(x)`	diagonal
`%*%`	matrix multiplication
`solve(a, b)`	solves `a %*% x = b` for x
`solve(a)`	matrix inverse of a
`rowsum(x)`	sum of rows for a matrix-like object. `rowSums(x)` is a faster version
`colsum(x)`, `colSums(x)`	id. for columns
`rowMeans(x)`	fast version of row means
`colMeans(x)`	id. for columns

mat1 <- cbind(c(1, -1/5), c(-1/3, 1))
mat1.inv <- solve(mat1)

mat1.identity <- mat1.inv %*% mat1
mat1.identity

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

b <- c(1, 2)
x <- solve (mat1, b)
x

## [1] 1.785714 2.357143

2.3.3 Optimization and model fitting

optim(par, fn, method = c(“Nelder-Mead”, “BFGS”, “CG”, “L-BFGS-B”, “SANN”)) general-purpose optimization; par is initial values, fn is a function to optimize (normally minimize).
nlm(f, p) minimize function fusing a Newton-type algorithm with starting values p.
lm(formula) fit linear models; formula is typically of the form response ~ termA + termB + ...; use I(x*y) + I(x^2) for terms made of nonlinear components.
glm(formula, family=) fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution; family is a description of the error distribution and link function to be used in the model; see ?family.
nls(formula) nonlinear least-squares estimates of the nonlinear model parameters.
approx(x, y=) linearly interpolate given data points; $x$ can be an $xy$ plotting structure.
spline(x, y=) cubic spline interpolation.
loess(formula) (locally weighted scatterplot smoothing) fit a polynomial surface using local fitting.

Many of the formula-based modeling functions have several common arguments:

data= the data frame for the formula variables, subset= a subset of variables used in the fit, na.action= action for missing values: "na.fail", "na.omit", or a function.

The following generics often apply to model fitting functions:

predict(fit, ...) predictions from fit based on input data.
df.residual(fit) returns the number of residual degrees of freedom.
coef(fit) returns the estimated coefficients (sometimes with their standard-errors).
residuals(fit) returns the residuals.
deviance(fit) returns the deviance.
fitted(fit) returns the fitted values.
logLik(fit) computes the logarithm of the likelihood and the number of parameters.
AIC(fit) computes the Akaike information criterion (AIC).

2.3.4 Statistics

aov(formula) analysis of variance model.
anova(fit, …) analysis of variance (or deviance) tables for one or more fitted model objects.
density(x) kernel density estimates of x.

Other functions include: binom.test(), pairwise.t.test(), power.t.test(), prop.test(), t.test(), … use help.search("test") to see details.

2.3.5 Distributions

The Probability Distributome Project provides many details about univariate probability distributions. The SOCR R Shiny Distribution Calculators and the SOCR Bivariate and Trivariate Interactive Graphical Calculators provide additional demonstrations of multivariate probability distribution.

In R, there are four complementary functions supporting each probability distribution. For Normal distribution, these four functions are dnorm() - density, pnorm() - distribution function, qnorm() - quantile function, and rnorm() - random generating function. For Poisson distribution, the corresponding functions are dpois(), ppois(), qpois(), and rpois().

The table below shows the invocation syntax for generating random samples from a number of different probability distributions.

Expression	Explanation
`rnorm(n, mean=0, sd=1)`	Gaussian (normal)
`rexp(n, rate=1)`	exponential
`rgamma(n, shape, scale=1)`	gamma
`rpois(n, lambda)`	Poisson
`rweibull(n, shape, scale=1)`	Weibull
`rcauchy(n, location=0, scale=1)`	Cauchy
`rbeta(n, shape1, shape2)`	beta
`rt(n, df)`	Student’s (t)
`rf(n, df1, df2)`	Fisher’s (F) (df1, df2)
`rchisq(n, df)`	Pearson rbinom(n, size, prob) binomial
`rgeom(n, prob)`	geometric
`rhyper(nn, m, n, k)`	hypergeometric
`rlogis(n, location=0, scale=1)`	logistic
`rlnorm(n, meanlog=0, sdlog=1)`	lognormal
`rnbinom(n, size, prob)`	negative binomial
`runif(n, min=0, max=1)`	uniform
`rwilcox(nn, m, n)`, `rsignrank(nn, n)`	Wilcoxon’s statistics

Obviously, replacing the first letter r with d, p or q would reference the corresponding probability density (dfunc(x, ...)), the cumulative probability density (pfunc(x, ...)), and the value of quantile (qfunc(p, ...), with $0 < p < 1$).

2.4 Advanced Data Processing

In this section, we will introduce some useful functions that are useful in many data analytic protocols. The family of *apply() functions act on lists, arrays, vectors, data frames and other objects.

apply(X, INDEX, FUN=) returns a vector or array or list of values obtained by applying a function FUN to margins (INDEX=1 means row, INDEX=2 means column) of $X$. Additional options may be specified after the FUN argument.

df1

##   a  b
## 1 1  1
## 2 1  1
## 3 7 NA
## 4 6  9
## 5 8  8

apply(df1, 2, mean, na.rm=T)

##    a    b 
## 4.60 4.75

lapply(X, FUN) applies FUN to each member of the list $X$. If $X$ is a data frame then it will apply the FUN to each column and return a list.

lapply(df1, mean, na.rm=T)

## $a
## [1] 4.6
## 
## $b
## [1] 4.75

lapply(list(a=c(1, 23, 5, 6, 1), b=c(9, 90, 999)), median)

## $a
## [1] 5
## 
## $b
## [1] 90

tapply(X, INDEX, FUN=) applies FUN to each cell of a ragged array given by $X$ with indexes equals to INDEX. Note that $X$ is an atomic object, typically a vector.

# v<-c(1, 2, 4, 2, 2, 5, 6, 4, 7, 8, 8)
v

##  [1] 1 2 4 2 2 5 6 4 7 8 8

fac <- factor(rep(1:3, length = 11), levels = 1:3)
table(fac)

## fac
## 1 2 3 
## 4 4 3

tapply(v, fac, sum)

##  1  2  3 
## 17 16 16

by(data, INDEX, FUN) applies FUN to data frame data subsetted by INDEX. In this example, we apply the sum function using column 1 (a) as an index.

by(df1, df1[, 1], sum)

## df1[, 1]: 1
## [1] 4
## ------------------------------------------------------------ 
## df1[, 1]: 6
## [1] 15
## ------------------------------------------------------------ 
## df1[, 1]: 7
## [1] NA
## ------------------------------------------------------------ 
## df1[, 1]: 8
## [1] 16

merge(a, b) merges two data frames by common columns or row names. We can use option by= to specify the index column.

df2<-data.frame(a=c(1, 1, 7, 6, 8), c=1:5)
df2

##   a c
## 1 1 1
## 2 1 2
## 3 7 3
## 4 6 4
## 5 8 5

df3<-merge(df1, df2, by="a")
df3

##   a  b c
## 1 1  1 1
## 2 1  1 2
## 3 1  1 1
## 4 1  1 2
## 5 6  9 4
## 6 7 NA 3
## 7 8  8 5

xtabs(a ~ b, data=x) reports specific factorized contingency tables. The example below uses the 1973 UC Berkeley admissions dataset to report gender-by-status breakdown.

DF <- as.data.frame(UCBAdmissions)
##  'DF' is a data frame with a grid of the factors and the counts
## in variable 'Freq'.
DF

##       Admit Gender Dept Freq
## 1  Admitted   Male    A  512
## 2  Rejected   Male    A  313
## 3  Admitted Female    A   89
## 4  Rejected Female    A   19
## 5  Admitted   Male    B  353
## 6  Rejected   Male    B  207
## 7  Admitted Female    B   17
## 8  Rejected Female    B    8
## 9  Admitted   Male    C  120
## 10 Rejected   Male    C  205
## 11 Admitted Female    C  202
## 12 Rejected Female    C  391
## 13 Admitted   Male    D  138
## 14 Rejected   Male    D  279
## 15 Admitted Female    D  131
## 16 Rejected Female    D  244
## 17 Admitted   Male    E   53
## 18 Rejected   Male    E  138
## 19 Admitted Female    E   94
## 20 Rejected Female    E  299
## 21 Admitted   Male    F   22
## 22 Rejected   Male    F  351
## 23 Admitted Female    F   24
## 24 Rejected Female    F  317

## Nice for taking margins ...
xtabs(Freq ~ Gender + Admit, DF)

##         Admit
## Gender   Admitted Rejected
##   Male       1198     1493
##   Female      557     1278

## And for testing independence ...
summary(xtabs(Freq ~ ., DF))

## Call: xtabs(formula = Freq ~ ., data = DF)
## Number of cases in table: 4526 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 2000.3, df = 16, p-value = 0

aggregate(x, by, FUN) splits the data frame $x$ into subsets, computes summary statistics for each part, and reports the results. by is a list of grouping elements, each of which has the same length as the variables in $x$. For example, we can apply the function sum to the data frame df3 subject to the index created by list(rep(1:3, length=7)).

list(rep(1:3, length=7))

## [[1]]
## [1] 1 2 3 1 2 3 1

aggregate(df3, by=list(rep(1:3, length=7)), sum)

##   Group.1  a  b c
## 1       1 10 10 8
## 2       2  7 10 6
## 3       3  8 NA 4

stack(x, …) transforms data stored as separate columns in a data frame, or list, into a single column vector; unstack(x, …) is the inverse of stack().

stack(df3)

##    values ind
## 1       1   a
## 2       1   a
## 3       1   a
## 4       1   a
## 5       6   a
## 6       7   a
## 7       8   a
## 8       1   b
## 9       1   b
## 10      1   b
## 11      1   b
## 12      9   b
## 13     NA   b
## 14      8   b
## 15      1   c
## 16      2   c
## 17      1   c
## 18      2   c
## 19      4   c
## 20      3   c
## 21      5   c

unstack(stack(df3))

##   a  b c
## 1 1  1 1
## 2 1  1 2
## 3 1  1 1
## 4 1  1 2
## 5 6  9 4
## 6 7 NA 3
## 7 8  8 5

reshape(x, …) reshapes a data frame between wide format, with repeated measurements in separate columns of the same record, and long format, with the repeated measurements in separate records. We can specify the transformation direction, direction="wide" or direction="long".

df4 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6), 
                  time = rep(c(1, 1, 2, 2), 3), score = rnorm(12))
wide <- reshape(df4, idvar = c("school", "class"), direction = "wide")
wide

##    school class    score.1    score.2
## 1       1     9 -0.8195732 -0.7498085
## 2       1    10 -0.9259571  1.2348810
## 5       2     9  0.1242512  1.3913972
## 6       2    10 -0.9269607 -1.5788499
## 9       3     9 -0.3396430  3.0950490
## 10      3    10  2.2346850 -0.5567815

long <- reshape(wide, idvar = c("school", "class"), direction = "long")
long

##        school class time    score.1
## 1.9.1       1     9    1 -0.8195732
## 1.10.1      1    10    1 -0.9259571
## 2.9.1       2     9    1  0.1242512
## 2.10.1      2    10    1 -0.9269607
## 3.9.1       3     9    1 -0.3396430
## 3.10.1      3    10    1  2.2346850
## 1.9.2       1     9    2 -0.7498085
## 1.10.2      1    10    2  1.2348810
## 2.9.2       2     9    2  1.3913972
## 2.10.2      2    10    2 -1.5788499
## 3.9.2       3     9    2  3.0950490
## 3.10.2      3    10    2 -0.5567815

Notes

The $x$ in this function has to be longitudinal data.
The call to rnorm used in reshape might generate different results for each call, unless set.seed(1234) is used to ensure reproducibility of random-number generation.

2.4.1 Strings

The following functions are useful for handling strings in R.

paste(…) and paste0(…) concatenate vectors after converting the arguments to a vector of characters. There are several options, sep= to use a string to separate terms (a single space is the default), collapse= to separate “collapsed” results.

a<-"today"
b<-"is a good day"
paste(a, b)

## [1] "today is a good day"

paste(a, b, sep=", ")

## [1] "today, is a good day"

substr(x, start, stop) substrings in a character vector. Using substr(x, start, stop) <- value, it can also assign values (with the same length) to part of a string.

a<-"When the going gets tough, the tough get going!"
substr(a, 10, 40)

## [1] "going gets tough, the tough get"

## [1] "going gets tough, the tough get"
substr(a, 1, 9)<-"........."
a

## [1] ".........going gets tough, the tough get going!"

Note that characters at start and stop indexes are inclusive in the output.

strsplit(x, split) splits $x$ according to the substring split. Use fixed=TRUE for non-regular expressions.

strsplit("a.b.c", ".", fixed = TRUE)

## [[1]]
## [1] "a" "b" "c"

grep(pattern, x) searches for pattern matches within $x$ and returns a vector of the indices of the elements of $x$ that had a match. Use regular expressions for pattern(unless fixed=TRUE), see ?regex for details.

letters

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

grep("[a-z]", letters)

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26

gsub(pattern, replacement, x) replaces matching patterns in $x$, allowing for use of regular expression matching; sub() is the same but it only replaces the first occurrence of the matched pattern.

a<-c("e", 0, "kj", 10, ";")
gsub("[a-z]", "letters", a)

## [1] "letters"        "0"              "lettersletters" "10"            
## [5] ";"

sub("[a-z]", "letters", a)

## [1] "letters"  "0"        "lettersj" "10"       ";"

tolower(x) converts strings to lowercase and toupper(x) converts to uppercase.

match(x, table) yields a vector of the positions of first matches for the elements of $x$ among table, x %in% table returns a logical vector.

x<-c(1, 2, 10, 19, 29)
match(x, c(1, 10))

## [1]  1 NA  2 NA NA

x %in% c(1, 10)

## [1]  TRUE FALSE  TRUE FALSE FALSE

pmatch(x, table) reports partial matches for the elements of $x$.

Dates and Times

The class Date stores calendar dates, without times. POSIXct() has dates and times, including time zones. Comparisons (e.g. $>$), seq(), and difftime() are useful to compare dates. ?DateTimeClasses gives more information, see also package chron.

The functions as.Date(s) and as.POSIXct(s) convert to the respective class; format(dt) converts to a string representation. The default string format is 2001-02-21. These accept a second argument to specify a format for conversion. Some common formats are:

Formats	Explanations
`%a, %A`	Abbreviated and full weekday name.
`%b, %B`	Abbreviated and full month name.
`%d`	Day of the month (01 … 31).
`%H`	Hours (00 … 23).
`%I`	Hours (01 … 12).
`%j`	Day of year (001 … 366).
`%m`	Month (01 … 12).
`%M`	Minute (00 … 59).
`%p`	AM/PM indicator.
`%S`	Second as a decimal number (00 … 61).
`%U`	Week (00 … 53); the first Sunday as day 1 of week 1.
`%w`	Weekday (0 … 6, Sunday is 0).
`%W`	Week (00 … 53); the first Monday as day 1 of week 1.
`%y`	Year without century (00 … 99). Don’t use it.
`%Y`	Year with century.
`%z` (output only.)	Offset from Greenwich; -0800 is 8 hours west of.
`%Z` (output only.)	Time zone as a character string (empty if not available).

Where leading zeros are shown they will be used on output but are optional on input. See ?strftime for details.

2.5 Basic Plotting

The following functions represent the basic plotting functions in R. Later, in Chapter 2 (Visualization & EDA), we will discuss more elaborate visualization in and exploratory data analytic strategies.

plot(x) plot of the values of x (on the y-axis) ordered on the x-axis.
plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis).
hist(x) histogram of the frequencies of x.
barplot(x) histogram of the values of x. Use horiz=FALSE for horizontal bars.
dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column).
pie(x) circular pie-chart.
boxplot(x) ‘box-and-whiskers’ plot.
sunflowerplot(x, y) sunflowers plot with multiple leaves (‘petals’) such that overplotting is visualized instead of accidental and invisible.
stripplot(x) plot of the values of x on a line (an alternative to boxplot() for small sample sizes).
coplot(x~y | z) bivariate plot of x and y for each value or interval of values of z.
interaction.plot (f1, f2, y) if f1 and f2 are factors, plots the means of y (on the y-axis) with respect to the values of f1 (on the x-axis) and of f2 (different curves). The option fun allows you to choose the summary statistic of y (by default fun=mean).
matplot(x, y) bivariate plot of the first column of x vs. the first one of y, the second one of x vs. the second one of y, etc.
fourfoldplot(x) visualizes, with quarters of circles, the association between two dichotomous variables for different populations (x must be an array with dim=c(2, 2, k), or a matrix with dim=c(2, 2) if k = 1)
assocplot(x) Cohen’s Friendly graph shows the deviations from independence of rows and columns in a two dimensional contingency table.
mosaicplot(x) “mosaic”” graph of the residuals from a log-linear regression of a contingency table.
pairs(x) if x is a matrix or a data frame, draws all possible bivariate plots between the columns of x.
plot.ts(x) if x is an object of class “ts”, plot of x with respect to time, x may be multivariate but the series must have the same frequency and dates. Detailed examples are in Chapter 17: Big Longitudinal Data Analysis.
ts.plot(x) id. but if x is multivariate the series may have different dates and must have the same frequency.
qqnorm(x) quantiles of x with respect to the values expected under a normal law.
qqplot(x, y) quantiles of y with respect to the quantiles of x.
contour(x, y, z) contour plot (data are interpolated to draw the curves), x and y must be vectors and z must be a matrix so that dim(z)=c(length(x), length(y)) (x and y may be omitted).
filled.contour(x, y, z) areas between the contours are colored, and a legend of the colors is drawn as well.
image(x, y, z) plotting actual data with colors.
persp(x, y, z) plotting actual data in perspective view.
stars(x) if x is a matrix or a data frame, it draws a graph with segments or a star where each row of x is represented by a star and the columns are the lengths of the segments.
symbols(x, y, …) draws, at the coordinates given by x and y, symbols (circles, squares, rectangles, stars, thermometers or “boxplots”“) whose sizes, colors… are specified by supplementary arguments.
termplot(mod.obj) plot of the (partial) effects of a regression model (mod.obj).

The following parameters are common to many plotting functions:

Parameters	Explanations
`add=FALSE`	if TRUE superposes the plot on the previous one (if it exists)
`axes=TRUE`	if FALSE does not draw the axes and the box
`type="p"`	specifies the type of plot, “p”: points, “l”: lines, “b”: points connected by lines, “o”: id. But the lines are over the points, “h”: vertical lines, “s”: steps, the data are represented by the top of the vertical lines, “S”: id. However, the data are represented at the bottom of the vertical lines
`xlim=, ylim=`	specifies the lower and upper limits of the axes, for example with `xlim=c(1, 10)` or `xlim=range(x)`
`xlab=, ylab=`	annotates the axes, must be variables of mode character
`main=`	main title, must be a variable of mode character
`sub=`	subtitle (written in a smaller font)

2.5.1 QQ Normal probability plot

Let’s look at one simple example - quantile-quantile probability plot. Suppose $X\sim N(0,1)$ and $Y\sim Cauchy$ represent the observed/raw and simulated/generated data for one feature (variable) in the data.

# This commended example illustrates a linear model based approach (below is a more direct QQ-plot demonstration)
# X_norm1 <- rnorm(1000)
# X_norm2 <- rnorm(1000, m=-75, sd=3.7)
# X_Cauchy <- rcauchy(1000)
# 
# # compare X to StdNormal distribution
# #     qqnorm(X, 
# #                 main="Normal Q-Q Plot of the data", 
# #                 xlab="Theoretical Quantiles of the Normal", 
# #                 ylab="Sample Quantiles of the X (Normal) Data")
# #     qqline(X)
# #     qqplot(X, Y)
# fit_norm_norm = lm(X_norm2 ~ X_norm1)
# fit_norm_cauchy = lm(X_Cauchy ~ X_norm1)
# 
# # Get model fitted values
# Fitted.Values.norm_norm <-  fitted(fit_norm_norm)
# Fitted.Values.norm_cauchy <-  fitted(fit_norm_cauchy)
#   
# # Extract model residuals
# Residuals.norm_norm <-  resid(fit_norm_norm)
# Residuals.norm_cauchy <-  resid(fit_norm_cauchy)
# 
# # Compute the model standardized residuals from lm() object
# Std.Res.norm_norm <- MASS::stdres(fit_norm_norm)  
# Std.Res.norm_cauchy <- MASS::stdres(fit_norm_cauchy)  
#   
# # Extract the theoretical (Normal) quantiles
# Theoretical.Quantiles.norm_norm <- qqnorm(Residuals.norm_norm, plot.it = F)$x
# Theoretical.Quantiles.norm_cauchy <- qqnorm(Residuals.norm_cauchy, plot.it = F)$x
#   
# qq.df.norm_norm <- data.frame(Std.Res.norm_norm, Theoretical.Quantiles.norm_norm)
# qq.df.norm_cauchy <- data.frame(Std.Res.norm_cauchy, Theoretical.Quantiles.norm_cauchy)
# 
# qq.df.norm_norm %>% 
#   plot_ly(x = ~Theoretical.Quantiles.norm_norm) %>% 
#     add_markers(y = ~Std.Res.norm_norm, name="Normal(0,1) vs. Normal(-75, 3.7) Data") %>%
#     add_lines(x = ~Theoretical.Quantiles.norm_norm, y = ~Theoretical.Quantiles.norm_norm, 
#               mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>% 
#     layout(title = "Q-Q Normal Plot", legend = list(orientation = 'h'))
# 
# # Normal vs. Cauchy
# qq.df.norm_cauchy %>% 
#   plot_ly(x = ~Theoretical.Quantiles.norm_cauchy) %>% 
#     add_markers(y = ~Std.Res.norm_cauchy, name="Normal(0,1) vs. Cauchy Data") %>%
#     add_lines(x = ~Theoretical.Quantiles.norm_norm, y = ~Theoretical.Quantiles.norm_norm, 
#               mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>% 
#     layout(title = "Normal vs. Cauchy Q-Q Plot", legend = list(orientation = 'h'))

# Q-Q plot data (X) vs. simulation(Y)
# 
# myQQ <- function(x, y, ...) {
#   #rang <- range(x, y, na.rm=T)
#   rang <- range(-4, 4, na.rm=T)
#   qqplot(x, y, xlim=rang, ylim=rang)
# }
# 
# myQQ(X, Y) # where the Y is the newly simulated data for X
# qqline(X)

# Sample different number of observations from all the 3 processes
X_norm1 <- rnorm(500)
X_norm2 <- rnorm(1000, m=-75, sd=3.7)
X_Cauchy <- rcauchy(1500)

# estimate the quantiles (scale the values to ensure measuring-unit invariance of both processes)
qX_norm1 <- quantile(scale(X_norm1), probs = seq(from=0.01, to=0.99, by=0.01))
qX_norm2 <- quantile(scale(X_norm2), probs = seq(from=0.01, to=0.99, by=0.01))
qq.df.norm_norm <- data.frame(qX_norm1, qX_norm2)

# Normal(0,1) vs. Normal(-75, 3.7)
qq.df.norm_norm %>% 
  plot_ly(x = ~qX_norm1) %>% 
    add_markers(y = ~qX_norm2, name="Normal(0,1) vs. Normal(-75, 3.7) Data") %>%
    add_lines(x = ~qX_norm1, y = ~qX_norm1, 
              mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>% 
    layout(title = "Q-Q Normal Plot", legend = list(orientation = 'h'))

# Normal(0,1) vs. Cauchy
qX_norm1 <- quantile(X_norm1, probs = seq(from=0.01, to=0.99, by=0.01))
qX_Cauchy <- quantile(X_Cauchy, probs = seq(from=0.01, to=0.99, by=0.01))
qq.df.norm_cauchy <- data.frame(qX_norm1, qX_Cauchy)

qq.df.norm_cauchy %>% 
  plot_ly(x = ~qX_norm1) %>% 
    add_markers(y = ~qX_Cauchy, name="Normal(0,1) vs. Cauchy Data") %>%
    add_lines(x = ~qX_norm1, y = ~qX_norm1, 
              mode = "line", name = "Theoretical Normal", line = list(width = 2)) %>% 
    layout(title = "Normal vs. Cauchy Q-Q Plot", legend = list(orientation = 'h'))

2.5.2 Low-level plotting commands

points(x, y) adds points (the option type= can be used)
lines(x, y) id. but with lines
text(x, y, labels, …) adds text given by labels at coordinates (x, y). Typical use: plot(x, y, type="n"); text(x, y, names)
mtext(text, side=3, line=0, …) adds text given by text in the margin specified by side (see axis() below); line specifies the line from the plotting area.
segments(x0, y0, x1, y1) draws lines from points (x0, y0) to points (x1, y1)
arrows(x0, y0, x1, y1, angle= 30, code=2) id. With arrows at points (x0, y0), if code=2. The arrow is at point (x1, y1), if code=1. Arrows are at both if code=3. Angle controls the angle from the shaft of the arrow to the edge of the arrow head.
abline(a, b) draws a line of slope b and intercept a.
abline(h=y) draws a horizontal line at ordinate y.
abline(v=x) draws a vertical line at abscissa x.
abline(lm.obj) draws the regression line given by lm.obj. abline(h=0, col=2) #color (col) is often used
rect(x1, y1, x2, y2) draws a rectangle whose left, right, bottom, and top limits are x1, x2, y1, and y2, respectively.
polygon(x, y) draws a polygon linking the points with coordinates given by x and y.
legend(x, y, legend) adds the legend at the point (x, y) with the symbols given by legend.
title() adds a title and optionally a subtitle.
axis(side, vect) adds an axis at the bottom (side=1), on the left (side=2), at the top (side=3), or on the right (side=4); vect (optional) gives the abscissa (or ordinates) where tick-marks are drawn.
rug(x) draws the data x on the x-axis as small vertical lines.
locator(n, type=“n”, …) returns the coordinates (x, y) after the user has clicked n times on the plot with the mouse; also draws symbols (type="p") or lines (type="l") with respect to optional graphic parameters (…); by default nothing is drawn (type="n").

2.5.3 General graphics parameters

These can be set globally with par(…). Many can be passed as parameters to plotting commands.

adj controls text justification (adj=0 left-justified, adj=0.5 centered, adj=1 right-justified).
bg specifies the color of the background (ex. : bg="red", bg="blue", …the list of the 657 available colors is displayed with colors()).
bty controls the type of box drawn around the plot. Allowed values are: “o”, “l”, “7”, “c”, “u” ou “]” (the box looks like the corresponding character). If bty="n" the box is not drawn.
cex a value controlling the size of texts and symbols with respect to the default. The following parameters have the same control for numbers on the axes-cex.axis, the axis labels-cex.lab, the title-cex.main, and the subtitle-cex.sub.
col controls the color of symbols and lines. Use color names: “red”, “blue” see colors() or as “#RRGGBB”; see rgb(), hsv(), gray(), and rainbow(); as for cex there are: col.axis, col.lab, col.main, col.sub.
font an integer which controls the style of text (1: normal, 2: italics, 3: bold, 4: bold italics); as for cex there are: font.axis, font.lab, font.main, font.sub.
las an integer which controls the orientation of the axis labels (0: parallel to the axes, 1: horizontal, 2: perpendicular to the axes, 3: vertical).
lty controls the type of lines, can be an integer or string (1: “solid”, 2: “dashed”, 3: “dotted”, 4: “dotdash”, 5: “longdash”, 6: “twodash”, or a string of up to eight characters (between “0” and “9”) which specifies alternatively the length, in points or pixels, of the drawn elements and the blanks, for example lty="44" will have the same effect than lty=2.
lwd a numeric which controls the width of lines, default=1.
mar a vector of 4 numeric values which control the space between the axes and the border of the graph of the form c(bottom, left, top, right), the default values are c(5.1, 4.1, 4.1, 2.1).
mfcol a vector of the form c(nr, nc) which partitions the graphic window as a matrix of nr lines and nc columns, the plots are then drawn in columns.
mfrow plots are drawn by row-by-row.
pch controls the type of symbol, either an integer between 1 and 25, or any single character within ““.
ts.plot(x) id. but if x is multivariate the series may have different dates by x and y.
ps an integer which controls the size in points of texts and symbols.
pty a character, which specifies the type of the plotting region, “s”: square, “m”: maximal.
tck a value which specifies the length of tick-marks on the axes as a fraction of the smallest of the width or height of the plot; if tck=1 a grid is drawn.
tcl a value which specifies the length of tick-marks on the axes as a fraction of the height of a line of text (by default tcl=-0.5).
xaxt if xaxt="n" the x-axis is set but not drawn (useful in conjunction with axis(side=1, ...)).
yaxt if yaxt="n" the y-axis is set but not drawn (useful in conjunction with axis(side=2, ...)).

Expression	Explanation
xyplot(y~x)	bivariate plots (with many functionalities).
barchart(y~x)	histogram of the values of y with respect to those of x.
dotplot(y~x)	Cleveland dot plot (stacked plots line-by-line and column-by-column)
densityplot(~x)	density functions plot
histogram(~x)	histogram of the frequencies of x
bwplot(y~x)	“box-and-whiskers” plot
qqmath(~x)	quantiles of x with respect to the values expected under a theoretical distribution
stripplot(y~x)	single dimension plot, x must be numeric, y may be a factor
qq(y~x)	quantiles to compare two distributions, x must be numeric, y may be numeric, character, or factor but must have two “levels”
splom(~x)	matrix of bivariate plots
parallel(~x)	parallel coordinates plot
levelplot($z\sim xy\\|g1g2$)	colored plot of the values of z at the coordinates given by x and y (x, y and z are all of the same length)
wireframe($z\sim xy\\|g1g2$)	3d surface plot
cloud($z\sim xy\\|g1g2$)	3d scatter plot

In the normal Lattice formula, y~x|g1*g2, combinations of optional conditioning variables g1 and g2 plotted on separate panels. Lattice functions take many of the same arguments as base graphics plus also data= the data frame for the formula variables and subset= for subsetting. Use panel= to define a custom panel function (see apropos("panel") and ?lines). Lattice functions return an object of class trellis and have to be printed to produce the graph. Use print(xyplot(...)) inside functions where automatic printing doesn’t work. Use lattice.theme and lset to change Lattice defaults.

2.6 Basic `R` Programming

The standard setting for our own function is:

function.name<-function(x) { expr(an expression) return(value) }

Where $x$ is the parameter in the expression. A simple example of this is:

adding <- function(x=0, y=0) {
  z<-x+y
  return(z)
}
adding(x=5, y=10)

## [1] 15

Conditions setting: if(cond) {expr} or if(cond) cons.expr else alt.expr.

x<-10
if(x>10) z="T" else z="F"
z

## [1] "F"

Alternatively, ifelse represents a vectorized and extremely efficient conditional mechanism that provides one of the main advantages of R.

For loop: for(var in seq) expr.

x<-c()
for(i in 1:10) x[i]=i
x

##  [1]  1  2  3  4  5  6  7  8  9 10

Other loops: While loop: while(cond) expr, repeat: repeat expr. Applied to the innermost of nested loops: break, next. Use braces {} around statements.

ifelse(test, yes, no) returns a value with the same shape as test, filled with yes or no Boolean values.

do.call(funname, args) executes a function call from the name of the function and a list of arguments to be passed to it.

2.7 Data Simulation Primer

Before we demonstrate how to synthetically simulate data that resembles closely the characteristics of real observations from the same process, let’s import some observed data for initial exploratory analytics.

Using the SOCR Parkinson’s Disease Case-study available in the Canvas Data Archive, we can import some data and extract some descriptions of the sample data (05_PPMI_top_UPDRS_Integrated_LongFormat1.csv).

PPMI <- read.csv("https://umich.instructure.com/files/330397/download?download_frd=1")
# summary(PPMI)
Hmisc::describe(PPMI)

## PPMI 
## 
##  31  Variables      1764  Observations
## --------------------------------------------------------------------------------
## FID_IID 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     3534    390.9     3054     3089 
##      .25      .50      .75      .90      .95 
##     3272     3476     3817     4072     4102 
## 
## lowest : 3001 3002 3003 3004 3006, highest: 4122 4123 4126 4136 4139
## --------------------------------------------------------------------------------
## L_insular_cortex_ComputeArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     2255    794.6    808.5    959.5 
##      .25      .50      .75      .90      .95 
##   1976.9   2498.7   2744.1   2962.3   3156.7 
## 
## lowest : 50.0355 92.941  220.452 225.999 306.361
## highest: 3474.54 3490.82 3580.98 3630.07 3650.81
## --------------------------------------------------------------------------------
## L_insular_cortex_Volume 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     6491     3255     1035     1539 
##      .25      .50      .75      .90      .95 
##     4881     7237     8405     9616    10424 
## 
## lowest : 22.6262 47.417  116.775 120.79  242.591
## highest: 12172.3 12544.2 12852.2 13148.2 13499.9
## --------------------------------------------------------------------------------
## R_insular_cortex_ComputeArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     1711    655.8    562.4    715.0 
##      .25      .50      .75      .90      .95 
##   1345.7   1889.6   2115.8   2329.0   2431.5 
## 
## lowest : 40.9245 70.1356 86.8461 129.144 159.828
## highest: 2631.22 2631.4  2637.81 2737.28 2791.92
## --------------------------------------------------------------------------------
## R_insular_cortex_Volume 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     3973     2158    592.0    886.9 
##      .25      .50      .75      .90      .95 
##   2652.1   4386.1   5243.2   6368.4   6795.0 
## 
## lowest : 11.8398 32.4826 48.7982 68.7627 111.06 
## highest: 7595.27 7659.64 7671.54 8122.7  8179.4 
## --------------------------------------------------------------------------------
## L_cingulate_gyrus_ComputeArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     3315     1423    917.9   1226.7 
##      .25      .50      .75      .90      .95 
##   2379.5   3654.1   4198.0   4705.9   5126.2 
## 
## lowest : 127.779 214.424 267.473 338.737 360.349
## highest: 5530.18 5562.61 5675.14 5694.19 5944.19
## --------------------------------------------------------------------------------
## L_cingulate_gyrus_Volume 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     7949     4613    872.1   1343.3 
##      .25      .50      .75      .90      .95 
##   4173.6   8840.8  10673.2  12718.9  14393.3 
## 
## lowest : 57.3298 120.742 169.844 225.796 242.425
## highest: 16406.7 16515.7 16659.9 16765.9 17153.2
## --------------------------------------------------------------------------------
## R_cingulate_gyrus_ComputeArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     3277     1465    792.9   1144.9 
##      .25      .50      .75      .90      .95 
##   2311.6   3658.9   4211.6   4610.4   4968.7 
## 
## lowest : 104.135 169.473 190.46  241.615 285.22 
## highest: 5581.44 5599.98 5607.32 6076.93 6593.7 
## --------------------------------------------------------------------------------
## R_cingulate_gyrus_Volume 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     7896     4624    791.1   1366.9 
##      .25      .50      .75      .90      .95 
##   4639.3   8926.5  10719.4  12226.0  13625.3 
## 
## lowest : 47.6712 87.2522 102.801 189.956 193.222
## highest: 16076.2 16091.2 17213.7 18046.8 19761.8
## --------------------------------------------------------------------------------
## L_caudate_ComputeArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1    635.6    437.7    42.06    73.96 
##      .25      .50      .75      .90      .95 
##   277.55   694.18   945.41  1112.96  1212.08 
## 
## lowest : 1.78156 4.09485 4.09981 4.19613 4.19617
## highest: 1328.71 1328.74 1347.38 1350.86 1453.51
## --------------------------------------------------------------------------------
## L_caudate_Volume 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1      952    827.2    15.09    32.66 
##      .25      .50      .75      .90      .95 
##   186.65   975.10  1519.82  1906.58  2192.25 
## 
## lowest : 0.192801 0.630879 0.633874 0.669029 0.669087
## highest: 2581.29  2582.98  2627.04  2632.19  2746.62 
## --------------------------------------------------------------------------------
## R_caudate_ComputeArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1    869.4    484.1    55.44   112.09 
##      .25      .50      .75      .90      .95 
##   439.06  1034.77  1184.46  1294.08  1401.43 
## 
## lowest : 1.78156 6.38316 6.38347 11.8365 15.8257
## highest: 1550.8  1568.45 1615.92 1666.42 1684.56
## --------------------------------------------------------------------------------
## R_caudate_Volume 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     1496     1039    21.36    51.32 
##      .25      .50      .75      .90      .95 
##   502.03  1768.10  2135.23  2531.07  2811.08 
## 
## lowest : 0.192801 1.22382  1.22397  2.42792  4.23963 
## highest: 3360.75  3365.66  3451.12  3464.1   3579.37 
## --------------------------------------------------------------------------------
## L_putamen_ComputeArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1    922.6    481.2    95.41   266.46 
##      .25      .50      .75      .90      .95 
##   660.77   993.58  1229.51  1418.33  1539.90 
## 
## lowest : 6.75987 9.16345 15.8901 16.1646 16.3758
## highest: 1711.69 1734.75 1786.51 2060.12 2129.67
## --------------------------------------------------------------------------------
## L_putamen_Volume 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     1764     1250    54.77   190.98 
##      .25      .50      .75      .90      .95 
##   916.76  1792.46  2507.00  3184.42  3648.16 
## 
## lowest : 1.2275  1.90048 3.44451 3.89709 4.11645
## highest: 4197.61 4299.28 4331.44 4363.21 4712.66
## --------------------------------------------------------------------------------
## R_putamen_ComputeArea 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     1297    535.6    303.5    457.9 
##      .25      .50      .75      .90      .95 
##   1023.5   1463.7   1627.2   1776.9   1894.2 
## 
## lowest : 13.9263 18.3367 28.3    29.8511 41.4573
## highest: 2050.44 2068.86 2127.9  2249.68 2251.41
## --------------------------------------------------------------------------------
## R_putamen_Volume 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      441        1     2965     1697    236.2    433.0 
##      .25      .50      .75      .90      .95 
##   1805.2   3380.4   3959.4   4647.8   5179.2 
## 
## lowest : 3.20741 4.82362 10.4877 11.3539 16.1329
## highest: 5969.68 6022.53 6359    6739.35 7096.58
## --------------------------------------------------------------------------------
## Sex 
##        n  missing distinct     Info     Mean      Gmd 
##     1764        0        2    0.673     1.34   0.4491 
##                     
## Value         1    2
## Frequency  1164  600
## Proportion 0.66 0.34
## --------------------------------------------------------------------------------
## Weight 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      276        1    82.05    18.51     55.4     61.8 
##      .25      .50      .75      .90      .95 
##     70.5     81.2     90.9    103.4    110.0 
## 
## lowest : 43.2  45    46.2  46.7  47.3 , highest: 124   131.2 134.5 134.7 135  
## --------------------------------------------------------------------------------
## ResearchGroup 
##        n  missing distinct 
##     1764        0        3 
##                                   
## Value      Control      PD   SWEDD
## Frequency      512    1092     160
## Proportion   0.290   0.619   0.091
## --------------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0      434        1    61.07    11.59    43.20    47.54 
##      .25      .50      .75      .90      .95 
##    54.07    62.15    68.82    73.68    76.20 
## 
## lowest : 31.1781 31.8849 31.9205 32.2959 32.3534
## highest: 81.8411 81.9452 82.2849 82.7699 83.0329
## --------------------------------------------------------------------------------
## chr12_rs34637584_GT 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1688       76        2    0.028       16 0.009479  0.01879 
## 
## --------------------------------------------------------------------------------
## chr17_rs11868035_GT 
##        n  missing distinct     Info     Mean      Gmd 
##     1688       76        3    0.816   0.6161   0.6983 
##                             
## Value          0     1     2
## Frequency    836   664   188
## Proportion 0.495 0.393 0.111
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## chr17_rs11012_GT 
##        n  missing distinct     Info     Mean      Gmd 
##     1688       76        3    0.658    0.346    0.489 
##                             
## Value          0     1     2
## Frequency   1152   488    48
## Proportion 0.682 0.289 0.028
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## chr17_rs393152_GT 
##        n  missing distinct     Info     Mean      Gmd 
##     1688       76        3    0.723   0.4265   0.5645 
##                             
## Value          0     1     2
## Frequency   1052   552    84
## Proportion 0.623 0.327 0.050
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## chr17_rs12185268_GT 
##        n  missing distinct     Info     Mean      Gmd 
##     1688       76        3    0.707   0.4028   0.5399 
##                             
## Value          0     1     2
## Frequency   1076   544    68
## Proportion 0.637 0.322 0.040
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## chr17_rs199533_GT 
##        n  missing distinct     Info     Mean      Gmd 
##     1688       76        3    0.691   0.3791    0.514 
##                             
## Value          0     1     2
## Frequency   1100   536    52
## Proportion 0.652 0.318 0.031
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## UPDRS_part_I 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1215      549       13      0.9    1.286    1.638        0        0 
##      .25      .50      .75      .90      .95 
##        0        1        2        3        5 
##                                                                             
## Value          0     1     2     3     4     5     6     7     8     9    10
## Frequency    527   296   182    96    42    33    21     6     4     3     2
## Proportion 0.434 0.244 0.150 0.079 0.035 0.027 0.017 0.005 0.003 0.002 0.002
##                       
## Value         11    13
## Frequency      2     1
## Proportion 0.002 0.001
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## UPDRS_part_II 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1211      553       28    0.994    6.087    5.432      0.0      1.0 
##      .25      .50      .75      .90      .95 
##      2.0      5.0      9.0     13.0     15.5 
## 
## lowest :  0  1  2  3  4, highest: 23 24 25 27 28
## --------------------------------------------------------------------------------
## UPDRS_part_III 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1210      554       58    0.999    19.44    13.04        0        2 
##      .25      .50      .75      .90      .95 
##       12       20       27       35       39 
## 
## lowest :  0  1  2  3  4, highest: 53 56 59 60 61
## --------------------------------------------------------------------------------
## time_visit 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1764        0       12    0.993     23.5    20.09     0.00     3.00 
##      .25      .50      .75      .90      .95 
##     8.25    21.00    37.50    48.00    54.00 
##                                                                             
## Value          0     3     6     9    12    18    24    30    36    42    48
## Frequency    147   147   147   147   147   147   147   147   147   147   147
## Proportion 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083
##                 
## Value         54
## Frequency    147
## Proportion 0.083
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------

# data driven age estimates
m  = round (mean(PPMI$Age), 2)
sd = round(sd(PPMI$Age), 2)

x.norm <- rnorm(n=200, m=m, sd=sd)
# hist(x.norm, main='N(10, 20) Histogram')
plot_ly(x = ~x.norm, type = "histogram") %>% 
  layout(bargap=0.1, title=paste0('N(', m, ', ', sd, ') Histogram'))

mean(PPMI$Age)

## [1] 61.07281

sd(PPMI$Age)

## [1] 10.26669

Next, we will simulate new synthetic data to match the properties/characteristics of the observed PPMI data (using Uniform, Normal, and Poisson distributions).

# age m=62, sd=10

# Demographics variables
# Define number of subjects
NumSubj <- 282
NumTime <- 4

# Define data elements
# Cases
Cases <- c(2, 3, 6, 7, 8, 10, 11, 12, 13, 14, 17, 18, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34, 35, 37, 41, 42, 43, 44, 45, 53, 55, 58, 60, 62, 67, 69, 71, 72, 74, 79, 80, 85, 87, 90, 95, 97, 99, 100, 101, 106, 107, 109, 112, 120, 123, 125, 128, 129, 132, 134, 136, 139, 142, 147, 149, 153, 158, 160, 162, 163, 167, 172, 174, 178, 179, 180, 182, 192, 195, 201, 208, 211, 215, 217, 223, 227, 228, 233, 235, 236, 240, 245, 248, 250, 251, 254, 257, 259, 261, 264, 268, 269, 272, 273, 275, 279, 288, 289, 291, 296, 298, 303, 305, 309, 314, 318, 324, 325, 326, 328, 331, 332, 333, 334, 336, 338, 339, 341, 344, 346, 347, 350, 353, 354, 359, 361, 363, 364, 366, 367, 368, 369, 370, 371, 372, 374, 375, 376, 377, 378, 381, 382, 384, 385, 386, 387, 389, 390, 393, 395, 398, 400, 410, 421, 423, 428, 433, 435, 443, 447, 449, 450, 451, 453, 454, 455, 456, 457, 458, 459, 460, 461, 465, 466, 467, 470, 471, 472, 476, 477, 478, 479, 480, 481, 483, 484, 485, 486, 487, 488, 489, 492, 493, 494, 496, 498, 501, 504, 507, 510, 513, 515, 528, 530, 533, 537, 538, 542, 545, 546, 549, 555, 557, 559, 560, 566, 572, 573, 576, 582, 586, 590, 592, 597, 603, 604, 611, 619, 621, 623, 624, 625, 631, 633, 634, 635, 637, 640, 641, 643, 644, 645, 646, 647, 648, 649, 650, 652, 654, 656, 658, 660, 664, 665, 670, 673, 677, 678, 679, 680, 682, 683, 686, 687, 688, 689, 690, 692)

# Imaging Biomarkers
L_caudate_ComputeArea <- rpois(NumSubj, 600)
L_caudate_Volume <- rpois(NumSubj, 800)
R_caudate_ComputeArea <- rpois(NumSubj, 893)
R_caudate_Volume <- rpois(NumSubj, 1000)
L_putamen_ComputeArea <- rpois(NumSubj, 900)
L_putamen_Volume <- rpois(NumSubj, 1400)
R_putamen_ComputeArea <- rpois(NumSubj, 1300)
R_putamen_Volume <- rpois(NumSubj, 3000)
L_hippocampus_ComputeArea <- rpois(NumSubj, 1300)
L_hippocampus_Volume <- rpois(NumSubj, 3200)
R_hippocampus_ComputeArea <- rpois(NumSubj, 1500)
R_hippocampus_Volume <- rpois(NumSubj, 3800)
cerebellum_ComputeArea <- rpois(NumSubj, 16700)
cerebellum_Volume <- rpois(NumSubj, 14000)
L_lingual_gyrus_ComputeArea <- rpois(NumSubj, 3300)
L_lingual_gyrus_Volume <- rpois(NumSubj, 11000)
R_lingual_gyrus_ComputeArea <- rpois(NumSubj, 3300)
R_lingual_gyrus_Volume <- rpois(NumSubj, 12000)
L_fusiform_gyrus_ComputeArea <- rpois(NumSubj, 3600)
L_fusiform_gyrus_Volume <- rpois(NumSubj, 11000)
R_fusiform_gyrus_ComputeArea <- rpois(NumSubj, 3300)
R_fusiform_gyrus_Volume <- rpois(NumSubj, 10000)

Sex <- ifelse(runif(NumSubj)<.5, 0, 1)

Weight <- as.integer(rnorm(NumSubj, 80, 10))

Age <- as.integer(rnorm(NumSubj, 62, 10))

# Diagnosis
Dx <- c(rep("PD", 100), rep("HC", 100), rep("SWEDD", 82))

# Genetics
chr12_rs34637584_GT <- c(ifelse(runif(100)<.3, 0, 1), ifelse(runif(100)<.6, 0, 1), ifelse(runif(82)<.4, 0, 1))                              # NumSubj Bernoulli trials

chr17_rs11868035_GT <- c(ifelse(runif(100)<.7, 0, 1), ifelse(runif(100)<.4, 0, 1), ifelse(runif(82)<.5, 0, 1))                              # NumSubj Bernoulli trials

# Clinical          # rpois(NumSubj, 15) + rpois(NumSubj, 6)
UPDRS_part_I <- c( ifelse(runif(100)<.7, 0, 1) + ifelse(runif(100) < .7, 0, 1), 

ifelse(runif(100)<.6, 0, 1)+ ifelse(runif(100)<.6, 0, 1), 

ifelse(runif(82)<.4, 0, 1)+ ifelse(runif(82)<.4, 0, 1) )

UPDRS_part_II <- c(sample.int(20, 100, replace=T), sample.int(14, 100, replace=T), 

sample.int(18, 82, replace=T) )

UPDRS_part_III <- c(sample.int(30, 100, replace=T), sample.int(20, 100, replace=T), 

           sample.int(25, 82, replace=T) )

# Time: VisitTime - done automatically below in aggregator

# Data (putting all components together)
sim_PD_Data <- cbind(
        rep(Cases, each= NumTime),                 # Cases
        rep(L_caudate_ComputeArea, each= NumTime), # Imaging
        rep(Sex, each= NumTime),                   # Demographics
        rep(Weight, each= NumTime), 
        rep(Age, each= NumTime), 
        rep(Dx, each= NumTime),                    # Dx
        rep(chr12_rs34637584_GT, each= NumTime),   # Genetics
        rep(chr17_rs11868035_GT, each= NumTime), 
        rep(UPDRS_part_I, each= NumTime),          # Clinical
        rep(UPDRS_part_II, each= NumTime), 
        rep(UPDRS_part_III, each= NumTime), 
        rep(c(0, 6, 12, 18), NumSubj)                # Time
)


# Assign the column names

colnames(sim_PD_Data) <- c(
"Cases", 
"L_caudate_ComputeArea", 
"Sex", "Weight", "Age", 
"Dx", "chr12_rs34637584_GT", "chr17_rs11868035_GT", 
"UPDRS_part_I", "UPDRS_part_II", "UPDRS_part_III", 
"Time"
)

# some QC
summary(sim_PD_Data)

##     Cases           L_caudate_ComputeArea     Sex               Weight         
##  Length:1128        Length:1128           Length:1128        Length:1128       
##  Class :character   Class :character      Class :character   Class :character  
##  Mode  :character   Mode  :character      Mode  :character   Mode  :character  
##      Age                 Dx            chr12_rs34637584_GT chr17_rs11868035_GT
##  Length:1128        Length:1128        Length:1128         Length:1128        
##  Class :character   Class :character   Class :character    Class :character   
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character   
##  UPDRS_part_I       UPDRS_part_II      UPDRS_part_III         Time          
##  Length:1128        Length:1128        Length:1128        Length:1128       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

dim(sim_PD_Data)

## [1] 1128   12

head(sim_PD_Data)

##      Cases L_caudate_ComputeArea Sex Weight Age  Dx   chr12_rs34637584_GT
## [1,] "2"   "624"                 "0" "89"   "75" "PD" "0"                
## [2,] "2"   "624"                 "0" "89"   "75" "PD" "0"                
## [3,] "2"   "624"                 "0" "89"   "75" "PD" "0"                
## [4,] "2"   "624"                 "0" "89"   "75" "PD" "0"                
## [5,] "3"   "582"                 "0" "77"   "60" "PD" "1"                
## [6,] "3"   "582"                 "0" "77"   "60" "PD" "1"                
##      chr17_rs11868035_GT UPDRS_part_I UPDRS_part_II UPDRS_part_III Time
## [1,] "0"                 "2"          "14"          "7"            "0" 
## [2,] "0"                 "2"          "14"          "7"            "6" 
## [3,] "0"                 "2"          "14"          "7"            "12"
## [4,] "0"                 "2"          "14"          "7"            "18"
## [5,] "0"                 "0"          "15"          "10"           "0" 
## [6,] "0"                 "0"          "15"          "10"           "6"

# hist(PPMI$Age, freq=FALSE, right=FALSE, ylim = c(0,0.05))
# lines(density(as.numeric(as.data.frame(sim_PD_Data)$Age)), lwd=2, col="blue")
# legend("topright", c("Raw Data", "Simulated Data"), fill=c("black", "blue"))

x <- PPMI$Age
fit <- density(as.numeric(as.data.frame(sim_PD_Data)$Age))

plot_ly(x = x, type = "histogram", name = "Histogram (Raw Age)") %>% 
    add_trace(x = fit$x, y = fit$y, type = "scatter", mode = "lines", 
              fill = "tozeroy", yaxis = "y2", name = "Density (Simulated Age)") %>% 
    layout(title='Observed and Simulated Ages', yaxis2 = list(overlaying = "y", side = "right"))

# Save Results
# Write out (save) the result to a file that can be shared
write.table(sim_PD_Data, "output_data.csv", sep=", ", row.names=FALSE, col.names=TRUE)

3 Appendix

3.1 Tidyverse

The Tidyverse represents a suite of integrated R packages that provide support for data science and Big Data analytics, including functionality for data import (readr), data manipulation (dplyr), data visualization (ggplot2), expanded data frames (tibble), data tidying (tidyr), and functional programming (purrr). These learning modules provide introduction to tidyverse.

3.2 Additional `R` documentation and resources

The Software Carpentry Foundation provides useful Programming with R and R for Reproducible Scientific Analysis materials.
A very gentle stats intro using R Book (Verzani).
Online Quick-R examples (StatsMethods).
R-tutor Introduction.
R project Introduction.
UCLA ITS/IDRE R Resources.

3.3 HTML SOCR Data Import

SOCR Datasets can automatically be downloaded into the R environment using the following protocol, which uses the Parkinson’s Disease dataset as an example:

library(rvest)
# Loading required package: xml2
wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_PD_BiomedBigMetadata") # UMich SOCR Data
# wiki_url <- read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_PD_BiomedBigMetadata") # UCLA SOCR Data
html_nodes(wiki_url, "#content")

## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body" role="main">\n\t\t\t<a id="top"></a>\n\ ...

pd_data <- html_table(html_nodes(wiki_url, "table")[[1]])
head(pd_data); summary(pd_data)

## # A tibble: 6 × 33
##   Cases L_caudate_ComputeArea L_caudate_Volume R_caudate_ComputeArea
##   <int>                 <int>            <int>                 <int>
## 1     2                   597              767                   855
## 2     2                   597              767                   855
## 3     2                   597              767                   855
## 4     2                   597              767                   855
## 5     3                   604              873                   935
## 6     3                   604              873                   935
## # ℹ 29 more variables: R_caudate_Volume <int>, L_putamen_ComputeArea <int>,
## #   L_putamen_Volume <int>, R_putamen_ComputeArea <int>,
## #   R_putamen_Volume <int>, L_hippocampus_ComputeArea <int>,
## #   L_hippocampus_Volume <int>, R_hippocampus_ComputeArea <int>,
## #   R_hippocampus_Volume <int>, cerebellum_ComputeArea <int>,
## #   cerebellum_Volume <int>, L_lingual_gyrus_ComputeArea <int>,
## #   L_lingual_gyrus_Volume <int>, R_lingual_gyrus_ComputeArea <int>, …

##      Cases       L_caudate_ComputeArea L_caudate_Volume R_caudate_ComputeArea
##  Min.   :  2.0   Min.   :525.0         Min.   :719.0    Min.   :795.0        
##  1st Qu.:158.0   1st Qu.:582.0         1st Qu.:784.0    1st Qu.:875.0        
##  Median :363.5   Median :600.0         Median :800.0    Median :897.0        
##  Mean   :346.1   Mean   :600.4         Mean   :800.3    Mean   :894.5        
##  3rd Qu.:504.0   3rd Qu.:619.0         3rd Qu.:819.0    3rd Qu.:916.0        
##  Max.   :692.0   Max.   :667.0         Max.   :890.0    Max.   :977.0        
##  R_caudate_Volume L_putamen_ComputeArea L_putamen_Volume R_putamen_ComputeArea
##  Min.   : 916     Min.   : 815.0        Min.   :1298     Min.   :1198         
##  1st Qu.: 979     1st Qu.: 879.0        1st Qu.:1376     1st Qu.:1276         
##  Median : 998     Median : 897.5        Median :1400     Median :1302         
##  Mean   :1001     Mean   : 898.9        Mean   :1400     Mean   :1300         
##  3rd Qu.:1022     3rd Qu.: 919.0        3rd Qu.:1427     3rd Qu.:1321         
##  Max.   :1094     Max.   :1003.0        Max.   :1507     Max.   :1392         
##  R_putamen_Volume L_hippocampus_ComputeArea L_hippocampus_Volume
##  Min.   :2846     Min.   :1203              Min.   :3036        
##  1st Qu.:2959     1st Qu.:1277              1st Qu.:3165        
##  Median :3000     Median :1300              Median :3200        
##  Mean   :3000     Mean   :1302              Mean   :3198        
##  3rd Qu.:3039     3rd Qu.:1325              3rd Qu.:3228        
##  Max.   :3148     Max.   :1422              Max.   :3381        
##  R_hippocampus_ComputeArea R_hippocampus_Volume cerebellum_ComputeArea
##  Min.   :1414              Min.   :3634         Min.   :16378         
##  1st Qu.:1479              1st Qu.:3761         1st Qu.:16617         
##  Median :1504              Median :3802         Median :16699         
##  Mean   :1504              Mean   :3799         Mean   :16700         
##  3rd Qu.:1529              3rd Qu.:3833         3rd Qu.:16784         
##  Max.   :1602              Max.   :4013         Max.   :17096         
##  cerebellum_Volume L_lingual_gyrus_ComputeArea L_lingual_gyrus_Volume
##  Min.   :13680     Min.   :3136                Min.   :10709         
##  1st Qu.:13933     1st Qu.:3262                1st Qu.:10943         
##  Median :13996     Median :3299                Median :11007         
##  Mean   :14002     Mean   :3300                Mean   :11010         
##  3rd Qu.:14077     3rd Qu.:3333                3rd Qu.:11080         
##  Max.   :14370     Max.   :3469                Max.   :11488         
##  R_lingual_gyrus_ComputeArea R_lingual_gyrus_Volume
##  Min.   :3135                Min.   :11679         
##  1st Qu.:3258                1st Qu.:11935         
##  Median :3294                Median :12001         
##  Mean   :3296                Mean   :12008         
##  3rd Qu.:3338                3rd Qu.:12079         
##  Max.   :3490                Max.   :12324         
##  L_fusiform_gyrus_ComputeArea L_fusiform_gyrus_Volume
##  Min.   :3446                 Min.   :10682          
##  1st Qu.:3554                 1st Qu.:10947          
##  Median :3594                 Median :11016          
##  Mean   :3598                 Mean   :11011          
##  3rd Qu.:3637                 3rd Qu.:11087          
##  Max.   :3763                 Max.   :11394          
##  R_fusiform_gyrus_ComputeArea R_fusiform_gyrus_Volume      Sex        
##  Min.   :3094                 Min.   : 9736           Min.   :0.0000  
##  1st Qu.:3260                 1st Qu.: 9928           1st Qu.:0.0000  
##  Median :3296                 Median : 9994           Median :1.0000  
##  Mean   :3299                 Mean   : 9996           Mean   :0.5851  
##  3rd Qu.:3332                 3rd Qu.:10058           3rd Qu.:1.0000  
##  Max.   :3443                 Max.   :10235           Max.   :1.0000  
##      Weight            Age             Dx            chr12_rs34637584_GT
##  Min.   : 51.00   Min.   :31.00   Length:1128        Min.   :0.000      
##  1st Qu.: 71.00   1st Qu.:54.00   Class :character   1st Qu.:0.000      
##  Median : 78.50   Median :61.00   Mode  :character   Median :1.000      
##  Mean   : 78.45   Mean   :60.64                      Mean   :0.539      
##  3rd Qu.: 84.00   3rd Qu.:68.00                      3rd Qu.:1.000      
##  Max.   :109.00   Max.   :87.00                      Max.   :1.000      
##  chr17_rs11868035_GT  UPDRS_part_I   UPDRS_part_II    UPDRS_part_III 
##  Min.   :0.0000      Min.   :0.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:0.0000      1st Qu.:0.000   1st Qu.: 5.000   1st Qu.: 6.00  
##  Median :0.0000      Median :1.000   Median : 9.000   Median :13.00  
##  Mean   :0.4184      Mean   :0.773   Mean   : 8.879   Mean   :13.02  
##  3rd Qu.:1.0000      3rd Qu.:1.000   3rd Qu.:13.000   3rd Qu.:18.00  
##  Max.   :1.0000      Max.   :2.000   Max.   :20.000   Max.   :30.00  
##       Time     
##  Min.   : 0.0  
##  1st Qu.: 4.5  
##  Median : 9.0  
##  Mean   : 9.0  
##  3rd Qu.:13.5  
##  Max.   :18.0

Also see the SMHS Simulation Primer.

3.4 `R` Debugging

Most programs that give incorrect results are impacted by logical errors. When errors (bugs, exceptions) occur, we need to explore deeper – this procedure to identify and fix bugs is “debugging”.

R tools for debugging: traceback(), debug() browser() trace() recover()

traceback(): Failing R functions report to the screen immediately the run-time errors. Calling traceback() shows the place where the error occurred. The traceback() function prints the list of functions that were called before the error occurred. The stacked function calls are printed in reverse order.

f1 <- function(x) { r<- x-g1(x); `R` }

g1 <- function(y) { r<-y*h1(y); `R` }

h1 <- function(z) { r<-log(z); if(r<10) r^2 else   r^3}

f1(-1)

## Warning in log(z): NaNs produced

## Error in if (r < 10) r^2 else r^3: missing value where TRUE/FALSE needed

traceback()   
3:  h(y)

## Error in h(y): could not find function "h"

2: g(x)

## Error in g(x): could not find function "g"

1: f(-1)

## Error in f(-1): could not find function "f"

debug() - traceback() does not tell you where the error is. To find out which line causes the error, we may step through the function using debug().

debug(foo) flags the function foo() for debugging. undebug(foo) unflags the function. When a function is flagged for debugging, each statement in the function is executed one at a time. After a statement is executed, the function suspends and the user can interact with the R shell. This allows us to inspect a function line-by-line.

Example: compute sum of squared error SS.

## compute sum of squares   
SS <- function(mu, x) { 
  d<-x-mu; 
  d2<-d^2; 
  ss<-sum(d2);  
  ss 
}  
set.seed(100);  
x<-rnorm(100); 
SS(1, x)

## to debug  
debug(SS); SS(1, x)

## debugging in: SS(1, x)
## debug: {
##     d <- x - mu
##     d2 <- d^2
##     ss <- sum(d2)
##     ss
## }
## debug: d <- x - mu
## debug: d2 <- d^2
## debug: ss <- sum(d2)
## debug: ss
## exiting from: SS(1, x)

## [1] 202.5615

In the debugging shell (“Browse[1]>”), users can:

Enter n (next) executes the current line and prints the next one;
Typing c (continue) executes the rest of the function without stopping;
Enter Q quits the debugging;
Enter ls() list all objects in the local environment;
Enter an object name or print(
) tells the current value of an object.
```
debug(SS)
SS(1, x)
```
```
## debugging in: SS(1, x)
## debug: {
##     d <- x - mu
##     d2 <- d^2
##     ss <- sum(d2)
##     ss
## }
## debug: d <- x - mu
## debug: d2 <- d^2
## debug: ss <- sum(d2)
## debug: ss
## exiting from: SS(1, x)
```
```
## [1] 202.5615
```
Browse[1]> n
debug: d <- x - mu ## the next command
Browse[1]> ls() ## current environment [1] “mu” “x” ## there is no d
Browse[1]> n ## go one step debug: d2 <- d^2 ## the next command
Browse[1]> ls() ## current environment [1] “d” “mu” “x” ## d has been created
Browse[1]> d[1:3] ## first three elements of d [1] -1.5021924 -0.8684688 -1.0789171
Browse[1]> hist(d) ## histogram of d
Browse[1]> where ## current position in call stack where 1: SS(1, x)
Browse[1]> n
debug: ss <- sum(d2)
Browse[1]> Q ## quit
```
undebug(SS)         ## remove debug label, stop debugging process  
SS(1, x)                ## now call SS again will without debugging  
```
You can label a function for debugging while debugging another function.
```
f <- function(x) { 
  r<-x-g(x); 
  `R` 
}  
g <- function(y) { 
  r<-y*h(y); 
  `R` 
}  
h <- function(z) { 
  r<-log(z); 
  if(r<10) r^2 
  else r^3 
}  

debug(f)            # ## If you only debug f, you will not go into g
f(-1)
```
```
## Warning in log(z): NaNs produced
```
```
## Error in if (r < 10) r^2 else r^3: missing value where TRUE/FALSE needed
```
Browse[1]> n
Browse[1]> n

But, we can also label g and h for debugging when we debug f

f(-1)
Browse[1]> n
Browse[1]> debug(g)
Browse[1]> debug(h)
Browse[1]> n

Inserting a call to browser() in a function will pause the execution of a function at the point where browser() is called. Similar to using debug(), except that you can control where execution gets paused. Here is another example.
```
h <- function(z) {
  browser()     ## a breakpoint inserted here 
  r<-log(z)
  if(r<10)  r^2 
  else r^3
}

f(-1)
```
```
## Error in if (r < 10) r^2 else r^3: missing value where TRUE/FALSE needed
```
Browse[1]> ls() Browse[1]> z
Browse[1]> n
Browse[1]> n
Browse[1]> ls()
Browse[1]> c

Calling trace() on a function allows inserting new code into a function.

as.list(body(h))
trace(“h”, quote(
if(is.nan(r))
{browser()}), at=3, print=FALSE)
f(1)
f(-1)

trace(“h”, quote(if(z<0) {z<-1}), at=2, print=FALSE)
f(-1)
untrace()

During the debugging process, recover() allows checking the status of variables in upper level functions. recover() can be used as an error handler using options() (e.g. options(error=recover)). When functions throw exceptions, execution stops at the point of failure. Browsing the function calls and examining the environment may indicate the source of the problem.

SOCR Resource Visitor number

Statistical Software	Advantages	Disadvantages
R	R is actively maintained (\(\ge 100,000\) developers, \(\ge 15K\) packages). Excellent connectivity to various types of data and other systems. Versatile for solving problems in many domains. It’s free, open-source code. Anybody can access/review/extend the source code. `R` is very stable and reliable. If you change or redistribute the `R` source code, you have to make those changes available for anybody else to use. `R` runs anywhere (platform agnostic). Extensibility: `R` supports extensions, e.g., for data manipulation, statistical modeling, and graphics. Active and engaged community supports R. Unparalleled question-and-answer (Q&A) websites. `R` connects with other languages (Java/C/JavaScript/Python/Fortran) & database systems, and other programs, SAS, SPSS, etc. Other packages have add-ons to connect with R. SPSS has incorporated a link to R, and SAS has protocols to move data and graphics between the two packages.	Mostly scripting language. Steeper learning curve
SAS	Large datasets. Commonly used in business & Government	Expensive. Somewhat dated programming language. Expensive/proprietary
Stata	Easy statistical analyses	Mostly classical stats
SPSS	Appropriate for beginners Simple interfaces	Weak in more cutting edge statistical procedures lacking in robust methods and survey methods

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Chapter 1: Introduction

SOCR/MIDAS (Ivo Dinov)

September 2025