SOCR ≫ DSPA ≫ DSPA2 Topics ≫

1 Specialized Machine Learning Topics

Depending on the specific hardware, operating system, and versions of R and Python installed on the computer, some of the following settings may be useful to support the subsequent integration of R, Python, and C/C++.

# install.packages("reticulate")
library(reticulate)
library(plotly)
# specify the path of the Python version that you want to use
#py_path = "C:/Users/Dinov/Anaconda3/"  # manual
py_path = Sys.which("python3")       # automated
# use_python(py_path, required = T)
# Sys.setenv(RETICULATE_PYTHON = "C:/Users/Dinov/Anaconda3/")
sys <- import("sys", convert = TRUE)

In this chapter, we will discuss some technical details about data formats, streaming, optimization of computation, and distributed deployment of optimized learning algorithms. Chapter 13 includes a deep dive into the mathematical and computational aspects of function optimization.

The Internet of Things (IoT) leads to a paradigm shift of scientific inference - from static data interrogated in a batch or distributed environment to an on-demand service-based Cloud computing. Here, we will demonstrate how to work with specialized datasets, data-streams, and SQL databases, as well as develop and assess on-the-fly data modeling, classification, prediction and forecasting methods. Important examples to keep in mind throughout this chapter include high-frequency data delivered real time in hospital ICU’s (e.g., microsecond Electroencephalography signals, EEGs), dynamically changing stock market data (e.g., Dow Jones Industrial Average Index, DJI), and weather patterns.

We will present (1) format conversion and working with XML, SQL, JSON, CSV, SAS, SQL, noSQL, Google BigQuery service, and other data objects, (2) visualization of bioinformatics and network data, (3) protocols for managing, classifying and predicting outcomes from data streams, (4) strategies for optimization, improvement of computational performance, parallel (MPI) and graphics (GPU) computing, (5) processing of very large datasets, and (6) electronic R-markdown notebook facilitating R integration with Python, C/C++, Java, and other languages.

1.1 Working with specialized data and databases

Unlike the case-studies we saw in the previous chapters, some real world data may not always be nicely formatted, e.g., as CSV files. We must collect, arrange, wrangle, and harmonize scattered information to generate computable data objects that can be further processed by various techniques. Data wrangling and preprocessing may take over 80% of the time researchers spend interrogating complex multi-source data archives. The following procedures will enhance your skills collecting and handling heterogeneous real world data. Multiple examples of handling long-and-wide data, messy and tidy data, and data cleaning strategies can be found in this JSS Tidy Data article by Hadley Wickham.

1.1.1 Data format conversion

The R package rio imports and exports various types of file formats, e.g., tab-separated (.tsv), comma-separated (.csv), JSON (.json), Stata (.dta), SPSS (.sav and .por), Microsoft Excel (.xls and .xlsx), Weka (.arff), and SAS (.sas7bdat and .xpt) file types.

There are three core functions in the rio package: import(), convert(), and export(). They are intuitive, easy to understand, and efficient to execute. Take Stata (.dta) files as an example. Let’s first download a dataset, 02_Nof1_Data.dta, from our data archive folder.

# install.packages("rio")
# install_formats()
library(rio)
## 
## Attaching package: 'rio'
## The following object is masked from 'package:plotly':
## 
##     export
## The following object is masked from 'package:reticulate':
## 
##     import
# Download the SAS .DTA file first locally
# https://umich.instructure.com/files/1760330/download?download_frd=1
# Local data can be loaded by
# nof1<-import("02_Nof1_Data.dta")
# the data can also be loaded from the server remotely as well:
# nof1<-import("https://umich.instructure.com/files/1760330/download?download_frd=1a") # .dta format
# nof1<-read.csv("https://umich.instructure.com/files/330385/download?download_frd=1")
nof1<-import("https://umich.instructure.com/files/1760330/download?download_frd=1")
str(nof1)
## 'data.frame':    900 obs. of  10 variables:
##  $ id       : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ day      : num  1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ tx       : num  1 1 0 0 1 1 0 0 1 1 ...
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ selfeff  : num  33 33 33 33 33 33 33 33 33 33 ...
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ selfeff25: num  8 8 8 8 8 8 8 8 8 8 ...
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ wpss     : num  0.97 -0.17 0.81 -0.41 0.59 -1.16 0.3 -0.34 -0.74 -0.38 ...
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ scssuppt : num  5 3.87 4.84 3.62 4.62 2.87 4.33 3.69 3.29 3.66 ...
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ pmss     : num  4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 ...
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ pmss3    : num  1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 ...
##   ..- attr(*, "format.stata")= chr "%12.0g"
##  $ qhyact   : num  53 73 23 36 21 0 21 0 73 114 ...
##   ..- attr(*, "format.stata")= chr "%12.0g"

The data is automatically stored as a data frame. Note that by default, the package rio uses stingAsFactors=FALSE.

rio can help us export files into any other format we choose. To do this we have to use the export() function.

export(nof1, "C:/Users/IvoD/Desktop/02_Nof1.xlsx")

This line of code exports the Nof1 data in xlsx format located in the R working directory or in a user-provided directory. Mac users may have a problem exporting *.xlsx files using rio because of a lack of a zip tool, but still can output other formats such as “.csv”. An alternative strategy to save an xlsx file is to use package xlsx with default row.name=TRUE.

rio also provides a one-step process to convert-and-save data into alternative formats. The following simple code allows us to convert and save the 02_Nof1_Data.dta file we just downloaded into a CSV file.

# convert("02_Nof1_Data.dta", "02_Nof1_Data.csv")
convert("C:/Users/IvoD/Desktop/02_Nof1.xlsx", "C:/Users/IvoD/Desktop/02_Nof1_Data.csv")

You can see a new CSV file pop-up in the working directory. Similar transformations are available for other data formats and types.

1.1.2 Querying data in SQL databases

Look at the CDC Behavioral Risk Factor Surveillance System (BRFSS) Data, 2013-2015. This file (BRFSS_2013_2014_2015.zip) includes the combined landline and cell phone dataset exported from SAS V9.3 using the XPT transport format. This dataset contains \(330\) variables. The data can be imported into SPSS or STATA, however, some of the variable labels may get truncated in the process of converting to the XPT format.

Caution: The size of this compressed (ZIP) file is over \(315MB\)! Let’s start by ingesting the data for a couple of years and explore some of the information.

# install.packages("Hmisc")
library(Hmisc)

memory.size(max=T)
## [1] Inf
pathToZip <- tempfile()
getOption('timeout')  # Check default timeout: [1] 60
## [1] 60
options(timeout=600)  # extend the timeout from 1 to 10-minutes, to allow lower speed connections.
download.file("https://www.socr.umich.edu/data/DSPA/BRFSS_2013_2014_2015.zip", pathToZip)
# let's just pull two of the 3 years of data (2013 and 2015)
brfss_2013 <- sasxport.get(unzip(pathToZip)[1])
## Processing SAS dataset LLCP2013   ..
brfss_2015 <- sasxport.get(unzip(pathToZip)[3])
## Processing SAS dataset LLCP2015   ..
dim(brfss_2013); object.size(brfss_2013)
## [1] 491773    336
## 685687656 bytes
# summary(brfss_2013[1:1000, 1:10])  # subsample the data

# report the summaries for
summary(brfss_2013$has_plan)
## Length  Class   Mode 
##      0   NULL   NULL
brfss_2013$x.race <- as.factor(brfss_2013$x.race)
summary(brfss_2013$x.race)
##      1      2      3      4      5      6      7      8      9   NA's 
## 376451  39151   7683   9510   1546   2693   9130  37054   8530     25
# clean up
unlink(pathToZip)

Next, we can try to use logistic regression to find out if self-reported race/ethnicity predicts the binary outcome of having a health care plan.

brfss_2013$has_plan <- brfss_2013$hlthpln1 == 1

system.time(
  gml1 <- glm(has_plan ~ as.factor(x.race), data=brfss_2013,
              family=binomial)
)   # report execution time
##    user  system elapsed 
##    0.89    0.16    1.08
summary(gml1)
## 
## Call:
## glm(formula = has_plan ~ as.factor(x.race), family = binomial, 
##     data = brfss_2013)
## 
## Coefficients:
##                     Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)         2.293549   0.005649  406.044   <2e-16 ***
## as.factor(x.race)2 -0.721676   0.014536  -49.647   <2e-16 ***
## as.factor(x.race)3 -0.511776   0.032974  -15.520   <2e-16 ***
## as.factor(x.race)4 -0.329489   0.031726  -10.386   <2e-16 ***
## as.factor(x.race)5 -1.119329   0.060153  -18.608   <2e-16 ***
## as.factor(x.race)6 -0.544458   0.054535   -9.984   <2e-16 ***
## as.factor(x.race)7 -0.510452   0.030346  -16.821   <2e-16 ***
## as.factor(x.race)8 -1.332005   0.012915 -103.138   <2e-16 ***
## as.factor(x.race)9 -0.582204   0.030604  -19.024   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 353371  on 491747  degrees of freedom
## Residual deviance: 342497  on 491739  degrees of freedom
##   (25 observations deleted due to missingness)
## AIC: 342515
## 
## Number of Fisher Scoring iterations: 5

We can also examine the odds (rather the log odds ratio, LOR) of having a health care plan (HCP) by race (R). The LORs are calculated for two-dimensional arrays, separately for each race level (presence of health care plan (HCP) is binary, whereas race (R) has \(9\) levels, \(R_1, R_2, ..., R_9\)). For example, the odds ratio of having a HCP for \(R_1:R_2\) is:

\[ OR(R_1:R_2) = \frac{\frac{P \left( HCP \mid R_1 \right)}{1 - P \left( HCP \mid R_1 \right)}}{\frac{P \left( HCP \mid R_2 \right)}{1 - P \left( HCP \mid R_2 \right)}} .\]

# install.packages("vcd")
# load the vcd package to compute the LOR
library("vcd")

# Note that by default *loddsratio* computes the Log odds ratio (OR). The raw OR = exp(loddsratio)
lor_HCP_by_R <- loddsratio(has_plan ~ as.factor(x.race), data = brfss_2013)
lor_HCP_by_R
## log odds ratios for has_plan and as.factor(x.race) 
## 
##         1:2         2:3         3:4         4:5         5:6         6:7 
## -0.72167619  0.20990061  0.18228646 -0.78984000  0.57487142  0.03400611 
##         7:8         8:9 
## -0.82155382  0.74980101

Now, let’s see an example of querying a database containing structured relational records. A query is a machine instruction (typically represented as text) sent by a user to a remote database requesting a specific database operation (e.g., search or summary). One database communication protocol relies on SQL (Structured query language). MySQL is an instance of a database management system that supports SQL communication that many web applications utilize, e.g., YouTube, Flickr, Wikipedia, biological databases like GO, ensembl, etc. Below is an example of an SQL query using the package RMySQL. An alternative way to interface an SQL database is by using the package RODBC. Let’s look at a couple of DB query examples. The first one uses the UCSC Genomics SQL server (genome-mysql.cse.ucsc.edu) and the second one uses a local client-side database service.

# install.packages("DBI", "RMySQL")
# install.packages("RODBC"); library(RODBC)
library(DBI); library(RMySQL)
library("stringr"); library("dplyr"); library("readr")
library(magrittr)

ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.cse.ucsc.edu')
# dbGetInfo(ucscGenomeConn); dbListResults(ucscGenomeConn)

result <- dbGetQuery(ucscGenomeConn,"show databases;");

# List the DB tables
allTables <- dbListTables(ucscGenomeConn); length(allTables)
## [1] 12725
# Get dimensions of a table, read and report the head
dbListFields(ucscGenomeConn, "affyU133Plus2")
##  [1] "bin"         "matches"     "misMatches"  "repMatches"  "nCount"     
##  [6] "qNumInsert"  "qBaseInsert" "tNumInsert"  "tBaseInsert" "strand"     
## [11] "qName"       "qSize"       "qStart"      "qEnd"        "tName"      
## [16] "tSize"       "tStart"      "tEnd"        "blockCount"  "blockSizes" 
## [21] "qStarts"     "tStarts"
affyData <- dbReadTable(ucscGenomeConn, "affyU133Plus2"); head(affyData)
##   bin matches misMatches repMatches nCount qNumInsert qBaseInsert tNumInsert
## 1 585     530          4          0     23          3          41          3
## 2 585    3355         17          0    109          9          67          9
## 3 585    4156         14          0     83         16          18          2
## 4 585    4667          9          0     68         21          42          3
## 5 585    5180         14          0    167         10          38          1
## 6 585     468          5          0     14          0           0          0
##   tBaseInsert strand        qName qSize qStart qEnd tName     tSize tStart
## 1         898      -  225995_x_at   637      5  603  chr1 249250621  14361
## 2       11621      -  225035_x_at  3635      0 3548  chr1 249250621  14381
## 3          93      -  226340_x_at  4318      3 4274  chr1 249250621  14399
## 4        5743      - 1557034_s_at  4834     48 4834  chr1 249250621  14406
## 5          29      -    231811_at  5399      0 5399  chr1 249250621  19688
## 6           0      -    236841_at   487      0  487  chr1 249250621  27542
##    tEnd blockCount
## 1 15816          5
## 2 29483         17
## 3 18745         18
## 4 24893         23
## 5 25078         11
## 6 28029          1
##                                                                   blockSizes
## 1                                                          93,144,229,70,21,
## 2              73,375,71,165,303,360,198,661,201,1,260,250,74,73,98,155,163,
## 3                 690,10,32,33,376,4,5,15,5,11,7,41,277,859,141,51,443,1253,
## 4 99,352,286,24,49,14,6,5,8,149,14,44,98,12,10,355,837,59,8,1500,133,624,58,
## 5                                       131,26,1300,6,4,11,4,7,358,3359,155,
## 6                                                                       487,
##                                                                                                  qStarts
## 1                                                                                    34,132,278,541,611,
## 2                        87,165,540,647,818,1123,1484,1682,2343,2545,2546,2808,3058,3133,3206,3317,3472,
## 3                   44,735,746,779,813,1190,1195,1201,1217,1223,1235,1243,1285,1564,2423,2565,2617,3062,
## 4 0,99,452,739,764,814,829,836,842,851,1001,1016,1061,1160,1173,1184,1540,2381,2441,2450,3951,4103,4728,
## 5                                                     0,132,159,1460,1467,1472,1484,1489,1497,1856,5244,
## 6                                                                                                     0,
##                                                                                                                                      tStarts
## 1                                                                                                             14361,14454,14599,14968,15795,
## 2                                     14381,14454,14969,15075,15240,15543,15903,16104,16853,17054,17232,17492,17914,17988,18267,24736,29320,
## 3                               14399,15089,15099,15131,15164,15540,15544,15549,15564,15569,15580,15587,15628,15906,16857,16998,17049,17492,
## 4 14406,20227,20579,20865,20889,20938,20952,20958,20963,20971,21120,21134,21178,21276,21288,21298,21653,22492,22551,22559,24059,24211,24835,
## 5                                                                         19688,19819,19845,21145,21151,21155,21166,21170,21177,21535,24923,
## 6                                                                                                                                     27542,
# Select a subset, fetch the data, and report the quantiles
subsetQuery <- dbSendQuery(ucscGenomeConn, "select * from affyU133Plus2 where misMatches between 1 and 3")
affySmall <- fetch(subsetQuery); dim(affySmall)
## [1] 500  22
quantile(affySmall$misMatches)
##   0%  25%  50%  75% 100% 
##    1    1    2    2    3
dbClearResult(subsetQuery)
## [1] TRUE
# Another query
# install.packages("magrittr")
bedFile <- "C:/Users/IvoD/Desktop/repUCSC.bed"
subsetQuery1 <- dbSendQuery(ucscGenomeConn,'select genoName,genoStart,genoEnd,repName,swScore, strand,
                  repClass, repFamily from rmsk')
subsetQuery1_df <- dbFetch(subsetQuery1 , n=100) %>%
        dplyr::mutate(genoName =
                        stringr::str_replace(genoName,'chr','')) %>%
        readr::write_tsv(bedFile, col_names=T)
message('saved: ', bedFile)
dbClearResult(subsetQuery1)
## [1] TRUE
# Another DB query: Select a specific DB subset
subsetQuery2 <- dbSendQuery(ucscGenomeConn,
            "select * from affyU133Plus2 where misMatches between 1 and 4")
affyU133Plus2MisMatch <- fetch(subsetQuery2)
quantile(affyU133Plus2MisMatch$misMatches)
##   0%  25%  50%  75% 100% 
##    1    1    2    3    4
affyU133Plus2MisMatchTiny_100x22 <- fetch(subsetQuery2, n=100)
dbClearResult(subsetQuery2)
## [1] TRUE
dim(affyU133Plus2MisMatchTiny_100x22)
## [1] 100  22
summary(affyU133Plus2MisMatchTiny_100x22)
##       bin           matches       misMatches     repMatches     nCount      
##  Min.   : 13.0   Min.   : 397   Min.   :1.00   Min.   :0    Min.   :  0.00  
##  1st Qu.:834.0   1st Qu.: 960   1st Qu.:1.00   1st Qu.:0    1st Qu.:  0.00  
##  Median :844.0   Median :1732   Median :2.00   Median :0    Median :  4.00  
##  Mean   :730.9   Mean   :2249   Mean   :2.12   Mean   :0    Mean   : 14.72  
##  3rd Qu.:863.5   3rd Qu.:2832   3rd Qu.:3.00   3rd Qu.:0    3rd Qu.: 19.00  
##  Max.   :878.0   Max.   :7475   Max.   :4.00   Max.   :0    Max.   :111.00  
##    qNumInsert     qBaseInsert       tNumInsert     tBaseInsert    
##  Min.   : 0.00   Min.   :  0.00   Min.   : 0.00   Min.   :     0  
##  1st Qu.: 0.00   1st Qu.:  0.00   1st Qu.: 1.00   1st Qu.:     5  
##  Median : 1.00   Median :  1.00   Median : 3.00   Median :  4384  
##  Mean   : 2.33   Mean   :  9.39   Mean   : 5.95   Mean   : 20048  
##  3rd Qu.: 3.00   3rd Qu.:  8.00   3rd Qu.: 8.25   3rd Qu.: 26387  
##  Max.   :12.00   Max.   :197.00   Max.   :29.00   Max.   :346335  
##     strand             qName               qSize          qStart      
##  Length:100         Length:100         Min.   : 410   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.: 972   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :1797   Median :  0.00  
##                                        Mean   :2294   Mean   :  5.46  
##                                        3rd Qu.:2840   3rd Qu.:  2.25  
##                                        Max.   :7512   Max.   :331.00  
##       qEnd         tName               tSize               tStart        
##  Min.   : 400   Length:100         Min.   :249250621   Min.   :32256024  
##  1st Qu.: 972   Class :character   1st Qu.:249250621   1st Qu.:33327868  
##  Median :1797   Mode  :character   Median :249250621   Median :35460512  
##  Mean   :2281                      Mean   :249250621   Mean   :35218635  
##  3rd Qu.:2834                      3rd Qu.:249250621   3rd Qu.:36778842  
##  Max.   :7480                      Max.   :249250621   Max.   :38510902  
##       tEnd            blockCount     blockSizes          qStarts         
##  Min.   :32256813   Min.   : 1.00   Length:100         Length:100        
##  1st Qu.:33333986   1st Qu.: 3.00   Class :character   Class :character  
##  Median :35497250   Median : 8.00   Mode  :character   Mode  :character  
##  Mean   :35240948   Mean   : 8.91                                        
##  3rd Qu.:36783220   3rd Qu.:12.25                                        
##  Max.   :38512306   Max.   :38.00                                        
##    tStarts         
##  Length:100        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# Once done, clear and close the connections
# dbClearResult(dbListResults(ucscGenomeConn)[[1]])
dbDisconnect(ucscGenomeConn)
## [1] TRUE

Depending upon the DB server, to complete the above database SQL commands, it may require access and/or specific user credentials. The example below can be done by all users, as it relies only on local DB services.

# install.packages("RSQLite")
library("RSQLite")
## 
## Attaching package: 'RSQLite'
## The following object is masked from 'package:RMySQL':
## 
##     isIdCurrent
# generate an empty DB stored in RAM
myConnection <- dbConnect(RSQLite::SQLite(), ":memory:")
myConnection
## <SQLiteConnection>
##   Path: :memory:
##   Extensions: TRUE
dbListTables(myConnection)
## character(0)
# Add tables to the local SQL DB
data(USArrests); dbWriteTable(myConnection, "USArrests", USArrests)
dbWriteTable(myConnection, "brfss_2013", brfss_2013)
dbWriteTable(myConnection, "brfss_2015", brfss_2015)

# Check again the DB content
# allTables <- dbListTables(myConnection); length(allTables); allTables
head(dbListFields(myConnection, "brfss_2013"))
## [1] "x.state" "fmonth"  "idate"   "imonth"  "iday"    "iyear"
tail(dbListFields(myConnection, "brfss_2013"))
## [1] "rcsrace1" "rchisla1" "rcsbirth" "typeinds" "typework" "has_plan"
dbListTables(myConnection);
## [1] "USArrests"  "brfss_2013" "brfss_2015"
# Retrieve the entire DB table (for the smaller USArrests table)
head(dbGetQuery(myConnection, "SELECT * FROM USArrests"))
##   Murder Assault UrbanPop Rape
## 1   13.2     236       58 21.2
## 2   10.0     263       48 44.5
## 3    8.1     294       80 31.0
## 4    8.8     190       50 19.5
## 5    9.0     276       91 40.6
## 6    7.9     204       78 38.7
# Retrieve just the average of one feature
myQuery <- dbGetQuery(myConnection, "SELECT avg(Assault) FROM USArrests")
head(myQuery)
##   avg(Assault)
## 1       170.76
myQuery <- dbGetQuery(myConnection, "SELECT avg(Assault) FROM USArrests GROUP BY UrbanPop"); myQuery
##    avg(Assault)
## 1         48.00
## 2         81.00
## 3        152.00
## 4        211.50
## 5        271.00
## 6        190.00
## 7         83.00
## 8        109.00
## 9        109.00
## 10       120.00
## 11        57.00
## 12        56.00
## 13       236.00
## 14       188.00
## 15       186.00
## 16       102.00
## 17       156.00
## 18       113.00
## 19       122.25
## 20       229.50
## 21       151.00
## 22       231.50
## 23       172.00
## 24       145.00
## 25       255.00
## 26       120.00
## 27       110.00
## 28       204.00
## 29       237.50
## 30       252.00
## 31       147.50
## 32       149.00
## 33       254.00
## 34       174.00
## 35       159.00
## 36       276.00
# Or do it in batches (for the much larger brfss_2013 and brfss_2015 tables)
myQuery <- dbGetQuery(myConnection, "SELECT * FROM brfss_2013")

# extract data in chunks of 2 rows, note: dbGetQuery vs. dbSendQuery
# myQuery <- dbSendQuery(myConnection, "SELECT * FROM brfss_2013")
# fetch2 <- dbFetch(myQuery, n = 2); fetch2
# do we have other cases in the DB remaining?
# extract all remaining data
# fetchRemaining <- dbFetch(myQuery, n = -1);fetchRemaining
# We should have all data in DB now
# dbHasCompleted(myQuery)

# compute the average (poorhlth) grouping by Insurance (hlthpln1)
# Try some alternatives: numadult nummen numwomen genhlth physhlth menthlth poorhlth hlthpln1
myQuery1_13 <- dbGetQuery(myConnection, "SELECT avg(poorhlth) FROM brfss_2013 GROUP BY hlthpln1"); myQuery1_13
##   avg(poorhlth)
## 1      56.25466
## 2      53.99962
## 3      58.85072
## 4      66.26757
# Compare 2013 vs. 2015: Health grouping by Insurance
myQuery1_15 <- dbGetQuery(myConnection, "SELECT avg(poorhlth) FROM brfss_2015 GROUP BY hlthpln1"); myQuery1_15
##   avg(poorhlth)
## 1      55.75539
## 2      55.49487
## 3      61.35445
## 4      67.62125
myQuery1_13 - myQuery1_15
##   avg(poorhlth)
## 1     0.4992652
## 2    -1.4952515
## 3    -2.5037326
## 4    -1.3536797
# reset the DB query
# dbClearResult(myQuery)

# clean up
dbDisconnect(myConnection)

1.1.3 SparQL Queries

The SparQL Protocol and RDF Query Language (SparQL) is a semantic database query language for RDF (Resource Description Framework) data objects. SparQL queries consist of (1) triple patterns, (2) conjunctions, and (3) disjunctions.

The following example uses SparQL to query the prevalence of tuberculosis from the WikiData SparQL server and plot it on a World geographic map.

# install.packages("SPARQL"); install.packages("rworldmap"); install.packages("spam")
library(SPARQL)
library(ggplot2)
library(rworldmap)
library(plotly)

# SparQL Format
# https://www.w3.org/2009/Talks/0615-qbe/

# W3C Turtle - Terse RDF Triple Language:
# https://www.w3.org/TeamSubmission/turtle/#sec-examples

# RDF (Resource Description Framework) is a graphical data model of (subject, predicate, object) triples representing:
# "subject-node to predicate arc to object arc"
# Resources are represented with URIs, which can be abbreviated as prefixed names
# Objects are literals: strings, integers, booleans, etc.
# Syntax
#    URIs: <http://example.com/resource> or prefix:name
#    Literals:
#             "plain string" "13.4""
#             xsd:float, or
#             "string with language" @en
#    Triple: pref:subject other:predicate "object".

wdqs <- "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
query = "PREFIX wd: <http://www.wikidata.org/entity/>
    # prefix declarations
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX p: <http://www.wikidata.org/prop/>
    PREFIX v: <http://www.wikidata.org/prop/statement/>
    PREFIX qualifier: <http://www.wikidata.org/prop/qualifier/>
    PREFIX statement: <http://www.wikidata.org/prop/statement/>

    # result clause
    SELECT DISTINCT ?countryLabel ?ISO3Code ?latlon ?prevalence ?doid ?year

    # query pattern against RDF data
    # Q36956 Hansen's disease, Leprosy https://www.wikidata.org/wiki/Q36956
    # Q15750965 - Alzheimer's disease: https://www.wikidata.org/wiki/Q15750965
    # Influenza - Q2840: https://www.wikidata.org/wiki/Q2840
    # Q12204 - tuberculosis  https://www.wikidata.org/wiki/Q12204
    # P699 Alzheimer's Disease ontology ID
    # P1193 prevalence: https://www.wikidata.org/wiki/Property:P1193
    # P17 country: https://www.wikidata.org/wiki/Property:P17
    # Country ISO-3 code: https://www.wikidata.org/wiki/Property:P298
    # Location: https://www.wikidata.org/wiki/Property:P625

    # Wikidata docs: https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual

    WHERE {
      wd:Q12204 wdt:P699 ?doid ; # tuberculosis P699 Disease ontology ID
      p:P1193 ?prevalencewithProvenance .
      ?prevalencewithProvenance qualifier:P17 ?country ;
      qualifier:P585 ?year ;
      statement:P1193 ?prevalence .
      ?country wdt:P625 ?latlon ;
      rdfs:label ?countryLabel ;
      wdt:P298 ?ISO3Code ;
      wdt:P297 ?ISOCode .
    FILTER (lang(?countryLabel) = \"en\")
    # FILTER constraints use boolean conditions to filter out unwanted query results.
    #    Shortcut: a semicolon (;) can be used to separate two triple patterns that share the same disease (?country is the shared subject above.)
    #     rdfs:label is a common predicate for giving a human-friendly label to a resource.

    }
    # query modifiers
    ORDER BY DESC(?population)
"

# install.packages("WikidataQueryServiceR")
library(WikidataQueryServiceR)
library(mapproj)

results <- query_wikidata(sparql_query=query); head(results)
## # A tibble: 6 × 6
##   countryLabel  ISO3Code latlon             prevalence doid  year               
##   <chr>         <chr>    <chr>                   <dbl> <chr> <dttm>             
## 1 Norway        NOR      Point(10.68333333…  0.0001    DOID… 2014-01-01 00:00:00
## 2 United States USA      Point(-98.5795 39…  0.000029  DOID… 2014-01-01 00:00:00
## 3 France        FRA      Point(2.0 47.0)     0.0000012 DOID… 2014-01-01 00:00:00
## 4 Iceland       ISL      Point(-19.0 65.0)   0.0000043 DOID… 2014-01-01 00:00:00
## 5 Ecuador       ECU      Point(-78.0 -1.0)   0.00078   DOID… 2014-01-01 00:00:00
## 6 Suriname      SUR      Point(-56.0 4.0)    0.00044   DOID… 2014-01-01 00:00:00
# OLD: results <- SPARQL(url=wdqs, query=query); head(results)
# resultMatrix <- as.matrix(results$results)
# View(resultMatrix)
# sPDF <- joinCountryData2Map(results$results, joinCode = "ISO3", nameJoinColumn = "ISO3Code")

# join the data to the geo map
sPDF <- joinCountryData2Map(results, joinCode = "ISO3", nameJoinColumn = "ISO3Code")
## 6 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 237 codes from the map weren't represented in your data
#map the data with no legend
mapParams <- mapCountryData( sPDF
              , nameColumnToPlot="prevalence"
              # Alternatively , nameColumnToPlot="doid"
              , addLegend='FALSE',
              mapTitle="Prevalence of Tuberculosis Worldwide"
              )

#add a modified legend using the same initial parameters as mapCountryData
do.call( addMapLegend, c( mapParams
                        , legendLabels="all"
                        , legendWidth=0.5
                        ))
text(1, -120, "Partial view of Tuberculosis Prevalence in the World", cex=1)

#do.call( addMapLegendBoxes
#        , c(mapParams
#        , list(
#          legendText=c('Chile', 'US','Brazil','Argentina'),
#          x='bottom',title="AD Prevalence",horiz=TRUE)))

# Alternatively: mapCountryData(sPDF, nameColumnToPlot="prevalence",  oceanCol="darkblue", missingCountryCol="white")

View(getMap())
# write.csv(file = "C:/Users/Map.csv", getMap())

# Alternative Plot_ly Geo-map
df_cities <- results
df_cities$popm <- paste(df_cities$countryLabel, df_cities$ISO3Code, "prevalence=", df_cities$prevalence)
df_cities$quart <- with(df_cities, cut(prevalence, quantile(prevalence), include.lowest = T))
levels(df_cities$quart) <- paste(c("1st", "2nd", "3rd", "4th"), "Quantile")
df_cities$quart <- as.ordered(df_cities$quart)

df_cities <- tidyr::separate(df_cities, latlon, into = c("long", "lat"), sep = " ")
df_cities$long <- gsub("Point\\(", "", df_cities$long)
df_cities$lat <- gsub("\\)", "", df_cities$lat)
head(df_cities)
## # A tibble: 6 × 9
##   countryLabel  ISO3Code long   lat   prevalence doid  year                popm 
##   <chr>         <chr>    <chr>  <chr>      <dbl> <chr> <dttm>              <chr>
## 1 Norway        NOR      10.68… 59.9…  0.0001    DOID… 2014-01-01 00:00:00 Norw…
## 2 United States USA      -98.5… 39.8…  0.000029  DOID… 2014-01-01 00:00:00 Unit…
## 3 France        FRA      2.0    47.0   0.0000012 DOID… 2014-01-01 00:00:00 Fran…
## 4 Iceland       ISL      -19.0  65.0   0.0000043 DOID… 2014-01-01 00:00:00 Icel…
## 5 Ecuador       ECU      -78.0  -1.0   0.00078   DOID… 2014-01-01 00:00:00 Ecua…
## 6 Suriname      SUR      -56.0  4.0    0.00044   DOID… 2014-01-01 00:00:00 Suri…
## # ℹ 1 more variable: quart <ord>
ge <- list(scope = 'world', showland = TRUE, landcolor = toRGB("lightgray"),
           subunitwidth = 1, countrywidth = 1, subunitcolor = toRGB("white"), countrycolor = toRGB("white"))

plot_geo(df_cities, lon = ~long, lat = ~lat, text = ~popm, mode="markers",
         marker = ~list(size = 20, line = list(width = 0.1)),
         color = ~quart, locationmode = 'country names') %>%
    layout(geo = ge, title = 'Prevalence of Tuberculosis Worldwide')

Similar geographic maps may be displayed fo other processes, e.g., malaria. Note that such data are sparse as they are pulled dynamically from wikidata.

Below is an example of a geo-map showing the global locations and population-size of various cities in millions.

library(plotly)

df_cities <- world.cities
df_cities$popm <- paste(df_cities$country.etc, df_cities$name, "Pop", round(df_cities$pop/1e6,2), " million")
df_cities$quart <- with(df_cities, cut(pop, quantile(pop), include.lowest = T))
levels(df_cities$quart) <- paste(c("1st", "2nd", "3rd", "4th"), "Quantile")
df_cities$quart <- as.ordered(df_cities$quart)

ge <- list(scope = 'world', showland = TRUE, landcolor = toRGB("lightgray"),
           subunitwidth = 1, countrywidth = 1, subunitcolor = toRGB("white"), countrycolor = toRGB("white"))

plot_geo(df_cities, lon = ~long, lat = ~lat, text = ~popm, mode="markers",
         marker = ~list(size = sqrt(pop/10000) + 1, line = list(width = 0.1)),
         color = ~quart, locationmode = 'country names') %>%
    layout(geo = ge, title = 'City Populations (Worldwide)')