| SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
Depending on the specific hardware, operating system, and versions of
R and Python installed on the computer, some
of the following settings may be useful to support the subsequent
integration of R, Python, and
C/C++.
# install.packages("reticulate")
library(reticulate)
library(plotly)
# specify the path of the Python version that you want to use
#py_path = "C:/Users/Dinov/Anaconda3/" # manual
py_path = Sys.which("python3") # automated
# use_python(py_path, required = T)
# Sys.setenv(RETICULATE_PYTHON = "C:/Users/Dinov/Anaconda3/")
sys <- import("sys", convert = TRUE)In this chapter, we will discuss some technical details about data formats, streaming, optimization of computation, and distributed deployment of optimized learning algorithms. Chapter 13 includes a deep dive into the mathematical and computational aspects of function optimization.
The Internet of Things (IoT) leads to a paradigm shift of scientific inference - from static data interrogated in a batch or distributed environment to an on-demand service-based Cloud computing. Here, we will demonstrate how to work with specialized datasets, data-streams, and SQL databases, as well as develop and assess on-the-fly data modeling, classification, prediction and forecasting methods. Important examples to keep in mind throughout this chapter include high-frequency data delivered real time in hospital ICU’s (e.g., microsecond Electroencephalography signals, EEGs), dynamically changing stock market data (e.g., Dow Jones Industrial Average Index, DJI), and weather patterns.
We will present (1) format conversion and working with XML, SQL,
JSON, CSV, SAS, SQL, noSQL, Google BigQuery service, and other data
objects, (2) visualization of bioinformatics and network data, (3)
protocols for managing, classifying and predicting outcomes from data
streams, (4) strategies for optimization, improvement of computational
performance, parallel (MPI) and graphics (GPU) computing, (5) processing
of very large datasets, and (6) electronic R-markdown
notebook facilitating R integration with Python, C/C++,
Java, and other languages.
Unlike the case-studies we saw in the previous chapters, some real
world data may not always be nicely formatted, e.g., as CSV files. We
must collect, arrange, wrangle, and harmonize scattered information to
generate computable data objects that can be further processed by
various techniques. Data wrangling and preprocessing may take over 80%
of the time researchers spend interrogating complex multi-source data
archives. The following procedures will enhance your skills collecting
and handling heterogeneous real world data. Multiple examples of
handling long-and-wide data, messy and tidy data, and data cleaning
strategies can be found in this JSS
Tidy Data article by Hadley Wickham.
The R package rio imports and exports various types of
file formats, e.g., tab-separated (.tsv), comma-separated
(.csv), JSON (.json), Stata
(.dta), SPSS (.sav and .por),
Microsoft Excel (.xls and .xlsx), Weka
(.arff), and SAS (.sas7bdat and
.xpt) file types.
There are three core functions in the rio package:
import(), convert(), and
export(). They are intuitive, easy to understand, and
efficient to execute. Take Stata (.dta) files as an example. Let’s first
download a dataset, 02_Nof1_Data.dta,
from our data
archive folder.
##
## Attaching package: 'rio'
## The following object is masked from 'package:plotly':
##
## export
## The following object is masked from 'package:reticulate':
##
## import
# Download the SAS .DTA file first locally
# https://umich.instructure.com/files/1760330/download?download_frd=1
# Local data can be loaded by
# nof1<-import("02_Nof1_Data.dta")
# the data can also be loaded from the server remotely as well:
# nof1<-import("https://umich.instructure.com/files/1760330/download?download_frd=1a") # .dta format
# nof1<-read.csv("https://umich.instructure.com/files/330385/download?download_frd=1")
nof1<-import("https://umich.instructure.com/files/1760330/download?download_frd=1")
str(nof1)## 'data.frame': 900 obs. of 10 variables:
## $ id : num 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ day : num 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ tx : num 1 1 0 0 1 1 0 0 1 1 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ selfeff : num 33 33 33 33 33 33 33 33 33 33 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ selfeff25: num 8 8 8 8 8 8 8 8 8 8 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ wpss : num 0.97 -0.17 0.81 -0.41 0.59 -1.16 0.3 -0.34 -0.74 -0.38 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ scssuppt : num 5 3.87 4.84 3.62 4.62 2.87 4.33 3.69 3.29 3.66 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ pmss : num 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ pmss3 : num 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ qhyact : num 53 73 23 36 21 0 21 0 73 114 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
The data is automatically stored as a data frame. Note that by
default, the package rio uses
stingAsFactors=FALSE.
rio can help us export files into any other format we
choose. To do this we have to use the export()
function.
This line of code exports the Nof1 data in xlsx
format located in the R working directory or in a
user-provided directory. Mac users may have a problem exporting
*.xlsx files using rio because of a lack of a
zip tool, but still can output other formats such as “.csv”. An
alternative strategy to save an xlsx file is to use package
xlsx with default row.name=TRUE.
rio also provides a one-step process to convert-and-save
data into alternative formats. The following simple code allows us to
convert and save the 02_Nof1_Data.dta file we just
downloaded into a CSV file.
# convert("02_Nof1_Data.dta", "02_Nof1_Data.csv")
convert("C:/Users/IvoD/Desktop/02_Nof1.xlsx", "C:/Users/IvoD/Desktop/02_Nof1_Data.csv")You can see a new CSV file pop-up in the working directory. Similar transformations are available for other data formats and types.
Look at the CDC Behavioral Risk Factor Surveillance System (BRFSS) Data, 2013-2015. This file (BRFSS_2013_2014_2015.zip) includes the combined landline and cell phone dataset exported from SAS V9.3 using the XPT transport format. This dataset contains \(330\) variables. The data can be imported into SPSS or STATA, however, some of the variable labels may get truncated in the process of converting to the XPT format.
Caution: The size of this compressed (ZIP) file is over \(315MB\)! Let’s start by ingesting the data for a couple of years and explore some of the information.
## [1] Inf
## [1] 60
options(timeout=600) # extend the timeout from 1 to 10-minutes, to allow lower speed connections.
download.file("https://www.socr.umich.edu/data/DSPA/BRFSS_2013_2014_2015.zip", pathToZip)
# let's just pull two of the 3 years of data (2013 and 2015)
brfss_2013 <- sasxport.get(unzip(pathToZip)[1])## Processing SAS dataset LLCP2013 ..
## Processing SAS dataset LLCP2015 ..
## [1] 491773 336
## 685687656 bytes
# summary(brfss_2013[1:1000, 1:10]) # subsample the data
# report the summaries for
summary(brfss_2013$has_plan)## Length Class Mode
## 0 NULL NULL
## 1 2 3 4 5 6 7 8 9 NA's
## 376451 39151 7683 9510 1546 2693 9130 37054 8530 25
Next, we can try to use logistic regression to find out if self-reported race/ethnicity predicts the binary outcome of having a health care plan.
brfss_2013$has_plan <- brfss_2013$hlthpln1 == 1
system.time(
gml1 <- glm(has_plan ~ as.factor(x.race), data=brfss_2013,
family=binomial)
) # report execution time## user system elapsed
## 0.89 0.16 1.08
##
## Call:
## glm(formula = has_plan ~ as.factor(x.race), family = binomial,
## data = brfss_2013)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.293549 0.005649 406.044 <2e-16 ***
## as.factor(x.race)2 -0.721676 0.014536 -49.647 <2e-16 ***
## as.factor(x.race)3 -0.511776 0.032974 -15.520 <2e-16 ***
## as.factor(x.race)4 -0.329489 0.031726 -10.386 <2e-16 ***
## as.factor(x.race)5 -1.119329 0.060153 -18.608 <2e-16 ***
## as.factor(x.race)6 -0.544458 0.054535 -9.984 <2e-16 ***
## as.factor(x.race)7 -0.510452 0.030346 -16.821 <2e-16 ***
## as.factor(x.race)8 -1.332005 0.012915 -103.138 <2e-16 ***
## as.factor(x.race)9 -0.582204 0.030604 -19.024 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 353371 on 491747 degrees of freedom
## Residual deviance: 342497 on 491739 degrees of freedom
## (25 observations deleted due to missingness)
## AIC: 342515
##
## Number of Fisher Scoring iterations: 5
We can also examine the odds (rather the log odds ratio, LOR) of having a health care plan (HCP) by race (R). The LORs are calculated for two-dimensional arrays, separately for each race level (presence of health care plan (HCP) is binary, whereas race (R) has \(9\) levels, \(R_1, R_2, ..., R_9\)). For example, the odds ratio of having a HCP for \(R_1:R_2\) is:
\[ OR(R_1:R_2) = \frac{\frac{P \left( HCP \mid R_1 \right)}{1 - P \left( HCP \mid R_1 \right)}}{\frac{P \left( HCP \mid R_2 \right)}{1 - P \left( HCP \mid R_2 \right)}} .\]
# install.packages("vcd")
# load the vcd package to compute the LOR
library("vcd")
# Note that by default *loddsratio* computes the Log odds ratio (OR). The raw OR = exp(loddsratio)
lor_HCP_by_R <- loddsratio(has_plan ~ as.factor(x.race), data = brfss_2013)
lor_HCP_by_R## log odds ratios for has_plan and as.factor(x.race)
##
## 1:2 2:3 3:4 4:5 5:6 6:7
## -0.72167619 0.20990061 0.18228646 -0.78984000 0.57487142 0.03400611
## 7:8 8:9
## -0.82155382 0.74980101
Now, let’s see an example of querying a database containing
structured relational records. A query is a machine instruction
(typically represented as text) sent by a user to a remote database
requesting a specific database operation (e.g., search or summary). One
database communication protocol relies on SQL (Structured query
language). MySQL is an instance of a database management system that
supports SQL communication that many web applications utilize, e.g.,
YouTube, Flickr, Wikipedia, biological
databases like GO, ensembl, etc. Below is an example
of an SQL query using the package RMySQL. An alternative
way to interface an SQL database is by using the package
RODBC. Let’s look at a couple of DB query examples. The
first one uses the UCSC Genomics
SQL server (genome-mysql.cse.ucsc.edu) and the second one uses a
local client-side database service.
# install.packages("DBI", "RMySQL")
# install.packages("RODBC"); library(RODBC)
library(DBI); library(RMySQL)
library("stringr"); library("dplyr"); library("readr")
library(magrittr)
ucscGenomeConn <- dbConnect(MySQL(),
user='genome',
dbname='hg19',
host='genome-mysql.cse.ucsc.edu')
# dbGetInfo(ucscGenomeConn); dbListResults(ucscGenomeConn)
result <- dbGetQuery(ucscGenomeConn,"show databases;");
# List the DB tables
allTables <- dbListTables(ucscGenomeConn); length(allTables)## [1] 12725
## [1] "bin" "matches" "misMatches" "repMatches" "nCount"
## [6] "qNumInsert" "qBaseInsert" "tNumInsert" "tBaseInsert" "strand"
## [11] "qName" "qSize" "qStart" "qEnd" "tName"
## [16] "tSize" "tStart" "tEnd" "blockCount" "blockSizes"
## [21] "qStarts" "tStarts"
## bin matches misMatches repMatches nCount qNumInsert qBaseInsert tNumInsert
## 1 585 530 4 0 23 3 41 3
## 2 585 3355 17 0 109 9 67 9
## 3 585 4156 14 0 83 16 18 2
## 4 585 4667 9 0 68 21 42 3
## 5 585 5180 14 0 167 10 38 1
## 6 585 468 5 0 14 0 0 0
## tBaseInsert strand qName qSize qStart qEnd tName tSize tStart
## 1 898 - 225995_x_at 637 5 603 chr1 249250621 14361
## 2 11621 - 225035_x_at 3635 0 3548 chr1 249250621 14381
## 3 93 - 226340_x_at 4318 3 4274 chr1 249250621 14399
## 4 5743 - 1557034_s_at 4834 48 4834 chr1 249250621 14406
## 5 29 - 231811_at 5399 0 5399 chr1 249250621 19688
## 6 0 - 236841_at 487 0 487 chr1 249250621 27542
## tEnd blockCount
## 1 15816 5
## 2 29483 17
## 3 18745 18
## 4 24893 23
## 5 25078 11
## 6 28029 1
## blockSizes
## 1 93,144,229,70,21,
## 2 73,375,71,165,303,360,198,661,201,1,260,250,74,73,98,155,163,
## 3 690,10,32,33,376,4,5,15,5,11,7,41,277,859,141,51,443,1253,
## 4 99,352,286,24,49,14,6,5,8,149,14,44,98,12,10,355,837,59,8,1500,133,624,58,
## 5 131,26,1300,6,4,11,4,7,358,3359,155,
## 6 487,
## qStarts
## 1 34,132,278,541,611,
## 2 87,165,540,647,818,1123,1484,1682,2343,2545,2546,2808,3058,3133,3206,3317,3472,
## 3 44,735,746,779,813,1190,1195,1201,1217,1223,1235,1243,1285,1564,2423,2565,2617,3062,
## 4 0,99,452,739,764,814,829,836,842,851,1001,1016,1061,1160,1173,1184,1540,2381,2441,2450,3951,4103,4728,
## 5 0,132,159,1460,1467,1472,1484,1489,1497,1856,5244,
## 6 0,
## tStarts
## 1 14361,14454,14599,14968,15795,
## 2 14381,14454,14969,15075,15240,15543,15903,16104,16853,17054,17232,17492,17914,17988,18267,24736,29320,
## 3 14399,15089,15099,15131,15164,15540,15544,15549,15564,15569,15580,15587,15628,15906,16857,16998,17049,17492,
## 4 14406,20227,20579,20865,20889,20938,20952,20958,20963,20971,21120,21134,21178,21276,21288,21298,21653,22492,22551,22559,24059,24211,24835,
## 5 19688,19819,19845,21145,21151,21155,21166,21170,21177,21535,24923,
## 6 27542,
# Select a subset, fetch the data, and report the quantiles
subsetQuery <- dbSendQuery(ucscGenomeConn, "select * from affyU133Plus2 where misMatches between 1 and 3")
affySmall <- fetch(subsetQuery); dim(affySmall)## [1] 500 22
## 0% 25% 50% 75% 100%
## 1 1 2 2 3
## [1] TRUE
# Another query
# install.packages("magrittr")
bedFile <- "C:/Users/IvoD/Desktop/repUCSC.bed"
subsetQuery1 <- dbSendQuery(ucscGenomeConn,'select genoName,genoStart,genoEnd,repName,swScore, strand,
repClass, repFamily from rmsk')
subsetQuery1_df <- dbFetch(subsetQuery1 , n=100) %>%
dplyr::mutate(genoName =
stringr::str_replace(genoName,'chr','')) %>%
readr::write_tsv(bedFile, col_names=T)
message('saved: ', bedFile)
dbClearResult(subsetQuery1)## [1] TRUE
# Another DB query: Select a specific DB subset
subsetQuery2 <- dbSendQuery(ucscGenomeConn,
"select * from affyU133Plus2 where misMatches between 1 and 4")
affyU133Plus2MisMatch <- fetch(subsetQuery2)
quantile(affyU133Plus2MisMatch$misMatches)## 0% 25% 50% 75% 100%
## 1 1 2 3 4
## [1] TRUE
## [1] 100 22
## bin matches misMatches repMatches nCount
## Min. : 13.0 Min. : 397 Min. :1.00 Min. :0 Min. : 0.00
## 1st Qu.:834.0 1st Qu.: 960 1st Qu.:1.00 1st Qu.:0 1st Qu.: 0.00
## Median :844.0 Median :1732 Median :2.00 Median :0 Median : 4.00
## Mean :730.9 Mean :2249 Mean :2.12 Mean :0 Mean : 14.72
## 3rd Qu.:863.5 3rd Qu.:2832 3rd Qu.:3.00 3rd Qu.:0 3rd Qu.: 19.00
## Max. :878.0 Max. :7475 Max. :4.00 Max. :0 Max. :111.00
## qNumInsert qBaseInsert tNumInsert tBaseInsert
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 5
## Median : 1.00 Median : 1.00 Median : 3.00 Median : 4384
## Mean : 2.33 Mean : 9.39 Mean : 5.95 Mean : 20048
## 3rd Qu.: 3.00 3rd Qu.: 8.00 3rd Qu.: 8.25 3rd Qu.: 26387
## Max. :12.00 Max. :197.00 Max. :29.00 Max. :346335
## strand qName qSize qStart
## Length:100 Length:100 Min. : 410 Min. : 0.00
## Class :character Class :character 1st Qu.: 972 1st Qu.: 0.00
## Mode :character Mode :character Median :1797 Median : 0.00
## Mean :2294 Mean : 5.46
## 3rd Qu.:2840 3rd Qu.: 2.25
## Max. :7512 Max. :331.00
## qEnd tName tSize tStart
## Min. : 400 Length:100 Min. :249250621 Min. :32256024
## 1st Qu.: 972 Class :character 1st Qu.:249250621 1st Qu.:33327868
## Median :1797 Mode :character Median :249250621 Median :35460512
## Mean :2281 Mean :249250621 Mean :35218635
## 3rd Qu.:2834 3rd Qu.:249250621 3rd Qu.:36778842
## Max. :7480 Max. :249250621 Max. :38510902
## tEnd blockCount blockSizes qStarts
## Min. :32256813 Min. : 1.00 Length:100 Length:100
## 1st Qu.:33333986 1st Qu.: 3.00 Class :character Class :character
## Median :35497250 Median : 8.00 Mode :character Mode :character
## Mean :35240948 Mean : 8.91
## 3rd Qu.:36783220 3rd Qu.:12.25
## Max. :38512306 Max. :38.00
## tStarts
## Length:100
## Class :character
## Mode :character
##
##
##
# Once done, clear and close the connections
# dbClearResult(dbListResults(ucscGenomeConn)[[1]])
dbDisconnect(ucscGenomeConn)## [1] TRUE
Depending upon the DB server, to complete the above database SQL commands, it may require access and/or specific user credentials. The example below can be done by all users, as it relies only on local DB services.
##
## Attaching package: 'RSQLite'
## The following object is masked from 'package:RMySQL':
##
## isIdCurrent
# generate an empty DB stored in RAM
myConnection <- dbConnect(RSQLite::SQLite(), ":memory:")
myConnection## <SQLiteConnection>
## Path: :memory:
## Extensions: TRUE
## character(0)
# Add tables to the local SQL DB
data(USArrests); dbWriteTable(myConnection, "USArrests", USArrests)
dbWriteTable(myConnection, "brfss_2013", brfss_2013)
dbWriteTable(myConnection, "brfss_2015", brfss_2015)
# Check again the DB content
# allTables <- dbListTables(myConnection); length(allTables); allTables
head(dbListFields(myConnection, "brfss_2013"))## [1] "x.state" "fmonth" "idate" "imonth" "iday" "iyear"
## [1] "rcsrace1" "rchisla1" "rcsbirth" "typeinds" "typework" "has_plan"
## [1] "USArrests" "brfss_2013" "brfss_2015"
# Retrieve the entire DB table (for the smaller USArrests table)
head(dbGetQuery(myConnection, "SELECT * FROM USArrests"))## Murder Assault UrbanPop Rape
## 1 13.2 236 58 21.2
## 2 10.0 263 48 44.5
## 3 8.1 294 80 31.0
## 4 8.8 190 50 19.5
## 5 9.0 276 91 40.6
## 6 7.9 204 78 38.7
# Retrieve just the average of one feature
myQuery <- dbGetQuery(myConnection, "SELECT avg(Assault) FROM USArrests")
head(myQuery)## avg(Assault)
## 1 170.76
myQuery <- dbGetQuery(myConnection, "SELECT avg(Assault) FROM USArrests GROUP BY UrbanPop"); myQuery## avg(Assault)
## 1 48.00
## 2 81.00
## 3 152.00
## 4 211.50
## 5 271.00
## 6 190.00
## 7 83.00
## 8 109.00
## 9 109.00
## 10 120.00
## 11 57.00
## 12 56.00
## 13 236.00
## 14 188.00
## 15 186.00
## 16 102.00
## 17 156.00
## 18 113.00
## 19 122.25
## 20 229.50
## 21 151.00
## 22 231.50
## 23 172.00
## 24 145.00
## 25 255.00
## 26 120.00
## 27 110.00
## 28 204.00
## 29 237.50
## 30 252.00
## 31 147.50
## 32 149.00
## 33 254.00
## 34 174.00
## 35 159.00
## 36 276.00
# Or do it in batches (for the much larger brfss_2013 and brfss_2015 tables)
myQuery <- dbGetQuery(myConnection, "SELECT * FROM brfss_2013")
# extract data in chunks of 2 rows, note: dbGetQuery vs. dbSendQuery
# myQuery <- dbSendQuery(myConnection, "SELECT * FROM brfss_2013")
# fetch2 <- dbFetch(myQuery, n = 2); fetch2
# do we have other cases in the DB remaining?
# extract all remaining data
# fetchRemaining <- dbFetch(myQuery, n = -1);fetchRemaining
# We should have all data in DB now
# dbHasCompleted(myQuery)
# compute the average (poorhlth) grouping by Insurance (hlthpln1)
# Try some alternatives: numadult nummen numwomen genhlth physhlth menthlth poorhlth hlthpln1
myQuery1_13 <- dbGetQuery(myConnection, "SELECT avg(poorhlth) FROM brfss_2013 GROUP BY hlthpln1"); myQuery1_13## avg(poorhlth)
## 1 56.25466
## 2 53.99962
## 3 58.85072
## 4 66.26757
# Compare 2013 vs. 2015: Health grouping by Insurance
myQuery1_15 <- dbGetQuery(myConnection, "SELECT avg(poorhlth) FROM brfss_2015 GROUP BY hlthpln1"); myQuery1_15## avg(poorhlth)
## 1 55.75539
## 2 55.49487
## 3 61.35445
## 4 67.62125
## avg(poorhlth)
## 1 0.4992652
## 2 -1.4952515
## 3 -2.5037326
## 4 -1.3536797
The SparQL Protocol and RDF Query Language (SparQL) is a semantic database query language for RDF (Resource Description Framework) data objects. SparQL queries consist of (1) triple patterns, (2) conjunctions, and (3) disjunctions.
The following example uses SparQL to query the prevalence of tuberculosis from the WikiData SparQL server and plot it on a World geographic map.
# install.packages("SPARQL"); install.packages("rworldmap"); install.packages("spam")
library(SPARQL)
library(ggplot2)
library(rworldmap)
library(plotly)
# SparQL Format
# https://www.w3.org/2009/Talks/0615-qbe/
# W3C Turtle - Terse RDF Triple Language:
# https://www.w3.org/TeamSubmission/turtle/#sec-examples
# RDF (Resource Description Framework) is a graphical data model of (subject, predicate, object) triples representing:
# "subject-node to predicate arc to object arc"
# Resources are represented with URIs, which can be abbreviated as prefixed names
# Objects are literals: strings, integers, booleans, etc.
# Syntax
# URIs: <http://example.com/resource> or prefix:name
# Literals:
# "plain string" "13.4""
# xsd:float, or
# "string with language" @en
# Triple: pref:subject other:predicate "object".
wdqs <- "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
query = "PREFIX wd: <http://www.wikidata.org/entity/>
# prefix declarations
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX qualifier: <http://www.wikidata.org/prop/qualifier/>
PREFIX statement: <http://www.wikidata.org/prop/statement/>
# result clause
SELECT DISTINCT ?countryLabel ?ISO3Code ?latlon ?prevalence ?doid ?year
# query pattern against RDF data
# Q36956 Hansen's disease, Leprosy https://www.wikidata.org/wiki/Q36956
# Q15750965 - Alzheimer's disease: https://www.wikidata.org/wiki/Q15750965
# Influenza - Q2840: https://www.wikidata.org/wiki/Q2840
# Q12204 - tuberculosis https://www.wikidata.org/wiki/Q12204
# P699 Alzheimer's Disease ontology ID
# P1193 prevalence: https://www.wikidata.org/wiki/Property:P1193
# P17 country: https://www.wikidata.org/wiki/Property:P17
# Country ISO-3 code: https://www.wikidata.org/wiki/Property:P298
# Location: https://www.wikidata.org/wiki/Property:P625
# Wikidata docs: https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual
WHERE {
wd:Q12204 wdt:P699 ?doid ; # tuberculosis P699 Disease ontology ID
p:P1193 ?prevalencewithProvenance .
?prevalencewithProvenance qualifier:P17 ?country ;
qualifier:P585 ?year ;
statement:P1193 ?prevalence .
?country wdt:P625 ?latlon ;
rdfs:label ?countryLabel ;
wdt:P298 ?ISO3Code ;
wdt:P297 ?ISOCode .
FILTER (lang(?countryLabel) = \"en\")
# FILTER constraints use boolean conditions to filter out unwanted query results.
# Shortcut: a semicolon (;) can be used to separate two triple patterns that share the same disease (?country is the shared subject above.)
# rdfs:label is a common predicate for giving a human-friendly label to a resource.
}
# query modifiers
ORDER BY DESC(?population)
"
# install.packages("WikidataQueryServiceR")
library(WikidataQueryServiceR)
library(mapproj)
results <- query_wikidata(sparql_query=query); head(results)## # A tibble: 6 × 6
## countryLabel ISO3Code latlon prevalence doid year
## <chr> <chr> <chr> <dbl> <chr> <dttm>
## 1 Norway NOR Point(10.68333333… 0.0001 DOID… 2014-01-01 00:00:00
## 2 United States USA Point(-98.5795 39… 0.000029 DOID… 2014-01-01 00:00:00
## 3 France FRA Point(2.0 47.0) 0.0000012 DOID… 2014-01-01 00:00:00
## 4 Iceland ISL Point(-19.0 65.0) 0.0000043 DOID… 2014-01-01 00:00:00
## 5 Ecuador ECU Point(-78.0 -1.0) 0.00078 DOID… 2014-01-01 00:00:00
## 6 Suriname SUR Point(-56.0 4.0) 0.00044 DOID… 2014-01-01 00:00:00
# OLD: results <- SPARQL(url=wdqs, query=query); head(results)
# resultMatrix <- as.matrix(results$results)
# View(resultMatrix)
# sPDF <- joinCountryData2Map(results$results, joinCode = "ISO3", nameJoinColumn = "ISO3Code")
# join the data to the geo map
sPDF <- joinCountryData2Map(results, joinCode = "ISO3", nameJoinColumn = "ISO3Code")## 6 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 237 codes from the map weren't represented in your data
#map the data with no legend
mapParams <- mapCountryData( sPDF
, nameColumnToPlot="prevalence"
# Alternatively , nameColumnToPlot="doid"
, addLegend='FALSE',
mapTitle="Prevalence of Tuberculosis Worldwide"
)
#add a modified legend using the same initial parameters as mapCountryData
do.call( addMapLegend, c( mapParams
, legendLabels="all"
, legendWidth=0.5
))
text(1, -120, "Partial view of Tuberculosis Prevalence in the World", cex=1)#do.call( addMapLegendBoxes
# , c(mapParams
# , list(
# legendText=c('Chile', 'US','Brazil','Argentina'),
# x='bottom',title="AD Prevalence",horiz=TRUE)))
# Alternatively: mapCountryData(sPDF, nameColumnToPlot="prevalence", oceanCol="darkblue", missingCountryCol="white")
View(getMap())
# write.csv(file = "C:/Users/Map.csv", getMap())
# Alternative Plot_ly Geo-map
df_cities <- results
df_cities$popm <- paste(df_cities$countryLabel, df_cities$ISO3Code, "prevalence=", df_cities$prevalence)
df_cities$quart <- with(df_cities, cut(prevalence, quantile(prevalence), include.lowest = T))
levels(df_cities$quart) <- paste(c("1st", "2nd", "3rd", "4th"), "Quantile")
df_cities$quart <- as.ordered(df_cities$quart)
df_cities <- tidyr::separate(df_cities, latlon, into = c("long", "lat"), sep = " ")
df_cities$long <- gsub("Point\\(", "", df_cities$long)
df_cities$lat <- gsub("\\)", "", df_cities$lat)
head(df_cities)## # A tibble: 6 × 9
## countryLabel ISO3Code long lat prevalence doid year popm
## <chr> <chr> <chr> <chr> <dbl> <chr> <dttm> <chr>
## 1 Norway NOR 10.68… 59.9… 0.0001 DOID… 2014-01-01 00:00:00 Norw…
## 2 United States USA -98.5… 39.8… 0.000029 DOID… 2014-01-01 00:00:00 Unit…
## 3 France FRA 2.0 47.0 0.0000012 DOID… 2014-01-01 00:00:00 Fran…
## 4 Iceland ISL -19.0 65.0 0.0000043 DOID… 2014-01-01 00:00:00 Icel…
## 5 Ecuador ECU -78.0 -1.0 0.00078 DOID… 2014-01-01 00:00:00 Ecua…
## 6 Suriname SUR -56.0 4.0 0.00044 DOID… 2014-01-01 00:00:00 Suri…
## # ℹ 1 more variable: quart <ord>
ge <- list(scope = 'world', showland = TRUE, landcolor = toRGB("lightgray"),
subunitwidth = 1, countrywidth = 1, subunitcolor = toRGB("white"), countrycolor = toRGB("white"))
plot_geo(df_cities, lon = ~long, lat = ~lat, text = ~popm, mode="markers",
marker = ~list(size = 20, line = list(width = 0.1)),
color = ~quart, locationmode = 'country names') %>%
layout(geo = ge, title = 'Prevalence of Tuberculosis Worldwide')Similar geographic maps may be displayed fo other processes, e.g.,
malaria. Note that such data are sparse as they are pulled dynamically
from wikidata.
Below is an example of a geo-map showing the global locations and population-size of various cities in millions.
library(plotly)
df_cities <- world.cities
df_cities$popm <- paste(df_cities$country.etc, df_cities$name, "Pop", round(df_cities$pop/1e6,2), " million")
df_cities$quart <- with(df_cities, cut(pop, quantile(pop), include.lowest = T))
levels(df_cities$quart) <- paste(c("1st", "2nd", "3rd", "4th"), "Quantile")
df_cities$quart <- as.ordered(df_cities$quart)
ge <- list(scope = 'world', showland = TRUE, landcolor = toRGB("lightgray"),
subunitwidth = 1, countrywidth = 1, subunitcolor = toRGB("white"), countrycolor = toRGB("white"))
plot_geo(df_cities, lon = ~long, lat = ~lat, text = ~popm, mode="markers",
marker = ~list(size = sqrt(pop/10000) + 1, line = list(width = 0.1)),
color = ~quart, locationmode = 'country names') %>%
layout(geo = ge, title = 'City Populations (Worldwide)')