SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
Depending on the specific hardware, operating system, and versions of
R
and Python
installed on the computer, some
of the following settings may be useful to support the subsequent
integration of R
, Python
, and
C/C++
.
# install.packages("reticulate")
library(reticulate)
library(plotly)
# specify the path of the Python version that you want to use
#py_path = "C:/Users/Dinov/Anaconda3/" # manual
py_path = Sys.which("python3") # automated
# use_python(py_path, required = T)
# Sys.setenv(RETICULATE_PYTHON = "C:/Users/Dinov/Anaconda3/")
sys <- import("sys", convert = TRUE)
In this chapter, we will discuss some technical details about data formats, streaming, optimization of computation, and distributed deployment of optimized learning algorithms. Chapter 13 includes a deep dive into the mathematical and computational aspects of function optimization.
The Internet of Things (IoT) leads to a paradigm shift of scientific inference - from static data interrogated in a batch or distributed environment to an on-demand service-based Cloud computing. Here, we will demonstrate how to work with specialized datasets, data-streams, and SQL databases, as well as develop and assess on-the-fly data modeling, classification, prediction and forecasting methods. Important examples to keep in mind throughout this chapter include high-frequency data delivered real time in hospital ICU’s (e.g., microsecond Electroencephalography signals, EEGs), dynamically changing stock market data (e.g., Dow Jones Industrial Average Index, DJI), and weather patterns.
We will present (1) format conversion and working with XML, SQL,
JSON, CSV, SAS, SQL, noSQL, Google BigQuery service, and other data
objects, (2) visualization of bioinformatics and network data, (3)
protocols for managing, classifying and predicting outcomes from data
streams, (4) strategies for optimization, improvement of computational
performance, parallel (MPI) and graphics (GPU) computing, (5) processing
of very large datasets, and (6) electronic R
-markdown
notebook facilitating R
integration with Python, C/C++,
Java, and other languages.
Unlike the case-studies we saw in the previous chapters, some real
world data may not always be nicely formatted, e.g., as CSV files. We
must collect, arrange, wrangle, and harmonize scattered information to
generate computable data objects that can be further processed by
various techniques. Data wrangling and preprocessing may take over 80%
of the time researchers spend interrogating complex multi-source data
archives. The following procedures will enhance your skills collecting
and handling heterogeneous real world data. Multiple examples of
handling long-and-wide data, messy and tidy data, and data cleaning
strategies can be found in this JSS
Tidy Data
article by Hadley Wickham.
The R package rio
imports and exports various types of
file formats, e.g., tab-separated (.tsv
), comma-separated
(.csv
), JSON (.json
), Stata
(.dta
), SPSS (.sav
and .por
),
Microsoft Excel (.xls
and .xlsx
), Weka
(.arff
), and SAS (.sas7bdat
and
.xpt
) file types.
There are three core functions in the rio
package:
import()
, convert()
, and
export()
. They are intuitive, easy to understand, and
efficient to execute. Take Stata (.dta) files as an example. Let’s first
download a dataset, 02_Nof1_Data.dta,
from our data
archive folder.
##
## Attaching package: 'rio'
## The following object is masked from 'package:plotly':
##
## export
## The following object is masked from 'package:reticulate':
##
## import
# Download the SAS .DTA file first locally
# https://umich.instructure.com/files/1760330/download?download_frd=1
# Local data can be loaded by
# nof1<-import("02_Nof1_Data.dta")
# the data can also be loaded from the server remotely as well:
# nof1<-import("https://umich.instructure.com/files/1760330/download?download_frd=1a") # .dta format
# nof1<-read.csv("https://umich.instructure.com/files/330385/download?download_frd=1")
nof1<-import("https://umich.instructure.com/files/1760330/download?download_frd=1")
str(nof1)
## 'data.frame': 900 obs. of 10 variables:
## $ id : num 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ day : num 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ tx : num 1 1 0 0 1 1 0 0 1 1 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ selfeff : num 33 33 33 33 33 33 33 33 33 33 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ selfeff25: num 8 8 8 8 8 8 8 8 8 8 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ wpss : num 0.97 -0.17 0.81 -0.41 0.59 -1.16 0.3 -0.34 -0.74 -0.38 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ scssuppt : num 5 3.87 4.84 3.62 4.62 2.87 4.33 3.69 3.29 3.66 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ pmss : num 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ pmss3 : num 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
## $ qhyact : num 53 73 23 36 21 0 21 0 73 114 ...
## ..- attr(*, "format.stata")= chr "%12.0g"
The data is automatically stored as a data frame. Note that by
default, the package rio
uses
stingAsFactors=FALSE
.
rio
can help us export files into any other format we
choose. To do this we have to use the export()
function.
This line of code exports the Nof1 data in xlsx
format located in the R
working directory or in a
user-provided directory. Mac users may have a problem exporting
*.xlsx
files using rio
because of a lack of a
zip tool, but still can output other formats such as “.csv”. An
alternative strategy to save an xlsx
file is to use package
xlsx
with default row.name=TRUE
.
rio
also provides a one-step process to convert-and-save
data into alternative formats. The following simple code allows us to
convert and save the 02_Nof1_Data.dta
file we just
downloaded into a CSV file.
# convert("02_Nof1_Data.dta", "02_Nof1_Data.csv")
convert("C:/Users/IvoD/Desktop/02_Nof1.xlsx", "C:/Users/IvoD/Desktop/02_Nof1_Data.csv")
You can see a new CSV file pop-up in the working directory. Similar transformations are available for other data formats and types.
Look at the CDC Behavioral Risk Factor Surveillance System (BRFSS) Data, 2013-2015. This file (BRFSS_2013_2014_2015.zip) includes the combined landline and cell phone dataset exported from SAS V9.3 using the XPT transport format. This dataset contains \(330\) variables. The data can be imported into SPSS or STATA, however, some of the variable labels may get truncated in the process of converting to the XPT format.
Caution: The size of this compressed (ZIP) file is over \(315MB\)! Let’s start by ingesting the data for a couple of years and explore some of the information.
## [1] Inf
## [1] 60
options(timeout=600) # extend the timeout from 1 to 10-minutes, to allow lower speed connections.
download.file("https://www.socr.umich.edu/data/DSPA/BRFSS_2013_2014_2015.zip", pathToZip)
# let's just pull two of the 3 years of data (2013 and 2015)
brfss_2013 <- sasxport.get(unzip(pathToZip)[1])
## Processing SAS dataset LLCP2013 ..
## Processing SAS dataset LLCP2015 ..
## [1] 491773 336
## 685687656 bytes
# summary(brfss_2013[1:1000, 1:10]) # subsample the data
# report the summaries for
summary(brfss_2013$has_plan)
## Length Class Mode
## 0 NULL NULL
## 1 2 3 4 5 6 7 8 9 NA's
## 376451 39151 7683 9510 1546 2693 9130 37054 8530 25
Next, we can try to use logistic regression to find out if self-reported race/ethnicity predicts the binary outcome of having a health care plan.
brfss_2013$has_plan <- brfss_2013$hlthpln1 == 1
system.time(
gml1 <- glm(has_plan ~ as.factor(x.race), data=brfss_2013,
family=binomial)
) # report execution time
## user system elapsed
## 0.69 0.05 1.11
##
## Call:
## glm(formula = has_plan ~ as.factor(x.race), family = binomial,
## data = brfss_2013)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.293549 0.005649 406.044 <2e-16 ***
## as.factor(x.race)2 -0.721676 0.014536 -49.647 <2e-16 ***
## as.factor(x.race)3 -0.511776 0.032974 -15.520 <2e-16 ***
## as.factor(x.race)4 -0.329489 0.031726 -10.386 <2e-16 ***
## as.factor(x.race)5 -1.119329 0.060153 -18.608 <2e-16 ***
## as.factor(x.race)6 -0.544458 0.054535 -9.984 <2e-16 ***
## as.factor(x.race)7 -0.510452 0.030346 -16.821 <2e-16 ***
## as.factor(x.race)8 -1.332005 0.012915 -103.138 <2e-16 ***
## as.factor(x.race)9 -0.582204 0.030604 -19.024 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 353371 on 491747 degrees of freedom
## Residual deviance: 342497 on 491739 degrees of freedom
## (25 observations deleted due to missingness)
## AIC: 342515
##
## Number of Fisher Scoring iterations: 5
We can also examine the odds (rather the log odds ratio, LOR) of having a health care plan (HCP) by race (R). The LORs are calculated for two-dimensional arrays, separately for each race level (presence of health care plan (HCP) is binary, whereas race (R) has \(9\) levels, \(R_1, R_2, ..., R_9\)). For example, the odds ratio of having a HCP for \(R_1:R_2\) is:
\[ OR(R_1:R_2) = \frac{\frac{P \left( HCP \mid R_1 \right)}{1 - P \left( HCP \mid R_1 \right)}}{\frac{P \left( HCP \mid R_2 \right)}{1 - P \left( HCP \mid R_2 \right)}} .\]
# install.packages("vcd")
# load the vcd package to compute the LOR
library("vcd")
# Note that by default *loddsratio* computes the Log odds ratio (OR). The raw OR = exp(loddsratio)
lor_HCP_by_R <- loddsratio(has_plan ~ as.factor(x.race), data = brfss_2013)
lor_HCP_by_R
## log odds ratios for has_plan and as.factor(x.race)
##
## 1:2 2:3 3:4 4:5 5:6 6:7
## -0.72167619 0.20990061 0.18228646 -0.78984000 0.57487142 0.03400611
## 7:8 8:9
## -0.82155382 0.74980101
Now, let’s see an example of querying a database containing
structured relational records. A query is a machine instruction
(typically represented as text) sent by a user to a remote database
requesting a specific database operation (e.g., search or summary). One
database communication protocol relies on SQL (Structured query
language). MySQL is an instance of a database management system that
supports SQL communication that many web applications utilize, e.g.,
YouTube, Flickr, Wikipedia, biological
databases like GO, ensembl, etc. Below is an example
of an SQL query using the package RMySQL
. An alternative
way to interface an SQL database is by using the package
RODBC
. Let’s look at a couple of DB query examples. The
first one uses the UCSC Genomics
SQL server (genome-mysql.cse.ucsc.edu) and the second one uses a
local client-side database service.
# install.packages("DBI", "RMySQL")
# install.packages("RODBC"); library(RODBC)
library(DBI); library(RMySQL)
library("stringr"); library("dplyr"); library("readr")
library(magrittr)
ucscGenomeConn <- dbConnect(MySQL(),
user='genome',
dbname='hg19',
host='genome-mysql.cse.ucsc.edu')
# dbGetInfo(ucscGenomeConn); dbListResults(ucscGenomeConn)
result <- dbGetQuery(ucscGenomeConn,"show databases;");
# List the DB tables
allTables <- dbListTables(ucscGenomeConn); length(allTables)
## [1] 12652
## [1] "bin" "matches" "misMatches" "repMatches" "nCount"
## [6] "qNumInsert" "qBaseInsert" "tNumInsert" "tBaseInsert" "strand"
## [11] "qName" "qSize" "qStart" "qEnd" "tName"
## [16] "tSize" "tStart" "tEnd" "blockCount" "blockSizes"
## [21] "qStarts" "tStarts"
## bin matches misMatches repMatches nCount qNumInsert qBaseInsert tNumInsert
## 1 585 530 4 0 23 3 41 3
## 2 585 3355 17 0 109 9 67 9
## 3 585 4156 14 0 83 16 18 2
## 4 585 4667 9 0 68 21 42 3
## 5 585 5180 14 0 167 10 38 1
## 6 585 468 5 0 14 0 0 0
## tBaseInsert strand qName qSize qStart qEnd tName tSize tStart
## 1 898 - 225995_x_at 637 5 603 chr1 249250621 14361
## 2 11621 - 225035_x_at 3635 0 3548 chr1 249250621 14381
## 3 93 - 226340_x_at 4318 3 4274 chr1 249250621 14399
## 4 5743 - 1557034_s_at 4834 48 4834 chr1 249250621 14406
## 5 29 - 231811_at 5399 0 5399 chr1 249250621 19688
## 6 0 - 236841_at 487 0 487 chr1 249250621 27542
## tEnd blockCount
## 1 15816 5
## 2 29483 17
## 3 18745 18
## 4 24893 23
## 5 25078 11
## 6 28029 1
## blockSizes
## 1 93,144,229,70,21,
## 2 73,375,71,165,303,360,198,661,201,1,260,250,74,73,98,155,163,
## 3 690,10,32,33,376,4,5,15,5,11,7,41,277,859,141,51,443,1253,
## 4 99,352,286,24,49,14,6,5,8,149,14,44,98,12,10,355,837,59,8,1500,133,624,58,
## 5 131,26,1300,6,4,11,4,7,358,3359,155,
## 6 487,
## qStarts
## 1 34,132,278,541,611,
## 2 87,165,540,647,818,1123,1484,1682,2343,2545,2546,2808,3058,3133,3206,3317,3472,
## 3 44,735,746,779,813,1190,1195,1201,1217,1223,1235,1243,1285,1564,2423,2565,2617,3062,
## 4 0,99,452,739,764,814,829,836,842,851,1001,1016,1061,1160,1173,1184,1540,2381,2441,2450,3951,4103,4728,
## 5 0,132,159,1460,1467,1472,1484,1489,1497,1856,5244,
## 6 0,
## tStarts
## 1 14361,14454,14599,14968,15795,
## 2 14381,14454,14969,15075,15240,15543,15903,16104,16853,17054,17232,17492,17914,17988,18267,24736,29320,
## 3 14399,15089,15099,15131,15164,15540,15544,15549,15564,15569,15580,15587,15628,15906,16857,16998,17049,17492,
## 4 14406,20227,20579,20865,20889,20938,20952,20958,20963,20971,21120,21134,21178,21276,21288,21298,21653,22492,22551,22559,24059,24211,24835,
## 5 19688,19819,19845,21145,21151,21155,21166,21170,21177,21535,24923,
## 6 27542,
# Select a subset, fetch the data, and report the quantiles
subsetQuery <- dbSendQuery(ucscGenomeConn, "select * from affyU133Plus2 where misMatches between 1 and 3")
affySmall <- fetch(subsetQuery); dim(affySmall)
## [1] 500 22
## 0% 25% 50% 75% 100%
## 1 1 2 2 3
## [1] TRUE
# Another query
# install.packages("magrittr")
bedFile <- "C:/Users/IvoD/Desktop/repUCSC.bed"
subsetQuery1 <- dbSendQuery(ucscGenomeConn,'select genoName,genoStart,genoEnd,repName,swScore, strand,
repClass, repFamily from rmsk')
subsetQuery1_df <- dbFetch(subsetQuery1 , n=100) %>%
dplyr::mutate(genoName =
stringr::str_replace(genoName,'chr','')) %>%
readr::write_tsv(bedFile, col_names=T)
message('saved: ', bedFile)
dbClearResult(subsetQuery1)
## [1] TRUE
# Another DB query: Select a specific DB subset
subsetQuery2 <- dbSendQuery(ucscGenomeConn,
"select * from affyU133Plus2 where misMatches between 1 and 4")
affyU133Plus2MisMatch <- fetch(subsetQuery2)
quantile(affyU133Plus2MisMatch$misMatches)
## 0% 25% 50% 75% 100%
## 1 1 2 3 4
## [1] TRUE
## [1] 100 22
## bin matches misMatches repMatches nCount
## Min. : 13.0 Min. : 397 Min. :1.00 Min. :0 Min. : 0.00
## 1st Qu.:834.0 1st Qu.: 960 1st Qu.:1.00 1st Qu.:0 1st Qu.: 0.00
## Median :844.0 Median :1732 Median :2.00 Median :0 Median : 4.00
## Mean :730.9 Mean :2249 Mean :2.12 Mean :0 Mean : 14.72
## 3rd Qu.:863.5 3rd Qu.:2832 3rd Qu.:3.00 3rd Qu.:0 3rd Qu.: 19.00
## Max. :878.0 Max. :7475 Max. :4.00 Max. :0 Max. :111.00
## qNumInsert qBaseInsert tNumInsert tBaseInsert
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 1.00 1st Qu.: 5
## Median : 1.00 Median : 1.00 Median : 3.00 Median : 4384
## Mean : 2.33 Mean : 9.39 Mean : 5.95 Mean : 20048
## 3rd Qu.: 3.00 3rd Qu.: 8.00 3rd Qu.: 8.25 3rd Qu.: 26387
## Max. :12.00 Max. :197.00 Max. :29.00 Max. :346335
## strand qName qSize qStart
## Length:100 Length:100 Min. : 410 Min. : 0.00
## Class :character Class :character 1st Qu.: 972 1st Qu.: 0.00
## Mode :character Mode :character Median :1797 Median : 0.00
## Mean :2294 Mean : 5.46
## 3rd Qu.:2840 3rd Qu.: 2.25
## Max. :7512 Max. :331.00
## qEnd tName tSize tStart
## Min. : 400 Length:100 Min. :249250621 Min. :32256024
## 1st Qu.: 972 Class :character 1st Qu.:249250621 1st Qu.:33327868
## Median :1797 Mode :character Median :249250621 Median :35460512
## Mean :2281 Mean :249250621 Mean :35218635
## 3rd Qu.:2834 3rd Qu.:249250621 3rd Qu.:36778842
## Max. :7480 Max. :249250621 Max. :38510902
## tEnd blockCount blockSizes qStarts
## Min. :32256813 Min. : 1.00 Length:100 Length:100
## 1st Qu.:33333986 1st Qu.: 3.00 Class :character Class :character
## Median :35497250 Median : 8.00 Mode :character Mode :character
## Mean :35240948 Mean : 8.91
## 3rd Qu.:36783220 3rd Qu.:12.25
## Max. :38512306 Max. :38.00
## tStarts
## Length:100
## Class :character
## Mode :character
##
##
##
# Once done, clear and close the connections
# dbClearResult(dbListResults(ucscGenomeConn)[[1]])
dbDisconnect(ucscGenomeConn)
## [1] TRUE
Depending upon the DB server, to complete the above database SQL commands, it may require access and/or specific user credentials. The example below can be done by all users, as it relies only on local DB services.
##
## Attaching package: 'RSQLite'
## The following object is masked from 'package:RMySQL':
##
## isIdCurrent
# generate an empty DB stored in RAM
myConnection <- dbConnect(RSQLite::SQLite(), ":memory:")
myConnection
## <SQLiteConnection>
## Path: :memory:
## Extensions: TRUE
## character(0)
# Add tables to the local SQL DB
data(USArrests); dbWriteTable(myConnection, "USArrests", USArrests)
dbWriteTable(myConnection, "brfss_2013", brfss_2013)
dbWriteTable(myConnection, "brfss_2015", brfss_2015)
# Check again the DB content
# allTables <- dbListTables(myConnection); length(allTables); allTables
head(dbListFields(myConnection, "brfss_2013"))
## [1] "x.state" "fmonth" "idate" "imonth" "iday" "iyear"
## [1] "rcsrace1" "rchisla1" "rcsbirth" "typeinds" "typework" "has_plan"
## [1] "USArrests" "brfss_2013" "brfss_2015"
# Retrieve the entire DB table (for the smaller USArrests table)
head(dbGetQuery(myConnection, "SELECT * FROM USArrests"))
## Murder Assault UrbanPop Rape
## 1 13.2 236 58 21.2
## 2 10.0 263 48 44.5
## 3 8.1 294 80 31.0
## 4 8.8 190 50 19.5
## 5 9.0 276 91 40.6
## 6 7.9 204 78 38.7
# Retrieve just the average of one feature
myQuery <- dbGetQuery(myConnection, "SELECT avg(Assault) FROM USArrests")
head(myQuery)
## avg(Assault)
## 1 170.76
myQuery <- dbGetQuery(myConnection, "SELECT avg(Assault) FROM USArrests GROUP BY UrbanPop"); myQuery
## avg(Assault)
## 1 48.00
## 2 81.00
## 3 152.00
## 4 211.50
## 5 271.00
## 6 190.00
## 7 83.00
## 8 109.00
## 9 109.00
## 10 120.00
## 11 57.00
## 12 56.00
## 13 236.00
## 14 188.00
## 15 186.00
## 16 102.00
## 17 156.00
## 18 113.00
## 19 122.25
## 20 229.50
## 21 151.00
## 22 231.50
## 23 172.00
## 24 145.00
## 25 255.00
## 26 120.00
## 27 110.00
## 28 204.00
## 29 237.50
## 30 252.00
## 31 147.50
## 32 149.00
## 33 254.00
## 34 174.00
## 35 159.00
## 36 276.00
# Or do it in batches (for the much larger brfss_2013 and brfss_2015 tables)
myQuery <- dbGetQuery(myConnection, "SELECT * FROM brfss_2013")
# extract data in chunks of 2 rows, note: dbGetQuery vs. dbSendQuery
# myQuery <- dbSendQuery(myConnection, "SELECT * FROM brfss_2013")
# fetch2 <- dbFetch(myQuery, n = 2); fetch2
# do we have other cases in the DB remaining?
# extract all remaining data
# fetchRemaining <- dbFetch(myQuery, n = -1);fetchRemaining
# We should have all data in DB now
# dbHasCompleted(myQuery)
# compute the average (poorhlth) grouping by Insurance (hlthpln1)
# Try some alternatives: numadult nummen numwomen genhlth physhlth menthlth poorhlth hlthpln1
myQuery1_13 <- dbGetQuery(myConnection, "SELECT avg(poorhlth) FROM brfss_2013 GROUP BY hlthpln1"); myQuery1_13
## avg(poorhlth)
## 1 56.25466
## 2 53.99962
## 3 58.85072
## 4 66.26757
# Compare 2013 vs. 2015: Health grouping by Insurance
myQuery1_15 <- dbGetQuery(myConnection, "SELECT avg(poorhlth) FROM brfss_2015 GROUP BY hlthpln1"); myQuery1_15
## avg(poorhlth)
## 1 55.75539
## 2 55.49487
## 3 61.35445
## 4 67.62125
## avg(poorhlth)
## 1 0.4992652
## 2 -1.4952515
## 3 -2.5037326
## 4 -1.3536797
The SparQL Protocol and RDF Query Language (SparQL) is a semantic database query language for RDF (Resource Description Framework) data objects. SparQL queries consist of (1) triple patterns, (2) conjunctions, and (3) disjunctions.
The following example uses SparQL to query the prevalence of tuberculosis from the WikiData SparQL server and plot it on a World geographic map.
# install.packages("SPARQL"); install.packages("rworldmap"); install.packages("spam")
library(SPARQL)
library(ggplot2)
library(rworldmap)
library(plotly)
# SparQL Format
# https://www.w3.org/2009/Talks/0615-qbe/
# W3C Turtle - Terse RDF Triple Language:
# https://www.w3.org/TeamSubmission/turtle/#sec-examples
# RDF (Resource Description Framework) is a graphical data model of (subject, predicate, object) triples representing:
# "subject-node to predicate arc to object arc"
# Resources are represented with URIs, which can be abbreviated as prefixed names
# Objects are literals: strings, integers, booleans, etc.
# Syntax
# URIs: <http://example.com/resource> or prefix:name
# Literals:
# "plain string" "13.4""
# xsd:float, or
# "string with language" @en
# Triple: pref:subject other:predicate "object".
wdqs <- "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
query = "PREFIX wd: <http://www.wikidata.org/entity/>
# prefix declarations
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX qualifier: <http://www.wikidata.org/prop/qualifier/>
PREFIX statement: <http://www.wikidata.org/prop/statement/>
# result clause
SELECT DISTINCT ?countryLabel ?ISO3Code ?latlon ?prevalence ?doid ?year
# query pattern against RDF data
# Q36956 Hansen's disease, Leprosy https://www.wikidata.org/wiki/Q36956
# Q15750965 - Alzheimer's disease: https://www.wikidata.org/wiki/Q15750965
# Influenza - Q2840: https://www.wikidata.org/wiki/Q2840
# Q12204 - tuberculosis https://www.wikidata.org/wiki/Q12204
# P699 Alzheimer's Disease ontology ID
# P1193 prevalence: https://www.wikidata.org/wiki/Property:P1193
# P17 country: https://www.wikidata.org/wiki/Property:P17
# Country ISO-3 code: https://www.wikidata.org/wiki/Property:P298
# Location: https://www.wikidata.org/wiki/Property:P625
# Wikidata docs: https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual
WHERE {
wd:Q12204 wdt:P699 ?doid ; # tuberculosis P699 Disease ontology ID
p:P1193 ?prevalencewithProvenance .
?prevalencewithProvenance qualifier:P17 ?country ;
qualifier:P585 ?year ;
statement:P1193 ?prevalence .
?country wdt:P625 ?latlon ;
rdfs:label ?countryLabel ;
wdt:P298 ?ISO3Code ;
wdt:P297 ?ISOCode .
FILTER (lang(?countryLabel) = \"en\")
# FILTER constraints use boolean conditions to filter out unwanted query results.
# Shortcut: a semicolon (;) can be used to separate two triple patterns that share the same disease (?country is the shared subject above.)
# rdfs:label is a common predicate for giving a human-friendly label to a resource.
}
# query modifiers
ORDER BY DESC(?population)
"
# install.packages("WikidataQueryServiceR")
library(WikidataQueryServiceR)
library(mapproj)
results <- query_wikidata(sparql_query=query); head(results)
## # A tibble: 6 × 6
## countryLabel ISO3Code latlon prevalence doid year
## <chr> <chr> <chr> <dbl> <chr> <dttm>
## 1 Norway NOR Point(… 0.0001 DOID… 2014-01-01 00:00:00
## 2 United States of America USA Point(… 0.000029 DOID… 2014-01-01 00:00:00
## 3 France FRA Point(… 0.0000012 DOID… 2014-01-01 00:00:00
## 4 Iceland ISL Point(… 0.0000043 DOID… 2014-01-01 00:00:00
## 5 Suriname SUR Point(… 0.00044 DOID… 2014-01-01 00:00:00
## 6 Ecuador ECU Point(… 0.00078 DOID… 2014-01-01 00:00:00
# OLD: results <- SPARQL(url=wdqs, query=query); head(results)
# resultMatrix <- as.matrix(results$results)
# View(resultMatrix)
# sPDF <- joinCountryData2Map(results$results, joinCode = "ISO3", nameJoinColumn = "ISO3Code")
# join the data to the geo map
sPDF <- joinCountryData2Map(results, joinCode = "ISO3", nameJoinColumn = "ISO3Code")
## 6 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 237 codes from the map weren't represented in your data
#map the data with no legend
mapParams <- mapCountryData( sPDF
, nameColumnToPlot="prevalence"
# Alternatively , nameColumnToPlot="doid"
, addLegend='FALSE',
mapTitle="Prevalence of Tuberculosis Worldwide"
)
#add a modified legend using the same initial parameters as mapCountryData
do.call( addMapLegend, c( mapParams
, legendLabels="all"
, legendWidth=0.5
))
text(1, -120, "Partial view of Tuberculosis Prevalence in the World", cex=1)
#do.call( addMapLegendBoxes
# , c(mapParams
# , list(
# legendText=c('Chile', 'US','Brazil','Argentina'),
# x='bottom',title="AD Prevalence",horiz=TRUE)))
# Alternatively: mapCountryData(sPDF, nameColumnToPlot="prevalence", oceanCol="darkblue", missingCountryCol="white")
View(getMap())
# write.csv(file = "C:/Users/Map.csv", getMap())
# Alternative Plot_ly Geo-map
df_cities <- results
df_cities$popm <- paste(df_cities$countryLabel, df_cities$ISO3Code, "prevalence=", df_cities$prevalence)
df_cities$quart <- with(df_cities, cut(prevalence, quantile(prevalence), include.lowest = T))
levels(df_cities$quart) <- paste(c("1st", "2nd", "3rd", "4th"), "Quantile")
df_cities$quart <- as.ordered(df_cities$quart)
df_cities <- tidyr::separate(df_cities, latlon, into = c("long", "lat"), sep = " ")
df_cities$long <- gsub("Point\\(", "", df_cities$long)
df_cities$lat <- gsub("\\)", "", df_cities$lat)
head(df_cities)
## # A tibble: 6 × 9
## countryLabel ISO3Code long lat prevalence doid year popm
## <chr> <chr> <chr> <chr> <dbl> <chr> <dttm> <chr>
## 1 Norway NOR 11.0 65.0 0.0001 DOID… 2014-01-01 00:00:00 Norw…
## 2 United States… USA -98.… 39.8… 0.000029 DOID… 2014-01-01 00:00:00 Unit…
## 3 France FRA 2.0 47.0 0.0000012 DOID… 2014-01-01 00:00:00 Fran…
## 4 Iceland ISL -19.0 65.0 0.0000043 DOID… 2014-01-01 00:00:00 Icel…
## 5 Suriname SUR -56.0 4.0 0.00044 DOID… 2014-01-01 00:00:00 Suri…
## 6 Ecuador ECU -78.0 -1.0 0.00078 DOID… 2014-01-01 00:00:00 Ecua…
## # ℹ 1 more variable: quart <ord>
ge <- list(scope = 'world', showland = TRUE, landcolor = toRGB("lightgray"),
subunitwidth = 1, countrywidth = 1, subunitcolor = toRGB("white"), countrycolor = toRGB("white"))
plot_geo(df_cities, lon = ~long, lat = ~lat, text = ~popm, mode="markers",
marker = ~list(size = 20, line = list(width = 0.1)),
color = ~quart, locationmode = 'country names') %>%
layout(geo = ge, title = 'Prevalence of Tuberculosis Worldwide')
Similar geographic maps may be displayed fo other processes, e.g.,
malaria. Note that such data are sparse as they are pulled dynamically
from wikidata
.
Below is an example of a geo-map showing the global locations and population-size of various cities in millions.
library(plotly)
df_cities <- world.cities
df_cities$popm <- paste(df_cities$country.etc, df_cities$name, "Pop", round(df_cities$pop/1e6,2), " million")
df_cities$quart <- with(df_cities, cut(pop, quantile(pop), include.lowest = T))
levels(df_cities$quart) <- paste(c("1st", "2nd", "3rd", "4th"), "Quantile")
df_cities$quart <- as.ordered(df_cities$quart)
ge <- list(scope = 'world', showland = TRUE, landcolor = toRGB("lightgray"),
subunitwidth = 1, countrywidth = 1, subunitcolor = toRGB("white"), countrycolor = toRGB("white"))
plot_geo(df_cities, lon = ~long, lat = ~lat, text = ~popm, mode="markers",
marker = ~list(size = sqrt(pop/10000) + 1, line = list(width = 0.1)),
color = ~quart, locationmode = 'country names') %>%
layout(geo = ge, title = 'City Populations (Worldwide)')
We are already familiar with (pseudo) random number generation (e.g.,
rnorm(100, 10, 4)
or runif(100, 10,20)
), which
algorithmically generate random values subject to specified
distributions. There are also web-services, e.g., random.org, that can provide true
random numbers based on atmospheric noise, rather than using a
pseudo random number generation protocol. Below is one example
of generating a total of 300 numbers arranged in 3 columns, each of 100
rows of random integers (in decimal format in the range \([100, 200]\).
siteURL <- "https://random.org/integers/" # base URL
shortQuery<-"num=300&min=100&max=200&col=3&base=10&format=plain&rnd=new"
completeQuery <- paste(siteURL, shortQuery, sep="?") # concat url and submit query string
rngNumbers <- read.table(file=completeQuery) # and read the data
head(rngNumbers); tail(rngNumbers)
## V1 V2 V3
## 1 193 112 161
## 2 108 170 138
## 3 112 191 192
## 4 186 160 189
## 5 121 130 181
## 6 115 163 185
## V1 V2 V3
## 95 164 129 124
## 96 190 162 132
## 97 181 193 150
## 98 187 197 163
## 99 168 182 143
## 100 124 114 119
The package RCurl
provides an amazing tool for
extracting and scraping information from websites. Let’s use it to
demonstrate extracting information from a SOCR website.
# install.packages("RCurl")
library(RCurl)
web <- getURL("https://wiki.socr.umich.edu/index.php/SOCR_Data", followlocation = TRUE)
str(web, nchar.max = 200)
## chr "<!DOCTYPE html>\n<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n<head>\n<meta charset=\"UTF-8\"/>\n<title>SOCR Data - SOCR</title>\n<script>document.documentElement.className ="| __truncated__
The web
object looks incomprehensible. This is because
most websites are wrapped in XML/HTML hypertext or include JSON
formatted meta-data. RCurl
deals with special HTML tags and
website meta-data.
To deal with the web pages only, httr
package would be a
better choice than RCurl
. It returns a list that makes much
more sense.
# install.packages("httr")
library(httr)
web<-GET("https://wiki.socr.umich.edu/index.php/SOCR_Data")
str(web[1:3])
## List of 3
## $ url : chr "https://wiki.socr.umich.edu/index.php/SOCR_Data"
## $ status_code: int 200
## $ headers :List of 15
## ..$ date : chr "Sun, 21 Apr 2024 01:26:39 GMT"
## ..$ server : chr "Apache"
## ..$ x-powered-by : chr "PHP/7.3.11"
## ..$ x-content-type-options: chr "nosniff"
## ..$ content-language : chr "en"
## ..$ x-ua-compatible : chr "IE=Edge"
## ..$ link : chr "</local/images/SOCR_3D_Logo_UM.png?55fd6>;rel=preload;as=image"
## ..$ vary : chr "Accept-Encoding,Cookie"
## ..$ expires : chr "Thu, 01 Jan 1970 00:00:00 GMT"
## ..$ cache-control : chr "private, must-revalidate, max-age=0"
## ..$ last-modified : chr "Sat, 21 Oct 2023 19:02:36 GMT"
## ..$ content-encoding : chr "gzip"
## ..$ transfer-encoding : chr "chunked"
## ..$ content-type : chr "text/html; charset=UTF-8"
## ..$ set-cookie : chr "LBSESSIONID=1392759693.47873.0000; path=/; Httponly; Secure; SameSite=none"
## ..- attr(*, "class")= chr [1:2] "insensitive" "list"
XML
packageA combination of the RCurl
and the XML
packages could help us extract only the plain text in our desired
webpages. This would be very helpful to get information from heavy text
based websites.
web<-getURL("https://wiki.socr.umich.edu/index.php/SOCR_Data", followlocation = TRUE)
#install.packages("XML")
library(XML)
web.parsed<-htmlParse(web, asText = T, encoding="UTF-8")
plain.text<-xpathSApply(web.parsed, "//p", xmlValue)
substr(paste(plain.text, collapse = "\n"), start=1, stop=256)
## [1] "The links below contain a number of datasets that may be used for demonstration purposes in probability and statistics education. There are two types of data - simulated (computer-generated using random sampling) and observed (research, observationally or "
Here we extracted all plain text between the starting and ending
paragraph HTML tags, <p>
and
</p>
.
More information about extracting text from XML/HTML to text via XPath is available here.
The process of extracting data from complete web pages and storing it
in structured data format is called scraping
. However,
before starting a data scrape from a website, we need to understand the
underlying HTML structure for that specific website. Also, we have to
check the terms of that website to make sure that scraping from this
site is allowed.
The R package rvest
is a very good place to start
“harvesting” data from websites.
To start with, we use read_html()
to ingest the SOCR
website hypertext metadata into a xmlnode
object.
library(rvest)
# SOCR<-read_html("http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data")
SOCR<-read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data")
SOCR
## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
From the summary structure of SOCR
, we can discover that
there are two important hypertext section markups
<head>
and <body>
. Also, notice
that the SOCR data website uses <title>
and
</title>
tags to extract the title in the
<head>
section. Let’s use html_node()
to
extract title information based on this knowledge.
## [1] "SOCR Data - SOCR"
Here we used %>%
operator, or pipe, to connect two
functions, see magrittr
package. The above line of code creates a chain of functions to
operate on the SOCR
object. The first function in the chain
html_node()
extracts the title
from the
head
section. Then, html_text()
translates
HTML formatted hypertext into English. More
on R
piping can be found in the magrittr
package.
Another function, rvest::html_nodes()
can be very
helpful in scraping. Similar to html_node()
,
html_nodes()
can help us extract multiple nodes in an
xmlnode
object. Assume that we want to obtain the meta
elements (usually page description, keywords, author of the document,
last modified, and other metadata) from the SOCR data website. We apply
html_nodes()
to the SOCR
object for lines
starting with <meta
in the <head>
section. It is optional to use html_attrs()
(extracts
attributes, text and tag name from html) to make texts prettier.
## [[1]]
## http-equiv content
## "Content-Type" "text/html; charset=UTF-8"
##
## [[2]]
## charset
## "UTF-8"
##
## [[3]]
## name content
## "ResourceLoaderDynamicStyles" ""
##
## [[4]]
## name content
## "generator" "MediaWiki 1.31.6"
Application Programming Interfaces (APIs) allow web-accessible functions to communicate with each other. Today most API is stored in JSON (JavaScript Object Notation) format.
JSON represents a plain text format used for web applications, data
structures or objects. Online JSON objects could be retrieved by
packages like RCurl
and httr
. Let’s see a JSON
formatted dataset first. We can use 02_Nof1_Data.json
in the class file as an example.
library(httr)
nof1 <- GET("https://umich.instructure.com/files/1760327/download?download_frd=1")
nof1
## Response [https://cdn.inst-fs-iad-prod.inscloudgate.net/0accb6e0-67aa-4ff4-a5a2-24b3c7606698/02_Nof1_Data.json?token=eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCIsImtpZCI6ImNkbiJ9.eyJyZXNvdXJjZSI6Ii8wYWNjYjZlMC02N2FhLTRmZjQtYTVhMi0yNGIzYzc2MDY2OTgvMDJfTm9mMV9EYXRhLmpzb24iLCJ0ZW5hbnQiOiJjYW52YXMiLCJ1c2VyX2lkIjpudWxsLCJpYXQiOjE3MTM2MjU2MDgsImV4cCI6MTcxMzcxMjAwOH0.lzXBORw6dFqeyJF7hihSU6llHNzioAF2eKFsLdnpILe2HsxBaCHKWlmHG7XPl1U-YaJqqactlcT8onvUYE1hxA&download=1&content_type=application%2Fjson]
## Date: 2024-04-21 01:16
## Status: 200
## Content-Type: application/json
## Size: 109 kB
## [{"ID":1,"Day":1,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":0.97,"SocSuppt":5.0...
## {"ID":1,"Day":2,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":-0.17,"SocSuppt":3.8...
## {"ID":1,"Day":3,"Tx":0,"SelfEff":33,"SelfEff25":8,"WPSS":0.81,"SocSuppt":4.84...
## {"ID":1,"Day":4,"Tx":0,"SelfEff":33,"SelfEff25":8,"WPSS":-0.41,"SocSuppt":3.6...
## {"ID":1,"Day":5,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":0.59,"SocSuppt":4.62...
## {"ID":1,"Day":6,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":-1.16,"SocSuppt":2.8...
## {"ID":1,"Day":7,"Tx":0,"SelfEff":33,"SelfEff25":8,"WPSS":0.30,"SocSuppt":4.33...
## {"ID":1,"Day":8,"Tx":0,"SelfEff":33,"SelfEff25":8,"WPSS":-0.34,"SocSuppt":3.6...
## {"ID":1,"Day":9,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":-0.74,"SocSuppt":3.2...
## {"ID":1,"Day":10,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":-0.38,"SocSuppt":3....
## ...
We can see that JSON objects are very simple. The data structure is
organized using hierarchies marked by square brackets. Each piece of
information is formatted as a {key:value}
pair.
The package jsonlite
is a very useful tool to import
online JSON formatted datasets into data frames directly. Its syntax is
very straight forward.
# install.packages("jsonlite")
library(jsonlite)
nof1_lite <- fromJSON("https://umich.instructure.com/files/1760327/download?download_frd=1")
class(nof1_lite)
## [1] "data.frame"
We can transfer a xlsx dataset into CSV and use
read.csv()
to load this kind of dataset. However, R
provides an alternative read.xlsx()
function in package
xlsx
to simplify this process. Take our
02_Nof1_Data.xls
data in the class file as an example. We
need to download the file first.
# install.packages("xlsx")
library(xlsx)
nof1 <- read.xlsx("C:/Users/IvoD/Desktop/02_Nof1.xlsx", 1)
str(nof1)
## 'data.frame': 900 obs. of 10 variables:
## $ id : num 1 1 1 1 1 1 1 1 1 1 ...
## $ day : num 1 2 3 4 5 6 7 8 9 10 ...
## $ tx : num 1 1 0 0 1 1 0 0 1 1 ...
## $ selfeff : num 33 33 33 33 33 33 33 33 33 33 ...
## $ selfeff25: num 8 8 8 8 8 8 8 8 8 8 ...
## $ wpss : num 0.97 -0.17 0.81 -0.41 0.59 -1.16 0.3 -0.34 -0.74 -0.38 ...
## $ scssuppt : num 5 3.87 4.84 3.62 4.62 2.87 4.33 3.69 3.29 3.66 ...
## $ pmss : num 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 ...
## $ pmss3 : num 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 ...
## $ qhyact : num 53 73 23 36 21 0 21 0 73 114 ...
The last argument, 1
, stands for the first excel sheet,
as any excel file may include a large number of tables in it. Also, we
can download the xls
or xlsx
file into our R
working directory so that it is easier to find file paths.
Sometimes more complex protocols may be necessary to ingest data from
XLSX documents. For instance, if the XLSX doc is large, includes many
tables and is only accessible via HTTP protocol from a web-server. Below
is an example downloading the second table,
ABIDE_Aggregated_Data
, from the multi-table
Autism/ABIDE XLSX dataset:
# install.packages("openxlsx"); library(openxlsx)
tmp = tempfile(fileext = ".xlsx")
download.file(url = "https://umich.instructure.com/files/3225493/download?download_frd=1", destfile = tmp, mode="wb")
df_Autism <- openxlsx::read.xlsx(xlsxFile = tmp, sheet = "ABIDE_Aggregated_Data", skipEmptyRows = TRUE)
dim(df_Autism)
## [1] 1098 2145
Many powerful Machine Learning methods are applicable in a broad range of disciplines. However, each domain has specific requirements and specialized data formats that often demand customized interfaces to employ existing tools to research-specific data-driven challenges. The important field of biomedical informatics and data science (BIDS) represents one such example.
Genetic data are stored in widely varying formats and usually have more feature variables than observations. They could have 1,000 columns and only 200 rows. One of the commonly used pre-processing steps for such datasets is variable selection. We will talk about this in Chapter 11.
The Bioconductor project created powerful R functionality (packages and tools) for analyzing genomic data, see Bioconductor for more detailed information.
Social network data and graph datasets describe the relations between nodes (vertices) using connections (links or edges) joining the node objects. Assume we have N objects, we can have \(N\times (N-1)\) directed links establishing paired associations between the nodes. Let’s use an example with N=4 to demonstrate a simple graph potentially modeling the following linkage table.
objects | 1 | 2 | 3 | 4 |
---|---|---|---|---|
1 | ….. | \(1\rightarrow 2\) | \(1\rightarrow 3\) | \(1\rightarrow 4\) |
2 | \(2\rightarrow 1\) | ….. | \(2\rightarrow 3\) | \(2\rightarrow 4\) |
3 | \(3\rightarrow 1\) | \(3\rightarrow 2\) | ….. | \(3\rightarrow 4\) |
4 | \(4\rightarrow 1\) | \(4\rightarrow 2\) | \(4\rightarrow 3\) | ….. |
If we change the \(a\rightarrow b\) to an indicator variable (0 or 1) capturing whether we have an edge connecting a pair of nodes, then we get the graph adjacency matrix.
Edge lists provide an alternative way to represent network connections. Every line in the list contains a connection between two nodes (objects).
Vertex | Vertex |
---|---|
1 | 2 |
1 | 3 |
2 | 3 |
The above edge list is listing three network connections: object 1 is linked to object 2; object 1 is linked to object 3; and object 2 is linked to object 3. Note that edge lists can represent both directed as well as undirected networks or graphs.
We can imagine that if N is very large, e.g., social
networks, the data representation and analysis may be resource intense
(memory or computation). In R, we have multiple packages that can deal
with social network data. One user-friendly example is provided using
the igraph
package. First, let’s build a toy example and
visualize it using this package.
Here c(1, 2, 1, 3, 2, 3, 3, 4)
is an edge list with 4
pairs of node connections and n=10
means we have 10 nodes
(objects) in total. The small arrows in the graph show us directed
network connections. We might notice that 5-10 nodes are scattered out
in the graph. This is because they are not included in the edge list, so
there are no network connections between them and the rest of the
network.
Now let’s examine the co-appearance
network of Facebook circles. The data contains anonymized
circles
(friends lists) from Facebook collected from survey
participants using a
Facebook app. The dataset only includes edges (circles, 88,234)
connecting pairs of nodes (users 4,039) in the ego networks.
The values on the connections represent the number of links/edges
within a circle. We have a huge edge-list made of scrambled Facebook
user IDs. Let’s load this dataset into R first. The data is stored in a
text file. Unlike CSV files, text files in table format need to be
imported using read.table()
. We are using the
header=F
option to let R know that we don’t have a header
in the text file that contains only tab-separated node pairs (indicating
the social connections, edges, between Facebook users).
soc.net.data<-read.table("https://umich.instructure.com/files/2854431/download?download_frd=1", sep=" ", header=F)
head(soc.net.data)
## V1 V2
## 1 0 1
## 2 0 2
## 3 0 3
## 4 0 4
## 5 0 5
## 6 0 6
Now the data is stored in a data frame. To make this dataset ready
for igraph
processing and visualization, we need to convert
soc.net.data
into a matrix object.
By using ncol=2
, we made a matrix with two columns. The
data is now ready and we can apply graph.edgelist()
.
# remove the first 347 edges (to wipe out the degenerate "0" node)
graph_m <- graph.edgelist(soc.net.data.mat[-c(0:347), ], directed = F)
## Warning: `graph.edgelist()` was deprecated in igraph 2.0.0.
## ℹ Please use `graph_from_edgelist()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Before we display the social network graph we may want to examine our model first.
## IGRAPH 3ab72f3 U--- 4038 87887 --
This is an extremely brief yet informative summary. The first line
U--- 4038 87887
includes potentially four letters and two
numbers. The first letter could be U
or D
indicating undirected or directed edges. The second
letter N
would mean that the object set has a “name”
attribute. And the third letter is for the weighted (W
)
graph. Since we didn’t add weight in our analysis the third letter is
empty (“-
”). A fourth character is an indicator for
bipartite graphs (whose vertices can be divided into
two disjoint sets
and (that is, represent independent sets
where each vertex from one set connects to one vertex in the other set).
The two numbers following the 4 letters represent the
number of nodes
and the number of edges
,
respectively. Now let’s render the graph.
# Choose an algorithm to find network communities.
# FastGreedy algorithm is great for large undirected networks
comm_graph_m <- fastgreedy.community(graph_m)
## Warning: `fastgreedy.community()` was deprecated in igraph 2.0.0.
## ℹ Please use `cluster_fast_greedy()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# sizes(comm_graph_m); membership(comm_graph_m)
# Collapse the graph by communities
reduced_comm_graph_m <- simplify(contract(graph_m, membership(comm_graph_m)))
# Plot simplified graph
# plot(reduced_comm_graph_m, vertex.color = adjustcolor("SkyBlue2", alpha.f = .5), vertex.label.color = adjustcolor("black", 0.9), margin=-0.2)
# plot(graph_m, vertex.color = adjustcolor("SkyBlue2", alpha.f = .5), vertex.label.color = adjustcolor("black", 0.9), margin=-0.2)
# plot(graph_m, margin=-0.2, vertex.shape="none", vertex.size=0.01)
plot(graph_m, vertex.size=3, vertex.color=adjustcolor("SkyBlue2", alpha.f = 0.7), vertex.label=NA, margin=-0.2, layout=layout.reingold.tilford)
# simplify graph
# simple_graph_m <- simplify(graph_m, remove.loops=T, remove.multiple=T)
# simple_graph_m <- delete.vertices(simple_graph_m, which(degree(simple_graph_m)<100))
# plot(simple_graph_m, vertex.size=3, vertex.color=adjustcolor("SkyBlue2", alpha.f = .7), vertex.label=NA, margin=-0.6, layout=layout.reingold.tilford(simple_graph_m, circular=T))
# Louvain Clustering of the graph nodes
community <- cluster_louvain(graph_m)
plot(community, graph_m, vertex.label.cex = .5, main = "Community Structure")
We can also use D3
to display a dynamic graph.
# install.packages('networkD3')
library(networkD3)
df <- as_data_frame(graph_m, what = "edges")
# Javascript note indexing starts at zero, not 1, make an artificial index zero root
df1 <- rbind(c(0,1), df)
# Use D3 to display graph
simpleNetwork(df1[1:1000,],fontSize = 12, zoom = T)
This graph is very complicated, yet we can still see that some nodes
(influencers) are surrounded by more nodes than other network members.
To obtain such information we can use the degree()
function
which lists the number of edges for each node.
## [1] 8 18 5 15 31 13 7 1044 12 36 4
Skimming the table we can find that the 107-th
user has
as many as \(1,044\) connections, which
makes the user a highly-connected hub. Likely, this node may
have higher social relevance.
Similarly, some edges might be more important than other edges
because they serve as a bridge between different cloud or clusters of
nodes. To compare their importance, we can use the betweenness
centrality measurement. Betweenness centrality measures
centrality in a network. High centrality for a specific node indicates
influence. betweenness()
can help us to calculate the
betweenness centrality of a given fixed network node \(v_o\)
\[g(v_0)=\sum_{s\neq v_o\neq t}{\frac{\sigma_{st}(v_o)}{\sigma_{st}}},\]
where \({\sigma_{st}}\) is the total number of shortest paths from node \(s\) to node \(t\) and \(\sigma_{st}(v_o)\) is the number of those paths that pass through the fixed network node \(v_o\), which is not a leaf node on the graph.
Again, the 107-th
node has the highest betweenness
centrality (\(3.556221e+06\)).
We can try another example using SOCR hierarchical
data, which is also available for dynamic exploration as a tree
graph. Let’s read its JSON data source using the
jsonlite
package.
tree.json <- fromJSON("http://socr.ucla.edu/SOCR_HyperTree.json", simplifyDataFrame = FALSE)
# tree.json<-fromJSON("https://socr.umich.edu/html/navigators/D3/xml/SOCR_HyperTree.json", simplifyDataFrame = FALSE)
# tree.json<-fromJSON("https://raw.githubusercontent.com/SOCR/Navigator/master/data/SOCR_HyperTree.json", simplifyDataFrame = FALSE)
This generates a list
object representing the
hierarchical structure of the network. Note that this is quite different
from edge list. There is one root node, its sub nodes are called
children nodes, and the terminal notes are call leaf
nodes. Instead of presenting the relationship between nodes in
pairs, this hierarchical structure captures the level for each node. To
draw the social network graph, we need to convert it as a
Node
object. We can utilize the as.Node()
function in the data.tree
package to do so.
# install.packages("data.tree")
library(data.tree)
tree.graph<-as.Node(tree.json, mode = "explicit")
Here we use the mode="explicit"
option to allow
“children” nodes to have their own “children” nodes. Now, the
tree.json
object has been separated into four different
node structures -
"About SOCR", "SOCR Resources", "Get Started",
and
"SOCR Wiki"
. Let’s plot the first one using the
igraph
package.
In the example below, we are demonstrating a slightly complicated scenario where the graphs source data (in this case JSON file) includes nodes with the same name. In principle, this causes a problem with the graph traversal that may lead to infinite loops of node traversal. Thus, we will search for nodes with duplicated names and modify their names to make the algorithm more robust.
AreNamesUnique <- function(node) {
mynames <- node$Get("name")
all(duplicated(mynames) == FALSE)
}
# AreNamesUnique(tree.graph$`About SOCR`)
# Find Duplicate Node names: duplicated(tree.graph$`About SOCR`$Get("name"))
# duplicated(tree.graph$Get("name"))
# One branch of the SOCR Tree: About SOCR
# getUniqueNodes(tree.graph$`About SOCR`)
# AreNamesUnique(tree.graph$`About SOCR`)
## extract Graph Nodes with Unique Names (remove duplicate nodes)
getUniqueNodes <- function(node) {
AreNamesUnique(node)
mynames <- node$Get("name")
(names_unique <- ifelse (duplicated(mynames),
sprintf("%s_%d", mynames, sample(1:1000,1)), mynames))
node$Set(name = names_unique)
AreNamesUnique(node)
return(node)
}
# Do this duplicate node renaming until there are no duplicated names
while (length(tree.graph$Get("name")) != length(unique(tree.graph$Get("name")))) {
getUniqueNodes(tree.graph)
AreNamesUnique(tree.graph)
}
length(tree.graph$Get("name"))
## [1] 587
## [1] 587
## D3 plot
df <- as_data_frame(as.igraph(tree.graph$`About SOCR`), what = "edges")
# Javascript note indexing starts at zero, not 1, make an artificial index zero root
df1 <- rbind(c("SOCR", "About SOCR"), df)
# Use D3 to display graph
simpleNetwork(df1, fontSize = 12, zoom = T)
# Louvain Clustering of the graph nodes
graph_m <- as.igraph(tree.graph$`About SOCR`)
community <- cluster_louvain(graph_m)
plot(community, graph_m, vertex.label.cex = .5, main = "SOCR Resource HTML Structure")
In this graph, the node "About SOCR"
, located at the
center of the graph, represents the root of the tree network. Of course,
we can repeat this process starting with the root of the complete
hierarchical structure, SOCR
.
The proliferation of Cloud services and the emergence of modern technology in all aspects of human experiences leads to a tsunami of data, much of which is steamed real-time. The interrogation of such voluminous data is an increasingly important area of research. Data streams are ordered, often unbounded, sequences of data points created continuously by a data generator. All the data mining, interrogation and forecasting methods we discussed for traditional datasets are also applicable to data streams.
Mathematically, a data stream in an ordered sequence of data points: \[Y = \{y_1, y_2, y_3, \cdots, y_t, \cdots \},\] where the (time) index, \(t\), reflects the order of the observation/record, which may be single numbers, simple vectors in multidimensional space, or objects, e.g., structured Ann Arbor Weather (JSON) and its corresponding structured form. Some streaming data is streamed because it’s too large to be downloaded shotgun style and some is streamed because it’s continually generated and serviced. This presents the potential problem of dealing with data streams that may be unlimited.
Notes:
rstream
. Real
stream data may be piped from financial data providers, the WHO, World
Bank, NCAR and other sources.factas
, for PCA, rEMM
and birch
for clustering, etc. Clustering and classification methods capable of
processing data streams have been developed, e.g., Very Fast
Decision Trees (VFDT), time window-based Online Information
Network (OLIN), On-demand Classification, and the
APRIORI streaming algorithm.R
tools to the Cloud.stream
packageThe R
stream package provides data stream
mining algorithms using fpc
, clue
,
cluster
, clusterGeneration
, MASS
,
and proxy
packages. In addition, the package
streamMOA
provides an rJava
interface to the
Java-based data stream clustering algorithms available in the
Massive Online Analysis (MOA) framework for stream
classification, regression and clustering.
If you need a deeper exposure to data streaming in R, we recommend you go over the stream vignettes.
This example shows the creation and loading of a mixture of 5 random 2D Gaussians, centers at (x_coords, y_coords) with paired correlations rho_corr, representing a simulated data stream.
# install.packages("stream")
library("stream")
x_coords <- c(0.2,0.3, 0.5, 0.8, 0.9)
y_coords <- c(0.8,0.3, 0.7, 0.1, 0.5)
p_weight <- c(0.1, 0.9, 0.5, 0.4, 0.3) # A vector of probabilities that determines the likelihood of generated a data point from a particular cluster
set.seed(12345)
stream_5G <- DSD_Gaussians(k = 5, d = 2, mu=cbind(x_coords, y_coords), p=p_weight)
We will now try k-means (Chapter 8) and a density-based data stream clustering algorithm, D-Stream, where micro-clusters are formed by grid cells of size gridsize with the density of a grid cell (Cm) being at least 1.2 times the average cell density. The model is updated with the next \(500\) data points from the stream.
First, let’s run the k-means clustering with \(k=5\) clusters and plot the resulting micro and macro clusters.
kmc <- DSC_Kmeans(k = 5)
recluster(kmc, dstream)
plot(kmc, stream_5G, type = "both", xlab="X-axis", ylab="Y-axis")
In this clustering plot, micro-clusters are shown as circles and macro-clusters are shown as crosses and their sizes represent the corresponding cluster weight estimates.
Prior to updating the model with the next 1,000 data points from the stream, we specify the grid cells as micro-clusters, grid cell size (gridsize=0.1), and a micro-cluster (Cm=1.2) that specifies the density of a grid cell as a multiple of the average cell density.
We can re-cluster the data using k-means with 5 clusters and plot the resulting micro and macro clusters.
Note the subtle changes in the clustering results after updating with the new batch of data.
mlbench
package.ff::ffdf
.For DSD
objects, some basic stream functions include
print()
, plot()
and
write_stream()
. These can save part of a data stream to
disk. DSD_Memory
and DSD_ReadCSV
objects also
include member functions like reset_stream()
to reset the
position in the stream to its beginning.
To request a new batch of data points from the stream we use
get_points()
. This chooses a random cluster (based
on the probability weights in p_weight
) and a point is
drawn from the multivariate Gaussian distribution (\(mean=mu, covariance\ matrix=\Sigma\)) of
that cluster. Below, we pull \(n = 10\)
new data points from the stream.
## X1 X2 .class
## 1 0.8733407 0.46265427 5
## 2 0.2381227 0.23825970 2
## 3 0.7980366 0.09936920 4
## 4 0.3088413 0.33481091 2
## 5 0.8717139 0.47762180 5
## 6 0.7269188 0.02644974 4
## 7 0.2558559 0.24844723 2
## 8 0.2533175 0.22967079 2
## 9 0.2882337 0.27945806 2
## 10 0.2324733 0.22381704 2
## Warning in .nodots(...): Unknown arguments: class = TRUE
## X1 X2 .class
## 1 0.7795674 0.07913567 4
## 2 0.2249437 0.25580476 2
## 3 0.2460355 0.24467360 2
## 4 0.3111685 0.34150150 2
## 5 0.3967347 0.39527809 2
## 6 0.4966072 0.70748454 3
## 7 0.3302468 0.34697587 2
## 8 0.8073032 0.10314599 4
## 9 0.9400014 0.49830257 5
## 10 0.4783992 0.77695101 3
## 11 0.1469569 0.74732867 1
## 12 0.4106327 0.78666642 3
## 13 0.3618811 0.30656516 2
## 14 0.4933261 0.75555620 3
## 15 0.2687940 0.28681053 2
## 16 0.3187406 0.32997154 2
## 17 0.3354780 0.36332733 2
## 18 0.1461205 0.68213193 1
## 19 0.5221782 0.74109176 3
## 20 0.4959186 0.66092866 3
Note that if you add noise to your stream, e.g.,
stream_Noise <- DSD_Gaussians(k = 5, d = 4, noise = .1, p = c(0.1, 0.5, 0.3, 0.9, 0.1))
,
then the noise points won’t be part of any clusters and may have an
NA
class label.
Clusters can be animated over time by animate_data()
.
Use reset_stream()
to start the animation at the beginning
of the stream and note that this method is not
implemented for streams of class DSD_Gaussians
,
DSD_R
, DSD_data.frame
, and DSD
.
We’ll create a new DSD_Benchmark
data stream.
## Benchmark 1: Two clusters moving diagonally from left to right, meeting
## in the center (d = 2, k = 2, 5% noise).Class: DSD_MG, DSD_R, DSD
## With 3 clusters in 2 dimensions. Time is 1
library("animation")
reset_stream(stream_Bench)
animate_data(stream_Bench, n=10000, horizon=100, xlim = c(0, 1), ylim = c(0, 1))
# Generate a random LIST of images
# img.list <- as.list(NULL)
# for (i in 1:100) img.list[[i]] <- imager::imnoise(x = 200, y = 200, z = 1)
# image(img.list[[1]][,,1,1])
This benchmark generator creates two 2D clusters moving in the plane. One moves from top-left to bottom-right, the other from bottom-left to top-right. When the pair of clusters meet at the center of the domain, they briefly overlap and then split again.
Concept drift in the stream can be depicted by requesting (\(10\)) times \(300\) data points from the stream and animating the plot. Fast-forwarding the stream can be accomplished by requesting, but ignoring, (\(2000\)) points in between the (\(10\)) plots.
for(i in 1:10) {
plot(stream_Bench, 300, xlim = c(0, 1), ylim = c(0, 1))
tmp <- get_points(stream_Bench, n = 2000)
}
reset_stream(stream_Bench)
# Uncomment this to see the animation
# animate_data(stream_Bench, n=8000, horizon = 120, xlim=c(0, 1), ylim=c(0, 1))
# Animations can be saved as HTML or GIF
#saveHTML(ani.replay(), htmlfile = "stream_Bench_Animation.html")
#saveGIF(ani.replay())
Streams can also be saved locally by
write_stream(stream_Bench, "dataStreamSaved.csv", n = 100, sep=",")
and loaded back in R
by DSD_ReadCSV()
.
These data represent the \(X\) and \(Y\) spatial knee-pain locations for over \(8,000\) patients, along with labels about the knee \(F\)ront, \(B\)ack, \(L\)eft and \(R\)ight. Let’s try to read the SOCR Knee Pain Datasest as a stream.
library("XML"); library("xml2"); library("rvest")
wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_KneePainData_041409")
html_nodes(wiki_url, "#content")
## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body" role="main">\n\t\t\t<a id="top"></a>\n\ ...
kneeRawData <- html_table(html_nodes(wiki_url, "table")[[2]])
normalize<-function(x){
return((x-min(x))/(max(x)-min(x)))
}
kneeRawData_df <- as.data.frame(cbind(normalize(kneeRawData$x), normalize(kneeRawData$Y), as.factor(kneeRawData$View)))
colnames(kneeRawData_df) <- c("X", "Y", "Label")
# randomize the rows of the DF as RF, RB, LF and LB labels of classes are sequential
set.seed(1234)
kneeRawData_df <- kneeRawData_df[sample(nrow(kneeRawData_df)), ]
summary(kneeRawData_df)
## X Y Label
## Min. :0.0000 Min. :0.0000 Min. :1.000
## 1st Qu.:0.1331 1st Qu.:0.4566 1st Qu.:2.000
## Median :0.2995 Median :0.5087 Median :3.000
## Mean :0.3382 Mean :0.5091 Mean :2.801
## 3rd Qu.:0.3645 3rd Qu.:0.5549 3rd Qu.:4.000
## Max. :1.0000 Max. :1.0000 Max. :4.000
We can use the DSD::DSD_Memory
class to get a stream
interface for matrix or data frame objects, like the Knee pain location
dataset. The number of true clusters \(k=4\) in this dataset.
# use data.frame to create a stream (3rd column contains the label assignment)
kneeDF <- data.frame(x=kneeRawData_df[,1], y=kneeRawData_df[,2],
.class=as.factor(kneeRawData_df[,3]))
head(kneeDF)
## x y .class
## 1 0.69096672 0.4479769 1
## 2 0.88431062 0.7716763 3
## 3 0.67828843 0.4393064 1
## 4 0.88589540 0.4421965 3
## 5 0.68621236 0.4739884 1
## 6 0.09350238 0.5664740 4
# streamKnee <- DSD_Memory(kneeDF[,c("x", "y")], class=kneeDF[,"class"], loop=T)
streamKnee <- DSD_Memory(kneeDF[,c("x", "y", ".class")], loop=T)
streamKnee
## Memorized Stream
## Class: DSD_Memory, DSD_R, DSD
## Contains 8666 data points - currently at position 1 - loop is TRUE
# define the stream classifier.
cl <- DSClassifier_SlidingWindow(class ~ x + y, window=50, rebuild=10)
cl
## Data Stream Classifier on a Sliding Window
## Function: rpart::rpart
## Class: DSClassifier_SlidingWindow, DSClassifier, DST_SlidingWindow, DST
# Each time we get a point from *streamKnee*, the stream pointer moves to the next position (row) in the data.
get_points(streamKnee, n=10)
## x y .class
## 1 0.69096672 0.4479769 1
## 2 0.88431062 0.7716763 3
## 3 0.67828843 0.4393064 1
## 4 0.88589540 0.4421965 3
## 5 0.68621236 0.4739884 1
## 6 0.09350238 0.5664740 4
## 7 0.11568938 0.4971098 4
## 8 0.16164818 0.4566474 4
## 9 0.13153724 0.6242775 4
## 10 0.16323296 0.5780347 4
## Memorized Stream
## Class: DSD_Memory, DSD_R, DSD
## Contains 8666 data points - currently at position 11 - loop is TRUE
# Stream pointer is in position 11 now
# We can redirect the current position of the stream pointer by:
reset_stream(streamKnee, pos = 200)
get_points(streamKnee, n=10)
## x y .class
## 200 0.26465927 0.4161850 2
## 201 0.09033281 0.2543353 4
## 202 0.37717908 0.5693642 2
## 203 0.16798732 0.5375723 4
## 204 0.06180666 0.5202312 4
## 205 0.64817750 0.5520231 1
## 206 0.70206022 0.4566474 1
## 207 0.29001585 0.5693642 2
## 208 0.14104596 0.5289017 4
## 209 0.11885895 0.5202312 4
## Memorized Stream
## Class: DSD_Memory, DSD_R, DSD
## Contains 8666 data points - currently at position 210 - loop is TRUE
Let’s demonstrate clustering using DSC_DStream
, which
assigns points to cells in a grid. First, initialize the clustering, as
an empty cluster and then use the update()
function to
implicitly alter the mutable DSC
object.
## D-Stream
## Class: DSC_DStream, DSC_Micro, DSC_R, DSC
## Number of micro-clusters: 0
## Number of macro-clusters: 0
# stream::update
reset_stream(streamKnee, pos = 1)
# update the classifier with 100 points from the stream
update(dsc_streamKnee, streamKnee, n = 500)
# update(cl, streamKnee, n = 500)
head(get_centers(dsc_streamKnee))
## x y
## 1 0.05 0.45
## 2 0.05 0.55
## 3 0.15 0.35
## 4 0.15 0.45
## 5 0.15 0.55
## 6 0.25 0.45
# plot(dsc_streamKnee, streamKnee, grid = TRUE)
# Micro-clusters are plotted in red on top of gray stream data points
# The size of the micro-clusters indicates their weight - it's proportional to the number of data points represented by each micro-cluster.
# Micro-clusters are shown as dense grid cells (density is coded with gray values).
The purity metric represent an external evaluation criterion of cluster quality, which is the proportion of the total number of points that were correctly classified: \(0\leq Purity = \frac{1}{N} \sum_{i=1}^k { \max_j a|c_i \cap t_j |} \leq 1\), where \(N\)=number of observed data points, \(k\) = number of clusters, \(c_i\) is the \(i\)th cluster, and \(t_j\) is the classification that has the maximum number of points with \(c_i\) class labels. High purity suggests that we correctly label points.
Next, we can use K-means clustering.
kMeans_Knee <- DSC_Kmeans(k = 5) # choose 4-5 clusters as we have 4 knee labels
recluster(kMeans_Knee, dsc_streamKnee)
plot(kMeans_Knee, streamKnee, type = "both")
# purity <- animate_cluster(kMeans_Knee, streamKnee, n=2500, type="both", xlim=c(0,1), ylim=c(-,1), evaluationMeasure="purity", horizon=10)
animate_cluster(kMeans_Knee, streamKnee, horizon = 100, n = 5000, measure = "purity", plot.args = list(xlim = c(0, 1), ylim = c(0, 1)))
## points purity
## 1 0 NA
## 2 100 0.9000000
## 3 200 1.0000000
## 4 300 0.9818182
## 5 400 1.0000000
## 6 500 1.0000000
## 7 600 0.9333333
## 8 700 0.9104762
## 9 800 1.0000000
## 10 900 0.9555556
## 11 1000 1.0000000
## 12 1100 1.0000000
## 13 1200 1.0000000
## 14 1300 1.0000000
## 15 1400 1.0000000
## 16 1500 0.9130435
## 17 1600 1.0000000
## 18 1700 1.0000000
## 19 1800 1.0000000
## 20 1900 0.9529412
## 21 2000 1.0000000
## 22 2100 0.9360000
## 23 2200 0.9666667
## 24 2300 1.0000000
## 25 2400 1.0000000
## 26 2500 1.0000000
## 27 2600 0.9750000
## 28 2700 0.8809524
## 29 2800 0.9577381
## 30 2900 0.9846154
## 31 3000 0.9857143
## 32 3100 1.0000000
## 33 3200 1.0000000
## 34 3300 0.9222222
## 35 3400 0.8000000
## 36 3500 1.0000000
## 37 3600 1.0000000
## 38 3700 0.9339181
## 39 3800 0.9238095
## 40 3900 0.9750000
## 41 4000 0.8458333
## 42 4100 0.9062500
## 43 4200 1.0000000
## 44 4300 0.9933333
## 45 4400 0.9230769
## 46 4500 0.9666667
## 47 4600 0.9180451
## 48 4700 1.0000000
## 49 4800 0.9846154
## 50 4900 1.0000000
# Synthetic Gaussian example
# stream <- DSD_Gaussians(k = 3, d = 2, noise = 0.05)
# dstream <- DSC_DStream(gridsize = 0.1)
# update(dstream, stream, n = 2000)
# evaluate(dstream, stream, n = 100)
evaluate_static(dsc_streamKnee, streamKnee,
measure = c("crand", "SSQ", "silhouette"), n = 100,
type = c("auto", "micro", "macro"), assign = "micro",
assignmentMethod=c("auto", "model", "nn")) # , noise=c("class", "exclude"))
## Evaluation results for micro-clusters.
## Points were assigned to micro-clusters.
##
## cRand SSQ silhouette
## 0.28474810 0.53860025 -0.03565456
## attr(,"type")
## [1] "micro"
## attr(,"assign")
## [1] "micro"
clusterEval <- evaluate_stream(dsc_streamKnee, streamKnee, measure = c("numMicroClusters", "purity"), n=5000, horizon=100)
head(clusterEval)
## points numMicroClusters purity
## 1 0 14 0.9619048
## 2 100 15 0.9625000
## 3 200 17 0.9666667
## 4 300 19 0.9777778
## 5 400 20 0.9691358
## 6 500 19 0.9699248
# plot(clusterEval[ , "points"], clusterEval[ , "purity"], type = "l", ylab = "Avg Purity", xlab = "Points")
library(plotly)
plot_ly(x=~clusterEval[ , "points"], y=~clusterEval[ , "purity"], type="scatter", mode="markers+lines") %>%
layout(title="Streaming Data Classification (Knee Data): Average Cluster Purity",
xaxis=list(title="Streaming Points"), yaxis=list(title="Average Purity"))
animate_cluster(dsc_streamKnee, streamKnee, horizon = 100, n = 5000, measure = "purity", plot.args = list(xlim = c(0, 1), ylim = c(0, 1)))
## points purity
## 1 0 NA
## 2 100 0.9764706
## 3 200 0.9682540
## 4 300 0.9833333
## 5 400 0.9629630
## 6 500 0.9625000
## 7 600 0.9761905
## 8 700 0.9736842
## 9 800 0.9625000
## 10 900 0.9803922
## 11 1000 0.9642857
## 12 1100 0.9671053
## 13 1200 0.9736842
## 14 1300 0.9785714
## 15 1400 0.9809524
## 16 1500 0.9647059
## 17 1600 0.9699248
## 18 1700 0.9700000
## 19 1800 0.9722222
## 20 1900 0.9691358
## 21 2000 0.9714286
## 22 2100 0.9841270
## 23 2200 0.9696970
## 24 2300 1.0000000
## 25 2400 0.9649123
## 26 2500 0.9841270
## 27 2600 0.9772727
## 28 2700 0.9800000
## 29 2800 0.9583333
## 30 2900 0.9777778
## 31 3000 0.9868421
## 32 3100 0.9809524
## 33 3200 0.9777778
## 34 3300 0.9714286
## 35 3400 0.9750000
## 36 3500 0.9875000
## 37 3600 0.9750000
## 38 3700 0.9714286
## 39 3800 0.9736842
## 40 3900 0.9714286
## 41 4000 0.9666667
## 42 4100 0.9736842
## 43 4200 0.9826087
## 44 4300 0.9789474
## 45 4400 0.9750000
## 46 4500 0.9800000
## 47 4600 0.9684211
## 48 4700 0.9649123
## 49 4800 0.9800000
## 50 4900 0.9750000
The dsc_streamKnee
includes the clustering results,
where \(n\) represents the data points
taken from streamKnee
. The evaluation measure
can be specified as a vector of character strings. Points are assigned
to clusters in dsc_streamKnee
using
get_assignment()
and can be used to assess the quality of
the classification. By default, points are assigned to
micro-clusters, or can be assigned to macro-cluster
centers by assign = "macro"
. Also, new points can be
assigned to clusters by the rule used in the clustering algorithm by
assignmentMethod = "model"
or using nearest-neighbor
assignment (nn
).
Just like we noticed in previous chapters, e.g., Chapter
9, streaming classification in R
may be slow and
memory-inefficient. These problems may become severe, especially for
datasets with millions of records or when using complex functions. There
are packages
for processing large datasets and memory optimization –
bigmemory
, biganalytics
,
bigtabulate
, etc.
dplyr
We have also seen long execution times when running processes that
ingest, store or manipulate huge data.frame
objects. The
dplyr
package, created by Hadley Wickham and Romain
Francoi, provides a faster route to manage such large datasets in R. It
creates an object called tbl
, similar to
data.frame
, which has an in-memory column-like structure. R
reads these objects a lot faster than data frames.
To make a tbl
object we can either convert an existing
data frame to tbl
or connect to an external database.
Converting from data frame to tbl
is quite easy. All we
need to do is call the function as.tbl()
.
## # A tibble: 900 × 10
## id day tx selfeff selfeff25 wpss scssuppt pmss pmss3 qhyact
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 33 8 0.97 5 4.03 1.03 53
## 2 1 2 1 33 8 -0.17 3.87 4.03 1.03 73
## 3 1 3 0 33 8 0.81 4.84 4.03 1.03 23
## 4 1 4 0 33 8 -0.41 3.62 4.03 1.03 36
## 5 1 5 1 33 8 0.59 4.62 4.03 1.03 21
## 6 1 6 1 33 8 -1.16 2.87 4.03 1.03 0
## 7 1 7 0 33 8 0.3 4.33 4.03 1.03 21
## 8 1 8 0 33 8 -0.34 3.69 4.03 1.03 0
## 9 1 9 1 33 8 -0.74 3.29 4.03 1.03 73
## 10 1 10 1 33 8 -0.38 3.66 4.03 1.03 114
## # ℹ 890 more rows
This looks like a normal data frame. If you are using R Studio by
viewing the nof1_tbl
you can see the same output as
nof1
.
Similar to tbl
, the data.table
package
provides another alternative to data frame object representation.
data.table
objects are processed in R much faster compared
to standard data frames. Also, all of the functions that can accept data
frames could be applied to data.table
objects as well. The
function fread()
is able to read a local CSV file directly
into a data.table
.
# install.packages("data.table")
library(data.table)
nof1 <- fread("C:/Users/IvoD/Desktop/02_Nof1_Data.csv")
Another amazing property of data.table
is that we can
use subscripts to access a specific location in the dataset just like
dataset[row, column]
. It also allows the selection of rows
with Boolean expression and direct application of functions to those
selected rows. Note that column names can be used to call the specific
column in data.table
, whereas with data frames, we have to
use the dataset$columnName
syntax).
## [1] 52.66667
This useful functionality can also help us run complex operations
with only a few lines of code. One of the drawbacks of using
data.table
objects is that they are still limited by the
available system memory.
ff
The ff
(fast-files) package allows us to overcome the
RAM limitations of finite system memory. For example, it helps with
operating datasets with billion rows. ff
creates objects in
ffdf
formats, which is like a map that points to a location
of the data on a disk. However, this makes ffdf
objects
inapplicable for most R functions. The only way to address this problem
is to break the huge dataset into small chunks. After processing a batch
of these small chunks, we have to combine the results to reconstruct the
complete output. This strategy is relevant in parallel computing, which
will be discussed in detail in the next section. First, let’s download
one of the large
datasets in our datasets archive, UQ_VitalSignsData_Case04.csv.
# install.packages("ff")
library(ff)
# vitalsigns<-read.csv.ffdf(file="UQ_VitalSignsData_Case04.csv", header=T)
vitalsigns <- read.csv.ffdf(file="https://umich.instructure.com/files/366335/download?download_frd=1", header=T)
As mentioned earlier, we cannot apply functions directly on this
ff
object, e.g.,
## Warning in mean.default(vitalsigns$Pulse): argument is not numeric or logical:
## returning NA
## [1] NA
For basic data calculations, we can download another package
ffbase
. This allows operations on ffdf
objects
using simple tasks like: mathematical operations, query functions,
summary statistics and bigger regression models using packages like
biglm
, which will be mentioned later in this chapter.
# Install RTools: https://cran.r-project.org/bin/windows/Rtools/
# install.packages("ffbase")
## ff vs. ffbase package incompatibility:
# https://forums.ohdsi.org/t/solving-error-object-is-factor-ff-is-not-exported-by-namespace-ff/11745
# Downgrade ff package to 2.2.14
# install.packages("C:/Users/IvoD/Desktop/ff_2.2-14.tar.gz", repos = NULL, type="source")
library(ffbase)
mean(vitalsigns$Pulse)
bigmemory
The previously introduced packages include alternatives to
data.frames
. For instance, the bigmemory
package creates alternative objects to 2D matrices (second-order
tensors). It can store huge datasets and can be divided into small
chunks that can be converted to data frames. However, we cannot directly
apply machine learning methods on these types of objects. More detailed information about the
bigmemory
package is available online.
In previous chapters, we saw various machine learning techniques applied as serial computing tasks. The traditional protocol involves the following steps. First, applying function 1 to our raw data. Then, using the output from function 1 as an input to function 2. This process is iterated for a series of functions. Finally, we have the terminal output generated by the last function. This serial or linear computing method is straightforward but time consuming and perhaps sub-optimal.
Now we introduce a more efficient way of computing - parallel computing, which provides a mechanism to deal with different tasks at the same time and combine the outputs for all of the processes to get the final answer faster. However, parallel algorithms may require special conditions and cannot be applied to all problems. If two tasks have to be run in a specific order, this problem cannot be parallelized.
To measure how much time can be saved for different methods, we can
use the function system.time()
.
## Warning in mean.default(vitalsigns$Pulse): argument is not numeric or logical:
## returning NA
## user system elapsed
## 0 0 0
This means calculating the mean of the Pulse
column in
the vitalsigns
dataset takes 0.001 seconds. These values
will vary between computers, operating systems, and states of
operations.
We will introduce two packages for parallel computing
multicore
and snow
(their core components are
included in the package parallel
). They both have a
different way of multitasking. However, to run these packages, you need
to have a relatively modern multicore computer. Let’s check how many
cores your computer has. This function
parallel::detectCores()
provides this functionality.
parallel
is a base package, so there is no need to install
it prior to using it.
## [1] 20
This reports that there are eight (8) cores on the computer used to compile and knit the Rmarkdown source of the DSPA book content. In this case, we can run up to 6–7 parallel jobs on this computer, never use all cores for computing as this will overwhelm the core state of the machine.
The multicore
package simply uses the multitasking
capabilities of the kernel, the computer’s operating system, to
“fork” additional R sessions that share the same memory. Imagine that we
open several R sessions in parallel and let each of them do part of the
work. Now, let’s examine how this can save time when running complex
protocols or dealing with large datasets. To start with, we can use the
mclapply()
function, which is similar to
lapply()
, which applies functions to a vector and returns a
vector of lists. Instead of applying functions to vectors
mcapply()
divides the complete computational task and
delegates portions of it to each available core. We will apply a simple,
yet time consuming, task - generating random numbers - for demonstrating
this procedure. Also, we can use the system.time()
to track
the time differences.
## user system elapsed
## 0.09 0.00 0.27
# Note the multi core calls may not work on Windows, but will work on Linux/Mac.
#This shows a 2-core and 4-vore invocations
# system.time(c2<-unlist(mclapply(1:2, function(x){rnorm(5000000)}, mc.cores = 2)))
# system.time(c4<-unlist(mclapply(1:4, function(x){rnorm(2500000)}, mc.cores = 4)))
# And here is a Windows (single core invocation)
library(parallel)
numWorkers <- 4
cl <-makeCluster(numWorkers, type="PSOCK")
system.time(c2 <- unlist(parLapply(cl,1:10, function(x) {rnorm(10000000)})))
## user system elapsed
## 0.19 0.07 1.72
The unlist()
is used at the end to combine results from
different cores into a single vector. Each line of code creates \(10,000,000\) random numbers. The generation
of the c1
output uses the default R
single
core invocation, which uses the most CPU time. The c2
output used two cores to complete the task (each core handles \(5,000,000\) (half) of the random number
generations, and uses less time than the first one. The final output,
c4
, uses four cores for the task and reduces the computing
time even further. Clearly, using additional cores significantly shrinks
the overall execution time.
The snow
package allows parallel computing on multicore
multiprocessor machines or a network of multiple machines. It might be
more difficult to use but it’s also certainly more flexible. First we
can set how many cores we want to use via the makeCluster()
function.
This call might cause a pop-up message warning about access through
the firewall. To do the same task we can use the
parLapply()
function in the snow
package. Note
that we have to call the object we created with the previous
makeCluster()
function.
## user system elapsed
## 0.01 0.02 0.22
While using parLapply()
, we have to specify the matrix
and the function that will be applied to this matrix. Remember to stop
the cluster we made after completing the task, to release back the
system resources.
foreach
and doParallel
The foreach
package provides another option of parallel
computing. It relies on a loop-like process basically applying a
specified function for each item in the set, which again is somewhat
similar to apply()
, lapply()
and other regular
functions. The interesting thing is that these loops can be computed in
parallel saving substantial amounts of time. The foreach
package alone cannot provide parallel computing. We have to combine it
with other packages like doParallel
. Let’s reexamine the
task of creating a vector of 10,000,000 random numbers. First, register
the 4 compute cores using registerDoParallel()
.
Then we can examine the time saving foreach
command.
#install.packages("foreach")
library(foreach)
system.time(c4 <- foreach(i=1:4, .combine = 'c') %dopar% rnorm(2500000) )
## user system elapsed
## 0.04 0.03 0.17
Here we used four items (each item runs on a separate core),
.combine=c
allows foreach
to combine the
results with the parameter c()
generating the aggregate
result vector.
Also, don’t forget to close the doParallel
by
registering the sequential backend.
Modern computers have graphics cards, GPU (Graphics Processing Unit), that consist of thousands of cores, however they are very specialized, unlike the standard CPU chip. If we can use this feature for parallel computing, we may reach amazing performance improvements, at the cost of complicating the processing algorithms and increasing the constraints on the data format. Specific disadvantages of GPU computing include relying on a proprietary manufacturer (e.g., NVidia) frameworks and Complete Unified Device Architecture (CUDA) programming language. CUDA allows programming of GPU instructions into a common computing language. This paper provides one example of using GPU computation to significantly improve the performance of advanced neuroimaging and brain mapping processing of multidimensional data.
The R package gputools
is created for parallel computing
using NVidia CUDA. Detailed GPU
computing in R information is available online.
As we mentioned earlier, some tasks can be parallelized easier than others. In real world situations, we can pick the algorithms that lend themselves well to parallelization. Some of the R packages that allow parallel computing using ML algorithms are listed below.
biglm
The R biglm
package allows training regression models with data from SQL databases
or large data chunks obtained from the ff
package. The
output is similar to the standard lm()
function that builds
linear models. However, biglm
operates efficiently on
massive datasets.
bigrf
The bigrf package
can be used to train random forests combining the foreach
and doParallel
packages. In Chapter
9, we presented random forests as machine learners ensembling
multiple tree learners. With parallel computing, we can split the task
of creating thousands of trees into smaller tasks that can be outsourced
to each available compute core. We only need to combine the results at
the end. Then, we will obtain the exact same output in a relatively
shorter amount of time.
caret
Combining the caret
package with foreach
and we can obtain a powerful method to deal with time-consuming tasks
like building a random forest learner. Utilizing the same example we
presented in Chapter
9, we can see the time difference of utilizing the
foreach
package.
#library(caret)
system.time(m_rf <- train(CHARLSONSCORE ~ ., data=qol, method="rf", metric="Kappa", trControl=ctrl, tuneGrid=grid_rf))
## user system elapsed
## 41.02 0.42 91.65
It took a couple of minutes to finish this task in the standard
(single core) execution model, relying purely on the regular
caret
function. Below, this same model training completes
much faster using parallelization; about 1/4 of the time compared to the
standard call above.
## [1] 4
system.time(m_rf <- train(CHARLSONSCORE ~ ., data = qol, method = "rf", metric = "Kappa", trControl = ctrl, tuneGrid = grid_rf))
## user system elapsed
## 1.25 0.00 31.29
Note that the call to train
remains the same, no need to
specify parallelization in the call. It automatically utilizes all
available resources, in this case 4 cores. The execution time is reduced
from about 120 seconds (in the standard single core environment) down to
40 seconds (in the cluster setting).
The R markdown
notebook allows the user to execute jobs in a number of different
kinds of software platforms. In addition to R
, one can
define Python
, C/C++
, and many other
languages. The complete list of knitr
package supported
scripting and compiled languages include:
## [1] "awk" "bash" "coffee" "gawk" "groovy" "haskell"
## [7] "lein" "mysql" "node" "octave" "perl" "php"
## [13] "psql" "Rscript" "ruby" "sas" "scala" "sed"
## [19] "sh" "stata" "zsh" "asis" "asy" "block"
## [25] "block2" "bslib" "c" "cat" "cc" "comment"
## [31] "css" "ditaa" "dot" "embed" "eviews" "exec"
## [37] "fortran" "fortran95" "go" "highlight" "js" "julia"
## [43] "python" "R" "Rcpp" "sass" "scss" "sql"
## [49] "stan" "targets" "tikz" "verbatim" "glue" "glue_sql"
## [55] "gluesql"
In this section, we will demonstrate the use of Python
within R
and the seamless integration between
R
, Python
, and C/C++
libraries.
This functionality substantially enriches the already comprehensive
collection of thousands of existent R libraries, as well as, provides a
mechanism to improve
significantly the performance of any R
protocol by
outsourcing some of the heavy-duty calculations to external (compiled)
C/C++
executables.
RStudio has a quick
demo of the reticulate package, which provides access to
tools and enables the interoperability between Python
and
R
.
We will demonstrate this interoperability by fitting some models using Python’s scikit-learn library.
Users need to first install Python
on their local
machines by downloading the software for the appropriate operating
system:
Under the download heading “Stable Releases”, select any
Python 3
version. Note that certain configurations may
require downloading and installing an earlier Python version \(\leq 3.8\).
NOTE: There may be a temporary incompatibility issue
between the reticulate
package and the latest Python
version (e.g., \(\geq 3.9\)). It may be
safer to download and install a slightly older Python 3 version. If
downloading the LATEST Python 3 release
fails the testing
below, try to reinstall an earlier Python version and try the tests
below again.
Once downloaded, run the installer following the prompts.
reticulate
packageWe need to load the (pre-installed) reticulate package and
point to the specific directory of the local Python installation on your
local machine. You can either manually type in the PATH
to
Python or use Sys.which("python3")
to find it
automatically, which may not work well if you don’t have the system
environmental variables correctly set.
Python
ModulesAdditional Python modules can be installed either using a
shell/terminal window for Mac OS system or cmd window
for Windows OS. In the command shell window, type in
pip install
and append it by the names of the modules you
want to install (e.g., pip install pandas
) and press
Enter. The module should be automatically downloaded and
installed on the local machine. Please make sure to install all of the
required modules (e.g., pandas
, sklearn
)
before you move onto the next stage. Some of these additional packages
may be automatically installed by a conda
python installation.
Following a successful installation of the add-on packages, we can
import python and any additional modules into the R environment. Note
the new notebook specification {python}
, instead of
{r}
, in the chunk of executable code.
NOTE: RStudio version must be \(2023\ +\) to allow passing objects between
R
, Python
, and any other of the languages that
can be invoked in the R
markdown notebook. See this
RStudio Reticulate video.
Let’s demonstrate the integration or R
with
external Python
functions as well as directly
within the same Rmarkdown electronic notebook using ```{python …
}’’’ blocks of code.
We can call external Python
functions directly in
R
via the reticulate::source_python(’*.py’).
In this example, we are using this Python
function, getEvenNumbers(), which takes a vector or numbers
and returns a sub-vector only containing the even numbers from
the input. Mind that this R-Python
integration is done
within an R
block of code, not as we showed earlier in a
separate Python
code block. The input is an R
object, the calculations are done purely by an external
Python
function, and the Python result is
automatically converted to an R object, which at the end is
printed out.
### <Start_Python_Code>
# # Pretest this function on colab before integrating within R/RStudio:
# # https://colab.research.google.com/
# # Mind Python strict code formatting principles: https://peps.python.org/pep-0008/
#
# # Define an external Python function that takes a numerical object
# # and returns an object including only the even numbers from the input
# def getEvenNumbers(numbers):
# even_nums = [num for num in numbers if not num % 2]
# return even_nums
#
# # Uncomment next line to Test the getEvenNumbers() function
# # getEvenNumbers([1, 2, 3, 4, 5, 6, 7, 9, 2.2, 1000000000000000])
# # Result should be: [2, 4, 6, 1000000000000000]
### <End_Python_Code>
# detach("package:reticulate", unload=TRUE)
### R code
# First source the Python code: for local Python files: reticulate::source_python("/path/meanCPP.cpp")
library(devtools)
library(reticulate)
# Local test:
# reticulate::source_python('C:/Users/user/Desktop/evenNumbersPythonFunction.py')
sourceURL <- "https://socr.umich.edu/DSPA2/DSPA2_notes/evenNumbersPythonFunction.py"
localSource <- "evenNumbersPythonFunction.py"
download.file(url=sourceURL, destfile=localSource)
source_python('evenNumbersPythonFunction.py')
# R data object (a numerical vector)
rInputVector <- c(1, 2, 3, 4, 5, 6, 7, 9, 2.2, 1000000000000000)
# Call the python function and retrieve the python result auto-converted as R object!
rOutputResults <- getEvenNumbers(rInputVector)
cat("R-data object ")
## R-data object
## 1 2 3 4 5 6 7 9 2.2 1e+15
## is sent to external Python function, evenNumbersPythonFunction.py,
## which is executed via reticulate package and the result object
## 2 4 6 1e+15
## is accessible/printed in R!
Next, we’ll showcase R-Python
integration directly into
RStudio/Rmarkdown notebook.
# import the necessary python packages (pandas) and sub-packages (sklearn.tree.DecisionTreeClassifier)
# Use the Anaconda shell to install the scikit-learn Python Package
# %> conda install anaconda::scikit-learn
import pandas
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
Let’s load the iris
data in R
, pass it onto
Python
, and split it into training and testing sets using
the sklearn.tree.train_test_split()
method.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Note that some of the code in this section of the Rmarkdown
notebook delimited between \[\begin{align*}
&{\text{```{python ...} }}\\
&{\text{... <python code block> ... }}\\
&{\text{```}}\end{align*}\] is python
, not
R
, e.g., train_test_split()
,
DecisionTreeClassifier()
.
# This is a ```{python ...}``` code block!
# report the first 5 cases of the data within Python
print(r.iris[1:6])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 4.9 3.0 1.4 0.2 setosa
## 2 4.7 3.2 1.3 0.2 setosa
## 3 4.6 3.1 1.5 0.2 setosa
## 4 5.0 3.6 1.4 0.2 setosa
## 5 5.4 3.9 1.7 0.4 setosa
# Split the data in Python (use random seed for reproducibility)
train, test = train_test_split(r.iris, test_size = 0.4, random_state = 4321)
X = train.drop('Species', axis = 1)
y = train.loc[:, 'Species'].values
X_test = test.drop('Species', axis = 1)
y_test = test.loc[:, 'Species'].values
Let’s pull back into R
the first 5 training observations
(\(X\)) from the Python
object. Note that \(X\) is a Python
object generated in the Python
chunk that we are now
processing within the R
chunk. Mind the use of the
py$
prefix to the object (py$X
). As the
train_test_split()
method does random selection of rows
(cases) into the training and testing sets, the top 5 cases reported in
the initial ordering of the cases by R
may be different
from the top 5 cases reported after the Python
block
processing.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 41 4.5 2.3 1.3 0.3
## 16 5.4 3.9 1.3 0.4
## 26 5.0 3.4 1.6 0.4
## 99 5.7 2.8 4.1 1.3
## 5 5.4 3.9 1.7 0.4
## 85 6.0 3.4 4.5 1.6
Next, we will fit a simple decision tree model within
Python
using sklearn
on the training data and
evaluate its performance on the independent testing set and visualize
the results in R
.
# Model fitting in Python
tree = DecisionTreeClassifier(random_state=4321)
clf = tree.fit(X, y)
pred = clf.predict(X_test)
pred[1:6]
## array(['setosa', 'virginica', 'virginica', 'setosa', 'virginica'],
## dtype=object)
R
To begin with, we will pull the Python
pandas dataset
into an R
object.
# Store python pandas object as R tibble and identify correct and incorrect predicted labels
library(kableExtra)
library(tibble)
foo <- py$test %>%
as_tibble() %>%
rename(truth = Species) %>%
mutate(predicted = as.factor(py$pred),
correct = (truth == predicted))
foo %>%
head(5) %>%
select(-Petal.Length, -Petal.Width) %>%
kable() %>%
kable_styling()
Sepal.Length | Sepal.Width | truth | predicted | correct |
---|---|---|---|---|
5.4 | 3.4 | setosa | setosa | TRUE |
5.1 | 3.3 | setosa | setosa | TRUE |
5.9 | 3.2 | versicolor | virginica | FALSE |
6.3 | 3.3 | virginica | virginica | TRUE |
5.1 | 3.8 | setosa | setosa | TRUE |
Finally, we can plot in R
the testing-data results and
compare the real iris flower taxa labels (colors) and their
predicted-label counterparts (shapes).
# R
# p1 <- py$test %>%
# ggplot(aes(py$test$Petal.Length, py$test$Petal.Width, color = py$test$Species)) + # Species == py$y
# geom_point(size = 4) +
# labs(x = "Petal Length", y = "Petal Width") +
# theme_bw() +
# theme(legend.position = "none") +
# ggtitle("Raw Testing Data Iris Differences",
# subtitle = str_c("Petal Length and Width vary",
# "\n", "significantly between species"))
#
# p2 <- py$test %>%
# ggplot(aes(py$test$Petal.Length, py$test$Petal.Width,
# color = py$test$Species), shape=as.factor(py$pred)) +
# geom_point(size = 4, aes(shape=as.factor(py$pred), color = py$test$Species)) +
# labs(x = "Petal Length", y = "Petal Width") +
# #theme_bw() +
# theme(legend.position = "right") +
# #scale_shape_manual(name="Species",
# # values=as.factor(py$pred), labels=as.factor(py$pred)) +
# ggtitle("Raw (Colors) vs. Predicted (Shape)\n Iris Differences",
# subtitle = str_c("Petal Length and Width between species"))
#
# grid.arrange(p1, p2, layout_matrix = rbind(c(1,3)))
library(plotly)
plot_ly(py$test, x=~py$test$Petal.Length, y=~py$test$Petal.Width, color = ~py$test$Species,
symbol = ~as.factor(py$pred), type="scatter", marker = list(size = 20), mode="markers") %>%
layout(title="Python Iris Taxa Prediction: Raw (Colors) vs. Predicted (Shape) Species",
xaxis=list(title="Petal Length"), xaxis=list(title="Petal Width"),
legend = list(orientation='h'))
There are many alternative ways to blend R
and
C/C++/Cpp
code. The simplest approach may be to use inline
C++
functional directly in R
via the cppFunction().
Alternatively, we can keep C++
source files completely
independent and sourceCpp()
them into R
for
indirect use. Here is an example of a stand-alone C++
program meanCPP()
computing the mean and standard deviation
of a vector input. To try this, save the C++
code below in
a text file: meanCPP.cpp
and invoke it within
R
. Note that the C++
code can also include
R
method calls, e.g., sdR()!
Note: This R/C++
integration requires Rtools package and
the make function, as well as proper PATH
environment variable setting, which can be checked and set by:
### <Start_CPP_Code>
# #include <Rcpp.h>
# using namespace Rcpp; // this is a required name-space declaration in the C++ code
#
# /*** functions that will be used within R are prefixed with: `// [[Rcpp::export]]`.
# We can compile the C++ code within R by *sourceCpp("/path_to/meanCPP.cpp")*.
# These compiled functions can be used in R, but can't be saved in a `.Rdata` files
# and need to always be reloaded prior to reuse after `R` restart.
# */
#
# // [[Rcpp::export]]
# double meanCPP(NumericVector vec) {
# int n = vec.size();
# double total = 0;
#
# for(int i = 0; i < n; ++i) { // mind the C++ indexing starts at zero, not 1, as in R
# total += vec[i];
# }
# return total/n;
# }
# /*** R
# # This is R code embedded in C++ to compute the SD of a vector
# sdR <- function (vec) {
# return(sd(vec))
# }
# */
### <End_CPP_Code>
Next, we will demonstrate the R
and C++
integration.
### R code
# First source C++ code: for local C++ files: sourceCpp("/path/meanCPP.cpp")
library(devtools)
library(Rcpp)
# install.packages("miniUI")
# library(miniUI)
# install.packages("pkgbuild")
library(pkgbuild)
sourceURL <- "https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/meanCPP.cpp"
localSource <- "meanCPP.cpp"
download.file(url=sourceURL, destfile=localSource)
sourceCpp("meanCPP.cpp")
##
## > sdR <- function(vec) {
## + return(sd(vec))
## + }
# Call outside C++ meanCPP() method
r_vec <- rnorm(10^8) # generate 100M random values & compare computational complexity
system.time(m1 <- mean(r_vec)) # R solution
## user system elapsed
## 0.06 0.00 0.12
## user system elapsed
## 0.03 0.00 0.06
## [1] 0
# Compare the sdR() function defined within C++ using R methods to base::sd()
s1 <- sdR(r_vec); round(s1, 3) # remember the data is N(mean=0, sd=1)
## [1] 1
## [1] 0
Notice that the C++
method meanCPP() is faster
in computing the mean compared to the native R
base::mean().
Try to analyze the co-appearance network in the novel “Les Miserablese”. The data contains the weighted network of co-appearances of characters in Victor Hugo’s novel “Les Miserables”. Nodes represent characters as indicated by the labels and edges connect any pair of characters that appear in the same chapter of the book. The values on the edges are the number of such co-appearances.
miserablese <- read.table("https://umich.instructure.com/files/330389/download?download_frd=1", sep="", header=F)
head(miserablese)
## V1 V2
## 1 Myriel Napoleon
## 2 Myriel MlleBaptistine
## 3 Myriel MmeMagloire
## 4 MlleBaptistine MmeMagloire
## 5 Myriel CountessDeLo
## 6 Myriel Geborand
Also, try to interrogate some of the larger datasets we have using alternative parallel computing and big data analytics.