SOCR ≫ | DSPA ≫ | Topics ≫ |
In this chapter, we will discuss some technical details about data formats, streaming, optimization of computation, and distributed deployment of optimized learning algorithms. Chapter 21 provides additional optimization details.
The Internet of Things (IoT) leads to a paradigm shift of scientific inference - from static data interrogated in a batch or distributed environment to an on-demand service-based Cloud computing. Here, we will demonstrate how to work with specialized data, data-streams, and SQL databases, as well as develop and assess on-the-fly data modeling, classification, prediction and forecasting methods. Important examples to keep in mind throughout this chapter include high-frequency data delivered real time in hospital ICU’s (e.g., microsecond Electroencephalography signals, EEGs), dynamically changing stock market data (e.g., Dow Jones Industrial Average Index, DJI), and weather patterns.
We will present (1) format conversion and working with XML, SQL, JSON, CSV, SAS and other data objects, (2) visualization of bioinformatics and network data, (3) protocols for managing, classifying and predicting outcomes from data streams, (4) strategies for optimization, improvement of computational performance, parallel (MPI) and graphics (GPU) computing, and (5) processing of very large datasets.
Unlike the case-studies we saw in the previous chapters, some real world data may not always be nicely formatted, e.g., as CSV files. We must collect, arrange, wrangle, and harmonize scattered information to generate computable data objects that can be further processed by various techniques. Data wrangling and preprocessing may take over 80% of the time researchers spend interrogating complex multi-source data archives. The following procedures will enhance your skills collecting and handling heterogeneous real world data. Multiple examples of handling long-and-wide data, messy and tidy data, and data cleaning strategies can be found in this JSS Tidy Data
article by Hadley Wickham.
The R package rio
imports and exports various types of file formats, e.g., tab-separated (.tsv
), comma-separated (.csv
), JSON (.json
), Stata (.dta
), SPSS (.sav
and .por
), Microsoft Excel (.xls
and .xlsx
), Weka (.arff
), and SAS (.sas7bdat
and .xpt
).
rio
provides three important functions import()
, export()
and convert()
. They are intuitive, easy to understand, and efficient to execute. Take Stata (.dta) files as an example. First we can download 02_Nof1_Data.dta from our datasets folder.
# install.packages("rio")
library(rio)
# Download the SAS .DTA file first locally
# Local data can be loaded by:
#nof1<-import("02_Nof1_Data.dta")
# the data can also be loaded from the server remotely as well:
nof1<-read.csv("https://umich.instructure.com/files/330385/download?download_frd=1")
str(nof1)
## 'data.frame': 900 obs. of 10 variables:
## $ ID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Tx : int 1 1 0 0 1 1 0 0 1 1 ...
## $ SelfEff : int 33 33 33 33 33 33 33 33 33 33 ...
## $ SelfEff25: int 8 8 8 8 8 8 8 8 8 8 ...
## $ WPSS : num 0.97 -0.17 0.81 -0.41 0.59 -1.16 0.3 -0.34 -0.74 -0.38 ...
## $ SocSuppt : num 5 3.87 4.84 3.62 4.62 2.87 4.33 3.69 3.29 3.66 ...
## $ PMss : num 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 ...
## $ PMss3 : num 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 ...
## $ PhyAct : int 53 73 23 36 21 0 21 0 73 114 ...
The data is automatically stored as a data frame. Note that rio
sets stingAsFactors=FALSE
as default.
rio
can help us export files into any other format we chose. To do this we have to use the export()
function.
#Sys.getenv("R_ZIPCMD", "zip") # Get the C Zip application
Sys.setenv(R_ZIPCMD="E:/Ivo.dir/Ivo_Tools/ZIP/bin/zip.exe")
Sys.getenv("R_ZIPCMD", "zip")
## [1] "E:/Ivo.dir/Ivo_Tools/ZIP/bin/zip.exe"
export(nof1, "C:/Users/Dinov/Desktop/02_Nof1.xlsx")
This line of code exports the Nof1 data in xlsx
format located in the R
working directory. Mac users may have a problem exporting *.xslx
files using rio
because of a lack of a zip tool, but still can output other formats such as “.csv”. An alternative strategy to save an xlsx
file is to use package xlsx
with default row.name=TRUE
.
rio
also provides a one-step process to convert-and-save data into alternative formats. The following simple code allows us to convert and save the 02_Nof1_Data.dta
file we just downloaded into a CSV file.
# convert("02_Nof1_Data.dta", "02_Nof1_Data.csv")
convert("C:/Users/Dinov/Desktop/02_Nof1.xlsx", "C:/Users/Dinov/Desktop/02_Nof1_Data.csv")
You can see a new CSV file pop-up in the working directory. Similar transformations are available for other data formats and types.
The CDC Behavioral Risk Factor Surveillance System (BRFSS) Data, 2013-2015. This file for the combined landline and cell phone data set was exported from SAS V9.3 in the XPT transport format. This file contains 330 variables. This format can be imported into SPSS or STATA. Please note: some of the variable labels get truncated in the process of converting to the XPT format
Be careful - this compressed (ZIP) file is over 315MB in size!
# install.packages("Hmisc")
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
memory.size(max=T)
## [1] 111.31
pathToZip <- tempfile()
download.file("http://www.socr.umich.edu/data/DSPA/BRFSS_2013_2014_2015.zip", pathToZip)
# let's just pull two of the 3 years of data (2013 and 2015)
brfss_2013 <- sasxport.get(unzip(pathToZip)[1])
## Processing SAS dataset LLCP2013 ..
brfss_2015 <- sasxport.get(unzip(pathToZip)[3])
## Processing SAS dataset LLCP2015 ..
dim(brfss_2013); object.size(brfss_2013)
## [1] 491773 336
## 685581232 bytes
# summary(brfss_2013[1:1000, 1:10]) # subsample the data
# report the summaries for
summary(brfss_2013$has_plan)
## Length Class Mode
## 0 NULL NULL
brfss_2013$x.race <- as.factor(brfss_2013$x.race)
summary(brfss_2013$x.race)
## 1 2 3 4 5 6 7 8 9 NA's
## 376451 39151 7683 9510 1546 2693 9130 37054 8530 25
# clean up
unlink(pathToZip)
Let’s try to use logistic regression to find out if self-reported race/ethnicity predicts the binary outcome of having a health care plan.
brfss_2013$has_plan <- brfss_2013$hlthpln1 == 1
system.time(
gml1 <- glm(has_plan ~ as.factor(x.race), data=brfss_2013,
family=binomial)
) # report execution time
## user system elapsed
## 2.08 0.26 2.35
summary(gml1)
##
## Call:
## glm(formula = has_plan ~ as.factor(x.race), family = binomial,
## data = brfss_2013)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1862 0.4385 0.4385 0.4385 0.8047
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.293549 0.005649 406.044 <2e-16 ***
## as.factor(x.race)2 -0.721676 0.014536 -49.647 <2e-16 ***
## as.factor(x.race)3 -0.511776 0.032974 -15.520 <2e-16 ***
## as.factor(x.race)4 -0.329489 0.031726 -10.386 <2e-16 ***
## as.factor(x.race)5 -1.119329 0.060153 -18.608 <2e-16 ***
## as.factor(x.race)6 -0.544458 0.054535 -9.984 <2e-16 ***
## as.factor(x.race)7 -0.510452 0.030346 -16.821 <2e-16 ***
## as.factor(x.race)8 -1.332005 0.012915 -103.138 <2e-16 ***
## as.factor(x.race)9 -0.582204 0.030604 -19.024 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 353371 on 491747 degrees of freedom
## Residual deviance: 342497 on 491739 degrees of freedom
## (25 observations deleted due to missingness)
## AIC: 342515
##
## Number of Fisher Scoring iterations: 5
Next, we’ll examine the odds (rather the log odds ration, LOR) of having a health care plan (HCP) by race (R). The LORs are calculated for two array dimensions, separately for each race level (presence of health care plan (HCP) is binary, whereas race (R) has 9 levels, \(R1, R2, ..., R9\)). For example, the odds ratio of having a HCP for \(R1:R2\) is:
\[ OR(R1:R2) = \frac{\frac{P \left( HCP \mid R1 \right)}{1 - P \left( HCP \mid R1 \right)}}{\frac{P \left( HCP \mid R2 \right)}{1 - P \left( HCP \mid R2 \right)}} .\]
# install.packages("vcd")
# load the vcd package to compute the LOR
library("vcd")
## Loading required package: grid
lor_HCP_by_R <- loddsratio(has_plan ~ as.factor(x.race), data = brfss_2013)
lor_HCP_by_R
## log odds ratios for has_plan and as.factor(x.race)
##
## 1:2 2:3 3:4 4:5 5:6 6:7
## -0.72167619 0.20990061 0.18228646 -0.78984000 0.57487142 0.03400611
## 7:8 8:9
## -0.82155382 0.74980101
Now, let’s see an example of querying a database containing structured relational collection of data records. A query is a machine instruction (typically represented as text) sent by a user to remote database requesting a specific database operation (e.g., search or summary). One database communication protocol relies on SQL (Structured query language). MySQL is an instance of a database management system that supports SQL communication that many web applications utilize, e.g., YouTube, Flickr, Wikipedia, biological databases like GO, ensembl, etc. Below is an example of an SQL query using the package RMySQL
. An alternative way to interface an SQL database is using the package RODBC
,
# install.packages("DBI"); install.packages("RMySQL")
# install.packages("RODBC"); library(RODBC)
library(DBI)
library(RMySQL)
ucscGenomeConn <- dbConnect(MySQL(),
user='genome',
dbname='hg38',
host='genome-mysql.cse.ucsc.edu')
result <- dbGetQuery(ucscGenomeConn,"show databases;");
# List the DB tables
allTables <- dbListTables(ucscGenomeConn); length(allTables)
# Get dimensions of a table, read and report the head
dbListFields(ucscGenomeConn, "affyU133Plus2")
affyData <- dbReadTable(ucscGenomeConn, "affyU133Plus2"); head(affyData)
# Select a subset, fetch the data, and report the quantiles
subsetQuery <- dbSendQuery(ucscGenomeConn, "select * from affyU133Plus2 where misMatches between 1 and 3")
affySmall <- fetch(subsetQuery); quantile(affySmall$misMatches)
# Get repeat mask
bedFile <- 'repUCSC.bed'
df <- dbSendQuery(ucscGenomeConn,'select genoName,genoStart,genoEnd,repName,swScore, strand,
repClass, repFamily from rmsk') %>%
dbFetch(n=-1) %>%
mutate(genoName = str_replace(genoName,'chr','')) %>%
tbl_df %>%
write_tsv(bedFile,col_names=F)
message('written ', bedFile)
# Once done, close the connection
dbDisconnect(ucscGenomeConn)
To complete the above database SQL commands, it requires access to the remote UCSC SQL Genome server and user-specific credentials. The example below can be done by all users, as it relies only on local services.
# install.packages("RSQLite")
library("RSQLite")
# generate an empty DB and store it in RAM
myConnection <- dbConnect(RSQLite::SQLite(), ":memory:")
myConnection
## <SQLiteConnection>
## Path: :memory:
## Extensions: TRUE
dbListTables(myConnection)
## character(0)
# Add tables to the local SQL DB
data(USArrests); dbWriteTable(myConnection, "USArrests", USArrests)
dbWriteTable(myConnection, "brfss_2013", brfss_2013)
dbWriteTable(myConnection, "brfss_2015", brfss_2015)
# Check again the DB content
dbListFields(myConnection, "brfss_2013")
## [1] "x.state" "fmonth" "idate" "imonth" "iday"
## [6] "iyear" "dispcode" "seqno" "x.psu" "ctelenum"
## [11] "pvtresd1" "colghous" "stateres" "cellfon3" "ladult"
## [16] "numadult" "nummen" "numwomen" "genhlth" "physhlth"
## [21] "menthlth" "poorhlth" "hlthpln1" "persdoc2" "medcost"
## [26] "checkup1" "sleptim1" "bphigh4" "bpmeds" "bloodcho"
## [31] "cholchk" "toldhi2" "cvdinfr4" "cvdcrhd4" "cvdstrk3"
## [36] "asthma3" "asthnow" "chcscncr" "chcocncr" "chccopd1"
## [41] "havarth3" "addepev2" "chckidny" "diabete3" "veteran3"
## [46] "marital" "children" "educa" "employ1" "income2"
## [51] "weight2" "height3" "numhhol2" "numphon2" "cpdemo1"
## [56] "cpdemo4" "internet" "renthom1" "sex" "pregnant"
## [61] "qlactlm2" "useequip" "blind" "decide" "diffwalk"
## [66] "diffdres" "diffalon" "smoke100" "smokday2" "stopsmk2"
## [71] "lastsmk2" "usenow3" "alcday5" "avedrnk2" "drnk3ge5"
## [76] "maxdrnks" "fruitju1" "fruit1" "fvbeans" "fvgreen"
## [81] "fvorang" "vegetab1" "exerany2" "exract11" "exeroft1"
## [86] "exerhmm1" "exract21" "exeroft2" "exerhmm2" "strength"
## [91] "lmtjoin3" "arthdis2" "arthsocl" "joinpain" "seatbelt"
## [96] "flushot6" "flshtmy2" "tetanus" "pneuvac3" "hivtst6"
## [101] "hivtstd3" "whrtst10" "pdiabtst" "prediab1" "diabage2"
## [106] "insulin" "bldsugar" "feetchk2" "doctdiab" "chkhemo3"
## [111] "feetchk" "eyeexam" "diabeye" "diabedu" "painact2"
## [116] "qlmentl2" "qlstres2" "qlhlth2" "medicare" "hlthcvrg"
## [121] "delaymed" "dlyother" "nocov121" "lstcovrg" "drvisits"
## [126] "medscost" "carercvd" "medbills" "ssbsugar" "ssbfrut2"
## [131] "wtchsalt" "longwtch" "dradvise" "asthmage" "asattack"
## [136] "aservist" "asdrvist" "asrchkup" "asactlim" "asymptom"
## [141] "asnoslep" "asthmed3" "asinhalr" "harehab1" "strehab1"
## [146] "cvdasprn" "aspunsaf" "rlivpain" "rduchart" "rducstrk"
## [151] "arttoday" "arthwgt" "arthexer" "arthedu" "imfvplac"
## [156] "hpvadvc2" "hpvadsht" "hadmam" "howlong" "profexam"
## [161] "lengexam" "hadpap2" "lastpap2" "hadhyst2" "bldstool"
## [166] "lstblds3" "hadsigm3" "hadsgco1" "lastsig3" "pcpsaad2"
## [171] "pcpsadi1" "pcpsare1" "psatest1" "psatime" "pcpsars1"
## [176] "pcpsade1" "pcdmdecn" "rrclass2" "rrcognt2" "rratwrk2"
## [181] "rrhcare3" "rrphysm2" "rremtsm2" "misnervs" "mishopls"
## [186] "misrstls" "misdeprd" "miseffrt" "miswtles" "misnowrk"
## [191] "mistmnt" "mistrhlp" "misphlpf" "scntmony" "scntmeal"
## [196] "scntpaid" "scntwrk1" "scntlpad" "scntlwk1" "scntvot1"
## [201] "rcsgendr" "rcsrltn2" "casthdx2" "casthno2" "emtsuprt"
## [206] "lsatisfy" "ctelnum1" "cellfon2" "cadult" "pvtresd2"
## [211] "cclghous" "cstate" "landline" "pctcell" "qstver"
## [216] "qstlang" "mscode" "x.ststr" "x.strwt" "x.rawrake"
## [221] "x.wt2rake" "x.imprace" "x.impnph" "x.chispnc" "x.crace1"
## [226] "x.impcage" "x.impcrac" "x.impcsex" "x.cllcpwt" "x.dualuse"
## [231] "x.dualcor" "x.llcpwt2" "x.llcpwt" "x.rfhlth" "x.hcvu651"
## [236] "x.rfhype5" "x.cholchk" "x.rfchol" "x.ltasth1" "x.casthm1"
## [241] "x.asthms1" "x.drdxar1" "x.prace1" "x.mrace1" "x.hispanc"
## [246] "x.race" "x.raceg21" "x.racegr3" "x.race.g1" "x.ageg5yr"
## [251] "x.age65yr" "x.age.g" "htin4" "htm4" "wtkg3"
## [256] "x.bmi5" "x.bmi5cat" "x.rfbmi5" "x.chldcnt" "x.educag"
## [261] "x.incomg" "x.smoker3" "x.rfsmok3" "drnkany5" "drocdy3."
## [266] "x.rfbing5" "x.drnkdy4" "x.drnkmo4" "x.rfdrhv4" "x.rfdrmn4"
## [271] "x.rfdrwm4" "ftjuda1." "frutda1." "beanday." "grenday."
## [276] "orngday." "vegeda1." "x.misfrtn" "x.misvegn" "x.frtresp"
## [281] "x.vegresp" "x.frutsum" "x.vegesum" "x.frtlt1" "x.veglt1"
## [286] "x.frt16" "x.veg23" "x.fruitex" "x.vegetex" "x.totinda"
## [291] "metvl11." "metvl21." "maxvo2." "fc60." "actin11."
## [296] "actin21." "padur1." "padur2." "pafreq1." "pafreq2."
## [301] "x.minac11" "x.minac21" "strfreq." "pamiss1." "pamin11."
## [306] "pamin21." "pa1min." "pavig11." "pavig21." "pa1vigm."
## [311] "x.pacat1" "x.paindx1" "x.pa150r2" "x.pa300r2" "x.pa30021"
## [316] "x.pastrng" "x.parec1" "x.pastae1" "x.lmtact1" "x.lmtwrk1"
## [321] "x.lmtscl1" "x.rfseat2" "x.rfseat3" "x.flshot6" "x.pneumo2"
## [326] "x.aidtst3" "x.age80" "x.impeduc" "x.impmrtl" "x.imphome"
## [331] "rcsbrac1" "rcsrace1" "rchisla1" "rcsbirth" "typeinds"
## [336] "typework" "has_plan"
dbListTables(myConnection);
## [1] "USArrests" "brfss_2013" "brfss_2015"
# Retrieve the entire DB table (for the smaller USArrests table)
dbGetQuery(myConnection, "SELECT * FROM USArrests")
## Murder Assault UrbanPop Rape
## 1 13.2 236 58 21.2
## 2 10.0 263 48 44.5
## 3 8.1 294 80 31.0
## 4 8.8 190 50 19.5
## 5 9.0 276 91 40.6
## 6 7.9 204 78 38.7
## 7 3.3 110 77 11.1
## 8 5.9 238 72 15.8
## 9 15.4 335 80 31.9
## 10 17.4 211 60 25.8
## 11 5.3 46 83 20.2
## 12 2.6 120 54 14.2
## 13 10.4 249 83 24.0
## 14 7.2 113 65 21.0
## 15 2.2 56 57 11.3
## 16 6.0 115 66 18.0
## 17 9.7 109 52 16.3
## 18 15.4 249 66 22.2
## 19 2.1 83 51 7.8
## 20 11.3 300 67 27.8
## 21 4.4 149 85 16.3
## 22 12.1 255 74 35.1
## 23 2.7 72 66 14.9
## 24 16.1 259 44 17.1
## 25 9.0 178 70 28.2
## 26 6.0 109 53 16.4
## 27 4.3 102 62 16.5
## 28 12.2 252 81 46.0
## 29 2.1 57 56 9.5
## 30 7.4 159 89 18.8
## 31 11.4 285 70 32.1
## 32 11.1 254 86 26.1
## 33 13.0 337 45 16.1
## 34 0.8 45 44 7.3
## 35 7.3 120 75 21.4
## 36 6.6 151 68 20.0
## 37 4.9 159 67 29.3
## 38 6.3 106 72 14.9
## 39 3.4 174 87 8.3
## 40 14.4 279 48 22.5
## 41 3.8 86 45 12.8
## 42 13.2 188 59 26.9
## 43 12.7 201 80 25.5
## 44 3.2 120 80 22.9
## 45 2.2 48 32 11.2
## 46 8.5 156 63 20.7
## 47 4.0 145 73 26.2
## 48 5.7 81 39 9.3
## 49 2.6 53 66 10.8
## 50 6.8 161 60 15.6
# Retrieve just the average of one feature
myQuery <- dbGetQuery(myConnection, "SELECT avg(Assault) FROM USArrests"); myQuery
## avg(Assault)
## 1 170.76
myQuery <- dbGetQuery(myConnection, "SELECT avg(Assault) FROM USArrests GROUP BY UrbanPop"); myQuery
## avg(Assault)
## 1 48.00
## 2 81.00
## 3 152.00
## 4 211.50
## 5 271.00
## 6 190.00
## 7 83.00
## 8 109.00
## 9 109.00
## 10 120.00
## 11 57.00
## 12 56.00
## 13 236.00
## 14 188.00
## 15 186.00
## 16 102.00
## 17 156.00
## 18 113.00
## 19 122.25
## 20 229.50
## 21 151.00
## 22 231.50
## 23 172.00
## 24 145.00
## 25 255.00
## 26 120.00
## 27 110.00
## 28 204.00
## 29 237.50
## 30 252.00
## 31 147.50
## 32 149.00
## 33 254.00
## 34 174.00
## 35 159.00
## 36 276.00
# Or do it in batches (for the much larger brfss_2013 and brfss_2015 tables)
myQuery <- dbGetQuery(myConnection, "SELECT * FROM brfss_2013")
# extract data in chunks of 2 rows, note: dbGetQuery vs. dbSendQuery
# myQuery <- dbSendQuery(myConnection, "SELECT * FROM brfss_2013")
# fetch2 <- dbFetch(myQuery, n = 2); fetch2
# do we have other cases in the DB remaining?
# extract all remaining data
# fetchRemaining <- dbFetch(myQuery, n = -1);fetchRemaining
# We should have all data in DB now
# dbHasCompleted(myQuery)
# compute the average (poorhlth) grouping by Insurance (hlthpln1)
# Try some alternatives: numadult nummen numwomen genhlth physhlth menthlth poorhlth hlthpln1
myQuery1_13 <- dbGetQuery(myConnection, "SELECT avg(poorhlth) FROM brfss_2013 GROUP BY hlthpln1"); myQuery1_13
## avg(poorhlth)
## 1 56.25466
## 2 53.99962
## 3 58.85072
## 4 66.26757
# Compare 2013 vs. 2015: Health grouping by Insurance
myQuery1_15 <- dbGetQuery(myConnection, "SELECT avg(poorhlth) FROM brfss_2015 GROUP BY hlthpln1"); myQuery1_15
## avg(poorhlth)
## 1 55.75539
## 2 55.49487
## 3 61.35445
## 4 67.62125
myQuery1_13 - myQuery1_15
## avg(poorhlth)
## 1 0.4992652
## 2 -1.4952515
## 3 -2.5037326
## 4 -1.3536797
# reset the DB query
# dbClearResult(myQuery)
# clean up
dbDisconnect(myConnection)
The SparQL Protocol and RDF Query Language (SparQL) is a semantic database query language for RDF (Resource Description Framework) data objects. SparQL queries consist of (1) triple patterns, (2) conjunctions, and (3) disjunctions.
The following example uses SparQL to query the prevalence of tuberculosis from the WikiData SparQL server and plot it on a World geographic map.
# install.packages("SPARQL")
# install.packages("rworldmap")
library(SPARQL)
## Loading required package: XML
## Loading required package: RCurl
## Loading required package: bitops
library(ggplot2)
library(rworldmap)
## Loading required package: sp
## ### Welcome to rworldmap ###
## For a short introduction type : vignette('rworldmap')
# SparQL Formal
# https://www.w3.org/2009/Talks/0615-qbe/
# W3C Turtle - Terse RDF Triple Language:
# https://www.w3.org/TeamSubmission/turtle/#sec-examples
# RDF (Resource Description Framework) is a graphical data model of (subject, predicate, object) triples representing:
# "subject-node to predicate arc to object arc"
# Resources are represented with URIs, which can be abbreviated as prefixed names
# Objects are literals: strings, integers, booleans, etc.
# Syntax
# URIs: <http://example.com/resource> or prefix:name
# Literals:
# "plain string" "13.4""
# xsd:float, or
# "string with language" @en
# Triple: pref:subject other:predicate "object".
wdqs <- "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
query = "PREFIX wd: <http://www.wikidata.org/entity/>
# prefix declarations
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX v: <http://www.wikidata.org/prop/statement/>
PREFIX qualifier: <http://www.wikidata.org/prop/qualifier/>
PREFIX statement: <http://www.wikidata.org/prop/statement/>
# result clause
SELECT DISTINCT ?countryLabel ?ISO3Code ?latlon ?prevalence ?doid ?year
# query pattern against RDF data
# Q36956 Hansen's disease, Leprosy https://www.wikidata.org/wiki/Q36956
# Q15750965 - Alzheimer's disease: https://www.wikidata.org/wiki/Q15750965
# Influenza - Q2840: https://www.wikidata.org/wiki/Q2840
# Q12204 - tuberculosis https://www.wikidata.org/wiki/Q12204
# P699 Alzheimer's Disease ontology ID
# P1193 prevalence: https://www.wikidata.org/wiki/Property:P1193
# P17 country: https://www.wikidata.org/wiki/Property:P17
# Country ISO-3 code: https://www.wikidata.org/wiki/Property:P298
# Location: https://www.wikidata.org/wiki/Property:P625
# Wikidata docs: https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual
WHERE {
wd:Q12204 wdt:P699 ?doid ; # tuberculosis P699 Disease ontology ID
p:P1193 ?prevalencewithProvenance .
?prevalencewithProvenance qualifier:P17 ?country ;
qualifier:P585 ?year ;
statement:P1193 ?prevalence .
?country wdt:P625 ?latlon ;
rdfs:label ?countryLabel ;
wdt:P298 ?ISO3Code ;
wdt:P297 ?ISOCode .
FILTER (lang(?countryLabel) = \"en\")
# FILTER constraints use boolean conditions to filter out unwanted query results.
# Shortcut: a semicolon (;) can be used to separate two triple patterns that share the same disease (?country is the shared subject above.)
# rdfs:label is a common predicate for giving a human-friendly label to a resource.
}
# query modifiers
ORDER BY DESC(?population)
"
results <- SPARQL(wdqs, query)
resultMatrix <- as.matrix(results$results)
View(resultMatrix)
# join the data to the geo map
sPDF <- joinCountryData2Map(results$results, joinCode = "ISO3", nameJoinColumn = "ISO3Code")
## 7 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 236 codes from the map weren't represented in your data
#map the data with no legend
mapParams <- mapCountryData( sPDF
, nameColumnToPlot="prevalence"
# Alternatively , nameColumnToPlot="doid"
, addLegend='FALSE'
)
#add a modified legend using the same initial parameters as mapCountryData
do.call( addMapLegend, c( mapParams
, legendLabels="all"
, legendWidth=0.5
))
text(1, 1, "Partial view of Tuberculosis Prevelance in the World", cex=1)
#do.call( addMapLegendBoxes
# , c(mapParams
# , list(
# legendText=c('Chile', 'US','Brazil','Argentina'),
# x='bottom',title="AD Prevelance",horiz=TRUE)))
# Alternatively: mapCountryData(sPDF, nameColumnToPlot="prevalence", oceanCol="darkblue", missingCountryCol="white")
View(getMap())
# write.csv(file = "C:/Users/Map.csv", getMap())
We are already familiar with (pseudo) random number generation (e.g., rnorm(100, 10, 4)
or runif(100, 10,20)
), which generate algorithmically computer values subject to specified distributions. There are also web-services, e.g., random.org, that can provide true random numbers based on atmospheric noise, rather than using a pseudo random number generation protocol. Below is one example of generating a total of 300 numbers arranged in 3 columns, each of 100 rows of random integers (in decimal format) between 100 and 200.
siteURL <- "http://random.org/integers/" # base URL
shortQuery<-"num=300&min=100&max=200&col=3&base=10&format=plain&rnd=new"
completeQuery <- paste(siteURL, shortQuery, sep="?") # concat url and submit query string
rngNumbers <- read.table(file=completeQuery) # and read the data
rngNumbers
## V1 V2 V3
## 1 172 130 170
## 2 163 147 194
## 3 109 129 125
## 4 190 177 166
## 5 122 188 179
## 6 123 142 189
## 7 174 194 150
## 8 154 173 129
## 9 128 195 130
## 10 130 196 174
## 11 165 101 118
## 12 190 136 102
## 13 139 165 151
## 14 117 180 150
## 15 135 168 113
## 16 169 169 124
## 17 152 147 100
## 18 115 183 119
## 19 148 131 159
## 20 117 177 131
## 21 118 186 193
## 22 198 120 173
## 23 147 137 151
## 24 163 102 101
## 25 173 108 194
## 26 124 168 116
## 27 100 109 132
## 28 165 137 104
## 29 161 149 158
## 30 124 176 170
## 31 127 167 144
## 32 104 121 165
## 33 178 168 130
## 34 176 113 152
## 35 120 191 134
## 36 121 189 129
## 37 198 117 186
## 38 191 178 198
## 39 176 163 188
## 40 162 139 119
## 41 163 106 144
## 42 164 132 180
## 43 119 183 118
## 44 119 138 191
## 45 102 123 192
## 46 125 107 156
## 47 123 134 128
## 48 177 147 115
## 49 138 174 162
## 50 176 186 153
## 51 198 123 160
## 52 149 169 106
## 53 117 109 104
## 54 134 106 185
## 55 196 178 135
## 56 164 110 161
## 57 149 129 119
## 58 155 189 180
## 59 196 118 110
## 60 137 131 171
## 61 164 120 138
## 62 118 167 178
## 63 108 197 146
## 64 186 126 199
## 65 133 135 171
## 66 184 180 173
## 67 130 130 152
## 68 103 107 172
## 69 190 143 118
## 70 146 126 133
## 71 197 131 165
## 72 146 130 181
## 73 139 167 148
## 74 142 171 199
## 75 198 165 152
## 76 116 105 107
## 77 188 188 128
## 78 131 114 159
## 79 171 113 159
## 80 125 137 133
## 81 174 107 157
## 82 102 153 108
## 83 138 145 186
## 84 147 133 160
## 85 105 200 188
## 86 127 141 190
## 87 148 111 161
## 88 190 161 153
## 89 185 113 145
## 90 170 120 136
## 91 183 198 126
## 92 191 151 185
## 93 137 179 187
## 94 129 137 138
## 95 191 106 111
## 96 121 199 101
## 97 102 197 138
## 98 181 133 104
## 99 123 160 162
## 100 120 200 173
RCurl
package provides an amazing tool for extracting and scraping information from websites. Let’s install it and extract information from a SOCR website.
# install.packages("RCurl")
library(RCurl)
web<-getURL("http://wiki.socr.umich.edu/index.php/SOCR_Data", followlocation = TRUE)
str(web, nchar.max = 200)
## chr "<!DOCTYPE html>\n<html lang=\"en\" dir=\"ltr\" class=\"client-nojs\">\n<head>\n<meta charset=\"UTF-8\" />\n<title>SOCR Data - SOCR</title>\n<meta http-equiv=\"X-UA-Compatible\" conten"| __truncated__
The web
object looks incomprehensible. This is because most websites are wrapped in XML/HTML hypertext or include JSON formatted meta-data. RCurl
deals with special HTML tags and website meta-data.
To deal with the web pages only, httr
package would be a better choice than RCurl
. It returns a list that makes much more sense.
#install.packages("httr")
library(httr)
web<-GET("http://wiki.socr.umich.edu/index.php/SOCR_Data")
str(web[1:3])
## List of 3
## $ url : chr "http://wiki.socr.umich.edu/index.php/SOCR_Data"
## $ status_code: int 200
## $ headers :List of 12
## ..$ date : chr "Sun, 17 Sep 2017 00:10:42 GMT"
## ..$ server : chr "Apache/2.2.15 (Red Hat)"
## ..$ x-powered-by : chr "PHP/5.3.3"
## ..$ x-content-type-options: chr "nosniff"
## ..$ content-language : chr "en"
## ..$ vary : chr "Accept-Encoding,Cookie"
## ..$ expires : chr "Thu, 01 Jan 1970 00:00:00 GMT"
## ..$ cache-control : chr "private, must-revalidate, max-age=0"
## ..$ last-modified : chr "Sat, 22 Oct 2016 21:46:21 GMT"
## ..$ connection : chr "close"
## ..$ transfer-encoding : chr "chunked"
## ..$ content-type : chr "text/html; charset=UTF-8"
## ..- attr(*, "class")= chr [1:2] "insensitive" "list"
XML
packageA combination of the RCurl
and the XML
packages could help us extract only the plain text in our desired webpages. This would be very helpful to get information from heavy text based websites.
web<-getURL("http://wiki.socr.umich.edu/index.php/SOCR_Data", followlocation = TRUE)
#install.packages("XML")
library(XML)
web.parsed<-htmlParse(web, asText = T)
plain.text<-xpathSApply(web.parsed, "//p", xmlValue)
cat(paste(plain.text, collapse = "\n"))
## The links below contain a number of datasets that may be used for demonstration purposes in probability and statistics education. There are two types of data - simulated (computer-generated using random sampling) and observed (research, observationally or experimentally acquired).
##
## The SOCR resources provide a number of mechanisms to simulate data using computer random-number generators. Here are some of the most commonly used SOCR generators of simulated data:
##
## The following collections include a number of real observed datasets from different disciplines, acquired using different techniques and applicable in different situations.
##
## In addition to human interactions with the SOCR Data, we provide several machine interfaces to consume and process these data.
##
## Translate this page:
##
## (default)
##
## Deutsch
##
## Español
##
## Français
##
## Italiano
##
## Português
##
## <U+65E5><U+672C><U+8A9E>
##
## <U+0411><U+044A><U+043B><U+0433><U+0430><U+0440><U+0438><U+044F>
##
## <U+0627><U+0644><U+0627><U+0645><U+0627><U+0631><U+0627><U+062A> <U+0627><U+0644><U+0639><U+0631><U+0628><U+064A><U+0629> <U+0627><U+0644><U+0645><U+062A><U+062D><U+062F><U+0629>
##
## Suomi
##
## <U+0907><U+0938> <U+092D><U+093E><U+0937><U+093E> <U+092E><U+0947><U+0902>
##
## Norge
##
## <U+D55C><U+AD6D><U+C5B4>
##
## <U+4E2D><U+6587>
##
## <U+7E41><U+4F53><U+4E2D><U+6587>
##
## <U+0420><U+0443><U+0441><U+0441><U+043A><U+0438><U+0439>
##
## Nederlands
##
## <U+0395><U+03BB><U+03BB><U+03B7><U+03BD><U+03B9><U+03BA><U+03AC>
##
## Hrvatska
##
## Ceská republika
##
## Danmark
##
## Polska
##
## România
##
## Sverige
Here we extracted all plain text between the starting and ending paragraph HTML tags, <p>
and </p>
.
More information about extracting text from XML/HTML to text via XPath is available here.
The process that extracting data from complete web pages and storing it in structured data format is called scraping
. However, before starting a data scrap from a website, we need to understand the underlying HTML structure for that specific website. Also, we have to check the terms of that website to make sure that scraping from this site is allowed.
The R package rvest
is a very good place to start “harvesting” data from websites.
To start with, we use read_html()
to store SOCR data website into a xmlnode
object.
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:XML':
##
## xml
## The following object is masked from 'package:Hmisc':
##
## html
SOCR<-read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data")
SOCR
## {xml_document}
## <html lang="en" dir="ltr" class="client-nojs">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="mediawiki ltr sitedir-ltr ns-0 ns-subject page-SOCR_Dat ...
From the summary structure of SOCR
, we can discover that there are two important hypertext section markups <head>
and <body>
. Also, notice that the SOCR data website uses <title>
and </title>
tags to separate title in the <head>
section. Let’s use html_node()
to extract title information based on this knowledge.
SOCR %>% html_node("head title") %>% html_text()
## [1] "SOCR Data - SOCR"
Here we used %>%
operator, or pipe, to connect two functions. The above line of code creates a chain of functions to operate on the SOCR
object. The first function in the chain html_node()
extracts the title
from head
section. Then, html_text()
translates HTML formatted hypertext into English. More on R
piping can be found in the magrittr
package.
Another function, rvest::html_nodes()
can be very helpful in scraping. Similar to html_node()
, html_nodes()
can help us extract multiple nodes in an xmlnode
object. Assume that we want to obtain the meta elements (usually page description, keywords, author of the document, last modified, and other metadata) from the SOCR data website. We apply html_nodes()
to SOCR
object for lines start with <meta
in <head>
section. It is optional to use html_attrs()
(extracts attributes, text and tag name from html) to make texts prettier.
meta<-SOCR %>% html_nodes("head meta") %>% html_attrs()
meta
## [[1]]
## http-equiv content
## "Content-Type" "text/html; charset=UTF-8"
##
## [[2]]
## charset
## "UTF-8"
##
## [[3]]
## http-equiv content
## "X-UA-Compatible" "IE=EDGE"
##
## [[4]]
## name content
## "generator" "MediaWiki 1.23.1"
##
## [[5]]
## name content
## "ResourceLoaderDynamicStyles" ""
Application Programming Interfaces (APIs) allow web-accessible functions to communicate with each other. Today most API is stored in JSON (JavaScript Object Notation) format.
JSON represents a plain text format used for web applications, data structures or objects. Online JSON objects could be retrieved by packages like RCurl
and httr
. Let’s see a JSON formatted dataset first. We can use 02_Nof1_Data.json in the class file as an example.
library(httr)
nof1<-GET("https://umich.instructure.com/files/1760327/download?download_frd=1")
nof1
## Response [https://instructure-uploads.s3.amazonaws.com/account_17700000000000001/attachments/1760327/02_Nof1_Data.json?response-content-disposition=attachment%3B%20filename%3D%2202_Nof1_Data.json%22%3B%20filename%2A%3DUTF-8%27%2702%255FNof1%255FData.json&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJFNFXH2V2O7RPCAA%2F20170917%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20170917T001047Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=dc68b4f7c48b732db89cf185356e1228ba2c226b2dd63e04a0f25aa77693d5e0]
## Date: 2017-09-17 00:10
## Status: 200
## Content-Type: application/json
## Size: 109 kB
## [{"ID":1,"Day":1,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":0.97,"SocSuppt...
## {"ID":1,"Day":2,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":-0.17,"SocSuppt...
## {"ID":1,"Day":3,"Tx":0,"SelfEff":33,"SelfEff25":8,"WPSS":0.81,"SocSuppt"...
## {"ID":1,"Day":4,"Tx":0,"SelfEff":33,"SelfEff25":8,"WPSS":-0.41,"SocSuppt...
## {"ID":1,"Day":5,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":0.59,"SocSuppt"...
## {"ID":1,"Day":6,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":-1.16,"SocSuppt...
## {"ID":1,"Day":7,"Tx":0,"SelfEff":33,"SelfEff25":8,"WPSS":0.30,"SocSuppt"...
## {"ID":1,"Day":8,"Tx":0,"SelfEff":33,"SelfEff25":8,"WPSS":-0.34,"SocSuppt...
## {"ID":1,"Day":9,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":-0.74,"SocSuppt...
## {"ID":1,"Day":10,"Tx":1,"SelfEff":33,"SelfEff25":8,"WPSS":-0.38,"SocSupp...
## ...
We can see that JSON objects are very simple. The data structure is organized using hierarchies marked by square brackets. Each piece of information is formatted as a {key:value}
pair.
The package jsonlite
is a very useful tool to import online JSON formatted datasets into data frame directly. Its syntax is very straight forward.
#install.packages("jsonlite")
library(jsonlite)
nof1_lite<-fromJSON("https://umich.instructure.com/files/1760327/download?download_frd=1")
class(nof1_lite)
## [1] "data.frame"
We can transfer a xlsx dataset into CSV and use read.csv()
to load this kind of dataset. However, R provides an alternative read.xlsx()
function in package xlsx
to simplify this process. Take our 02_Nof1_Data.xls
data in the class file as an example. We need to download the file first.
# install.packages("xlsx")
library(xlsx)
## Loading required package: rJava
##
## Attaching package: 'rJava'
## The following object is masked from 'package:RCurl':
##
## clone
## Loading required package: xlsxjars
nof1<-read.xlsx("C:/Users/Dinov/Desktop/02_Nof1.xlsx", 1)
str(nof1)
## 'data.frame': 900 obs. of 10 variables:
## $ ID : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Tx : num 1 1 0 0 1 1 0 0 1 1 ...
## $ SelfEff : num 33 33 33 33 33 33 33 33 33 33 ...
## $ SelfEff25: num 8 8 8 8 8 8 8 8 8 8 ...
## $ WPSS : num 0.97 -0.17 0.81 -0.41 0.59 -1.16 0.3 -0.34 -0.74 -0.38 ...
## $ SocSuppt : num 5 3.87 4.84 3.62 4.62 2.87 4.33 3.69 3.29 3.66 ...
## $ PMss : num 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 4.03 ...
## $ PMss3 : num 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 1.03 ...
## $ PhyAct : num 53 73 23 36 21 0 21 0 73 114 ...
The last argument, 1
, stands for the first excel sheet, as any excel file may include a large number of tables in it. Also, we can download the xls
or xlsx
file into our R working directory so that it is easier to find file path.
Powerful Machine-Learning methods have already been applied in many fields. Some of them are very specialized and require unique approaches to address their characteristics.
Genetic data are stored in widely varying formats and usually have more feature variables than observations. They could have 1,000 columns and only 200 rows. One of the commonly used pre-processng steps for such datasets is variable selection. We will talk about this in Chapter 16.
The Bioconductor project created powerful R functionality (packages and tools) for analyzing genomic data, see Bioconductor for more detailed information.
Social network data and graph datasets describe the relations between nodes (vertices) using connections (links or edges) joining the node objects. Assume we have N objects, we can have \(N*(N-1)\) directed links establishing paired associations between the nodes. Let’s use an example with N=4 to demonstrate a simple graph potentially modeling the following linkage table.
objects | 1 | 2 | 3 | 4 |
---|---|---|---|---|
1 | ….. | \(1\rightarrow 2\) | \(1\rightarrow 3\) | \(1\rightarrow 4\) |
2 | \(2\rightarrow 1\) | ….. | \(2\rightarrow 3\) | \(2\rightarrow 4\) |
3 | \(3\rightarrow 1\) | \(3\rightarrow 2\) | ….. | \(3\rightarrow 4\) |
4 | \(4\rightarrow 1\) | \(4\rightarrow 2\) | \(4\rightarrow 3\) | ….. |
If we change the \(a\rightarrow b\) to an indicator variable (0 or 1) capturing whether we have an edge connecting a pair of nodes, then we get the graph adjacency matrix.
Edge lists provide an alternative way to represent network connections. Every line in the list contains a connection between two nodes (objects).
Vertex | Vertex |
---|---|
1 | 2 |
1 | 3 |
2 | 3 |
The above edge list is listing three network connections: object 1 is linked to object 2; object 1 is linked to object 3; and object 2 is linked to object 3. Note that edge lists can represent both directed as well as undirected networks or graphs.
We can imagine that if N is very large, e.g., social networks, the data representation and analysis may be resource intense (memory or computation). In R, we have multiple packages that can deal with social network data. One user-friendly example is provided using the igraph
package. First, let’s build a toy example and visualize it using this package.
#install.packages("igraph")
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
g<-graph(c(1, 2, 1, 3, 2, 3, 3, 4), n=10)
plot(g)
Here c(1, 2, 1, 3, 2, 3, 3, 4)
is an edge list with 4 rows and n=10
means we have 10 nodes (objects) in total. The small arrows in the graph shows us directed network connections. We might notice that 5-10 nodes are scattered out in the graph. This is because they are not included in the edge list, so there are no network connections between them and the rest of the network.
Now let’s examine the co-appearance network of Facebook circles. The data contains anonymized circles
(friends lists) from Facebook collected from survey participants using a Facebook app. The dataset only includes edges (circles, 88, 234) connecting pairs of nodes (users, 4, 039) in the ego networks.
The values on the connections represent the number of links/edges within a circle. We have a huge edge-list made of scrambled Facebook user IDs. Let’s load this dataset into R first. The data is stored in a text file. Unlike CSV files, text files in table format need to be imported using read.table()
. We are using header=F
option to let R know that we don’t have a header in the text file that contains only tab-separated node pairs (indicating the social connections, edges, between Facebook users).
soc.net.data<-read.table("https://umich.instructure.com/files/2854431/download?download_frd=1", sep=" ", header=F)
head(soc.net.data)
## V1 V2
## 1 0 1
## 2 0 2
## 3 0 3
## 4 0 4
## 5 0 5
## 6 0 6
Now the data is stored in a data frame. To make this dataset ready for igraph
processing and visualization, we need to convert soc.net.data
into a matrix object.
soc.net.data.mat <- as.matrix(soc.net.data, ncol=2)
By using ncol=2
, we made a matrix with two columns. The data is now ready and we can apply graph.edgelist()
.
# remove the first 347 edges (to wipe out the degenerate "0" node)
graph_m<-graph.edgelist(soc.net.data.mat[-c(0:347), ], directed = F)
Before we display the social network graph we may want to examine our model first.
summary(graph_m)
## IGRAPH aacbcc1 U--- 4038 87887 --
This is an extremely brief yet informative summary. The first line U--- 4038 87887
includes potentially four letters and two numbers. The first letter could be U
or D
indicating undirected or directed edges. A second letter N
would mean that the objects set has a “name” attribute. A third letter is for weighted (W
) graph. Since we didn’t add weight in our analysis the third letter is empty (“-
”). A forth character is an indicator for bipartite graphs (whose vertices can be divided into two disjoint sets
and (that is, represent independent sets where each vertex from one set connects to one vertex in the other set). The two numbers following the 4 letters represent the number of nodes
and the number of edges
, respectively. Now let’s render the graph.
plot(graph_m)
This graph is very complicated. We can still see that some words are surrounded by more nodes than others. To obtain such information we can use degree()
function which list the number of edges for each node.
degree(graph_m)
Skimming the table we can find that the 107-th
user has as many as 1,044 connections, which makes the user a highly-connected hub. Likely, this node may have higher social relevance.
Some edges might be more important than other edges because they serve as a bridge to link a cloud of nodes. To compare their importance, we can use the betweenness centrality measurement. Betweenness centrality measures centrality in a network. High centrality for a specific node indicates influence. betweenness()
can help us to calculate this measurement.
betweenness(graph_m)
Again, the 107-th
node has the highest betweenness centrality (\(3.556221e+06\)).
We can try another example using SOCR hierarchical data, which is also available for dynamic exploration as a tree graph. Let’s read its JSON data source using the jsonlite
package.
tree.json<-fromJSON("http://socr.ucla.edu/SOCR_HyperTree.json", simplifyDataFrame = FALSE)
This generates a list
object representing the hierarchical structure of the network. Note that this is quite different from edge list. There is one root node, its sub nodes are called children nodes, and the terminal notes are call leaf nodes. Instead of presenting the relationship between nodes in pairs, this hierarchical structure captures the level for each node. To draw the social network graph, we need to convert it as a Node
object. We can utilize as.Node()
function in data.tree
package to do so.
# install.packages("data.tree")
library(data.tree)
tree.graph<-as.Node(tree.json, mode = "explicit")
Here we use mode="explicit"
option to allow “children” nodes to have their own “children” nodes. Now, the tree.json
object has been separated into four different node structures - "About SOCR", "SOCR Resources", "Get Started",
and "SOCR Wiki"
. Let’s plot the first one using igraph
package.
plot(as.igraph(tree.graph$`About SOCR`), edge.arrow.size=5, edge.label.font=0.05)
In this graph, "About SOCR"
that located at the center is the root.
The proliferation of Cloud services and the emergence of modern technology in all aspects of human experiences leads to a tsunami of data much of which is steamed real-time. The interrogation of such voluminous data is an increasingly important area of research. Data streams are ordered, often unbounded sequences of data points created continuously by a data generator. All of the data mining, interrogation and forecasting methods we discuss here are also applicable to data streams.
Mathematically, a data stream in an ordered sequence of data points \[Y = \{y_1, y_2, y_3, \cdots, y_t, \cdots \},\] where the (time) index, \(t\), reflects the order of the observation/record, which may be single numbers, simple vectors in multidimensional space, or objects, e.g., structured Ann Arbor Weather (JSON) and its corresponding structured form. Some streaming data is streamed because it’s too large to be downloaded shotgun style and some is streamed because it’s continually generated and serviced. This presents the potential problem of dealing with data streams that may be unlimited.
Notes:
rstream
. Real stream data may be piped from financial data providers, the WHO, World Bank, NCAR and other sources.factas
, for PCA, rEMM
and birch
for clustering, etc. Clustering and classification methods capable of processing data streams have been developed, e.g., Very Fast Decision Trees (VFDT), time window-based Online Information Network (OLIN), On-demand Classification, and the APRIORI streaming algorithm.stream
packageThe R
stream
package provides data stream mining algorithms using fpc
, clue
, cluster
, clusterGeneration
, MASS
, and proxy
packages. In addition, the package streamMOA
provides an rJava
interface to the Java-based data stream clustering algorithms available in the Massive Online Analysis (MOA) framework for stream classification, regression and clustering.
If you need a deeper exposure to data streaming in R, we recommend you go over the stream vignettes.
This example shows the creation and loading of a mixture of 5 random 2D Gaussians, centers at (x_coords, y_coords) with paired correlations rho_corr, representing a simulated data stream.
# install.packages("stream")
library("stream")
## Loading required package: proxy
##
## Attaching package: 'proxy'
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
##
## Attaching package: 'stream'
## The following object is masked from 'package:httr':
##
## write_stream
x_coords <- c(0.2,0.3, 0.5, 0.8, 0.9)
y_coords <- c(0.8,0.3, 0.7, 0.1, 0.5)
p_weight <- c(0.1, 0.9, 0.5, 0.4, 0.3) # A vector of probabilities that determines the likelihood of generated a data point from a particular cluster
set.seed(12345)
stream_5G <- DSD_Gaussians(k = 5, d = 2, mu=cbind(x_coords, y_coords), p=p_weight)
We will now try k-means and density-based data stream clustering algorithm, D-Stream, where micro-clusters are formed by grid cells of size gridsize with density of a grid cell (Cm) is least 1.2 times the average cell density. The model is updated with the next 500 data points from the stream.
dstream <- DSC_DStream(gridsize = .1, Cm = 1.2)
update(dstream, stream_5G, n = 500)
First, let’s run the k-means clustering with \(k=5\) clusters and plot the resulting micro and macro clusters
kmc <- DSC_Kmeans(k = 5)
recluster(kmc, dstream)
plot(kmc, stream_5G, type = "both", xlab="X-axis", ylab="Y-axis")
In this clustering plot, micro-clusters are shown as circles and macro-clusters are shown as crosses and their sizes represent the corresponding cluster weight estimates.
Next try the density-based data stream clustering algorithm D-Stream. Prior to updating the model with the next 1,000 data points from the stream, we specify the grid cells as micro-clusters, grid cell size (gridsize=0.1), and a micro-cluster (Cm=1.2) that specifies the density of a grid cell as a multiple of the average cell density.
dstream <- DSC_DStream(gridsize = 0.1, Cm = 1.2)
update(dstream, stream_5G, n=1000)
We can re-cluster the data using k-means with 5 clusters and plot the resulting micro and macro clusters.
km_G5 <- DSC_Kmeans(k = 5)
recluster(km_G5, dstream)
plot(km_G5, stream_5G, type = "both")
Note the subtle changes in the clustering results between kmc
and km_G5
.
mlbench
package.ff::ffdf
.For DSD
objects, some basic stream functions include print()
, plot()
and write_stream()
. These can save part of a data stream to disk. DSD_Memory
and DSD_ReadCSV
objects also include member functions like reset_stream()
to reset the position in the stream to its beginning.
to request a new batch of data points from the stream we use get_points()
. This chooses a random cluster (based on the probability weights in p\_weight
) and a point is drawn from the multivariate Gaussian distribution (\(mean=mu, covariance\ matrix=\Sigma\)) of that cluster. Below, we pull \(n = 10\) new data points from the stream.
new_p <- get_points(stream_5G, n = 10)
new_p
## X1 X2
## 1 0.4017803 0.2999017
## 2 0.4606262 0.5797737
## 3 0.4611642 0.6617809
## 4 0.3369141 0.2840991
## 5 0.8928082 0.5687830
## 6 0.8706420 0.4282589
## 7 0.2539396 0.2783683
## 8 0.5594320 0.7019670
## 9 0.5030676 0.7560124
## 10 0.7930719 0.0937701
new_p <- get_points(stream_5G, n = 100, class = TRUE)
head(new_p, n = 20)
## X1 X2 class
## 1 0.7915730 0.09533001 4
## 2 0.4305147 0.36953997 2
## 3 0.4914093 0.82120395 3
## 4 0.7837102 0.06771246 4
## 5 0.9233074 0.48164544 5
## 6 0.8606862 0.49399269 5
## 7 0.3191884 0.27607324 2
## 8 0.2528981 0.27596700 2
## 9 0.6627604 0.68988585 3
## 10 0.7902887 0.09402659 4
## 11 0.7926677 0.09030248 4
## 12 0.9393515 0.50259344 5
## 13 0.9333770 0.62817482 5
## 14 0.7906710 0.10125432 4
## 15 0.1798662 0.24967850 2
## 16 0.7985790 0.08324688 4
## 17 0.5247573 0.57527380 3
## 18 0.2358468 0.23087585 2
## 19 0.8818853 0.49668824 5
## 20 0.4255094 0.81789418 3
plot(stream_5G, n = 700, method = "pc")
Note that if you add noise to your stream, e.g., stream_Noise <- DSD_Gaussians(k = 5, d = 4, noise = .1, p = c(0.1, 0.5, 0.3, 0.9, 0.1))
, then the noise points won’t be part of any clusters will have an NA
class label.
Clusters can be animated over time by animate_data()
. Use reset_stream()
to start the animation at the beginning of the stream and note that this method is not implemented for streams of class DSD_Gaussians
, DSD_R
, DSD_data.frame
, and DSD
. We’ll create a new DSD_Benchmark
data stream.
set.seed(12345)
stream_Bench <- DSD_Benchmark(1)
stream_Bench
## Benchmark 1: Two clusters moving diagonally from left to right, meeting in
## the center (5% noise).
## Class: DSD_MG, DSD_R, DSD_data.frame, DSD
## With 2 clusters in 2 dimensions. Time is 1
library("animation")
reset_stream(stream_Bench)
animate_data(stream_Bench, n=10000, horizon=100, xlim = c(0, 1), ylim = c(0, 1))
This benchmark generator creates two 2D clusters moving in 2D. One moves from top-left to bottom-right, the other from bottom-left to top-right. Then they meet at the center of the domain, the 2 clusters overlap and then split again.
Concept drift in the stream can be depicted by requesting (\(10\)) times \(300\) data points from the stream and animating the plot. Fast-forwarding the stream can be accomplished by requesting, but ignoring, (\(2000\)) points in between the (\(10\)) plots.
for(i in 1:10) {
plot(stream_Bench, 300, xlim = c(0, 1), ylim = c(0, 1))
tmp <- get_points(stream_Bench, n = 2000)
}
reset_stream(stream_Bench)
animate_data(stream_Bench, n=8000, horizon = 120, xlim=c(0, 1), ylim=c(0, 1))
# Animations can be saved as HTML or GIF
#saveHTML(ani.replay(), htmlfile = "stream_Bench_Animation.html")
#saveGIF(ani.replay())
Streams can also be saved locally by write_stream(stream_Bench, "dataStreamSaved.csv", n = 100, sep=",")
and loaded back in R
by DSD_ReadCSV()
.
These data represent the \(X\) and \(Y\) spatial knee-pain locations for over \(8,000\) patients, along with labels about the knee \(F\)ront, \(B\)ack, \(L\)eft and \(R\)ight. Let’s try to read the SOCR Knee Pain Datasest as a stream.
library("XML"); library("xml2"); library("rvest")
wiki_url <- read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data_KneePainData_041409")
html_nodes(wiki_url, "#content")
## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body-primary" role="main">\n\t<a id="top ...
kneeRawData <- html_table(html_nodes(wiki_url, "table")[[2]])
normalize<-function(x){
return((x-min(x))/(max(x)-min(x)))
}
kneeRawData_df <- as.data.frame(cbind(normalize(kneeRawData$x), normalize(kneeRawData$Y), as.factor(kneeRawData$View)))
colnames(kneeRawData_df) <- c("X", "Y", "Label")
# randomize the rows of the DF as RF, RB, LF and LB labels of classes are sequential
set.seed(1234)
kneeRawData_df <- kneeRawData_df[sample(nrow(kneeRawData_df)), ]
summary(kneeRawData_df)
## X Y Label
## Min. :0.0000 Min. :0.0000 Min. :1.000
## 1st Qu.:0.1331 1st Qu.:0.4566 1st Qu.:2.000
## Median :0.2995 Median :0.5087 Median :3.000
## Mean :0.3382 Mean :0.5091 Mean :2.801
## 3rd Qu.:0.3645 3rd Qu.:0.5549 3rd Qu.:4.000
## Max. :1.0000 Max. :1.0000 Max. :4.000
# View(kneeRawData_df)
We can use the DSD::DSD_Memory
class to get a stream interface for matrix or data frame objects, like the Knee pain location dataset. The number of true clusters \(k=4\) in this dataset.
# use data.frame to create a stream (3rd column contains the label assignment)
kneeDF <- data.frame(x=kneeRawData_df[,1], y=kneeRawData_df[,2],
class=as.factor(kneeRawData_df[,3]))
head(kneeDF)
## x y class
## 1 0.1188590 0.5057803 4
## 2 0.3248811 0.6040462 2
## 3 0.3153724 0.4971098 2
## 4 0.3248811 0.4161850 2
## 5 0.6941363 0.5289017 1
## 6 0.3217116 0.4595376 2
streamKnee <- DSD_Memory(kneeDF[,c("x", "y")], class=kneeDF[,"class"], loop=T)
streamKnee
## Memory Stream Interface
## Class: DSD_Memory, DSD_R, DSD_data.frame, DSD
## With NA clusters in 2 dimensions
## Contains 8666 data points - currently at position 1 - loop is TRUE
# Each time we get a point from *streamKnee*, the stream pointer moves to the next position (row) in the data.
get_points(streamKnee, n=10)
## x y
## 1 0.11885895 0.5057803
## 2 0.32488114 0.6040462
## 3 0.31537242 0.4971098
## 4 0.32488114 0.4161850
## 5 0.69413629 0.5289017
## 6 0.32171157 0.4595376
## 7 0.06497623 0.4913295
## 8 0.12519810 0.4682081
## 9 0.32329635 0.4942197
## 10 0.30744849 0.5086705
streamKnee
## Memory Stream Interface
## Class: DSD_Memory, DSD_R, DSD_data.frame, DSD
## With NA clusters in 2 dimensions
## Contains 8666 data points - currently at position 11 - loop is TRUE
# Stream pointer is in position 11 now
# We can redirect the current position of the stream pointer by:
reset_stream(streamKnee, pos = 200)
get_points(streamKnee, n=10)
## x y
## 200 0.9413629 0.5606936
## 201 0.3217116 0.5664740
## 202 0.3122029 0.6416185
## 203 0.1553090 0.6040462
## 204 0.3645008 0.5346821
## 205 0.3122029 0.5000000
## 206 0.3549921 0.5404624
## 207 0.1473851 0.5260116
## 208 0.1870048 0.6329480
## 209 0.1220285 0.4132948
streamKnee
## Memory Stream Interface
## Class: DSD_Memory, DSD_R, DSD_data.frame, DSD
## With NA clusters in 2 dimensions
## Contains 8666 data points - currently at position 210 - loop is TRUE
Let’s demonstrate clustering using DSC_DStream
, which assigns points to cells in a grid. First, initialize the clustering, as an empty cluster and then use the update()
function to implicitly alter the mutable DSC
object.
dsc_streamKnee <- DSC_DStream(gridsize = 0.1, Cm = 0.4, attraction=T)
dsc_streamKnee
## DStream
## Class: DSC_DStream, DSC_Micro, DSC_R, DSC
## Number of micro-clusters: 0
## Number of macro-clusters: 0
# stream::update
reset_stream(streamKnee, pos = 1)
update(dsc_streamKnee, streamKnee, n = 500)
dsc_streamKnee
## DStream
## Class: DSC_DStream, DSC_Micro, DSC_R, DSC
## Number of micro-clusters: 16
## Number of macro-clusters: 11
head(get_centers(dsc_streamKnee))
## [,1] [,2]
## [1,] 0.05 0.45
## [2,] 0.05 0.55
## [3,] 0.15 0.35
## [4,] 0.15 0.45
## [5,] 0.15 0.55
## [6,] 0.15 0.65
plot(dsc_streamKnee, streamKnee, xlim=c(0,1), ylim=c(0,1))
# plot(dsc_streamKnee, streamKnee, grid = TRUE)
# Micro-clusters are plotted in red on top of gray stream data points
# The size of the micro-clusters indicates their weight - it's proportional to the number of data points represented by each micro-cluster.
# Micro-clusters are shown as dense grid cells (density is coded with gray values).
The purity metric represent an external evaluation criterion of cluster quality, which is the proportion of the total number of points that were correctly classified: $ 0Purity = _{i=1}^k { _j a|c_i t_j |} 1$, where \(N\)=number of observed data points, \(k\) = number of clusters, \(c_i\) is the \(i\)th cluster, and \(t_j\) is the classification that has the maximum number of points with \(c_i\) class labels. High purity suggests that we correctly label points.
Next, we can use K-means clustering.
kMeans_Knee <- DSC_Kmeans(k = 5) # choose 4-5 clusters as we have 4 knee labels
recluster(kMeans_Knee, dsc_streamKnee)
plot(kMeans_Knee, streamKnee, type = "both")
animate_data(streamKnee, n=1000, horizon=100, xlim = c(0, 1), ylim = c(0, 1))
# purity <- animate_cluster(kMeans_Knee, streamKnee, n=2500, type="both", xlim=c(0,1), ylim=c(-,1), evaluationMeasure="purity", horizon=10)
animate_cluster(kMeans_Knee, streamKnee, horizon = 100, n = 5000, measure = "purity", plot.args = list(xlim = c(0, 1), ylim = c(0, 1)))
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(actual), value = 0L): invalid factor
## level, NA generated
## points purity
## 1 1 0.9600000
## 2 101 0.9043478
## 3 201 0.9500000
## 4 301 0.9500000
## 5 401 0.9333333
## 6 501 0.8982872
## 7 601 0.9916667
## 8 701 0.8570115
## 9 801 0.9037037
## 10 901 0.9047619
## 11 1001 1.0000000
## 12 1101 0.9210526
## 13 1201 0.8130435
## 14 1301 0.8815789
## 15 1401 0.9000000
## 16 1501 0.8596491
## 17 1601 0.9738095
## 18 1701 0.9400000
## 19 1801 0.9428571
## 20 1901 0.9259259
## 21 2001 0.9066667
## 22 2101 0.9230769
## 23 2201 0.9111111
## 24 2301 0.8093651
## 25 2401 0.9090909
## 26 2501 0.9685484
## 27 2601 0.9000000
## 28 2701 0.9111111
## 29 2801 0.9153846
## 30 2901 0.8411765
## 31 3001 0.9052632
## 32 3101 0.9200000
## 33 3201 0.9176471
## 34 3301 0.8945055
## 35 3401 0.9333333
## 36 3501 0.9166667
## 37 3601 0.9263158
## 38 3701 0.9100000
## 39 3801 1.0000000
## 40 3901 0.9100000
## 41 4001 0.9181818
## 42 4101 0.8700000
## 43 4201 0.8928571
## 44 4301 0.9217391
## 45 4401 0.9708333
## 46 4501 0.8977273
## 47 4601 0.9500000
## 48 4701 0.9500000
## 49 4801 0.9047619
## 50 4901 0.8850000
# Synthetic Gaussian example
# stream <- DSD_Gaussians(k = 3, d = 2, noise = .05)
# dstream <- DSC_DStream(gridsize = .1)
# update(dstream, stream, n = 2000)
# evaluate(dstream, stream, n = 100)
evaluate(dsc_streamKnee, streamKnee, measure = c("crand", "SSQ", "silhouette"), n = 100, type = c("auto", "micro", "macro"), assign = "micro", assignmentMethod = c("auto", "model", "nn"), noise = c("class", "exclude"))
## Evaluation results for micro-clusters.
## Points were assigned to micro-clusters.
##
## cRand SSQ silhouette
## 0.3473634 0.3382900 0.1373143
clusterEval <- evaluate_cluster(dsc_streamKnee, streamKnee, measure = c("numMicroClusters", "purity"), n = 5000, horizon = 100)
head(clusterEval)
## points numMicroClusters purity
## 1 0 16 0.9555556
## 2 100 17 0.9733333
## 3 200 18 0.9671053
## 4 300 21 0.9687500
## 5 400 21 0.9880952
## 6 500 22 0.9750000
plot(clusterEval[ , "points"], clusterEval[ , "purity"], type = "l", ylab = "Avg Purity", xlab = "Points")
animate_cluster(dsc_streamKnee, streamKnee, horizon = 100, n = 5000, measure = "purity", plot.args = list(xlim = c(0, 1), ylim = c(0, 1)))
## points purity
## 1 1 0.9714286
## 2 101 0.9833333
## 3 201 0.9722222
## 4 301 0.9699248
## 5 401 0.9722222
## 6 501 0.9642857
## 7 601 0.9652778
## 8 701 0.9736842
## 9 801 0.9761905
## 10 901 0.9727273
## 11 1001 0.9714286
## 12 1101 0.9736842
## 13 1201 0.9833333
## 14 1301 0.9736842
## 15 1401 0.9857143
## 16 1501 0.9684211
## 17 1601 0.9736842
## 18 1701 0.9629630
## 19 1801 0.9833333
## 20 1901 0.9632353
## 21 2001 0.9647059
## 22 2101 0.9888889
## 23 2201 0.9736842
## 24 2301 0.9809524
## 25 2401 0.9855072
## 26 2501 0.9785714
## 27 2601 0.9562500
## 28 2701 0.9809524
## 29 2801 0.9824561
## 30 2901 0.9750000
## 31 3001 0.9750000
## 32 3101 0.9750000
## 33 3201 0.9714286
## 34 3301 1.0000000
## 35 3401 0.9809524
## 36 3501 0.9702381
## 37 3601 1.0000000
## 38 3701 0.9740260
## 39 3801 0.9855072
## 40 3901 0.9761905
## 41 4001 0.9642857
## 42 4101 0.9761905
## 43 4201 0.9722222
## 44 4301 0.9727273
## 45 4401 0.9814815
## 46 4501 0.9609375
## 47 4601 0.9714286
## 48 4701 0.9722222
## 49 4801 0.9772727
## 50 4901 0.9777778
The dsc_streamKnee
is the evaluated clustering, where \(n\) is the data points are taken from streamKnee
. The evaluation measure
can be specified as a vector of character strings. Points are assigned to clusters in dsc_streamKnee
using get_assignment()
and can be used to assess the quality of the classification. By default, points are assigned to micro-clusters, or can be assigned to macro-cluster centers by assign = "macro"
. ALso, new points can be assigned to clusters by the rule used in the clustering algorithm by assignmentMethod = "model"
or using nearest-neighbor assignment (nn
).
Here and in previous chapters, e.g., Chapter 14, we notice that R may sometimes be slow and memory-inefficient. These problems may be severe, especially for datasets with millions of records or when using complex functions. There are packages for processing large datasets and memory optimization – bigmemory
, biganalytics
, bigtabulate
, etc.
dplyr
We have also seen long execution times when running processes that ingest, store or manipulate huge data.frame
objects. The dplyr
package, created by Hadley Wickham and Romain Francoi, provides a faster route to manage such large datasets in R. It creates an object called tbl
, similar to data.frame
, which has an in-memory column-like structure. R reads these objects a lot faster than data frames.
To make a tbl
object we can either convert an existing data frame to tbl
or connect to an external database. Converting from data frame to tbl
is quite easy. All we need to do is call the function as.tbl()
.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:igraph':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:Hmisc':
##
## combine, src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
nof1_tbl<-as.tbl(nof1)
nof1_tbl
## # A tibble: 900 x 10
## ID Day Tx SelfEff SelfEff25 WPSS SocSuppt PMss PMss3 PhyAct
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 33 8 0.97 5.00 4.03 1.03 53
## 2 1 2 1 33 8 -0.17 3.87 4.03 1.03 73
## 3 1 3 0 33 8 0.81 4.84 4.03 1.03 23
## 4 1 4 0 33 8 -0.41 3.62 4.03 1.03 36
## 5 1 5 1 33 8 0.59 4.62 4.03 1.03 21
## 6 1 6 1 33 8 -1.16 2.87 4.03 1.03 0
## 7 1 7 0 33 8 0.30 4.33 4.03 1.03 21
## 8 1 8 0 33 8 -0.34 3.69 4.03 1.03 0
## 9 1 9 1 33 8 -0.74 3.29 4.03 1.03 73
## 10 1 10 1 33 8 -0.38 3.66 4.03 1.03 114
## # ... with 890 more rows
This looks like a normal data frame. If you are using R Studio by viewing the nof1_tbl
you can see the same output as nof1
.
Similar to tbl
, the data.table
package provides another alternative to data frame object representation. data.table
objects are processed in R much faster compared to standard data frames. Also, all of the functions that can accept data frame could be applied to data.table
objects as well. The function fread()
is able to read a local CSV file directly into a data.table
.
#install.packages("data.table")
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
nof1<-fread("C:/Users/Dinov/Desktop/02_Nof1_Data.csv")
Another amazing property of data.table
is that we can use subscripts to access a specific location in the dataset just like dataset[row, column]
. It also allows the selection of rows with Boolean expression and direct application of functions to those selected rows. Note that column names can be used to call the specific column in data.table
, whereas with data frames, we have to use the dataset$columnName
syntax).
nof1[ID==1, mean(PhyAct)]
## [1] 52.66667
This useful functionality can also help us run complex operations with only a few lines of code. One of the drawbacks of using data.table
objects is that they are still limited by the available system memory.
ff
The ff
(fast-files) package allows us to overcome the RAM limitations of finite system memory. For example, it helps with operating datasets with billion rows. ff
creates objects in ffdf
formats, which is like a map that points to a location of the data on a disk. However, this makes ffdf
objects inapplicable for most R functions. The only way to address this problem is to break the huge dataset into small chunks. After processing a batch of these small chunks, we have to combine the results to reconstruct the complete output. This strategy is relevant in parallel computing, which will be discussed in detail in the next section. First, let’s download one of the large datasets in our datasets archive, UQ_VitalSignsData_Case04.csv.
# install.packages("ff")
library(ff)
## Loading required package: bit
## Attaching package bit
## package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
## creators: bit bitwhich
## coercion: as.logical as.integer as.bit as.bitwhich which
## operator: ! & | xor != ==
## querying: print length any all min max range sum summary
## bit access: length<- [ [<- [[ [[<-
## for more help type ?bit
##
## Attaching package: 'bit'
## The following object is masked from 'package:data.table':
##
## setattr
## The following object is masked from 'package:rJava':
##
## clone
## The following object is masked from 'package:RCurl':
##
## clone
## The following object is masked from 'package:base':
##
## xor
## Attaching package ff
## - getOption("fftempdir")=="C:/Users/Dinov/AppData/Local/Temp/Rtmp08lk9C"
## - getOption("ffextension")=="ff"
## - getOption("ffdrop")==TRUE
## - getOption("fffinonexit")==TRUE
## - getOption("ffpagesize")==65536
## - getOption("ffcaching")=="mmnoflush" -- consider "ffeachflush" if your system stalls on large writes
## - getOption("ffbatchbytes")==168034304 -- consider a different value for tuning your system
## - getOption("ffmaxbytes")==8401715200 -- consider a different value for tuning your system
##
## Attaching package: 'ff'
## The following objects are masked from 'package:bit':
##
## clone, clone.default, clone.list
## The following object is masked from 'package:rJava':
##
## clone
## The following object is masked from 'package:RCurl':
##
## clone
## The following objects are masked from 'package:utils':
##
## write.csv, write.csv2
## The following objects are masked from 'package:base':
##
## is.factor, is.ordered
# vitalsigns<-read.csv.ffdf(file="UQ_VitalSignsData_Case04.csv", header=T)
vitalsigns<-read.csv.ffdf(file="https://umich.instructure.com/files/366335/download?download_frd=1", header=T)
As mentioned earlier, we cannot apply functions directly on this object.
mean(vitalsigns$Pulse)
## Warning in mean.default(vitalsigns$Pulse): argument is not numeric or
## logical: returning NA
## [1] NA
For basic calculations in datasets we can download another package ffbase
. This allows operations on ffdf
objects using simple tasks like: mathematical operations, query functions, summary statistics and bigger regression models using packages like biglm
, which will be mentioned later in this chapter.
# install.packages("ffbase")
library(ffbase)
##
## Attaching package: 'ffbase'
## The following objects are masked from 'package:ff':
##
## [.ff, [.ffdf, [<-.ff, [<-.ffdf
## The following objects are masked from 'package:base':
##
## %in%, table
mean(vitalsigns$Pulse)
## [1] 108.7185
bigmemory
The previously introduced packages include alternatives to data.frames
. FOr instance, the bigmemory
package creates alternative objects to 2D matrices (second-order tensors). It can store huge datasets and can be divided into small chunks that can be converted to data frames. However, we cannot directly apply machine learning methods on this types of objects. More detailed information about the bigmemory
package is available online.
In previous chapters, we saw various machine learning techniques applied as serial computing tasks. The traditional protocol involves: First, applying function 1 to our raw data. Then, using the output from function 1 as an input to function 2. This process in iterated for a series of functions. Finally, we have the terminal output generated by the last function. This serial or linear computing method is straight forward but time consuming and perhaps sub-optimal.
Now we introduce a more efficient way of computing - parallel computing, which provides a mechanism to deal with different tasks at the same time and combine the outputs for all of processes to get the final answer faster. However, parallel algorithms may require special conditions and cannot be applied to all problems. If two tasks have to be run in a specific order, this problem cannot be parallelized.
To measure how much time can be saved for different methods, we can use function system.time()
.
system.time(mean(vitalsigns$Pulse))
## user system elapsed
## 0.01 0.00 0.01
This means calculating the mean of Pulse
column in the vitalsigns
dataset takes 0.001 seconds. These values will vary between computers, operating systems, and states of operations.
We will introduce two packages for parallel computing multicore
and snow
(their core components are included in the package parallel
). They both have a different way of multitasking. However, to run these packages, you need to have a relatively modern multicore computer. Let’s check how many cores your computer has. This function parallel::detectCores()
provides this functionality. parallel
is a base package, so there is no need to install it prior to using it.
library(parallel)
detectCores()
## [1] 8
So, there are eight (8) cores in my computer. I will be able to run up to 6-8 parallel jobs on this computer.
The multicore
package simply uses the multitasking capabilities of the kernel, the computer’s operating system, to “fork” additional R sessions that share the same memory. Imagine that we open several R sessions in parallel and let each of them does part of the work. Now, let’s examine how this can save time when running complex protocols or dealing with large datasets. To start with, we can use the mclapply()
function, which is similar to lapply()
, which applies functions to a vector and returns a vector of lists. Instead of applying functions to vectors mcapply()
divides the complete computational task and delegates portions of it to each available core. We will apply a simple, yet time consuming, task-generating random numbers for demonstrating this procedure. Also, we can use the system.time()
to track the time differences.
set.seed(123)
system.time(c1<-rnorm(10000000))
## user system elapsed
## 0.60 0.01 0.62
# Note the multi core calls may not work on Windows, but will work on Linux/Mac.
#This shows a 2-core and 4-vore invocations
# system.time(c2<-unlist(mclapply(1:2, function(x){rnorm(5000000)}, mc.cores = 2)))
# system.time(c4<-unlist(mclapply(1:4, function(x){rnorm(2500000)}, mc.cores = 4)))
# And here is a Windows (single core invocation)
system.time(c2<-unlist(mclapply(1:2, function(x){rnorm(5000000)}, mc.cores = 1)))
## user system elapsed
## 0.64 0.03 0.67
The unlist()
is used at the end to combine results from different cores into a single vector. Each line of code creates 10,000,000 random numbers. c1
is a regular R command, which used longest time. c2
used two cores to finish the task (each core handle 5,000,000 numbers) and used less time than the first one. c4
used all four cores to finish the task and successfully reduce the time again. We can see that when we use more cores the time is significantly reduced.
The snow
package allows parallel computing on multicore multiprocessor machines or a network of multiple machines. It might be more difficult to use but it’s also certainly more flexible. First we can set how many cores we want to use via makeCluster()
function.
# install.packages("snow")
library(snow)
##
## Attaching package: 'snow'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, clusterSplit, makeCluster,
## parApply, parCapply, parLapply, parRapply, parSapply,
## splitIndices, stopCluster
cl<-makeCluster(2)
This call might cause your computer to pop up a message warning about access though the firewall. To do the same task we can use parLapply()
function in the snow
package. Note that we have to call the object we created with the previous makeCluster()
function.
system.time(c2<-unlist(parLapply(cl, c(5000000, 5000000), function(x) {rnorm(x)})))
## user system elapsed
## 0.07 0.13 0.61
While using parLapply()
, we have to specify the matrix and the function that will be applied to this matrix. Remember to stop the cluster we made after completing the task, to release back the system resources.
stopCluster(cl)
foreach
and doParallel
The foreach
package provides another option of parallel computing. It relies on a loop-like process basically applying a specified function for each item in the set, which again is somewhat similar to apply()
, lapply()
and other regular functions. The interesting thing is that these loops can be computed in parallel saving substantial amounts of time. The foreach
package alone cannot provide parallel computing. We have to combine it with other packages like doParallel
. Let’s reexamine the task of creating a vector of 10,000,000 random numbers. First, register the 4 compute cores using registerDoParallel()
.
# install.packages("doParallel")
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
cl<-makeCluster(4)
registerDoParallel(cl)
Then we can examine the time saving foreach
command.
#install.packages("foreach")
library(foreach)
system.time(c4<-foreach(i=1:4, .combine = 'c')
%dopar% rnorm(2500000))
## user system elapsed
## 0.22 0.11 0.63
Here we used four items (each item runs on a separate core), .combine=c
allows foreach
to combine the results with the parameter c()
generating the aggregate result vector.
Also, don’t forget to close the doParallel
by registering the sequential backend.
unregister<-registerDoSEQ()
Modern computers have graphics cards, GPU (Graphics Processing Unit), that consists of thousands of cores, however they are very specialized, unlike the standard CPU chip. If we can use this feature for parallel computing, we may reach amazing performance improvements, at the cost of complicating the processing algorithms and increasing the constraints on the data format. Specific disadvantages of GPU computing include relying on a proprietary manufacturer (e.g., NVidia) frameworks and Complete Unified Device Architecture (CUDA) programming language. CUDA allows programming of GPU instructions into a common computing language. This paper provides one example of using GPU computation to improve significantly the performance of advanced neuroimaging and brain mapping processing of multidimensional data.
The R package gputools
is created for parallel computing using NVidia CUDA. Detailed GPU computing in R information is available online.
As we mentioned earlier, some tasks can be parallelized easier than others. In real word situations, we can pick the algorithms that lend themselves well to parallelization. Some of the R packages that allow parallel computing using ML algorithms are listed below.
biglm
biglm
allows training regression models with data from SQL databases or large data chunks obtained from the ff
package. The output is similar to teh standard lm()
function that builds linear models. However, biglm
operates efficiently on massive datasets.
bigrf
The bigrf
package can be used to train random forests combining the foreach
and doParallel
packages. In Chapter 14, we presented random forests as machine learners ensembling multiple tree learners. With parallel computing, we can split the task of creating thousands of trees into smaller tasks that can be outsourced to each available compute core. We only need to combine the results at the end. Then, we will obtain the exact same output in a relatively shorter amount of time.
caret
Combining the caret
package with foreach
and we can obtain a powerful method to deal with time-consuming tasks like building a random forest learner. Utilizing the same example we presented in [Chapter 14(http://www.socr.umich.edu/people/dinov/2017/Spring/DSPA_HS650/notes/14_ImprovingModelPerformance.html), we can see the time difference of utilizing the foreach
package.
#library(caret)
system.time(m_rf <- train(CHARLSONSCORE ~ ., data = qol, method = "rf",
metric = "Kappa", trControl = ctrl, tuneGrid = grid_rf))
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:Hmisc':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## user system elapsed
## 128.40 1.38 130.05
It took more than a minute to finish this task in standard execution model purely relying on the regular caret
function. Below, this same model training completes much faster using parallelization (less than half the time) compared to the standard call above.
set.seed(123)
cl<-makeCluster(4)
registerDoParallel(cl)
getDoParWorkers()
## [1] 4
system.time(m_rf <- train(CHARLSONSCORE ~ ., data = qol, method = "rf",
metric = "Kappa", trControl = ctrl, tuneGrid = grid_rf))
## Warning in .Internal(gc(verbose, reset)): closing unused connection 8 (<-
## nur-cnd4194cfw:11310)
## Warning in .Internal(gc(verbose, reset)): closing unused connection 7 (<-
## nur-cnd4194cfw:11310)
## Warning in .Internal(gc(verbose, reset)): closing unused connection 6 (<-
## nur-cnd4194cfw:11310)
## Warning in .Internal(gc(verbose, reset)): closing unused connection 5 (<-
## nur-cnd4194cfw:11310)
## user system elapsed
## 5.14 0.05 49.11
unregister<-registerDoSEQ()
Try to analyze the co-appearance network in the novel “Les Miserablese”. The data contains the weighted network of co-appearances of characters in Victor Hugo’s novel “Les Miserables”. Nodes represent characters as indicated by the labels and edges connect any pair of characters that appear in the same chapter of the book. The values on the edges are the number of such co-appearances.
miserablese<-read.table(“https://umich.instructure.com/files/330389/download?download_frd=1”, sep=“”, header=F) head(miserablese)
Also, try to interrogate some of the larger datasets we have using alternative parallel computing and big data analytics.