SOCR ≫ DSPA ≫ DSPA2 Topics ≫

In this DSPA Appendix, we will present some common strategies to interact with relational databases.

The Structured Query Language (SQL) is used to communicate with complex databases (DBs)that include large, structured, and relational information. SQL is the de facto standard language for managing, storing, and retrieval of information from relational database management systems. SQL statements may be used to update data elements in a database, retrieve data from a database, or query the meta-data in the system. Many relational database management systems like Oracle, Sybase, Microsoft SQL Server, Access, Ingres, and so on provide SQL application programming interfaces (API’s). There are also no-SQL database systems, e.g., MongoDB, that provide more flexibility by relaxing the relational requirement of SQL systems. In many situations, relational SQL DBs are impractical for large amounts of heterogeneous, incongruent, and rapidly-changing data that blends structured, semi-structured, unstructured, and polymorphic data elements. We will predominantly focus on the standard SQL commands, such as Select, Insert, Update, Delete, Create, and Drop to interact with external databases. The MongoDB noSQL tutorial provides information on interacting with noSQL DBs. In this module, we mostly focus on data import from SQL Databases.

Let’s start by exploring R-based data-import from SQL databases. First, we need to install and load the RODBC(R Open Database Connectivity) package.

# install.packages("RODBC", repos = "http://cran.us.r-project.org")
library(RODBC)

Then, we will open a connection to the SQL server database with Data Source Name (DSN), using Microsoft Access. More details are provided in the Microsoft documentaiton and in the RODBC documentaiton.

1 Connections to SQL Database

1.1 Set a connection

As a demonstration, we will be connecting with username “genome” to a MySQL database hg19 hosted by genome.ucsc.edu server.

# install.packages("RMySQL"); 
library(RMySQL)
# install.packages("RODBC"); 
library(RODBC)
# install.packages("DBI"); 
library(DBI)

ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')

The RODBC library also allows connecting to other Relational Databases such as SQL Server or Database-like platforms such as Dremio. In order to connect to each of these data sources, the specific driver for that data source needs to be downloaded and installed in R.

If connecting to multiple SQL data sources simultaneously within one R project is desired, establish multiple connections in fashion similar to above.

1.2 Managing and closing existing connections

After all needed jobs are done on the database, it is important to disconnect from the database, because the connection isn’t automatically closed and there can be a limit to how many connections can be set up at the same time in R.

Here we will demonstrate how to see all active connections and how to close connections.

# install.packages("DBI"); 
library(DBI)
# install.packages("RMySQL"); 
library(RMySQL)
# install.packages("RODBC"); 
library(RODBC)


#display all current connections to MySQL databases
dbListConnections(MySQL())
## list()
#close connection [1] to MySQL database
#dbDisconnect(dbListConnections(dbDriver(drv="MySQL"))[[1]])

#list and close all connections to MySQL databases
lapply(dbListConnections(MySQL()), dbDisconnect)
## list()
#setup connection
ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')

#disconnect current session from the database
dbDisconnect(ucscGenomeConn)
## [1] TRUE

2 Basic Functions in RODBC

There are several other important functions in RODBC besides dbConnect, dbDisconnect and dbListConnections that can be useful for database users. These functions are demonstrated below.

# install.packages("RMySQL"); 
library(RMySQL)
# install.packages("RODBC"); 
library(RODBC)
#setup connection
ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')
#Store the names of all the tables in the database into 'allTables' and display the total number of tables in the database
allTables <- dbListTables(ucscGenomeConn); length(allTables)
## [1] 12551
#List the fields in the table "affyU133Plus2"
dbListFields(ucscGenomeConn, "affyU133Plus2")
##  [1] "bin"         "matches"     "misMatches"  "repMatches"  "nCount"     
##  [6] "qNumInsert"  "qBaseInsert" "tNumInsert"  "tBaseInsert" "strand"     
## [11] "qName"       "qSize"       "qStart"      "qEnd"        "tName"      
## [16] "tSize"       "tStart"      "tEnd"        "blockCount"  "blockSizes" 
## [21] "qStarts"     "tStarts"
#Read the table "affyU133Plus2" into "affyData" and display the first few lines
affyData <- dbReadTable(ucscGenomeConn, "affyU133Plus2");head(affyData)
##   bin matches misMatches repMatches nCount qNumInsert qBaseInsert tNumInsert
## 1 585     530          4          0     23          3          41          3
## 2 585    3355         17          0    109          9          67          9
## 3 585    4156         14          0     83         16          18          2
## 4 585    4667          9          0     68         21          42          3
## 5 585    5180         14          0    167         10          38          1
## 6 585     468          5          0     14          0           0          0
##   tBaseInsert strand        qName qSize qStart qEnd tName     tSize tStart
## 1         898      -  225995_x_at   637      5  603  chr1 249250621  14361
## 2       11621      -  225035_x_at  3635      0 3548  chr1 249250621  14381
## 3          93      -  226340_x_at  4318      3 4274  chr1 249250621  14399
## 4        5743      - 1557034_s_at  4834     48 4834  chr1 249250621  14406
## 5          29      -    231811_at  5399      0 5399  chr1 249250621  19688
## 6           0      -    236841_at   487      0  487  chr1 249250621  27542
##    tEnd blockCount
## 1 15816          5
## 2 29483         17
## 3 18745         18
## 4 24893         23
## 5 25078         11
## 6 28029          1
##                                                                   blockSizes
## 1                                                          93,144,229,70,21,
## 2              73,375,71,165,303,360,198,661,201,1,260,250,74,73,98,155,163,
## 3                 690,10,32,33,376,4,5,15,5,11,7,41,277,859,141,51,443,1253,
## 4 99,352,286,24,49,14,6,5,8,149,14,44,98,12,10,355,837,59,8,1500,133,624,58,
## 5                                       131,26,1300,6,4,11,4,7,358,3359,155,
## 6                                                                       487,
##                                                                                                  qStarts
## 1                                                                                    34,132,278,541,611,
## 2                        87,165,540,647,818,1123,1484,1682,2343,2545,2546,2808,3058,3133,3206,3317,3472,
## 3                   44,735,746,779,813,1190,1195,1201,1217,1223,1235,1243,1285,1564,2423,2565,2617,3062,
## 4 0,99,452,739,764,814,829,836,842,851,1001,1016,1061,1160,1173,1184,1540,2381,2441,2450,3951,4103,4728,
## 5                                                     0,132,159,1460,1467,1472,1484,1489,1497,1856,5244,
## 6                                                                                                     0,
##                                                                                                                                      tStarts
## 1                                                                                                             14361,14454,14599,14968,15795,
## 2                                     14381,14454,14969,15075,15240,15543,15903,16104,16853,17054,17232,17492,17914,17988,18267,24736,29320,
## 3                               14399,15089,15099,15131,15164,15540,15544,15549,15564,15569,15580,15587,15628,15906,16857,16998,17049,17492,
## 4 14406,20227,20579,20865,20889,20938,20952,20958,20963,20971,21120,21134,21178,21276,21288,21298,21653,22492,22551,22559,24059,24211,24835,
## 5                                                                         19688,19819,19845,21145,21151,21155,21166,21170,21177,21535,24923,
## 6                                                                                                                                     27542,

The function dbReadTable() is equivalent to the SQL command SELECT * FROM affyU133Plus2.

3 Querying with SQL

After establishing the connection to a SQL data-source, you can then pass SQL commands in string formats to the database. Here we will demonstrate such interactions with several basic SQL commands.

# set up connection
ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')

# select the top 10 rows of the genoName column from table rmsk with dbSendQuery and dbClearResult.
rs <- dbSendQuery(ucscGenomeConn,'select genoName from rmsk limit 10')
dbClearResult(rs)
## [1] TRUE
# select the top 10 rows of genoName column from table rmsk with dbGetQuery
df <- dbGetQuery(ucscGenomeConn,'select genoName from rmsk limit 10')

#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE

The difference between dbSendQuery and dbGetQuery is the following:

  • dbSendQuery sends a query to the database and doesn’t retrieve any data until being fetched, while dbGetQuery retrieves data back from the server and can be written directly into a data structure.
  • dbSendQuery needs to be used with dbClearResult() to close the query activity before any other interaction with the server, while dbGetQuery closes the query activity on the database automatically.

As a result, dbGetQuery is more suitable for running quick queries on the server and retrieving data immediately, while dbSendQuery is more useful when running queries on larger datasets and when immediate results are not required.

Both of these two functions are intended for data users only, so only SELECT statements are intended to be sent through them. If the user has higher access to the database and intend to manipulate data, dbExecute is needed. This is demonstrated (but not executed) below.

# create a new empty table named newTable using dbExecute
#dbExecute(ucscGenomeConn,'create table newTable')

4 Fetching Results

When using dbGetQuery, data is automatically retrieved, or “fetched” from the server. However, this is not always intended. If the demand is to query first and fetch at a later time, a combination of dbSendQuery and fetch. Here is an example.

# set up connection
ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')

# select the top 10 rows of the genoName column from table rmsk with deSendQuery and fetch.
rs <- dbSendQuery(ucscGenomeConn,'select genoName from rmsk limit 10')
fetch(rs)
##    genoName
## 1      chr1
## 2      chr1
## 3      chr1
## 4      chr1
## 5      chr1
## 6      chr1
## 7      chr1
## 8      chr1
## 9      chr1
## 10     chr1
dbClearResult(rs)
## [1] TRUE
#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE

Regardless which function is used to query on the database, the results from the query must be fetched into R if analysis with R is intended.

5 Important SQL clauses

Here are a few important SQL commands that can be used when working with databases.

5.1 Basic SELECT

“SELECT (…) FROM (…)” selects all rows (unless further specified) of specific columns of data from a table in the database. It can be used in combination with these following clauses: - “DISTINCT” selects data where replicated rows with the same values for all specified columns are removed. - “WHERE” selects data where values at specified columns meet conditions defined after this clause - “AND/OR/NOT” logic operators that allow setting complex requirement for the WHERE clause - “GROUP BY” separates the table into groups by values in a specified column. The number of groups created is equal to the number of distinct values in the column specified. - “AS” renames a column or the results yielded from operations on it to another specified name - “HAVING” works similar to WHERE class in terms of serving the purpose of qualification of values, but HAVING is different in that it allows conditioning on aggregate functions like COUNT() and works best with GROUP BY - “ORDER BY” rearrange the rows by specified orders on values of specified column(s) - “LIMIT” sets the limit to how many rows of data are selected, useful when working with large datasets These clauses are shown below.

ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')
#The tables we will be working with: pseudoYale60 and pseudoYale60Class from the hg19 database:
dbListFields(ucscGenomeConn, "pseudoYale60")
##  [1] "bin"        "name"       "chrom"      "strand"     "txStart"   
##  [6] "txEnd"      "cdsStart"   "cdsEnd"     "exonCount"  "exonStarts"
## [11] "exonEnds"
dbListFields(ucscGenomeConn, "pseudoYale60Class")
## [1] "name"  "class" "owner"
# select the top 10 rows of all columns from table pseudoYale60Class.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class limit 10')
##                 name      class owner
## 1  PGOHUM00000232565  Ambiguous  Yale
## 2  PGOHUM00000232567  Processed  Yale
## 3  PGOHUM00000232568  Ambiguous  Yale
## 4  PGOHUM00000232569 Duplicated  Yale
## 5  PGOHUM00000232570 Duplicated  Yale
## 6  PGOHUM00000232573  Processed  Yale
## 7  PGOHUM00000232574  Ambiguous  Yale
## 8  PGOHUM00000232575 Duplicated  Yale
## 9  PGOHUM00000232579  Processed  Yale
## 10 PGOHUM00000232580 Duplicated  Yale
# select the top 10 rows of the class column from table pseudoYale60Class.
dbGetQuery(ucscGenomeConn,'select class from pseudoYale60Class limit 10')
##         class
## 1   Ambiguous
## 2   Processed
## 3   Ambiguous
## 4  Duplicated
## 5  Duplicated
## 6   Processed
## 7   Ambiguous
## 8  Duplicated
## 9   Processed
## 10 Duplicated
# select the top 10 distinct rows of the class column from table pseudoYale60Class.
dbGetQuery(ucscGenomeConn,'select distinct class from pseudoYale60Class limit 10')
##        class
## 1  Ambiguous
## 2  Processed
## 3 Duplicated
# select the top 10 rows of name and class columns from table pseudoYale60Class where class is 'Ambiguous'.
dbGetQuery(ucscGenomeConn,'select name, class from pseudoYale60Class WHERE class = \'Ambiguous\' limit 10')
##                 name     class
## 1  PGOHUM00000232565 Ambiguous
## 2  PGOHUM00000232568 Ambiguous
## 3  PGOHUM00000232574 Ambiguous
## 4  PGOHUM00000232584 Ambiguous
## 5  PGOHUM00000232585 Ambiguous
## 6  PGOHUM00000232590 Ambiguous
## 7  PGOHUM00000232593 Ambiguous
## 8  PGOHUM00000232594 Ambiguous
## 9  PGOHUM00000232597 Ambiguous
## 10 PGOHUM00000232601 Ambiguous
# select the top 10 rows of all columns from table pseudoYale60Class where class is 'Ambiguous' and owner is 'Yale'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' AND owner = \'Yale\' limit 10')
##                 name     class owner
## 1  PGOHUM00000232565 Ambiguous  Yale
## 2  PGOHUM00000232568 Ambiguous  Yale
## 3  PGOHUM00000232574 Ambiguous  Yale
## 4  PGOHUM00000232584 Ambiguous  Yale
## 5  PGOHUM00000232585 Ambiguous  Yale
## 6  PGOHUM00000232590 Ambiguous  Yale
## 7  PGOHUM00000232593 Ambiguous  Yale
## 8  PGOHUM00000232594 Ambiguous  Yale
## 9  PGOHUM00000232597 Ambiguous  Yale
## 10 PGOHUM00000232601 Ambiguous  Yale
# select the top 10 rows of all columns from table pseudoYale60Class where class is 'Ambiguous' or owner is 'Yale'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' OR owner = \'Yale\' limit 10')
##                 name      class owner
## 1  PGOHUM00000232565  Ambiguous  Yale
## 2  PGOHUM00000232567  Processed  Yale
## 3  PGOHUM00000232568  Ambiguous  Yale
## 4  PGOHUM00000232569 Duplicated  Yale
## 5  PGOHUM00000232570 Duplicated  Yale
## 6  PGOHUM00000232573  Processed  Yale
## 7  PGOHUM00000232574  Ambiguous  Yale
## 8  PGOHUM00000232575 Duplicated  Yale
## 9  PGOHUM00000232579  Processed  Yale
## 10 PGOHUM00000232580 Duplicated  Yale
# select the top 10 rows of all columns from table pseudoYale60Class where class is \'Ambiguous\' or \'Processed\'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' OR class = \'Processed\' limit 10')
##                 name     class owner
## 1  PGOHUM00000232565 Ambiguous  Yale
## 2  PGOHUM00000232567 Processed  Yale
## 3  PGOHUM00000232568 Ambiguous  Yale
## 4  PGOHUM00000232573 Processed  Yale
## 5  PGOHUM00000232574 Ambiguous  Yale
## 6  PGOHUM00000232579 Processed  Yale
## 7  PGOHUM00000232581 Processed  Yale
## 8  PGOHUM00000232582 Processed  Yale
## 9  PGOHUM00000232584 Ambiguous  Yale
## 10 PGOHUM00000232585 Ambiguous  Yale
# select the top 10 rows of all columns from table pseudoYale60Class where class is not \'Ambiguous\'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE NOT class = \'Ambiguous\' limit 10')
##                 name      class owner
## 1  PGOHUM00000232567  Processed  Yale
## 2  PGOHUM00000232569 Duplicated  Yale
## 3  PGOHUM00000232570 Duplicated  Yale
## 4  PGOHUM00000232573  Processed  Yale
## 5  PGOHUM00000232575 Duplicated  Yale
## 6  PGOHUM00000232579  Processed  Yale
## 7  PGOHUM00000232580 Duplicated  Yale
## 8  PGOHUM00000232581  Processed  Yale
## 9  PGOHUM00000232582  Processed  Yale
## 10 PGOHUM00000232583 Duplicated  Yale
# select class and how many names in the class from pseudoYale60Class
dbGetQuery(ucscGenomeConn,'select count(name) as number_of_names, class from pseudoYale60Class GROUP BY class')
##   number_of_names      class
## 1            6409  Ambiguous
## 2            2631 Duplicated
## 3            8836  Processed
# select class and how many names in the class from pseudoYale60Class if the number of names in the class is greater than 4000
dbGetQuery(ucscGenomeConn,'select count(name) as number_of_names, class from pseudoYale60Class group by class having count(name) > 4000')
##   number_of_names     class
## 1            6409 Ambiguous
## 2            8836 Processed
#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE

5.2 SELECT from multiple tables

Let’s now show several clauses that will be useful to work across multiple tables

Working with Multiple Tables: - “JOIN/LEFT JOIN/RIGHT JOIN/INNER JOIN/” JOIN functions combine two tables together where they have one or more columns sharing the same type of data. - “USING” can be an alternative to the ON clause in JOINs. It allows joining with columns of the same name while with ON you can specify and join columns with different names. - “UNION” appends one table to another where the columns of the two tables have the same data types, replicates are removed. - “UNION ALL” appends one table to another where the columns of the two tables have the same data types without removing replicates.

ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')
#The tables we will be working with: pseudoYale60 and pseudoYale60Class from the hg19 database

#joins the pseudoYale60 table with pseudoYale60Class table, where each row is matched on the name column
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60 join pseudoYale60Class on pseudoYale60.name = pseudoYale60Class.name limit 10')
##    bin              name chrom strand txStart  txEnd cdsStart cdsEnd exonCount
## 1  585 PGOHUM00000244550  chr1      +   12981  13670        0      0         3
## 2  585 PGOHUM00000251045  chr1      -   16607  18381        0      0         6
## 3  585 PGOHUM00000244551  chr1      +   52475  53354        0      0         1
## 4  585 PGOHUM00000244552  chr1      +   62951  63851        0      0         1
## 5  585 PGOHUM00000244553  chr1      +   65887  65974        0      0         1
## 6   73 PGOHUM00000243857  chr1      -  120819 133580        0      0         3
## 7   73 PGOHUM00000243858  chr1      -  131029 131544        0      0         1
## 8  586 PGOHUM00000244554  chr1      +  131124 135623        0      0         1
## 9  586 PGOHUM00000251046  chr1      -  135714 136507        0      0         1
## 10 586 PGOHUM00000244555  chr1      +  135956 136244        0      0         1
##                              exonStarts                             exonEnds
## 1                    12981,13224,13448,                   13056,13377,13670,
## 2  16607,16854,17229,17601,17914,18265, 16766,17052,17364,17745,18058,18381,
## 3                                52475,                               53354,
## 4                                62951,                               63851,
## 5                                65887,                               65974,
## 6                 120819,129028,133367,                120930,129232,133580,
## 7                               131029,                              131544,
## 8                               131124,                              135623,
## 9                               135714,                              136507,
## 10                              135956,                              136244,
##                 name      class owner
## 1  PGOHUM00000244550 Duplicated  Yale
## 2  PGOHUM00000251045 Duplicated  Yale
## 3  PGOHUM00000244551  Processed  Yale
## 4  PGOHUM00000244552  Processed  Yale
## 5  PGOHUM00000244553  Ambiguous  Yale
## 6  PGOHUM00000243857 Duplicated  Yale
## 7  PGOHUM00000243858  Processed  Yale
## 8  PGOHUM00000244554  Processed  Yale
## 9  PGOHUM00000251046  Processed  Yale
## 10 PGOHUM00000244555  Ambiguous  Yale
#alternatively using USING Clause
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60 join pseudoYale60Class USING (name) limit 10')
##                 name bin chrom strand txStart  txEnd cdsStart cdsEnd exonCount
## 1  PGOHUM00000244550 585  chr1      +   12981  13670        0      0         3
## 2  PGOHUM00000251045 585  chr1      -   16607  18381        0      0         6
## 3  PGOHUM00000244551 585  chr1      +   52475  53354        0      0         1
## 4  PGOHUM00000244552 585  chr1      +   62951  63851        0      0         1
## 5  PGOHUM00000244553 585  chr1      +   65887  65974        0      0         1
## 6  PGOHUM00000243857  73  chr1      -  120819 133580        0      0         3
## 7  PGOHUM00000243858  73  chr1      -  131029 131544        0      0         1
## 8  PGOHUM00000244554 586  chr1      +  131124 135623        0      0         1
## 9  PGOHUM00000251046 586  chr1      -  135714 136507        0      0         1
## 10 PGOHUM00000244555 586  chr1      +  135956 136244        0      0         1
##                              exonStarts                             exonEnds
## 1                    12981,13224,13448,                   13056,13377,13670,
## 2  16607,16854,17229,17601,17914,18265, 16766,17052,17364,17745,18058,18381,
## 3                                52475,                               53354,
## 4                                62951,                               63851,
## 5                                65887,                               65974,
## 6                 120819,129028,133367,                120930,129232,133580,
## 7                               131029,                              131544,
## 8                               131124,                              135623,
## 9                               135714,                              136507,
## 10                              135956,                              136244,
##         class owner
## 1  Duplicated  Yale
## 2  Duplicated  Yale
## 3   Processed  Yale
## 4   Processed  Yale
## 5   Ambiguous  Yale
## 6  Duplicated  Yale
## 7   Processed  Yale
## 8   Processed  Yale
## 9   Processed  Yale
## 10  Ambiguous  Yale
#Append the first 10 names from pseudoYale60Class to the first 10 names of pseudoYale60, and the replicated rows are removed
dbGetQuery(ucscGenomeConn,'(select name from pseudoYale60 limit 10) union (select name from pseudoYale60Class limit 10)')
##                 name
## 1  PGOHUM00000232565
## 2  PGOHUM00000232567
## 3  PGOHUM00000232568
## 4  PGOHUM00000232569
## 5  PGOHUM00000232570
## 6  PGOHUM00000232573
## 7  PGOHUM00000232574
## 8  PGOHUM00000232575
## 9  PGOHUM00000232579
## 10 PGOHUM00000232580
#Append the first 10 names from pseudoYale60Class to the first 10 names of pseudoYale60
dbGetQuery(ucscGenomeConn,'(select name from pseudoYale60 limit 10) union all (select name from pseudoYale60Class limit 10)')
##                 name
## 1  PGOHUM00000232565
## 2  PGOHUM00000232567
## 3  PGOHUM00000232568
## 4  PGOHUM00000232569
## 5  PGOHUM00000232570
## 6  PGOHUM00000232573
## 7  PGOHUM00000232574
## 8  PGOHUM00000232575
## 9  PGOHUM00000232579
## 10 PGOHUM00000232580
## 11 PGOHUM00000232565
## 12 PGOHUM00000232567
## 13 PGOHUM00000232568
## 14 PGOHUM00000232569
## 15 PGOHUM00000232570
## 16 PGOHUM00000232573
## 17 PGOHUM00000232574
## 18 PGOHUM00000232575
## 19 PGOHUM00000232579
## 20 PGOHUM00000232580
#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE

As demonstrated, the USING clause eliminates the repeated name column, while the ON clause version of JOIN keeps both columns from both tables.

The difference between JOIN, LEFT JOIN, RIGHT JOIN, INNER JOIN, and FULL join will be explained later.

Here are a few other multi-table operators that can be useful:

  • “EXCEPT” selects rows in one table that are not present in another table.
    • SELECT columnA FROM Table1 EXCEPT SELECT columnA FROM Table2
  • “INTERSECT” selects rows that are present in both tables.
    • SELECT columnA FROM Table1 INTERSECT SELECT columnA FROM Table2

We will be demonstrating these two operators later with the JOINS.

6 Connecting to Google BigQuery

RODBC can also connect to PaaS providers such as Google Cloud Platform and use it as a data-source. Utilizing it will allow faster query executions and minimize usage of local memory.

6.2 Billing in Google BigQuery

With smaller scale querying operations, BigQuery is free for public use, but it charges a fee for heavy usage. That’s why it’s important to specify billing information when querying.

Fees for BigQuery are billed on a per-project basis, and when supplying billing information, just provide the project that should be billed for. When you are not an eligible member of the project with the permission to bill the project, billing that project will be declined.

For many public libraries and projects, they make their data publicly available, but they don’t allow the public to bill these projects. That means, users who want to query data from a public library will need to additionally provide another project that Google can bill.

6.3 Case Study I: Open Library in BigQuery

Here we will connect to a sample dataset in Google Cloud Platform’s list of open libraries and query from it.

6.3.1 Set up BigQuery on GCP

In order to use BigQuery functionalities, a Google Cloud Platform account is needed

  1. Go to Google Cloud Platform and register a GCP account with your email. New members to GCP are given 12-months of free trials and $300 credits.
  2. HS 650 DSPA Studests will receive $50 GCP credit (from the Instructor)
  3. In Google Cloud Platform, start a new project.
  4. After starting a new project, on GCP, go to IAM & Admin tab.
  5. On this tab, you will see a list of members of the project. You can add new members and assign each member specific role(s).
  6. Edit and make sure all members (including yourself) who need to access BigQuery functionalities from R are assigned the role of BigQuery Admin or similar ones that have the permission to create jobs remotely on BigQuery.
  7. After assigning roles and permissions, go to GCP BigQuery User Interface

This is the primary interface for setting up data sources. We will come to this tab only when needed to add new data sources and datasets and the remaining of the time we only need R to access data and query from GCP.

6.3.2 Connecting to Dataset

We will now demonstrate connecting to your project and the bigquery-public-data.samples.shakespeare.

#install.packages("bigrquery")
library(DBI)
library(RODBC)
library(bigrquery)
#set 'billing' to the project that will be billed

#Connect to the project
con <- dbConnect(bigquery(),
                 project = "bigquery-public-data",
                 billing = billing
)

#list first few lines of shakespeare table
sql <- "SELECT * FROM [bigquery-public-data.samples.shakespeare] LIMIT 5"
result <- query_exec(sql, project=billing)
result
##      word word_count  corpus corpus_date
## 1    LVII          1 sonnets           0
## 2  augurs          1 sonnets           0
## 3  dimm'd          1 sonnets           0
## 4 plagues          1 sonnets           0
## 5 treason          1 sonnets           0
#disconnect from Google BigQuery
dbDisconnect(con)

6.4 Case Study II: MIMICIII Intro

Here, we will demonstrate how to connect to a MIMIC-III dataset through Google Cloud’s BigQuery platform and query some data.

6.4.1 Setting Up MIMIC III on BigQuery

Similar to in Case Study I, MIMIC III is a project available on BigQuery so we will need to pin this project to the BigQuery interface. However, since this project is not in the public library, we will need to gain access to this project first then pin it.

6.4.1.1 Getting Access to MIMIC III

The access to MIMIC III is granted on a per-email basis. Follow these steps to gain access to MIMIC III for the email address registered for the project owner:

  1. MIMIC II data is initially stored on PhysioNet, and this is also the site for gaining access to MIMIC III.
  2. Go to PhysioNet Registration, where you will then register an account with the email address that you desire to gain access for.
  3. Confirm the registration after receiving an email, then log back into PhysioNet.
  4. Go the page for MIMIC II Data, scroll to the bottom and click on credentialed user.
  5. In the Credentialing tab of the page, click on “apply for access”.
  6. You will then be directed to the application page for accessing MIMIC III.
  7. Complete the application, which involves completing an online course: CITI Program in “Data or Specimens Only Research”, and listing your faculty supervisor as a reference.

6.4.1.2 Adding MIMICIII to BigQuery

After gaining access to MIMIC II data (it may take up to 2 weeks to process application), follow these steps to add the MIMIC III project to BigQuery:

  1. Go to GCP BigQuery User Interface
  2. On the Resources Tab, click on ADD DATA then Pin a project.
  3. Type in “physionet-data”, which will pin the MIMIC III dataset to your BigQuery Resources tab.

After these steps are done, you can now query directly from the MIMICIII dataset from R. However, you first need to link the physionet and the BigQuery accounts by:

  1. Going to Physionet MIMIC II Dataset
  2. Scroll down to Files, Access Files, where you will see lots of CSV.gz files
  3. Click the Request access using Google BigQuery.

6.4.2 Simple Demonstration

#install.packages("bigrquery")
library(DBI)
library(RODBC)
library(bigrquery)
#Connect to the MIMIC III database, eciu_crd_demo dataset
con <- dbConnect(bigquery(),
                 project = "physionet-data",
                 dataset = "eicu_crd_demo",
                 billing = billing
)

#list tables in eciu_crd_demo
dbListTables(con)
##  [1] "admissiondrug"             "admissiondx"              
##  [3] "allergy"                   "apacheapsvar"             
##  [5] "apachepatientresult"       "apachepredvar"            
##  [7] "careplancareprovider"      "careplaneol"              
##  [9] "careplangeneral"           "careplangoal"             
## [11] "careplaninfectiousdisease" "customlab"                
## [13] "diagnosis"                 "hospital"                 
## [15] "infusiondrug"              "intakeoutput"             
## [17] "lab"                       "medication"               
## [19] "microlab"                  "note"                     
## [21] "nurseassessment"           "nursecare"                
## [23] "nursecharting"             "pasthistory"              
## [25] "patient"                   "physicalexam"             
## [27] "respiratorycare"           "respiratorycharting"      
## [29] "treatment"                 "vitalaperiodic"           
## [31] "vitalperiodic"
#list fields in table infusiondrug
dbListFields(con,"infusiondrug")
## [1] "infusiondrugid"    "patientunitstayid" "infusionoffset"   
## [4] "drugname"          "drugrate"          "infusionrate"     
## [7] "drugamount"        "volumeoffluid"     "patientweight"
#list fields in table medication
dbListFields(con,"medication")
##  [1] "medicationid"       "patientunitstayid"  "drugorderoffset"   
##  [4] "drugstartoffset"    "drugivadmixture"    "drugordercancelled"
##  [7] "drugname"           "drughiclseqno"      "dosage"            
## [10] "routeadmin"         "frequency"          "loadingdose"       
## [13] "prn"                "drugstopoffset"     "gtc"
#query from table infusiondrug
sql <- "SELECT drugname FROM [physionet-data.eicu_crd_demo.infusiondrug] group by drugname limit 10"
result <- query_exec(sql, project=billing)
## Warning: 'query_exec' is deprecated.
## Use 'bq_perform_query' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## Warning: 'run_query_job' is deprecated.
## Use 'bq_perform_query' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## Warning: 'insert_query_job' is deprecated.
## Use 'bq_perform_query' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## Warning: 'wait_for' is deprecated.
## Use 'bq_job_wait' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## Warning: 'get_job' is deprecated.
## Use 'bq_job_meta' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## 0 bytes processed
## Warning: 'list_tabledata' is deprecated.
## Use 'bq_table_download' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## Warning: 'get_table' is deprecated.
## Use 'bq_table_meta' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
result
##    drugname
## 1        LR
## 2        NS
## 3       D5w
## 4       NSS
## 5       TPA
## 6   0.9% NS
## 7   Heparin
## 8   Insulin
## 9   (ml/hr)
## 10 Dopamine
#disconnect from Google BigQuery
dbDisconnect(con)

6.5 Case Study III: MIMIC III & Acute HF

6.5.1 Introduction

The following section of R-codes are developed by the University of Michigan SOCR Data Analytics - MIMICIII team and posted on GitHub. The following are excerpts from the project:

Heart failure (HF) is a “multifactorial, systemic disease” in which a network of autoprotective mechanisms activated after cardiac injury cause significant problems with heart function (Tanai & Frantz, 2015). HF is one of the largest contributing factors to mortality in the United States, playing a role in 1 in 9 deaths and accounting for more than $30 billion in annual healthcare expenditures (CDC.gov, Mozaffarian et. al). In addition to the estimated 5.7 million Americans living with HF, an additional 915,000 are newly diagnosed each year (Mozaffarian et. al). Despite recent advancements in the treatment of underlying risk factors, long-term mortality remains high with less than half of those newly diagnosed patients surviving for five years (Mozaffarian et. al).

Risk stratification systems have become an important part of patient management, and have the potential to “improve clinical outcome and resource allocation” by “avoiding the overtreatment of low-risk subjects or the early, inappropriate discharge of high-risk patients” (Passantino et. al, 2015). In their comprehensive 2015 publication, Andrea Passantino and colleagues review a collection of the “most relevant” risk stratification systems targeted at acute heart failure (AHF) including EFFECT, OPTIMIZE-HF, APACHE-HF, and ESCAPE, among others. Passantino and her team describe the wide array of data sources and techniques used by the original authors to validate these scoring systems, including “public registries, clinical trials, and retrospective data” (Passantino et. al, 2015). The heterogeneity of these data sources makes direct performance comparisons difficult.

In this document, we aim to extend the work contributed by Passantino and colleagues and provide a direct comparison of the performance of these AHF risk stratification metrics on a single, unified dataset. To this end, we leverage the Medical Information Mart for Intensive Care (MIMIC-III), a dataset developed by the Massachusetts Institute of Technology Lab for Computational Physiology (MIT-LCP) which contains de-identified health data for more than 40,000 intensive care unit (ICU) patients over the years 2001-2012 (Johnson et. al, 2012). It includes much of the same data commonly found in electronic health records (EHR), including demographic information, diagnosis codes, procedures, laboratory values, vital signs, free-text clinical notes, and admission, discharge, and mortality information. Moreover, it is large and comprehensive enough to serve as a proving ground for each of the individual risk stratification systems, allowing a level playing field from which to make direct comparisons of the efficacy and performance of each proposed metric.

In the spirit of SOCR’s open science initiative, this document and it’s parent repository contain the complete end-to-end computational protocol, results, and validation procedures. It is subdivided into the following partitions:

  1. Cohort selection: criteria for inclusion into the study are codified and explained.
  2. Data extraction: specific data points are extracted from the larger dataset and transformed for further analysis.

6.5.2 Cohort Selection

#install.packages("bigrquery")
library(DBI)
library(RODBC)
library(bigrquery)
#install.packages("knitr")
#install.packages("kableExtra")
#install.packages("ggplot2")
#install.packages("ggpubr")
#install.packages("magrittr")
library("magrittr")
library('knitr')         # Make knitted tables ("kables")
library('kableExtra')    # Extra kable formatting options
library('ggplot2')       # Plotting library
library('ggpubr')

As the MIMIC-III dataset contains a wide variety of patients, those suitable for our particular analysis must first be isolated. Based on the manuscripts cited in Passantino et. al’s review, we developed a set of inclusion criteria to select such a cohort. These criteria include:

  1. Patients diagnosed with acute heart failure, excluding those with existing chronic heart failure.
  2. Patients who survived to ICU discharge, preferring to instead focus on out-of-hospital mortality.
  3. Patients who stayed in the ICU between 24 hours and 45 days, yielding enough time to facilitate a number of test results while excluding the few ultra-long term ICU stays present in the dataset.
  4. Patients who are between 18 and 89 years of age (inclusive).

These criteria were inspired by the inclusion criteria schema described in the following references:

  1. Auble et. al, A Prediction Rule to Identify Low-risk Patients with Heart Failure. Academic Emergency Medicine, 2005.
  2. Abraham et. al, Predictors of In-Hospital Mortality in Patients Hospitalized for Heart Failure: Insights From the Organized Program to Initiate Lifesaving Treatment in Hospitalized Patients With Heart Failure (OPTIMIZE-HF). Journal of the American College of Cardiology, 2008.
  3. Peterson et. al, A Validated Risk Score for In-Hospital Mortality in Patients With Heart Failure From the American Heart Association Get With the Guidelines Program. Circulation, 2009.
  4. Lee et. al, Prediction of Heart Failure Mortality in Emergent Care: A Cohort Study. Annals of Internal Medicine, 2012.
  5. Okazaki et. al, New scoring system (APACHE-HF) for predicting adverse outcomes in patients with acute heart failure: Evaluation of the APACHE II and Modified APACHE II scoring systems. Journal of Cardiology, 2014
  6. Salah et. al, A novel discharge risk model for patients hospitalised for acute decompensated heart failure incorporating N-terminal pro-B-type natriuretic peptide levels: a European coLlaboration on Acute decompeNsated Heart Failure: ÉLAN-HF Score. Heart, 2013
  7. Lee et. al, Predicting Mortality Among Patients Hospitalized for Heart Failure. JAMA, 2003
  8. O’Connor, et. al, Triage After Hospitalization With Advanced Heart Failure: The ESCAPE (Evaluation Study of Congestive Heart Failure and Pulmonary Artery Catheterization Effectiveness) Risk Model and Discharge Score, Journal of the American College of Cardiology, 2010

6.5.2.1 Criteria 1: Patients diagnosed with acute heart failure

To select patients with acute heart failure, we first constructed a cohort of all heart failure patients, denoted by the ICD9 code 4280, then subtracted those that had an additional chronic heart failure diagnosis (codes 42822, 42823, 42832, 42833, 42842 & 42843).

Note: To successfully run the MIMIC-III data analytics example below, you need to have a credentialed PhysioNet account. If you don’t have CITI certification and/or Physionet access, you can just check the MIMIC-III BigQuery description/meta-data and skill over this example.

#Connect to the MIMIC III database, eciu_crd_demo dataset
con <- dbConnect(bigquery(),
                 project = "physionet-data",
                 dataset = "eicu_crd_demo",
                 billing = billing #set billing to Project id
)

sql <- "SELECT *
        FROM [physionet-data.mimiciii_clinical.diagnoses_icd]
        WHERE icd9_code IN ('4280')
        AND hadm_id NOT IN (
            SELECT hadm_id
            FROM [physionet-data.mimiciii_clinical.diagnoses_icd]
            WHERE icd9_code IN ('42822', '42823', '42832', '42833', '42842', '42843') 
            GROUP BY hadm_id
        )"
cohort <- query_exec(sql, project=billing)

6.5.2.2 Criteria 2: Subject survived to ICU discharge

As our target is predicting out-of-hospital mortality due to acute heart failure, we exclude patients who expired in the hospital.

adm.tbl <- query_exec(
  sprintf( "SELECT HADM_ID, DISCHTIME, DEATHTIME
        FROM [physionet-data.mimiciii_clinical.admissions] 
        WHERE hadm_id IN (%s)",
      paste(cohort$HADM_ID, collapse=", ")),
  project=billing, max_pages = Inf)
# Merge with cohort object
cohort <- merge(cohort, adm.tbl, by="HADM_ID")
# Remove subjects with in hospital date of death (aka they have a deathtime in admissions table)
cohort <- cohort[is.na(cohort$DEATHTIME), ]

6.5.2.3 Criteria 3: Length of stay between 24-hours and 45 days

admissions.tbl <- query_exec(
  sprintf( "SELECT *
            FROM [physionet-data.mimiciii_clinical.admissions] 
            WHERE hadm_id IN (%s)", 
          paste(cohort$HADM_ID, collapse=", ")),
  project=billing, max_pages = Inf)
# Calculate length of stay
admissions.tbl$LOS <- difftime(admissions.tbl$DISCHTIME, admissions.tbl$ADMITTIME, unit='days')
# Merge with cohort object
cohort <- merge(cohort, admissions.tbl[, c('SUBJECT_ID', 'HADM_ID', 'DISCHTIME', 'ADMITTIME', 'LOS')], by=c('SUBJECT_ID', 'HADM_ID'))
# Plot length of stay before removal
fig <- ggplot(cohort, aes(x=LOS)) +
  geom_histogram() + 
  ggtitle('Length of Stay') + xlab('Days')
print(fig)

# Remove encounters where LOS falls outside bounds
cohort <- cohort[cohort$LOS > 1 & cohort$LOS <= 45, ]
cohorttemp <- cohort

6.5.2.4 Criteria 4: Patients 18 years or older

To select patients 18 or older, first we join the patients table and the admissions table to get the admissions ID and the date of birth. DOB is the date of birth of the given patient. All dates in the database have been shifted to protect patient confidentiality. Dates are internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300 to obscure their age and comply with HIPAA. The shift process was as follows: the patient’s age at their first admission was determined. The date of birth was then set to exactly 300 years before their first admission. Therefore, we also extract the first admission time so we can calculate the age of the patient at the time of first admission. Finally, we exclude those younger than 18 at the time of first admission.

sql <- sprintf(
       "SELECT admissions.HADM_ID as admissions_HADM_ID, admissions.ADMITTIME as admissions_ADMITTIME, patients.DOD as DOD, patients.DOD_HOSP as DOD_HOSP, patients.DOD_SSN as DOD_SSN, patients.DOB as patients_DOB
        FROM [physionet-data.mimiciii_clinical.admissions] AS admissions
        JOIN [physionet-data.mimiciii_clinical.patients] AS patients
        ON admissions.subject_id = patients.subject_id
        WHERE admissions.hadm_id IN (%s)",
        paste(cohort$HADM_ID, collapse = ", "))
age.tbl <- query_exec(sql, project=billing)
age.tbl$age <- difftime(age.tbl$admissions_ADMITTIME, age.tbl$patients_DOB, units='days')
age.tbl <- data.frame(age.tbl[, c('admissions_HADM_ID', 'age', 'DOD', 'DOD_HOSP', 'DOD_SSN')])
colnames(age.tbl) <- c('HADM_ID', 'AGE', 'DOD', 'DOD_HOSP', 'DOD_SSN')
cohort <- merge(cohort, age.tbl, by="HADM_ID")
cohort$AGE <- as.numeric(cohort$AGE) / 365
cohort <- cohort[cohort$AGE < 90 & cohort$AGE >= 18, ]
# Plot length of stay before removal
fig <- ggplot(cohort, aes(x=AGE)) +
  geom_histogram() + 
  ggtitle('Subject Age') + xlab('Years')
print(fig)

6.5.3 Data extraction

pull.last.event <- function(hadm_id, itemcodes, return.fields, table, project, max_pages=Inf) {
  sql <- sprintf("
                 SELECT *
                 FROM [physionet-data.mimiciii_clinical.%s] AS table1
                 INNER JOIN (
                   SELECT hadm_id, MAX(charttime), MAX(row_id)
                   FROM [physionet-data.mimiciii_clinical.%s]
                   WHERE itemid IN (%s)
                   AND hadm_id IN (%s)
                   GROUP BY hadm_id
                 ) AS table2
                 ON table1.row_id = table2.f1_",
                 table, table, 
                 paste(itemcodes, collapse=", "),
                 paste(hadm_id, collapse=", ")
  )
  
  data <- query_exec(sql, project=project, max_pages=max_pages)
  colnames(data) <- gsub('table[0-9]_', '', colnames(data))
  return(data[ , return.fields])
}

6.5.3.1 Systolic Blood Pressure

sbp.itemcodes <- c( 6, 51, 442, 3313, 3315, 3317, 3321, 3323, 3325, 6701, 228152, 224167, 227243, 220050, 220179, 225309 )
return.fields <- c('SUBJECT_ID', 'HADM_ID', 'ITEMID', 'CHARTTIME', 'VALUENUM', 'VALUEUOM')
data.sbp <- pull.last.event(cohort$HADM_ID, sbp.itemcodes, return.fields, 'chartevents', billing, max_pages=1)

#disconnect from Google BigQuery
dbDisconnect(con)

7 Additional Applications

There are several additional applications of RODBC that may be useful.

7.1 Cross-Data-Source Projects

You can connect to multiple databases simultaneously and query from them at the same time in one project. You can thereby transfer data across platforms utilizing R as an intermediary.These databases include local databases, which can also be created through R.

Here we demonstrate how we can create a local SQL Database in your RAM and connect to it at the same time as the UCSC Genome database. We will attempt to copy some data from the Genome database into the local database.

#Create an ephemeral in-memory RSQLite database connect to it through RODBC
# install.packages("RSQLite")
#remove.packages("devtools")
#install.packages("devtools", repos="http://cran.rstudio.com/", dependencies=TRUE)
#remove.packages("RSQLite")
#install.packages("RSQLite", repos="http://cran.rstudio.com/", dependencies=TRUE)
#install.packages("rlang", repos="http://cran.rstudio.com/", dependencies=TRUE)
#install.packages("usethis", repos="http://cran.rstudio.com/", dependencies=TRUE)
library(devtools)
## Loading required package: usethis
library("RSQLite")
## 
## Attaching package: 'RSQLite'
## The following object is masked from 'package:RMySQL':
## 
##     isIdCurrent
library(RODBC)
LocalConnection <- dbConnect(RSQLite::SQLite(), ":memory:")

#Connect to UCSC Genome database
ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')
#Copy data from database into dataframe
df_selected <- dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' AND owner = \'Yale\' limit 10')
df_selected
##                 name     class owner
## 1  PGOHUM00000232565 Ambiguous  Yale
## 2  PGOHUM00000232568 Ambiguous  Yale
## 3  PGOHUM00000232574 Ambiguous  Yale
## 4  PGOHUM00000232584 Ambiguous  Yale
## 5  PGOHUM00000232585 Ambiguous  Yale
## 6  PGOHUM00000232590 Ambiguous  Yale
## 7  PGOHUM00000232593 Ambiguous  Yale
## 8  PGOHUM00000232594 Ambiguous  Yale
## 9  PGOHUM00000232597 Ambiguous  Yale
## 10 PGOHUM00000232601 Ambiguous  Yale
#Write dataframe into local database
dbWriteTable(LocalConnection,"UCSC",df_selected)
dbListTables(LocalConnection)
## [1] "UCSC"
dbGetQuery(LocalConnection,'select * from UCSC')
##                 name     class owner
## 1  PGOHUM00000232565 Ambiguous  Yale
## 2  PGOHUM00000232568 Ambiguous  Yale
## 3  PGOHUM00000232574 Ambiguous  Yale
## 4  PGOHUM00000232584 Ambiguous  Yale
## 5  PGOHUM00000232585 Ambiguous  Yale
## 6  PGOHUM00000232590 Ambiguous  Yale
## 7  PGOHUM00000232593 Ambiguous  Yale
## 8  PGOHUM00000232594 Ambiguous  Yale
## 9  PGOHUM00000232597 Ambiguous  Yale
## 10 PGOHUM00000232601 Ambiguous  Yale
#Disconnect and cleanup
dbDisconnect(LocalConnection)
dbDisconnect(ucscGenomeConn)
## [1] TRUE

7.2 JOINs, EXCEPT, INTERSECT

Structured Query Language joins are SQL instructions combining data from two or more datasets or relational tables. Let’s examine the main differences between the different JOINs.

  • Inner Join/Join
    • Select all rows from both tables as long as the conditions are met.
    • Similar to finding the intersection of two sets
  • Left Join
    • Select all rows in the first table and the matching rows in the second table.
    • The size of the joined table has the same number of rows as the first table.
  • Right Join
    • Similar to Left Join, but all rows of the second tables are selected instead
  • Full Join
    • Select all rows from both tables
    • Similar to finding the union of two sets.

We will also be demonstrating the EXCEPT and INTERSECT operators from the previous section:

  • “EXCEPT” selects rows in one table that are not present in another table.
    • SELECT columnA FROM Table1 EXCEPT SELECT columnA FROM Table2
  • “INTERSECT” selects rows that are present in both tables.
    • SELECT columnA FROM Table1 INTERSECT SELECT columnA FROM Table2

These will be demonstrated with the pseudoYale60 dataset here:

library("RSQLite")
library(RODBC)
LocalConnection <- dbConnect(RSQLite::SQLite(), ":memory:")

#Connect to UCSC Genome database
ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')
#Copy data from database into local database. Since the two tables have been pre-matched, we delete some rows so that we can artificially create differences between the two tables' keys
df_selected <- dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class limit 12345')
dbWriteTable(LocalConnection,"Class",df_selected)
df_selected <- dbGetQuery(ucscGenomeConn,'select * from pseudoYale60 limit 10000')
dbWriteTable(LocalConnection,"Yale",df_selected)

#Display columns of table 'Class'
dbListFields(LocalConnection,'Class')
## [1] "name"  "class" "owner"
#Display number of rows of table 'Class'
nrow(dbGetQuery(LocalConnection,'select * from Class'))
## [1] 12345
#Display columns of table 'Yale'
dbListFields(LocalConnection,'Yale')
##  [1] "bin"        "name"       "chrom"      "strand"     "txStart"   
##  [6] "txEnd"      "cdsStart"   "cdsEnd"     "exonCount"  "exonStarts"
## [11] "exonEnds"
#Display number of rows of table 'Yale'
nrow(dbGetQuery(LocalConnection,'select * from Yale'))
## [1] 10000
#Inner join the two tables on 'name' column
InnerJoin <- dbGetQuery(LocalConnection, 'select Class.*, Yale.chrom, Yale.strand from Class INNER JOIN Yale on Class.name = Yale.name')
#Display number of rows of the resulting table. It's smaller than both original tables
nrow(InnerJoin)
## [1] 6567
#Left join the two tables whereas the 'Class' tables is the left table on 'name' column
LeftJoin <- dbGetQuery(LocalConnection, 'select Class.*, Yale.chrom, Yale.strand from Class LEFT JOIN Yale on Class.name = Yale.name')
#Display number of rows of the resulting table. It's the same size as the left table 'Class'
nrow(LeftJoin)
## [1] 12345
#Left join the two tables whereas the 'Yale' tables is the left table on 'name' column
LeftJoin <- dbGetQuery(LocalConnection, 'select Class.*, Yale.chrom, Yale.strand from Yale LEFT JOIN Class on Class.name = Yale.name')
#Display number of rows of the resulting table. It's the same size as the left table 'Yale'
nrow(LeftJoin)
## [1] 10000
#Use except to select names from Yale that are not in Class
EXCEPT1 <- dbGetQuery(LocalConnection, "select name from Yale except select name from Class")
nrow(EXCEPT1)
## [1] 3433
#Use except to select names from Class that are not in Yale
EXCEPT2 <- dbGetQuery(LocalConnection, "select name from Class except select name from Yale")
nrow(EXCEPT2)
## [1] 5778
#Use intersect to select names that are in both Class and Yale
INTERSECT <- dbGetQuery(LocalConnection, "select name from Yale intersect select name from Class")
nrow(INTERSECT)
## [1] 6567
#The number of rows in Yale that is not in Class and the number of rows in both Yale and Class add up to the total number of rows in Yale
nrow(EXCEPT1)+nrow(INTERSECT)
## [1] 10000
#The number of rows in Class that is not in Yale and the number of rows in both Yale and Class add up to the total number of rows in Class
nrow(EXCEPT2)+nrow(INTERSECT)
## [1] 12345
#As right join and Full Join are not supported on SQLite, we will display the syntax here and demonstrate later with BigQuery.
#RightJoin <- dbGetQuery(LocalConnection, 'select Class.*, Yale.chrom, Yale.strand from Class RIGHT JOIN Yale on Class.name = Yale.name')
#nrow(RightJoin)
#FullJoin <- dbGetQuery(LocalConnection, 'select Class.*, Yale.chrom, Yale.strand from Class FULL JOIN Yale on Class.name = Yale.name')
#nrow(FullJoin)

#Disconnect and cleanup
dbDisconnect(LocalConnection)
dbDisconnect(ucscGenomeConn)
## [1] TRUE

7.3 Database Management with SQL

The access of the user on the database limits the types of queries one can send. However, when having administrator access, there are a few more important SQL clauses that are important to remember:

  • “CREATE TABLE (…)” generates a new table and requires definition of the table’s schema
    • CREATE TABLE (…) AS SELECT * FROM (…) LIMIT 0 copies the schema of an existing table
  • “INSERT INTO (…) SELECT (…) FROM (…)” takes data from one table and puts it into an existing table with known schema and structure. This requires the data to be inserted to have the same structure.
    • “INSERT INTO (table_name) (column1, column2, column3, …) VALUES (value1, value2, value3, …) appends rows to table directly without selection from another table.
  • “UPDATE(…)SET(…)” changes values in a table
  • “ALTER TABLE (…) ADD/DROP/ALTER COLUMN (…)” changes the schema of a table
  • “DROP TABLE (…)” removes a table and data within from a database

These statements will be demonstrated below utilizing ephemeral local SQLite database, where initial data is copied from UCSC Genome database (similar to what we did earlier). Keep in mind that dbExecute() is needed for these administrator level commands.

#Create an ephemeral in-memory RSQLite database connect to it through RODBC
# install.packages("RSQLite")
library("RSQLite")
library(RODBC)
LocalConnection <- dbConnect(RSQLite::SQLite(), ":memory:")

#Connect to UCSC Genome database
ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')

#Copy data from database into dataframe
df_selected <- dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' AND owner = \'Yale\' limit 10')

#Write dataframe into local database
dbWriteTable(LocalConnection,"UCSC",df_selected)
dbListTables(LocalConnection)
## [1] "UCSC"
#Alternatively, creates a table UCSC_Insert that has the same schema as UCSC
fb <- dbExecute(LocalConnection,'CREATE TABLE UCSC_Insert AS SELECT * FROM UCSC LIMIT 0')
dbListTables(LocalConnection)
## [1] "UCSC"        "UCSC_Insert"
dbListFields(LocalConnection,'UCSC_Insert')
## [1] "name"  "class" "owner"
#Copy the table from table 'UCSC' into the new table UCSC_Insert.
fb <- dbExecute(LocalConnection,'INSERT INTO UCSC_Insert SELECT * FROM UCSC LIMIT 4')
dbGetQuery(LocalConnection,'SELECT * FROM UCSC_Insert')
##                name     class owner
## 1 PGOHUM00000232565 Ambiguous  Yale
## 2 PGOHUM00000232568 Ambiguous  Yale
## 3 PGOHUM00000232574 Ambiguous  Yale
## 4 PGOHUM00000232584 Ambiguous  Yale
#Append a new row to table UCSC_Insert.
fb <- dbExecute(LocalConnection,"INSERT INTO UCSC_Insert (name,class,owner) VALUES ('123','456','789')")
dbGetQuery(LocalConnection,'SELECT * FROM UCSC_Insert')
##                name     class owner
## 1 PGOHUM00000232565 Ambiguous  Yale
## 2 PGOHUM00000232568 Ambiguous  Yale
## 3 PGOHUM00000232574 Ambiguous  Yale
## 4 PGOHUM00000232584 Ambiguous  Yale
## 5               123       456   789
#Update one row of table UCSC_Insert's 'owner' column to be 'ELAY
fb <- dbExecute(LocalConnection,"UPDATE UCSC_Insert SET owner = 'ELAY' WHERE name = 'PGOHUM00000232568'")
dbGetQuery(LocalConnection,'SELECT * FROM UCSC_Insert')
##                name     class owner
## 1 PGOHUM00000232565 Ambiguous  Yale
## 2 PGOHUM00000232568 Ambiguous  ELAY
## 3 PGOHUM00000232574 Ambiguous  Yale
## 4 PGOHUM00000232584 Ambiguous  Yale
## 5               123       456   789
#change name of table UCSC_Insert to 'UCSC_Insert2'
fb <- dbExecute(LocalConnection,"ALTER TABLE UCSC_Insert RENAME TO UCSC_Insert2")
dbListTables(LocalConnection)
## [1] "UCSC"         "UCSC_Insert2"
#add new column to table UCSC_Insert2 named 'newOwner', with string as its datatype
fb <- dbExecute(LocalConnection,"ALTER TABLE UCSC_Insert2 ADD COLUMN newOwner string")
dbGetQuery(LocalConnection,'SELECT * FROM UCSC_Insert2')
##                name     class owner newOwner
## 1 PGOHUM00000232565 Ambiguous  Yale       NA
## 2 PGOHUM00000232568 Ambiguous  ELAY       NA
## 3 PGOHUM00000232574 Ambiguous  Yale       NA
## 4 PGOHUM00000232584 Ambiguous  Yale       NA
## 5               123       456   789       NA
#remove table UCSC_Insert2
fb <- dbExecute(LocalConnection,"DROP TABLE UCSC_Insert2")
dbListTables(LocalConnection)
## [1] "UCSC"
#Disconnect and cleanup
fb <- dbDisconnect(LocalConnection)
fb <- dbDisconnect(ucscGenomeConn)

7.4 Querying Automation

As SQL doesn’t have the power to automate its execution, R can be used as an automation tool as well as an user interface. User can write SQL in either R script as string or separate .sql files. R can then take the string or input file, parse them appropriately, then pass the SQL commands into databases one after another. This allows users to change SQL lines at one place while avoiding the risk of being able to fetch data only after the entire SQL command has been executed.

Let’s demonstrated this with an example.

#Setup connections 
#install.packages("RSQLite")
library(devtools)
library(RODBC)
library("RSQLite")
LocalConnection <- dbConnect(RSQLite::SQLite(), ":memory:")

ucscGenomeConn <- dbConnect(MySQL(),
                user='genome',
                dbname='hg19',
                host='genome-mysql.soe.ucsc.edu')

#and move relevant tables into an ephemeral in-memory SQLite database
df_selected <- dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' AND owner = \'Yale\' limit 10')
dbWriteTable(LocalConnection,"UCSC",df_selected)
dbListTables(LocalConnection)
## [1] "UCSC"
#Keep all SQL Statements at one place
#Here they are stored in string format in a vector, but you can Write them outside of the R Script in csv, txt, sql, or other formats, and take them in as input
SQLString <- c("CREATE TABLE UCSC_Insert AS SELECT * FROM UCSC LIMIT 0",
               "INSERT INTO UCSC_Insert SELECT * FROM UCSC LIMIT 4",
               "INSERT INTO UCSC_Insert (name,class,owner) VALUES ('123','456','789')",
               "UPDATE UCSC_Insert    owner = 'ELAY' WHERE name = 'PGOHUM00000232568'",
               "ALTER TABLE UCSC_Insert RENAME TO UCSC_Insert2",
               "ALTER TABLE UCSC_Insert2 ADD COLUMN newOwner string")

#Use for-loop to automate the execution of these sql statements
for (i in 1:length(SQLString)) {
   possibleError <- tryCatch({fb <- dbExecute(LocalConnection,SQLString[i])},
                             error = function(e) {print(c('error at SQL Line # ', i))}
                             )
 if(inherits(possibleError, "error")){
    next
  }
}
## [1] "error at SQL Line # " "4"
#Check the result of these statements
dbListTables(LocalConnection)
## [1] "UCSC"         "UCSC_Insert2"
dbListFields(LocalConnection, 'UCSC_Insert2')
## [1] "name"     "class"    "owner"    "newOwner"
#Store the data into a dataframe in R
#Data can also be exported out of R in formats that you define
result <- dbGetQuery(LocalConnection,'SELECT * FROM UCSC_Insert2')
result
##                name     class owner newOwner
## 1 PGOHUM00000232565 Ambiguous  Yale       NA
## 2 PGOHUM00000232568 Ambiguous  Yale       NA
## 3 PGOHUM00000232574 Ambiguous  Yale       NA
## 4 PGOHUM00000232584 Ambiguous  Yale       NA
## 5               123       456   789       NA
#Disconnect and clean up
fb <- dbDisconnect(LocalConnection)
fb <- dbDisconnect(ucscGenomeConn)

In the codes above, we used a for-loop to execute the SQL commands. We used the try-catch function so that when an error occurs in execution, throw an error message and keep the remaining SQL codes running. This is especially useful if the dataset is large and the SQL commands don’t depend on each other, then you can run a large dataset and multiple lines of SQL over a long period of time(e.g. overnight) without the concern of stopping in the middle.

In the example above, we intentionally made the fourth line of SQL command wrong in syntax, and the try-catch function successfully detected this error and continued executing SQL commands after throwing the error message.

SOCR Resource Visitor number Web Analytics SOCR Email