SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
In this DSPA Appendix, we will present some common strategies to interact with relational databases.
The Structured Query Language (SQL) is used to communicate with complex databases (DBs)that include large, structured, and relational information. SQL is the de facto standard language for managing, storing, and retrieval of information from relational database management systems. SQL statements may be used to update data elements in a database, retrieve data from a database, or query the meta-data in the system. Many relational database management systems like Oracle, Sybase, Microsoft SQL Server, Access, Ingres, and so on provide SQL application programming interfaces (API’s). There are also no-SQL database systems, e.g., MongoDB, that provide more flexibility by relaxing the relational requirement of SQL systems. In many situations, relational SQL DBs are impractical for large amounts of heterogeneous, incongruent, and rapidly-changing data that blends structured, semi-structured, unstructured, and polymorphic data elements. We will predominantly focus on the standard SQL commands, such as Select
, Insert
, Update
, Delete
, Create
, and Drop
to interact with external databases. The MongoDB noSQL tutorial provides information on interacting with noSQL DBs. In this module, we mostly focus on data import from SQL Databases.
Let’s start by exploring R
-based data-import from SQL databases. First, we need to install and load the RODBC(R Open Database Connectivity) package.
# install.packages("RODBC", repos = "http://cran.us.r-project.org")
library(RODBC)
Then, we will open a connection to the SQL server database with Data Source Name (DSN), using Microsoft Access. More details are provided in the Microsoft documentaiton and in the RODBC documentaiton.
As a demonstration, we will be connecting with username “genome” to a MySQL database hg19 hosted by genome.ucsc.edu server.
# install.packages("RMySQL");
library(RMySQL)
# install.packages("RODBC");
library(RODBC)
# install.packages("DBI");
library(DBI)
<- dbConnect(MySQL(),
ucscGenomeConn user='genome',
dbname='hg19',
host='genome-mysql.soe.ucsc.edu')
The RODBC library also allows connecting to other Relational Databases such as SQL Server or Database-like platforms such as Dremio. In order to connect to each of these data sources, the specific driver for that data source needs to be downloaded and installed in R.
If connecting to multiple SQL data sources simultaneously within one R project is desired, establish multiple connections in fashion similar to above.
After all needed jobs are done on the database, it is important to disconnect from the database, because the connection isn’t automatically closed and there can be a limit to how many connections can be set up at the same time in R.
Here we will demonstrate how to see all active connections and how to close connections.
# install.packages("DBI");
library(DBI)
# install.packages("RMySQL");
library(RMySQL)
# install.packages("RODBC");
library(RODBC)
#display all current connections to MySQL databases
dbListConnections(MySQL())
## list()
#close connection [1] to MySQL database
#dbDisconnect(dbListConnections(dbDriver(drv="MySQL"))[[1]])
#list and close all connections to MySQL databases
lapply(dbListConnections(MySQL()), dbDisconnect)
## list()
#setup connection
<- dbConnect(MySQL(),
ucscGenomeConn user='genome',
dbname='hg19',
host='genome-mysql.soe.ucsc.edu')
#disconnect current session from the database
dbDisconnect(ucscGenomeConn)
## [1] TRUE
There are several other important functions in RODBC besides dbConnect, dbDisconnect and dbListConnections that can be useful for database users. These functions are demonstrated below.
# install.packages("RMySQL");
library(RMySQL)
# install.packages("RODBC");
library(RODBC)
#setup connection
<- dbConnect(MySQL(),
ucscGenomeConn user='genome',
dbname='hg19',
host='genome-mysql.soe.ucsc.edu')
#Store the names of all the tables in the database into 'allTables' and display the total number of tables in the database
<- dbListTables(ucscGenomeConn); length(allTables) allTables
## [1] 12551
#List the fields in the table "affyU133Plus2"
dbListFields(ucscGenomeConn, "affyU133Plus2")
## [1] "bin" "matches" "misMatches" "repMatches" "nCount"
## [6] "qNumInsert" "qBaseInsert" "tNumInsert" "tBaseInsert" "strand"
## [11] "qName" "qSize" "qStart" "qEnd" "tName"
## [16] "tSize" "tStart" "tEnd" "blockCount" "blockSizes"
## [21] "qStarts" "tStarts"
#Read the table "affyU133Plus2" into "affyData" and display the first few lines
<- dbReadTable(ucscGenomeConn, "affyU133Plus2");head(affyData) affyData
## bin matches misMatches repMatches nCount qNumInsert qBaseInsert tNumInsert
## 1 585 530 4 0 23 3 41 3
## 2 585 3355 17 0 109 9 67 9
## 3 585 4156 14 0 83 16 18 2
## 4 585 4667 9 0 68 21 42 3
## 5 585 5180 14 0 167 10 38 1
## 6 585 468 5 0 14 0 0 0
## tBaseInsert strand qName qSize qStart qEnd tName tSize tStart
## 1 898 - 225995_x_at 637 5 603 chr1 249250621 14361
## 2 11621 - 225035_x_at 3635 0 3548 chr1 249250621 14381
## 3 93 - 226340_x_at 4318 3 4274 chr1 249250621 14399
## 4 5743 - 1557034_s_at 4834 48 4834 chr1 249250621 14406
## 5 29 - 231811_at 5399 0 5399 chr1 249250621 19688
## 6 0 - 236841_at 487 0 487 chr1 249250621 27542
## tEnd blockCount
## 1 15816 5
## 2 29483 17
## 3 18745 18
## 4 24893 23
## 5 25078 11
## 6 28029 1
## blockSizes
## 1 93,144,229,70,21,
## 2 73,375,71,165,303,360,198,661,201,1,260,250,74,73,98,155,163,
## 3 690,10,32,33,376,4,5,15,5,11,7,41,277,859,141,51,443,1253,
## 4 99,352,286,24,49,14,6,5,8,149,14,44,98,12,10,355,837,59,8,1500,133,624,58,
## 5 131,26,1300,6,4,11,4,7,358,3359,155,
## 6 487,
## qStarts
## 1 34,132,278,541,611,
## 2 87,165,540,647,818,1123,1484,1682,2343,2545,2546,2808,3058,3133,3206,3317,3472,
## 3 44,735,746,779,813,1190,1195,1201,1217,1223,1235,1243,1285,1564,2423,2565,2617,3062,
## 4 0,99,452,739,764,814,829,836,842,851,1001,1016,1061,1160,1173,1184,1540,2381,2441,2450,3951,4103,4728,
## 5 0,132,159,1460,1467,1472,1484,1489,1497,1856,5244,
## 6 0,
## tStarts
## 1 14361,14454,14599,14968,15795,
## 2 14381,14454,14969,15075,15240,15543,15903,16104,16853,17054,17232,17492,17914,17988,18267,24736,29320,
## 3 14399,15089,15099,15131,15164,15540,15544,15549,15564,15569,15580,15587,15628,15906,16857,16998,17049,17492,
## 4 14406,20227,20579,20865,20889,20938,20952,20958,20963,20971,21120,21134,21178,21276,21288,21298,21653,22492,22551,22559,24059,24211,24835,
## 5 19688,19819,19845,21145,21151,21155,21166,21170,21177,21535,24923,
## 6 27542,
The function dbReadTable()
is equivalent to the SQL command SELECT * FROM affyU133Plus2
.
After establishing the connection to a SQL data-source, you can then pass SQL commands in string formats to the database. Here we will demonstrate such interactions with several basic SQL commands.
# set up connection
<- dbConnect(MySQL(),
ucscGenomeConn user='genome',
dbname='hg19',
host='genome-mysql.soe.ucsc.edu')
# select the top 10 rows of the genoName column from table rmsk with dbSendQuery and dbClearResult.
<- dbSendQuery(ucscGenomeConn,'select genoName from rmsk limit 10')
rs dbClearResult(rs)
## [1] TRUE
# select the top 10 rows of genoName column from table rmsk with dbGetQuery
<- dbGetQuery(ucscGenomeConn,'select genoName from rmsk limit 10')
df
#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE
The difference between dbSendQuery and dbGetQuery is the following:
dbClearResult()
to close the query activity before any other interaction with the server, while dbGetQuery closes the query activity on the database automatically.As a result, dbGetQuery is more suitable for running quick queries on the server and retrieving data immediately, while dbSendQuery is more useful when running queries on larger datasets and when immediate results are not required.
Both of these two functions are intended for data users only, so only SELECT
statements are intended to be sent through them. If the user has higher access to the database and intend to manipulate data, dbExecute is needed. This is demonstrated (but not executed) below.
# create a new empty table named newTable using dbExecute
#dbExecute(ucscGenomeConn,'create table newTable')
When using dbGetQuery, data is automatically retrieved, or “fetched” from the server. However, this is not always intended. If the demand is to query first and fetch at a later time, a combination of dbSendQuery and fetch. Here is an example.
# set up connection
<- dbConnect(MySQL(),
ucscGenomeConn user='genome',
dbname='hg19',
host='genome-mysql.soe.ucsc.edu')
# select the top 10 rows of the genoName column from table rmsk with deSendQuery and fetch.
<- dbSendQuery(ucscGenomeConn,'select genoName from rmsk limit 10')
rs fetch(rs)
## genoName
## 1 chr1
## 2 chr1
## 3 chr1
## 4 chr1
## 5 chr1
## 6 chr1
## 7 chr1
## 8 chr1
## 9 chr1
## 10 chr1
dbClearResult(rs)
## [1] TRUE
#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE
Regardless which function is used to query on the database, the results from the query must be fetched into R if analysis with R is intended.
Here are a few important SQL commands that can be used when working with databases.
“SELECT (…) FROM (…)” selects all rows (unless further specified) of specific columns of data from a table in the database. It can be used in combination with these following clauses: - “DISTINCT” selects data where replicated rows with the same values for all specified columns are removed. - “WHERE” selects data where values at specified columns meet conditions defined after this clause - “AND/OR/NOT” logic operators that allow setting complex requirement for the WHERE clause - “GROUP BY” separates the table into groups by values in a specified column. The number of groups created is equal to the number of distinct values in the column specified. - “AS” renames a column or the results yielded from operations on it to another specified name - “HAVING” works similar to WHERE class in terms of serving the purpose of qualification of values, but HAVING is different in that it allows conditioning on aggregate functions like COUNT() and works best with GROUP BY - “ORDER BY” rearrange the rows by specified orders on values of specified column(s) - “LIMIT” sets the limit to how many rows of data are selected, useful when working with large datasets These clauses are shown below.
<- dbConnect(MySQL(),
ucscGenomeConn user='genome',
dbname='hg19',
host='genome-mysql.soe.ucsc.edu')
#The tables we will be working with: pseudoYale60 and pseudoYale60Class from the hg19 database:
dbListFields(ucscGenomeConn, "pseudoYale60")
## [1] "bin" "name" "chrom" "strand" "txStart"
## [6] "txEnd" "cdsStart" "cdsEnd" "exonCount" "exonStarts"
## [11] "exonEnds"
dbListFields(ucscGenomeConn, "pseudoYale60Class")
## [1] "name" "class" "owner"
# select the top 10 rows of all columns from table pseudoYale60Class.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class limit 10')
## name class owner
## 1 PGOHUM00000232565 Ambiguous Yale
## 2 PGOHUM00000232567 Processed Yale
## 3 PGOHUM00000232568 Ambiguous Yale
## 4 PGOHUM00000232569 Duplicated Yale
## 5 PGOHUM00000232570 Duplicated Yale
## 6 PGOHUM00000232573 Processed Yale
## 7 PGOHUM00000232574 Ambiguous Yale
## 8 PGOHUM00000232575 Duplicated Yale
## 9 PGOHUM00000232579 Processed Yale
## 10 PGOHUM00000232580 Duplicated Yale
# select the top 10 rows of the class column from table pseudoYale60Class.
dbGetQuery(ucscGenomeConn,'select class from pseudoYale60Class limit 10')
## class
## 1 Ambiguous
## 2 Processed
## 3 Ambiguous
## 4 Duplicated
## 5 Duplicated
## 6 Processed
## 7 Ambiguous
## 8 Duplicated
## 9 Processed
## 10 Duplicated
# select the top 10 distinct rows of the class column from table pseudoYale60Class.
dbGetQuery(ucscGenomeConn,'select distinct class from pseudoYale60Class limit 10')
## class
## 1 Ambiguous
## 2 Processed
## 3 Duplicated
# select the top 10 rows of name and class columns from table pseudoYale60Class where class is 'Ambiguous'.
dbGetQuery(ucscGenomeConn,'select name, class from pseudoYale60Class WHERE class = \'Ambiguous\' limit 10')
## name class
## 1 PGOHUM00000232565 Ambiguous
## 2 PGOHUM00000232568 Ambiguous
## 3 PGOHUM00000232574 Ambiguous
## 4 PGOHUM00000232584 Ambiguous
## 5 PGOHUM00000232585 Ambiguous
## 6 PGOHUM00000232590 Ambiguous
## 7 PGOHUM00000232593 Ambiguous
## 8 PGOHUM00000232594 Ambiguous
## 9 PGOHUM00000232597 Ambiguous
## 10 PGOHUM00000232601 Ambiguous
# select the top 10 rows of all columns from table pseudoYale60Class where class is 'Ambiguous' and owner is 'Yale'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' AND owner = \'Yale\' limit 10')
## name class owner
## 1 PGOHUM00000232565 Ambiguous Yale
## 2 PGOHUM00000232568 Ambiguous Yale
## 3 PGOHUM00000232574 Ambiguous Yale
## 4 PGOHUM00000232584 Ambiguous Yale
## 5 PGOHUM00000232585 Ambiguous Yale
## 6 PGOHUM00000232590 Ambiguous Yale
## 7 PGOHUM00000232593 Ambiguous Yale
## 8 PGOHUM00000232594 Ambiguous Yale
## 9 PGOHUM00000232597 Ambiguous Yale
## 10 PGOHUM00000232601 Ambiguous Yale
# select the top 10 rows of all columns from table pseudoYale60Class where class is 'Ambiguous' or owner is 'Yale'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' OR owner = \'Yale\' limit 10')
## name class owner
## 1 PGOHUM00000232565 Ambiguous Yale
## 2 PGOHUM00000232567 Processed Yale
## 3 PGOHUM00000232568 Ambiguous Yale
## 4 PGOHUM00000232569 Duplicated Yale
## 5 PGOHUM00000232570 Duplicated Yale
## 6 PGOHUM00000232573 Processed Yale
## 7 PGOHUM00000232574 Ambiguous Yale
## 8 PGOHUM00000232575 Duplicated Yale
## 9 PGOHUM00000232579 Processed Yale
## 10 PGOHUM00000232580 Duplicated Yale
# select the top 10 rows of all columns from table pseudoYale60Class where class is \'Ambiguous\' or \'Processed\'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' OR class = \'Processed\' limit 10')
## name class owner
## 1 PGOHUM00000232565 Ambiguous Yale
## 2 PGOHUM00000232567 Processed Yale
## 3 PGOHUM00000232568 Ambiguous Yale
## 4 PGOHUM00000232573 Processed Yale
## 5 PGOHUM00000232574 Ambiguous Yale
## 6 PGOHUM00000232579 Processed Yale
## 7 PGOHUM00000232581 Processed Yale
## 8 PGOHUM00000232582 Processed Yale
## 9 PGOHUM00000232584 Ambiguous Yale
## 10 PGOHUM00000232585 Ambiguous Yale
# select the top 10 rows of all columns from table pseudoYale60Class where class is not \'Ambiguous\'.
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE NOT class = \'Ambiguous\' limit 10')
## name class owner
## 1 PGOHUM00000232567 Processed Yale
## 2 PGOHUM00000232569 Duplicated Yale
## 3 PGOHUM00000232570 Duplicated Yale
## 4 PGOHUM00000232573 Processed Yale
## 5 PGOHUM00000232575 Duplicated Yale
## 6 PGOHUM00000232579 Processed Yale
## 7 PGOHUM00000232580 Duplicated Yale
## 8 PGOHUM00000232581 Processed Yale
## 9 PGOHUM00000232582 Processed Yale
## 10 PGOHUM00000232583 Duplicated Yale
# select class and how many names in the class from pseudoYale60Class
dbGetQuery(ucscGenomeConn,'select count(name) as number_of_names, class from pseudoYale60Class GROUP BY class')
## number_of_names class
## 1 6409 Ambiguous
## 2 2631 Duplicated
## 3 8836 Processed
# select class and how many names in the class from pseudoYale60Class if the number of names in the class is greater than 4000
dbGetQuery(ucscGenomeConn,'select count(name) as number_of_names, class from pseudoYale60Class group by class having count(name) > 4000')
## number_of_names class
## 1 6409 Ambiguous
## 2 8836 Processed
#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE
Let’s now show several clauses that will be useful to work across multiple tables
Working with Multiple Tables: - “JOIN/LEFT JOIN/RIGHT JOIN/INNER JOIN/” JOIN functions combine two tables together where they have one or more columns sharing the same type of data. - “USING” can be an alternative to the ON clause in JOINs. It allows joining with columns of the same name while with ON you can specify and join columns with different names. - “UNION” appends one table to another where the columns of the two tables have the same data types, replicates are removed. - “UNION ALL” appends one table to another where the columns of the two tables have the same data types without removing replicates.
<- dbConnect(MySQL(),
ucscGenomeConn user='genome',
dbname='hg19',
host='genome-mysql.soe.ucsc.edu')
#The tables we will be working with: pseudoYale60 and pseudoYale60Class from the hg19 database
#joins the pseudoYale60 table with pseudoYale60Class table, where each row is matched on the name column
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60 join pseudoYale60Class on pseudoYale60.name = pseudoYale60Class.name limit 10')
## bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount
## 1 585 PGOHUM00000244550 chr1 + 12981 13670 0 0 3
## 2 585 PGOHUM00000251045 chr1 - 16607 18381 0 0 6
## 3 585 PGOHUM00000244551 chr1 + 52475 53354 0 0 1
## 4 585 PGOHUM00000244552 chr1 + 62951 63851 0 0 1
## 5 585 PGOHUM00000244553 chr1 + 65887 65974 0 0 1
## 6 73 PGOHUM00000243857 chr1 - 120819 133580 0 0 3
## 7 73 PGOHUM00000243858 chr1 - 131029 131544 0 0 1
## 8 586 PGOHUM00000244554 chr1 + 131124 135623 0 0 1
## 9 586 PGOHUM00000251046 chr1 - 135714 136507 0 0 1
## 10 586 PGOHUM00000244555 chr1 + 135956 136244 0 0 1
## exonStarts exonEnds
## 1 12981,13224,13448, 13056,13377,13670,
## 2 16607,16854,17229,17601,17914,18265, 16766,17052,17364,17745,18058,18381,
## 3 52475, 53354,
## 4 62951, 63851,
## 5 65887, 65974,
## 6 120819,129028,133367, 120930,129232,133580,
## 7 131029, 131544,
## 8 131124, 135623,
## 9 135714, 136507,
## 10 135956, 136244,
## name class owner
## 1 PGOHUM00000244550 Duplicated Yale
## 2 PGOHUM00000251045 Duplicated Yale
## 3 PGOHUM00000244551 Processed Yale
## 4 PGOHUM00000244552 Processed Yale
## 5 PGOHUM00000244553 Ambiguous Yale
## 6 PGOHUM00000243857 Duplicated Yale
## 7 PGOHUM00000243858 Processed Yale
## 8 PGOHUM00000244554 Processed Yale
## 9 PGOHUM00000251046 Processed Yale
## 10 PGOHUM00000244555 Ambiguous Yale
#alternatively using USING Clause
dbGetQuery(ucscGenomeConn,'select * from pseudoYale60 join pseudoYale60Class USING (name) limit 10')
## name bin chrom strand txStart txEnd cdsStart cdsEnd exonCount
## 1 PGOHUM00000244550 585 chr1 + 12981 13670 0 0 3
## 2 PGOHUM00000251045 585 chr1 - 16607 18381 0 0 6
## 3 PGOHUM00000244551 585 chr1 + 52475 53354 0 0 1
## 4 PGOHUM00000244552 585 chr1 + 62951 63851 0 0 1
## 5 PGOHUM00000244553 585 chr1 + 65887 65974 0 0 1
## 6 PGOHUM00000243857 73 chr1 - 120819 133580 0 0 3
## 7 PGOHUM00000243858 73 chr1 - 131029 131544 0 0 1
## 8 PGOHUM00000244554 586 chr1 + 131124 135623 0 0 1
## 9 PGOHUM00000251046 586 chr1 - 135714 136507 0 0 1
## 10 PGOHUM00000244555 586 chr1 + 135956 136244 0 0 1
## exonStarts exonEnds
## 1 12981,13224,13448, 13056,13377,13670,
## 2 16607,16854,17229,17601,17914,18265, 16766,17052,17364,17745,18058,18381,
## 3 52475, 53354,
## 4 62951, 63851,
## 5 65887, 65974,
## 6 120819,129028,133367, 120930,129232,133580,
## 7 131029, 131544,
## 8 131124, 135623,
## 9 135714, 136507,
## 10 135956, 136244,
## class owner
## 1 Duplicated Yale
## 2 Duplicated Yale
## 3 Processed Yale
## 4 Processed Yale
## 5 Ambiguous Yale
## 6 Duplicated Yale
## 7 Processed Yale
## 8 Processed Yale
## 9 Processed Yale
## 10 Ambiguous Yale
#Append the first 10 names from pseudoYale60Class to the first 10 names of pseudoYale60, and the replicated rows are removed
dbGetQuery(ucscGenomeConn,'(select name from pseudoYale60 limit 10) union (select name from pseudoYale60Class limit 10)')
## name
## 1 PGOHUM00000232565
## 2 PGOHUM00000232567
## 3 PGOHUM00000232568
## 4 PGOHUM00000232569
## 5 PGOHUM00000232570
## 6 PGOHUM00000232573
## 7 PGOHUM00000232574
## 8 PGOHUM00000232575
## 9 PGOHUM00000232579
## 10 PGOHUM00000232580
#Append the first 10 names from pseudoYale60Class to the first 10 names of pseudoYale60
dbGetQuery(ucscGenomeConn,'(select name from pseudoYale60 limit 10) union all (select name from pseudoYale60Class limit 10)')
## name
## 1 PGOHUM00000232565
## 2 PGOHUM00000232567
## 3 PGOHUM00000232568
## 4 PGOHUM00000232569
## 5 PGOHUM00000232570
## 6 PGOHUM00000232573
## 7 PGOHUM00000232574
## 8 PGOHUM00000232575
## 9 PGOHUM00000232579
## 10 PGOHUM00000232580
## 11 PGOHUM00000232565
## 12 PGOHUM00000232567
## 13 PGOHUM00000232568
## 14 PGOHUM00000232569
## 15 PGOHUM00000232570
## 16 PGOHUM00000232573
## 17 PGOHUM00000232574
## 18 PGOHUM00000232575
## 19 PGOHUM00000232579
## 20 PGOHUM00000232580
#disconnect from database
dbDisconnect(ucscGenomeConn)
## [1] TRUE
As demonstrated, the USING
clause eliminates the repeated name column, while the ON
clause version of JOIN
keeps both columns from both tables.
The difference between JOIN
, LEFT JOIN
, RIGHT JOIN
, INNER JOIN
, and FULL
join will be explained later.
Here are a few other multi-table operators that can be useful:
We will be demonstrating these two operators later with the JOINS.
BigQuery
RODBC can also connect to PaaS providers such as Google Cloud Platform and use it as a data-source. Utilizing it will allow faster query executions and minimize usage of local memory.
BigQuery
With smaller scale querying operations, BigQuery
is free for public use, but it charges a fee for heavy usage. That’s why it’s important to specify billing information when querying.
Fees for BigQuery
are billed on a per-project basis, and when supplying billing information, just provide the project that should be billed for. When you are not an eligible member of the project with the permission to bill the project, billing that project will be declined.
For many public libraries and projects, they make their data publicly available, but they don’t allow the public to bill these projects. That means, users who want to query data from a public library will need to additionally provide another project that Google can bill.
BigQuery
Here we will connect to a sample dataset in Google Cloud Platform’s list of open libraries and query from it.
BigQuery
on GCPIn order to use BigQuery functionalities, a Google Cloud Platform account is needed
BigQuery Admin
or similar ones that have the permission to create jobs remotely on BigQuery.This is the primary interface for setting up data sources. We will come to this tab only when needed to add new data sources and datasets and the remaining of the time we only need R to access data and query from GCP.
We will now demonstrate connecting to your project and the bigquery-public-data.samples.shakespeare
.
#install.packages("bigrquery")
library(DBI)
library(RODBC)
library(bigrquery)
#set 'billing' to the project that will be billed
#Connect to the project
<- dbConnect(bigquery(),
con project = "bigquery-public-data",
billing = billing
)
#list first few lines of shakespeare table
<- "SELECT * FROM [bigquery-public-data.samples.shakespeare] LIMIT 5"
sql <- query_exec(sql, project=billing)
result result
## word word_count corpus corpus_date
## 1 LVII 1 sonnets 0
## 2 augurs 1 sonnets 0
## 3 dimm'd 1 sonnets 0
## 4 plagues 1 sonnets 0
## 5 treason 1 sonnets 0
#disconnect from Google BigQuery
dbDisconnect(con)
Here, we will demonstrate how to connect to a MIMIC-III dataset through Google Cloud’s BigQuery platform and query some data.
BigQuery
Similar to in Case Study I, MIMIC III is a project available on BigQuery
so we will need to pin this project to the BigQuery
interface. However, since this project is not in the public library, we will need to gain access to this project first then pin it.
The access to MIMIC III is granted on a per-email basis. Follow these steps to gain access to MIMIC III for the email address registered for the project owner:
credentialed user
.Credentialing
tab of the page, click on “apply for access”.After gaining access to MIMIC II data (it may take up to 2 weeks to process application), follow these steps to add the MIMIC III project to BigQuery
:
Resources
Tab, click on ADD DATA
then Pin a project
.BigQuery
Resources tab.After these steps are done, you can now query directly from the MIMICIII dataset from R. However, you first need to link the physionet and the BigQuery accounts by:
Request access using Google BigQuery
.#install.packages("bigrquery")
library(DBI)
library(RODBC)
library(bigrquery)
#Connect to the MIMIC III database, eciu_crd_demo dataset
<- dbConnect(bigquery(),
con project = "physionet-data",
dataset = "eicu_crd_demo",
billing = billing
)
#list tables in eciu_crd_demo
dbListTables(con)
## [1] "admissiondrug" "admissiondx"
## [3] "allergy" "apacheapsvar"
## [5] "apachepatientresult" "apachepredvar"
## [7] "careplancareprovider" "careplaneol"
## [9] "careplangeneral" "careplangoal"
## [11] "careplaninfectiousdisease" "customlab"
## [13] "diagnosis" "hospital"
## [15] "infusiondrug" "intakeoutput"
## [17] "lab" "medication"
## [19] "microlab" "note"
## [21] "nurseassessment" "nursecare"
## [23] "nursecharting" "pasthistory"
## [25] "patient" "physicalexam"
## [27] "respiratorycare" "respiratorycharting"
## [29] "treatment" "vitalaperiodic"
## [31] "vitalperiodic"
#list fields in table infusiondrug
dbListFields(con,"infusiondrug")
## [1] "infusiondrugid" "patientunitstayid" "infusionoffset"
## [4] "drugname" "drugrate" "infusionrate"
## [7] "drugamount" "volumeoffluid" "patientweight"
#list fields in table medication
dbListFields(con,"medication")
## [1] "medicationid" "patientunitstayid" "drugorderoffset"
## [4] "drugstartoffset" "drugivadmixture" "drugordercancelled"
## [7] "drugname" "drughiclseqno" "dosage"
## [10] "routeadmin" "frequency" "loadingdose"
## [13] "prn" "drugstopoffset" "gtc"
#query from table infusiondrug
<- "SELECT drugname FROM [physionet-data.eicu_crd_demo.infusiondrug] group by drugname limit 10"
sql <- query_exec(sql, project=billing) result
## Warning: 'query_exec' is deprecated.
## Use 'bq_perform_query' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## Warning: 'run_query_job' is deprecated.
## Use 'bq_perform_query' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## Warning: 'insert_query_job' is deprecated.
## Use 'bq_perform_query' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## Warning: 'wait_for' is deprecated.
## Use 'bq_job_wait' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## Warning: 'get_job' is deprecated.
## Use 'bq_job_meta' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## 0 bytes processed
## Warning: 'list_tabledata' is deprecated.
## Use 'bq_table_download' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
## Warning: 'get_table' is deprecated.
## Use 'bq_table_meta' instead.
## See help("Deprecated") and help("bigrquery-deprecated").
result
## drugname
## 1 LR
## 2 NS
## 3 D5w
## 4 NSS
## 5 TPA
## 6 0.9% NS
## 7 Heparin
## 8 Insulin
## 9 (ml/hr)
## 10 Dopamine
#disconnect from Google BigQuery
dbDisconnect(con)
The following section of R-codes are developed by the University of Michigan SOCR Data Analytics - MIMICIII team and posted on GitHub. The following are excerpts from the project:
Heart failure (HF) is a “multifactorial, systemic disease” in which a network of autoprotective mechanisms activated after cardiac injury cause significant problems with heart function (Tanai & Frantz, 2015). HF is one of the largest contributing factors to mortality in the United States, playing a role in 1 in 9 deaths and accounting for more than $30 billion in annual healthcare expenditures (CDC.gov, Mozaffarian et. al). In addition to the estimated 5.7 million Americans living with HF, an additional 915,000 are newly diagnosed each year (Mozaffarian et. al). Despite recent advancements in the treatment of underlying risk factors, long-term mortality remains high with less than half of those newly diagnosed patients surviving for five years (Mozaffarian et. al).
Risk stratification systems have become an important part of patient management, and have the potential to “improve clinical outcome and resource allocation” by “avoiding the overtreatment of low-risk subjects or the early, inappropriate discharge of high-risk patients” (Passantino et. al, 2015). In their comprehensive 2015 publication, Andrea Passantino and colleagues review a collection of the “most relevant” risk stratification systems targeted at acute heart failure (AHF) including EFFECT, OPTIMIZE-HF, APACHE-HF, and ESCAPE, among others. Passantino and her team describe the wide array of data sources and techniques used by the original authors to validate these scoring systems, including “public registries, clinical trials, and retrospective data” (Passantino et. al, 2015). The heterogeneity of these data sources makes direct performance comparisons difficult.
In this document, we aim to extend the work contributed by Passantino and colleagues and provide a direct comparison of the performance of these AHF risk stratification metrics on a single, unified dataset. To this end, we leverage the Medical Information Mart for Intensive Care (MIMIC-III), a dataset developed by the Massachusetts Institute of Technology Lab for Computational Physiology (MIT-LCP) which contains de-identified health data for more than 40,000 intensive care unit (ICU) patients over the years 2001-2012 (Johnson et. al, 2012). It includes much of the same data commonly found in electronic health records (EHR), including demographic information, diagnosis codes, procedures, laboratory values, vital signs, free-text clinical notes, and admission, discharge, and mortality information. Moreover, it is large and comprehensive enough to serve as a proving ground for each of the individual risk stratification systems, allowing a level playing field from which to make direct comparisons of the efficacy and performance of each proposed metric.
In the spirit of SOCR’s open science initiative, this document and it’s parent repository contain the complete end-to-end computational protocol, results, and validation procedures. It is subdivided into the following partitions:
#install.packages("bigrquery")
library(DBI)
library(RODBC)
library(bigrquery)
#install.packages("knitr")
#install.packages("kableExtra")
#install.packages("ggplot2")
#install.packages("ggpubr")
#install.packages("magrittr")
library("magrittr")
library('knitr') # Make knitted tables ("kables")
library('kableExtra') # Extra kable formatting options
library('ggplot2') # Plotting library
library('ggpubr')
As the MIMIC-III dataset contains a wide variety of patients, those suitable for our particular analysis must first be isolated. Based on the manuscripts cited in Passantino et. al’s review, we developed a set of inclusion criteria to select such a cohort. These criteria include:
These criteria were inspired by the inclusion criteria schema described in the following references:
To select patients with acute heart failure, we first constructed a cohort of all heart failure patients, denoted by the ICD9 code 4280
, then subtracted those that had an additional chronic heart failure diagnosis (codes 42822, 42823, 42832, 42833, 42842 & 42843
).
Note: To successfully run the MIMIC-III data analytics example below, you need to have a credentialed PhysioNet account. If you don’t have CITI certification and/or Physionet access, you can just check the MIMIC-III BigQuery description/meta-data and skill over this example.
#Connect to the MIMIC III database, eciu_crd_demo dataset
<- dbConnect(bigquery(),
con project = "physionet-data",
dataset = "eicu_crd_demo",
billing = billing #set billing to Project id
)
<- "SELECT *
sql FROM [physionet-data.mimiciii_clinical.diagnoses_icd]
WHERE icd9_code IN ('4280')
AND hadm_id NOT IN (
SELECT hadm_id
FROM [physionet-data.mimiciii_clinical.diagnoses_icd]
WHERE icd9_code IN ('42822', '42823', '42832', '42833', '42842', '42843')
GROUP BY hadm_id
)"
<- query_exec(sql, project=billing) cohort
As our target is predicting out-of-hospital mortality due to acute heart failure, we exclude patients who expired in the hospital.
<- query_exec(
adm.tbl sprintf( "SELECT HADM_ID, DISCHTIME, DEATHTIME
FROM [physionet-data.mimiciii_clinical.admissions]
WHERE hadm_id IN (%s)",
paste(cohort$HADM_ID, collapse=", ")),
project=billing, max_pages = Inf)
# Merge with cohort object
<- merge(cohort, adm.tbl, by="HADM_ID")
cohort # Remove subjects with in hospital date of death (aka they have a deathtime in admissions table)
<- cohort[is.na(cohort$DEATHTIME), ] cohort
<- query_exec(
admissions.tbl sprintf( "SELECT *
FROM [physionet-data.mimiciii_clinical.admissions]
WHERE hadm_id IN (%s)",
paste(cohort$HADM_ID, collapse=", ")),
project=billing, max_pages = Inf)
# Calculate length of stay
$LOS <- difftime(admissions.tbl$DISCHTIME, admissions.tbl$ADMITTIME, unit='days')
admissions.tbl# Merge with cohort object
<- merge(cohort, admissions.tbl[, c('SUBJECT_ID', 'HADM_ID', 'DISCHTIME', 'ADMITTIME', 'LOS')], by=c('SUBJECT_ID', 'HADM_ID'))
cohort # Plot length of stay before removal
<- ggplot(cohort, aes(x=LOS)) +
fig geom_histogram() +
ggtitle('Length of Stay') + xlab('Days')
print(fig)
# Remove encounters where LOS falls outside bounds
<- cohort[cohort$LOS > 1 & cohort$LOS <= 45, ]
cohort <- cohort cohorttemp
To select patients 18 or older, first we join the patients table and the admissions table to get the admissions ID and the date of birth. DOB is the date of birth of the given patient. All dates in the database have been shifted to protect patient confidentiality. Dates are internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300 to obscure their age and comply with HIPAA. The shift process was as follows: the patient’s age at their first admission was determined. The date of birth was then set to exactly 300 years before their first admission. Therefore, we also extract the first admission time so we can calculate the age of the patient at the time of first admission. Finally, we exclude those younger than 18 at the time of first admission.
<- sprintf(
sql "SELECT admissions.HADM_ID as admissions_HADM_ID, admissions.ADMITTIME as admissions_ADMITTIME, patients.DOD as DOD, patients.DOD_HOSP as DOD_HOSP, patients.DOD_SSN as DOD_SSN, patients.DOB as patients_DOB
FROM [physionet-data.mimiciii_clinical.admissions] AS admissions
JOIN [physionet-data.mimiciii_clinical.patients] AS patients
ON admissions.subject_id = patients.subject_id
WHERE admissions.hadm_id IN (%s)",
paste(cohort$HADM_ID, collapse = ", "))
<- query_exec(sql, project=billing)
age.tbl $age <- difftime(age.tbl$admissions_ADMITTIME, age.tbl$patients_DOB, units='days')
age.tbl<- data.frame(age.tbl[, c('admissions_HADM_ID', 'age', 'DOD', 'DOD_HOSP', 'DOD_SSN')])
age.tbl colnames(age.tbl) <- c('HADM_ID', 'AGE', 'DOD', 'DOD_HOSP', 'DOD_SSN')
<- merge(cohort, age.tbl, by="HADM_ID")
cohort $AGE <- as.numeric(cohort$AGE) / 365
cohort<- cohort[cohort$AGE < 90 & cohort$AGE >= 18, ]
cohort # Plot length of stay before removal
<- ggplot(cohort, aes(x=AGE)) +
fig geom_histogram() +
ggtitle('Subject Age') + xlab('Years')
print(fig)
<- function(hadm_id, itemcodes, return.fields, table, project, max_pages=Inf) {
pull.last.event <- sprintf("
sql SELECT *
FROM [physionet-data.mimiciii_clinical.%s] AS table1
INNER JOIN (
SELECT hadm_id, MAX(charttime), MAX(row_id)
FROM [physionet-data.mimiciii_clinical.%s]
WHERE itemid IN (%s)
AND hadm_id IN (%s)
GROUP BY hadm_id
) AS table2
ON table1.row_id = table2.f1_",
table, table, paste(itemcodes, collapse=", "),
paste(hadm_id, collapse=", ")
)
<- query_exec(sql, project=project, max_pages=max_pages)
data colnames(data) <- gsub('table[0-9]_', '', colnames(data))
return(data[ , return.fields])
}
<- c( 6, 51, 442, 3313, 3315, 3317, 3321, 3323, 3325, 6701, 228152, 224167, 227243, 220050, 220179, 225309 )
sbp.itemcodes <- c('SUBJECT_ID', 'HADM_ID', 'ITEMID', 'CHARTTIME', 'VALUENUM', 'VALUEUOM')
return.fields <- pull.last.event(cohort$HADM_ID, sbp.itemcodes, return.fields, 'chartevents', billing, max_pages=1)
data.sbp
#disconnect from Google BigQuery
dbDisconnect(con)
There are several additional applications of RODBC that may be useful.
You can connect to multiple databases simultaneously and query from them at the same time in one project. You can thereby transfer data across platforms utilizing R as an intermediary.These databases include local databases, which can also be created through R.
Here we demonstrate how we can create a local SQL Database in your RAM and connect to it at the same time as the UCSC Genome database. We will attempt to copy some data from the Genome database into the local database.
#Create an ephemeral in-memory RSQLite database connect to it through RODBC
# install.packages("RSQLite")
#remove.packages("devtools")
#install.packages("devtools", repos="http://cran.rstudio.com/", dependencies=TRUE)
#remove.packages("RSQLite")
#install.packages("RSQLite", repos="http://cran.rstudio.com/", dependencies=TRUE)
#install.packages("rlang", repos="http://cran.rstudio.com/", dependencies=TRUE)
#install.packages("usethis", repos="http://cran.rstudio.com/", dependencies=TRUE)
library(devtools)
## Loading required package: usethis
library("RSQLite")
##
## Attaching package: 'RSQLite'
## The following object is masked from 'package:RMySQL':
##
## isIdCurrent
library(RODBC)
<- dbConnect(RSQLite::SQLite(), ":memory:")
LocalConnection
#Connect to UCSC Genome database
<- dbConnect(MySQL(),
ucscGenomeConn user='genome',
dbname='hg19',
host='genome-mysql.soe.ucsc.edu')
#Copy data from database into dataframe
<- dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' AND owner = \'Yale\' limit 10')
df_selected df_selected
## name class owner
## 1 PGOHUM00000232565 Ambiguous Yale
## 2 PGOHUM00000232568 Ambiguous Yale
## 3 PGOHUM00000232574 Ambiguous Yale
## 4 PGOHUM00000232584 Ambiguous Yale
## 5 PGOHUM00000232585 Ambiguous Yale
## 6 PGOHUM00000232590 Ambiguous Yale
## 7 PGOHUM00000232593 Ambiguous Yale
## 8 PGOHUM00000232594 Ambiguous Yale
## 9 PGOHUM00000232597 Ambiguous Yale
## 10 PGOHUM00000232601 Ambiguous Yale
#Write dataframe into local database
dbWriteTable(LocalConnection,"UCSC",df_selected)
dbListTables(LocalConnection)
## [1] "UCSC"
dbGetQuery(LocalConnection,'select * from UCSC')
## name class owner
## 1 PGOHUM00000232565 Ambiguous Yale
## 2 PGOHUM00000232568 Ambiguous Yale
## 3 PGOHUM00000232574 Ambiguous Yale
## 4 PGOHUM00000232584 Ambiguous Yale
## 5 PGOHUM00000232585 Ambiguous Yale
## 6 PGOHUM00000232590 Ambiguous Yale
## 7 PGOHUM00000232593 Ambiguous Yale
## 8 PGOHUM00000232594 Ambiguous Yale
## 9 PGOHUM00000232597 Ambiguous Yale
## 10 PGOHUM00000232601 Ambiguous Yale
#Disconnect and cleanup
dbDisconnect(LocalConnection)
dbDisconnect(ucscGenomeConn)
## [1] TRUE
Structured Query Language join
s are SQL instructions combining data from two or more datasets or relational tables. Let’s examine the main differences between the different JOINs.
We will also be demonstrating the EXCEPT and INTERSECT operators from the previous section:
These will be demonstrated with the pseudoYale60 dataset here:
library("RSQLite")
library(RODBC)
<- dbConnect(RSQLite::SQLite(), ":memory:")
LocalConnection
#Connect to UCSC Genome database
<- dbConnect(MySQL(),
ucscGenomeConn user='genome',
dbname='hg19',
host='genome-mysql.soe.ucsc.edu')
#Copy data from database into local database. Since the two tables have been pre-matched, we delete some rows so that we can artificially create differences between the two tables' keys
<- dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class limit 12345')
df_selected dbWriteTable(LocalConnection,"Class",df_selected)
<- dbGetQuery(ucscGenomeConn,'select * from pseudoYale60 limit 10000')
df_selected dbWriteTable(LocalConnection,"Yale",df_selected)
#Display columns of table 'Class'
dbListFields(LocalConnection,'Class')
## [1] "name" "class" "owner"
#Display number of rows of table 'Class'
nrow(dbGetQuery(LocalConnection,'select * from Class'))
## [1] 12345
#Display columns of table 'Yale'
dbListFields(LocalConnection,'Yale')
## [1] "bin" "name" "chrom" "strand" "txStart"
## [6] "txEnd" "cdsStart" "cdsEnd" "exonCount" "exonStarts"
## [11] "exonEnds"
#Display number of rows of table 'Yale'
nrow(dbGetQuery(LocalConnection,'select * from Yale'))
## [1] 10000
#Inner join the two tables on 'name' column
<- dbGetQuery(LocalConnection, 'select Class.*, Yale.chrom, Yale.strand from Class INNER JOIN Yale on Class.name = Yale.name')
InnerJoin #Display number of rows of the resulting table. It's smaller than both original tables
nrow(InnerJoin)
## [1] 6567
#Left join the two tables whereas the 'Class' tables is the left table on 'name' column
<- dbGetQuery(LocalConnection, 'select Class.*, Yale.chrom, Yale.strand from Class LEFT JOIN Yale on Class.name = Yale.name')
LeftJoin #Display number of rows of the resulting table. It's the same size as the left table 'Class'
nrow(LeftJoin)
## [1] 12345
#Left join the two tables whereas the 'Yale' tables is the left table on 'name' column
<- dbGetQuery(LocalConnection, 'select Class.*, Yale.chrom, Yale.strand from Yale LEFT JOIN Class on Class.name = Yale.name')
LeftJoin #Display number of rows of the resulting table. It's the same size as the left table 'Yale'
nrow(LeftJoin)
## [1] 10000
#Use except to select names from Yale that are not in Class
<- dbGetQuery(LocalConnection, "select name from Yale except select name from Class")
EXCEPT1 nrow(EXCEPT1)
## [1] 3433
#Use except to select names from Class that are not in Yale
<- dbGetQuery(LocalConnection, "select name from Class except select name from Yale")
EXCEPT2 nrow(EXCEPT2)
## [1] 5778
#Use intersect to select names that are in both Class and Yale
<- dbGetQuery(LocalConnection, "select name from Yale intersect select name from Class")
INTERSECT nrow(INTERSECT)
## [1] 6567
#The number of rows in Yale that is not in Class and the number of rows in both Yale and Class add up to the total number of rows in Yale
nrow(EXCEPT1)+nrow(INTERSECT)
## [1] 10000
#The number of rows in Class that is not in Yale and the number of rows in both Yale and Class add up to the total number of rows in Class
nrow(EXCEPT2)+nrow(INTERSECT)
## [1] 12345
#As right join and Full Join are not supported on SQLite, we will display the syntax here and demonstrate later with BigQuery.
#RightJoin <- dbGetQuery(LocalConnection, 'select Class.*, Yale.chrom, Yale.strand from Class RIGHT JOIN Yale on Class.name = Yale.name')
#nrow(RightJoin)
#FullJoin <- dbGetQuery(LocalConnection, 'select Class.*, Yale.chrom, Yale.strand from Class FULL JOIN Yale on Class.name = Yale.name')
#nrow(FullJoin)
#Disconnect and cleanup
dbDisconnect(LocalConnection)
dbDisconnect(ucscGenomeConn)
## [1] TRUE
The access of the user on the database limits the types of queries one can send. However, when having administrator access, there are a few more important SQL clauses that are important to remember:
These statements will be demonstrated below utilizing ephemeral local SQLite database, where initial data is copied from UCSC Genome database (similar to what we did earlier). Keep in mind that dbExecute() is needed for these administrator level commands.
#Create an ephemeral in-memory RSQLite database connect to it through RODBC
# install.packages("RSQLite")
library("RSQLite")
library(RODBC)
<- dbConnect(RSQLite::SQLite(), ":memory:")
LocalConnection
#Connect to UCSC Genome database
<- dbConnect(MySQL(),
ucscGenomeConn user='genome',
dbname='hg19',
host='genome-mysql.soe.ucsc.edu')
#Copy data from database into dataframe
<- dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' AND owner = \'Yale\' limit 10')
df_selected
#Write dataframe into local database
dbWriteTable(LocalConnection,"UCSC",df_selected)
dbListTables(LocalConnection)
## [1] "UCSC"
#Alternatively, creates a table UCSC_Insert that has the same schema as UCSC
<- dbExecute(LocalConnection,'CREATE TABLE UCSC_Insert AS SELECT * FROM UCSC LIMIT 0')
fb dbListTables(LocalConnection)
## [1] "UCSC" "UCSC_Insert"
dbListFields(LocalConnection,'UCSC_Insert')
## [1] "name" "class" "owner"
#Copy the table from table 'UCSC' into the new table UCSC_Insert.
<- dbExecute(LocalConnection,'INSERT INTO UCSC_Insert SELECT * FROM UCSC LIMIT 4')
fb dbGetQuery(LocalConnection,'SELECT * FROM UCSC_Insert')
## name class owner
## 1 PGOHUM00000232565 Ambiguous Yale
## 2 PGOHUM00000232568 Ambiguous Yale
## 3 PGOHUM00000232574 Ambiguous Yale
## 4 PGOHUM00000232584 Ambiguous Yale
#Append a new row to table UCSC_Insert.
<- dbExecute(LocalConnection,"INSERT INTO UCSC_Insert (name,class,owner) VALUES ('123','456','789')")
fb dbGetQuery(LocalConnection,'SELECT * FROM UCSC_Insert')
## name class owner
## 1 PGOHUM00000232565 Ambiguous Yale
## 2 PGOHUM00000232568 Ambiguous Yale
## 3 PGOHUM00000232574 Ambiguous Yale
## 4 PGOHUM00000232584 Ambiguous Yale
## 5 123 456 789
#Update one row of table UCSC_Insert's 'owner' column to be 'ELAY
<- dbExecute(LocalConnection,"UPDATE UCSC_Insert SET owner = 'ELAY' WHERE name = 'PGOHUM00000232568'")
fb dbGetQuery(LocalConnection,'SELECT * FROM UCSC_Insert')
## name class owner
## 1 PGOHUM00000232565 Ambiguous Yale
## 2 PGOHUM00000232568 Ambiguous ELAY
## 3 PGOHUM00000232574 Ambiguous Yale
## 4 PGOHUM00000232584 Ambiguous Yale
## 5 123 456 789
#change name of table UCSC_Insert to 'UCSC_Insert2'
<- dbExecute(LocalConnection,"ALTER TABLE UCSC_Insert RENAME TO UCSC_Insert2")
fb dbListTables(LocalConnection)
## [1] "UCSC" "UCSC_Insert2"
#add new column to table UCSC_Insert2 named 'newOwner', with string as its datatype
<- dbExecute(LocalConnection,"ALTER TABLE UCSC_Insert2 ADD COLUMN newOwner string")
fb dbGetQuery(LocalConnection,'SELECT * FROM UCSC_Insert2')
## name class owner newOwner
## 1 PGOHUM00000232565 Ambiguous Yale NA
## 2 PGOHUM00000232568 Ambiguous ELAY NA
## 3 PGOHUM00000232574 Ambiguous Yale NA
## 4 PGOHUM00000232584 Ambiguous Yale NA
## 5 123 456 789 NA
#remove table UCSC_Insert2
<- dbExecute(LocalConnection,"DROP TABLE UCSC_Insert2")
fb dbListTables(LocalConnection)
## [1] "UCSC"
#Disconnect and cleanup
<- dbDisconnect(LocalConnection)
fb <- dbDisconnect(ucscGenomeConn) fb
As SQL doesn’t have the power to automate its execution, R can be used as an automation tool as well as an user interface. User can write SQL in either R script as string or separate .sql files. R can then take the string or input file, parse them appropriately, then pass the SQL commands into databases one after another. This allows users to change SQL lines at one place while avoiding the risk of being able to fetch data only after the entire SQL command has been executed.
Let’s demonstrated this with an example.
#Setup connections
#install.packages("RSQLite")
library(devtools)
library(RODBC)
library("RSQLite")
<- dbConnect(RSQLite::SQLite(), ":memory:")
LocalConnection
<- dbConnect(MySQL(),
ucscGenomeConn user='genome',
dbname='hg19',
host='genome-mysql.soe.ucsc.edu')
#and move relevant tables into an ephemeral in-memory SQLite database
<- dbGetQuery(ucscGenomeConn,'select * from pseudoYale60Class WHERE class = \'Ambiguous\' AND owner = \'Yale\' limit 10')
df_selected dbWriteTable(LocalConnection,"UCSC",df_selected)
dbListTables(LocalConnection)
## [1] "UCSC"
#Keep all SQL Statements at one place
#Here they are stored in string format in a vector, but you can Write them outside of the R Script in csv, txt, sql, or other formats, and take them in as input
<- c("CREATE TABLE UCSC_Insert AS SELECT * FROM UCSC LIMIT 0",
SQLString "INSERT INTO UCSC_Insert SELECT * FROM UCSC LIMIT 4",
"INSERT INTO UCSC_Insert (name,class,owner) VALUES ('123','456','789')",
"UPDATE UCSC_Insert owner = 'ELAY' WHERE name = 'PGOHUM00000232568'",
"ALTER TABLE UCSC_Insert RENAME TO UCSC_Insert2",
"ALTER TABLE UCSC_Insert2 ADD COLUMN newOwner string")
#Use for-loop to automate the execution of these sql statements
for (i in 1:length(SQLString)) {
<- tryCatch({fb <- dbExecute(LocalConnection,SQLString[i])},
possibleError error = function(e) {print(c('error at SQL Line # ', i))}
)if(inherits(possibleError, "error")){
next
} }
## [1] "error at SQL Line # " "4"
#Check the result of these statements
dbListTables(LocalConnection)
## [1] "UCSC" "UCSC_Insert2"
dbListFields(LocalConnection, 'UCSC_Insert2')
## [1] "name" "class" "owner" "newOwner"
#Store the data into a dataframe in R
#Data can also be exported out of R in formats that you define
<- dbGetQuery(LocalConnection,'SELECT * FROM UCSC_Insert2')
result result
## name class owner newOwner
## 1 PGOHUM00000232565 Ambiguous Yale NA
## 2 PGOHUM00000232568 Ambiguous Yale NA
## 3 PGOHUM00000232574 Ambiguous Yale NA
## 4 PGOHUM00000232584 Ambiguous Yale NA
## 5 123 456 789 NA
#Disconnect and clean up
<- dbDisconnect(LocalConnection)
fb <- dbDisconnect(ucscGenomeConn) fb
In the codes above, we used a for-loop to execute the SQL commands. We used the try-catch function so that when an error occurs in execution, throw an error message and keep the remaining SQL codes running. This is especially useful if the dataset is large and the SQL commands don’t depend on each other, then you can run a large dataset and multiple lines of SQL over a long period of time(e.g. overnight) without the concern of stopping in the middle.
In the example above, we intentionally made the fourth line of SQL command wrong in syntax, and the try-catch function successfully detected this error and continued executing SQL commands after throwing the error message.