forked from datacarpentry/R-genomics
-
Notifications
You must be signed in to change notification settings - Fork 5
/
05-r-and-sql.Rmd
203 lines (160 loc) · 6.56 KB
/
05-r-and-sql.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
title: "SQL databases and R"
author: Data Carpentry contributors
---
```{r, echo=FALSE, purl=FALSE, message = FALSE, results='hide'}
source("setup.R")
if (file.exists("portalR.db")) file.remove("portalR.db")
```
------------
> ## Learning Objectives
>
> * Students will be able to connect to an existing database using R
> * Query an existing database through R using queries constructed with
> variables
> * Create a database and populate it from R
> * Understand that scripted database interactions enhance reproducibility of
> analysis
------------
# Introduction
Interacting with databases through scripted languages can offer advantages over
querying databases via a GUI interface. A GUI interface for your database is
easier to use and allows the quick viewing of adhoc queries. Querying a
database with a programatic interface (in this case R, but it could be any
language) are slightly more complicated. However the trade-off is that data
manipulations are preserved in the code. Aggregations, summaries and other
database operations are preserved. Therefore those pre-analysis data
manipulation steps are not lost and can be reproduced later by yourself or
others.
# Connecting R to sqlite databases
R can connect to databases through a number of packages. In our case we we will
use [RSQLite](http://cran.r-project.org/web/packages/RSQLite/index.html) to
connect to existing SQLite3 databases. However you should be able to connect on
almost any database in R via
JDBC[http://cran.r-project.org/web/packages/RJDBC/index.html](http://cran.r-project.org/web/packages/RJDBC/index.html)
or [ODBC](http://cran.r-project.org/web/packages/RODBC/index.html), or specific
database packages (such as we are doing, or
[MySQL](http://cran.r-project.org/web/packages/RMySQL/index.html) ).
To begin these exercises we'll connect to the mammal database you've just
created.
```{r connect, purl=FALSE}
library(RSQLite)
## Set dbname and driver out of convenience
myDB <- "data/portal_mammals.sqlite"
conn <- dbConnect(drv = SQLite(), dbname= myDB)
```
Now you're connected to your database, you can get information about the tables
in the database. While SQLite manager will provide this for you, you'll need to
get it yourself with some basic SQL commands
# Basic queries
You can grab information about all tables like this:
```{r Get Table information, eval=FALSE}
dbGetQuery(conn, "SELECT type, tbl_name FROM sqlite_master")
```
`RSQLite` has some functions that easily list basic information about your
database and tables. Here you can see the types and names of fields and get a
count of records.
```{r table details, purl=FALSE}
dbListTables(conn)
dbListFields(conn, "surveys")
dbGetQuery(conn, "SELECT count(*) FROM surveys")
```
Great! That's all there is to it. You are now connected to your database and
know something about the tables in it. From here you can manipulate it the
exact same way you can from SQLite Manager except in a scripted manner. Let's
try some basic queries from the previous lesson. Querying simply takes a
connection to a database and query as inputs and returns a dataframe with the
results.
```{r basic queries, purl=FALSE}
q <- 'SELECT DISTINCT year, species_id FROM surveys'
result <- dbGetQuery(conn, q)
head(result)
```
> ## Challenge
>
> Write a query that gets counts of genus by plot type. If this seems to
> advanced, then get counts of genus by plot_id (eliminates one of the >joins)
### Answer
```{r genus by type, purl=FALSE}
q <- "SELECT d.plot_type , c.genus, count(*)
FROM
(SELECT a.genus, b.plot
FROM species a
JOIN surveys b
ON a.species_id = b.species_id) c
JOIN plots d
ON c.plot = d.plot_id
GROUP BY d.plot_type,c.genus"
result <- dbGetQuery(conn,q)
head(result)
```
In the above answer you'll note that I use a common technique in SQL queries
called aliasing. Aliasing is simply a way to reference a table or sub query,
usually within a join. In the above answer aliases allow an easy way to give a
shorthand name to a table, e.g. `plots d`. Sub queries also require an alias if
you want to use them in a join.
All we've done so far is execute the same sorts of queries that can easily be
made with a GUI. Now let's try leveraging the power of scripted queries.
Imagine you want to know how many rodents were found every other year. To get
this we'll get the range of dates from the database, sequence them by two and
make new queries.
```{r Get years of data, purl=FALSE}
yearRange <- dbGetQuery(conn,"SELECT min(year),max(year) FROM surveys")
years <- seq(yearRange[,1],yearRange[,2],by=2)
```
Next we'll build our query string using the `paste()` function.
```{r creating query, purl=FALSE}
q <- paste("
SELECT a.year,b.taxa,count(*) as count
FROM surveys a
JOIN species b
ON a.species_id = b.species_id
AND b.taxa = 'Rodent'
AND a.year in (",
paste(years,collapse=",")
,")
GROUP BY a.year, b.taxa",
sep = "" )
rCount <- dbGetQuery(conn,q)
head(rCount)
```
With the nested paste commands we were able to construct a query
programatically, without having to type out all the years. This could also be
done with a for loop, especially if the query to be constructed is more
complicated.
# Building your workflow
Up until now we been working with an existing data. However we can use R as a
way to build up databases from existing flat files. We'll use the flat files
that make up the mammals database to recreate it. First let's read in the
files.
```{r Generate a dataframe, purl=FALSE}
species <- read.csv("data/species.csv")
surveys <- read.csv("data/surveys.csv")
plots <- read.csv("data/plots.csv")
```
We have some data now. Next we'll need to create a database and add our tables.
```{r create database, purl=FALSE}
myDB <- "portalR.db"
myConn <- dbConnect(drv = SQLite(), dbname= myDB)
dbListTables(myConn)
```
By opening up a connection, you've created a new database. If you list the
tables you'll see that the database is empty. Next we'll add the dataframes we
just read in to the database.
```{r add data, purl=FALSE}
dbWriteTable(myConn,"species",species)
dbListTables(myConn)
dbGetQuery(myConn,"SELECT * from species limit 10")
```
If you check the location of your database you'll see that data is automatically
being written to disk. Not only does R and RSQLite provide easy ways to query
existing databases, it also allows you to easily create your own databases from
flat files.
> ## Challenge
>
> Add the remaining tables to the existing database. Query your database from
> SQLite manager just to verify that you're adding data.
> ## Extra Challenge
>
> Run some of your queries from earlier in the lesson to verify that you have
> faithfully recreated the mammals database.