-
Notifications
You must be signed in to change notification settings - Fork 0
/
lab07.Rmd
228 lines (170 loc) · 7.22 KB
/
lab07.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
title: "Lab07"
author: "Amei Hao"
date: "9/30/2020"
output: html_document
---
```{r setup}
knitr::opts_chunk$set(include = TRUE)
```
# Learning goals
- Use a real world API to make queries and process the data.
- Use regular expressions to parse the information.
- Practice your GitHub skills.
# Lab description
In this lab, we will be working with the [NCBIAPI](https://www.ncbi.nlm.nih.gov/home/develop/api/) to make queries and extract information using XML and regular expressions. For this lab, we will be using the `httr`, `xml2`, and `stringr` R packages.
This markdown document should be rendered using `github_document` document.
## Question 1: How many sars-cov-2 papers?
Build an automatic counter of sars-cov-2 papers using PubMed. You will need to apply XPath as we did during the lecture to extract the number of results returned by PubMed in the following web address:
```
https://pubmed.ncbi.nlm.nih.gov/?term=sars-cov-2
```
Complete the lines of code:
```{r counter-pubmed, eval=TRUE}
# Downloading the website
website <- xml2::read_html("https://pubmed.ncbi.nlm.nih.gov/?term=sars-cov-2")
# Finding the counts
counts <- xml2::xml_find_first(website, "/html/body/main/div[9]/div[2]/div[2]/div[1]/span
")
# Turning it into text
counts <- as.character(counts)
# Extracting the data using regex
stringr::str_extract(counts, "[0-9,]+")
```
Don't forget to commit your work!
## Question 2: Academic publications on COVID19 and Hawaii
You need to query the following
The parameters passed to the query are documented [here](https://www.ncbi.nlm.nih.gov/books/NBK25499/).
Use the function `httr::GET()` to make the following query:
1. Baseline URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
2. Query parameters:
- db: pubmed
- term: covid19 hawaii
- retmax: 1000
```{r papers-covid-hawaii, eval=TRUE}
library(httr)
query_ids <- GET(
url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
query = list(
db= "pubmed",
term= "covid19 hawaii",
retmax= 1000)
)
# Extracting the content of the response of GET
ids <- httr::content(query_ids)
```
The query will return an XML object, we can turn it into a character list to
analyze the text directly with `as.character()`. Another way of processing the
data could be using lists with the function `xml2::as_list()`. We will skip the
latter for now.
Take a look at the data, and continue with the next question (don't forget to
commit and push your results to your GitHub repo!).
## Question 3: Get details about the articles
The Ids are wrapped around text in the following way: `<Id>... id number ...</Id>`.
we can use a regular expression that extract that information. Fill out the
following lines of code:
```{r get-ids, eval = TRUE}
# Turn the result into a character vector
ids <- as.character(ids)
# Find all the ids [[1]] can change list to character
ids <- stringr::str_extract_all(ids, "<Id>[1-9]+</Id>")[[1]]
# Remove all the leading and trailing <Id> </Id>. Make use of "|"
ids <- stringr::str_remove_all(ids, "<Id>")
ids <- stringr::str_remove_all(ids, "</Id>")
```
With the ids in hand, we can now try to get the abstracts of the papers. As
before, we will need to coerce the contents (results) to a list using:
1. Baseline url: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
2. Query parameters:
- db: pubmed
- id: A character with all the ids separated by comma, e.g., "1232131,546464,13131"
- retmax: 1000
- rettype: abstract
**Pro-tip**: If you want `GET()` to take some element literal, wrap it around `I()` (as you would do in a formula in R). For example, the text `"123,456"` is replaced with `"123%2C456"`. If you don't want that behavior, you would need to do the following `I("123,456")`.
```{r get-abstracts, eval = TRUE}
publications <- GET(
url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
query = list(
db = "pubmed",
id = paste(ids, collapse = ","),
retmax = 1000,
rettype = "abstract"
)
)
# Turning the output into character vector
publications <- httr::content(publications)
publications_txt <- as.character(publications)
```
With this in hand, we can now analyze the data. This is also a good time for committing and pushing your work!
## Question 4: Distribution of universities, schools, and departments
Using the function `stringr::str_extract_all()` applied on `publications_txt`, capture all the terms of the form:
1. University of ...
2. ... Institute of ...
Write a regular expression that captures all such instances
```{r univ-institute-regex, eval = TRUE}
library(stringr)
institution <- str_extract_all(
publications_txt,
"University\\s+of\\s+[[:alpha:]]+|[[:alpha:]]+\\s+Institute\\s+of\\s+[[:alpha:]]+"
)
institution <- unlist(institution)
table(institution)
```
Repeat the exercise and this time focus on schools and departments in the form of
1. School of ...
2. Department of ...
And tabulate the results
```{r school-department, eval = TRUE}
schools_and_deps <- str_extract_all(
publications_txt,
"School of\\s[[:alpha:]]+|Department of\\s[[:alpha:]]+"
)
schools_and_deps<-unlist(schools_and_deps)
table(schools_and_deps)
```
## Question 5: Form a database
We want to build a dataset which includes the title and the abstract of the
paper. The title of all records is enclosed by the HTML tag `ArticleTitle`, and
the abstract by `Abstract`.
Before applying the functions to extract text directly, it will help to process
the XML a bit. We will use the `xml2::xml_children()` function to keep one element
per id. This way, if a paper is missing the abstract, or something else, we will be able to properly match PUBMED IDS with their corresponding records.
```{r one-string-per-response, eval = TRUE}
pub_char_list <- xml2::xml_children(publications)
pub_char_list <- sapply(pub_char_list, as.character)
```
Now, extract the abstract and article title for each one of the elements of
`pub_char_list`. You can either use `sapply()` as we just did, or simply
take advantage of vectorization of `stringr::str_extract`
```{r extracting-last-bit, eval = TRUE}
abstracts <- str_extract(pub_char_list, "<Abstract>(\\n|.)+</Abstract>")
abstracts <- str_remove_all(abstracts, "</?[[:alnum:]]+>")
abstracts <- str_replace_all(abstracts, "\\s+"," ")
table(is.na(abstracts))
```
How many of these don't have an abstract? Now, the title
```{r process-titles, eval = TRUE}
titles <- str_extract(pub_char_list, "<ArticleTitle>(\\n|.)+</ArticleTitle>")
titles <- str_remove_all(titles, "</?[[:alnum:]]+>")
titles <- str_replace_all(titles, "\\s+"," ")
```
Finally, put everything together into a single `data.frame` and use
`knitr::kable` to print the results
```{r build-db, eval = TRUE}
database <- data.frame(
PubMedID = ids,
Title = titles,
Abstracts=abstracts
)
knitr::kable(database)
```
Done! Knit the document, commit, and push.
## Final Pro Tip (optional)
You can still share the HTML document on github. You can include a link in your `README.md` file as the following:
```md
View [here](https://ghcdn.rawgit.org/:user/:repo/:tag/:file)
```
For example, if we wanted to add a direct link the HTML page of lecture 7, we could do something like the following:
```md
View [here](https://ghcdn.rawgit.org/USCbiostats/PM566/master/static/slides/07-apis-regex/slides.html)
```