forked from TheDataMine/the-examples-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path09-think-summer-2020.Rmd
243 lines (155 loc) · 10.2 KB
/
09-think-summer-2020.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
# Think Summer 2020 {#think-summer-2020}
<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/983291/sp/98329100/embedIframeJs/uiconf_id/29134031/partner_id/983291?iframeembed=true&playerId=kaltura_player&entry_id=1_uh7optd7&flashvars[streamerType]=auto&flashvars[localizationCode]=en&flashvars[leadWithHTML5]=true&flashvars[sideBarContainer.plugin]=true&flashvars[sideBarContainer.position]=left&flashvars[sideBarContainer.clickToClose]=true&flashvars[chapters.plugin]=true&flashvars[chapters.layout]=vertical&flashvars[chapters.thumbnailRotator]=false&flashvars[streamSelector.plugin]=true&flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&flashvars[dualScreen.plugin]=true&flashvars[Kaltura.addCrossoriginToIframe]=true&&wid=0_2wn5rlrs" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" frameborder="0" title="Kaltura Player"></iframe>
## Project
---
#### Submission
Students need to submit an [RMarkdown](#r-rmarkdown) file with all of the required code and output by **Wednesday, July 8th at 12:00 PM EST** through Gradescope inside Brightspace.
You can find an Rmarkdown template which you can modify and use a starting point for your project [here](https://raw.githubusercontent.com/TheDataMine/the-examples-book/master/files/think-summer-examples-2020.Rmd), and the resulting, compiled PDF [here](https://raw.githubusercontent.com/TheDataMine/the-examples-book/master/files/think-summer-examples-2020.pdf).
**Motivation:** SQL is an incredibly powerful tool that allows you to process and filter massive amounts of data -- amounts of data where tools like spreadsheets start to fail. You can perform SQL queries directly within the R environment, and doing so allows you to quickly perform ad-hoc analyses.
**Context:** This project is specially designed for Purdue University's Think Summer program, in conjunction with Purdue University's integrative data science initiative, [The Data Mine](https://datamine.purdue.edu/).
**Scope:** SQL, SQL in R
**Learning objectives:**
```{block, type="bbox"}
- Demonstrate the ability to interact with popular database management systems within R.
- Solve data-driven problems using a combination of SQL and R.
- Use basic SQL commands: select, order by, limit, desc, asc, count, where, from.
- Perform grouping and aggregate data using group by and the following functions: count, max, sum, avg, like, having.
```
You can find useful examples that walk you through relevant material in The Examples Book:
https://thedatamine.github.io/the-examples-book
It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
**Important note:** It is highly recommended that you use https://rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials. Use another system at your own risk. The version of RStudio on https://desktop.scholar.rcac.purdue.edu/ (which uses ThinLinc), is 99.9.9, and is known to have some strange issues when running code chunks.
Don't forget the very useful documentation shortcut `?`. To use, simply type `?` in the console, followed by the name of the function you are interested in.
You can also look for package documentation by using `help(package=PACKAGENAME)`, so for example, to see the documentation for the package `ggplot2`, we could run:
```{r, eval=F}
help(package=ggplot2)
```
Sometimes it can be helpful to see the source code of a defined function. A [function](https://www.tutorialspoint.com/r/r_functions.htm) is any chunk of organized code that is used to perform an operation. Source code is the underlying `R` or `c` or `c++` code that is used to create the function. To see the source code of a defined function, type the function's name without the `()`. For example, if we were curious about what the function `Reduce` does, we could run:
```{r, eval=F}
Reduce
```
Occasionally this will be less useful as the resulting code will be code that calls `c` code we can't see. Other times it will allow you to understand the function better.
#### Dataset
The following questions will use the `imdb` database found in Scholar. The credentials to the database are:
**Username:** imdb_user
**Password:** movie$Rkool
This database has 6 tables, namely:
`akas`, `crew`, `episodes`, `people`, `ratings`, and `titles`.
To connect to the database from a terminal in Scholar, execute the following:
`mysql -u imdb_user -h scholar-db.rcac.purdue.edu -p`
You will be asked for the password. Type the provided password and press enter. Note that it will look like nothing is being typed as you type, this is OK, you are indeed typing the password.
To connect to the database from Rstudio, open a browser and navigate to https://rstudio.scholar.rcac.purdue.edu/, and login using your Purdue Career Account credentials.
To establish a connection with the MySQL database within Rstudio, run the following:
```{r, eval=F}
install.packages("RMariaDB")
library(RMariaDB)
host <- "scholar-db.rcac.purdue.edu"
user <- "imdb_user"
password <- "movie$Rkool"
database <- "imdb"
db <- dbConnect(RMariaDB::MariaDB(), host=host, db=database, user=user, password=password)
```
After running the code above, you should be successfully connected to the database. From here, you can either use the package `RMariaDB` to query our database:
```{r, eval=F}
result <- dbGetQuery(db, "SELECT * FROM titles LIMIT 5;")
```
Or you can execute SQL directly in an Rmarkdown file. For example, copy and paste the following code chunks in an RMarkdown file:
This code chunk initiates a connection to the database.
````markdown
`r ''````{r}
install.packages("RMariaDB")
library(RMariaDB)
host <- "scholar-db.rcac.purdue.edu"
user <- "imdb_user"
password <- "movie$Rkool"
database <- "imdb"
db <- dbConnect(RMariaDB::MariaDB(), host=host, db=database, user=user, password=password)
```
````
This code chunk demonstrates how to run SQL queries from within R.
````markdown
`r ''````{r}
result <- dbGetQuery(db, "SELECT * FROM titles LIMIT 5;")
```
````
This code chunk demonstrates how to use the SQL connection to run SQL queries directly within a code chunk.
````markdown
`r ''````{sql, connection=db}
SELECT * FROM titles LIMIT 5;
```
````
##### 1. Explore the 6 tables. State an interesting fact (of your choice) that you find about at least one of the tables.
**Relevant topics:** *[sql](#sql-examples), [sql in R](#sql-in-r-examples)*
```{block, type="bbox"}
**Item(s) to submit:**
- A sentence describing at least 1 interesting fact about at least one of the tables.
```
##### 2. Find the title_id, rating, and number of votes for all movies that received at least 2 million votes.
**Hint:** *Use the ratings table.*
**Relevant topics:** *[sql](#sql-examples), [sql in R](#sql-in-r-examples)*
```{block, type="bbox"}
**Item(s) to submit:**
- SQL query used to solve this problem.
- Output from running the SQL query.
```
##### 3. Now use the information you found, about the movies that received at least 2 million votes, to identify the titles of these movies, using the titles table.
**Hint:** *You will probably recognize the names of these movies.*
**Relevant topics:** *[sql](#sql-examples), [sql in R](#sql-in-r-examples)*
```{block, type="bbox"}
**Item(s) to submit:**
- SQL query used to solve this problem.
- Output from running the SQL query.
```
##### 4. Find the names, birth years, and death years, for all actors and actresses who lived more than 115 years.
**Hint:** *You can use this clause in your SQL query:*
```{sql, eval=F}
WHERE died - born > 115
```
**Relevant topics:** *[sql](#sql-examples), [sql in R](#sql-in-r-examples)*
```{block, type="bbox"}
**Item(s) to submit:**
- SQL query used to solve this problem.
- Output from running the SQL query.
```
##### 5. In the titles table, the genres column specifies the genre of each movie. Use the `COUNT` function to find how many movies of each genre occur in the database.
**Hint:** *You can use the same strategy from the `SUM` of transactions examples in the election database. Just use `COUNT` instead of `SUM`.*
**Relevant topics:** *[sql](#sql-examples), [sql in R](#sql-in-r-examples)*
```{block, type="bbox"}
**Item(s) to submit:**
- SQL query used to solve this problem.
```
##### 6. In the titles table, the premiered column specifies the year that a movie was premiered. Use the `COUNT` function to find how many movies premiered in each year in the database.
**Relevant topics:** *[sql](#sql-examples), [sql in R](#sql-in-r-examples)*
```{block, type="bbox"}
**Item(s) to submit:**
- SQL query used to solve this problem.
```
##### 7. One movie has a strange premiere year. Which movie is this?
**Relevant topics:** *[sql](#sql-examples), [sql in R](#sql-in-r-examples)*
```{block, type="bbox"}
**Item(s) to submit:**
- SQL query used to solve this problem.
- Output from running the SQL query.
```
##### 8. Make a dotchart that shows how many movies premiered in each year since the year 2000.
**Relevant topics:** *[sql](#sql-examples), [sql in R](#sql-in-r-examples)*
```{block, type="bbox"}
**Item(s) to submit:**
- SQL query used to gather the data used in the dotchart.
- A dotchart that shows how many movies premiered in each year since the year 2000, in png or jpg/jpeg format.
```
##### 9. The title 'The Awakening' has been used very often! How many times has this been used as a title?
**Relevant topics:** *[sql](#sql-examples), [sql in R](#sql-in-r-examples)*
```{block, type="bbox"}
**Item(s) to submit:**
- SQL query used to solve this problem.
- Output from running the SQL query.
```
##### 10. Investigate all of the occurrences of these titles called 'The Awakening'. Find an interesting fact about the entries with these titles.
**Relevant topics:** *[sql](#sql-examples), [sql in R](#sql-in-r-examples)*
```{block, type="bbox"}
**Item(s) to submit:**
- SQL query used to solve this problem.
- Output from running the SQL query.
- 1-2 sentences describing the interesting fact you found about the entries with these titles.
```