Assignment 5: Data import (readr) and tidy data (tidyr)

Instructions: Please read through this before you begin

This homework is due by 10pm on Monday 10/26/20.
Please reproduce this markdown template. Pay attention to all the formating in this file, including bullet points, bolded characters, inserted code chunks, headings, text colors, blank lines, etc.
For the first two exercises in this assignment, import data files into R and parse them properly as instructed. Tidy and/or transform the data frames when appropriate.
Please note that exercise 1 is optional. We have not covered data import in lecture, but the relevant material can be found here if you’d like to give it a shot.
- We have not extensively covered parsing data in lecture, but you can find more information about it here.
- Briefly, you can choose to use the col_types argument in a read_*() function to parse the data during import.
- In many cases, you can also choose to parse the data using one of the parse_*() functions nested within a mutate() function after importing the data.
For the next exercise, you will explore a dataset with details about passengers on the Titanic. First, answer the questions below and use figures or tables to support your answer. Then, explore the dataset on your own using the data transformation and visualization skills that you have learned in this class so far.
- For this exercise, please make sure to put some thought into editing the aesthetics of your figures and tables to make them easier to understand and nicer to look at (e.g. choose the most appropriate geometric object, aesthetic mapping, facetting, position adjustment; add meaningful axis labels, figure titles, legend titles; change the background; be creative; and etc.).
Please note that exercise 4 is optional. It is a good exercise for practicing your data tidying and exploration skills though, so we encourage you to give it a try if time allows.
Please reproduce the tables and plots exactly as shown in this markdown file.
When a verbal response is needed, answer by editing the part in the R markdown template where it says “Write your response here” .
Have all your code embedded within the R markdown file, and show both of your code and plots in the knitted markdown file.
Use R Markdown functionalities to hide messages and warnings when needed. (Suggestion: messages and warnings can often be informative and important, so please examine them carefully and only turn them off when you finish the exercise).
Please name your R markdown file assignment_5.Rmd and the knitted markdown file assignment_5.md. Please upload both files using your personal GitHub repository for this class.
To start, first load all the required packages with the following code. Install them if they are not installed yet.

library(tidyverse)
library(knitr)

Exercise 1. Tibble and Data Import (OPTIONAL)

Import data frames listed below into R and parse the columns appropriately when needed. Watch out for the formating oddities of each dataset. Print the results with kable().

1.1 Create the following tibble manually, first using `tribble()` and then using `tibble()`. Print both results.

tribble():

a	b	c
1	2.1	apple
2	3.2	orrange

tibble():

a	b	c
1	2.1	apple
2	3.2	orrange

1.2 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset2.txt` into R. Change the column names into “Name”, “Weight”, “Price”.

Name	Weight	Price
apple	1	2.9
orange	2	4.9
durian	10	19.9

1.3 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset3.txt` into R. Watch out for the first few lines, missing values, separators, quotation marks, and deliminaters.

Name	Weight	Price
apple	1	2.9
orange	2	NA
durian	NA	19.9

1.4 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset4.txt` into R. Watch out for comments, units, and decimal marks (which are `,` in this case).

Name	Weight	Price
apple	1	2.9
orange	2	4.9
durian	10	19.9

1.5 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset5.txt` into R. Parse the columns properly. Write this imported and parsed data frame into a new csv file named `dataset5_new.csv` in your `problem_sets` folder.

Name	Expiration Date	Time
apple	2018-09-26	01:00:00
orange	2018-10-02	13:00:00
durian	2018-10-21	11:00:00

Exercise 2. Weather station

This dataset contains the weather and air quality data collected by a weather station in Taiwan. It was obtained from the Environmental Protection Administration, Executive Yuan, R.O.C. (Taiwan).

2.1 The text file `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/2015y_Weather_Station_notes.txt` contains desciptions of different variables collected by the station. Import it into R and print it in a table as shown below.

Item	Unit	Description
AMB_TEMP	Celsius	Ambient air temperature
CO	ppm	Carbon monoxide
NO	ppb	Nitric oxide
NO2	ppb	Nitrogen dioxide
NOx	ppb	Nitrogen oxides
O3	ppb	Ozone
PM10	μg/m3	Particulate matter with a diameter between 2.5 and 10 μm
PM2.5	μg/m3	Particulate matter with a diameter of 2.5 μm or less
RAINFALL	mm	Rainfall
RH	%	Relative humidity
SO2	ppb	Sulfur dioxide
WD_HR	degress	Wind direction (The average of hour)
WIND_DIREC	degress	Wind direction (The average of last ten minutes per hour)
WIND_SPEED	m/sec	Wind speed (The average of last ten minutes per hour)
WS_HR	m/sec	Wind speed (The average of hour)

# indicates invalid value by equipment inspection
* indicates invalid value by program inspection
x indicates invalid value by human inspection
NR indicates no rainfall
blank indicates no data

2.2 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/2015y_Weather_Station.csv` into R. As you can see, this dataset is a classic example of untidy data: values of a variable (i.e. hour of the day) are stored as column names; variable names are stored in the `item` column. Clean this dataset up by turning it into a tidy format. Also, parse the `date` variable into date format and parse `hour` into time. Turn all invalid values into `NA` and turn `NR` in rainfall into `0`. Parse all values into numbers. Show the first 6 rows and 15 columns of this cleaned dataset, as shown below. (Hint: you don’t have to do these tasks in the given order.)

Before cleaning:

date	station	item	00	01	02	03	04	05	06	07	08	09	10	11
2015-01-01	Cailiao	AMB_TEMP	16	16	15	15	15	14	14	14	14	15	14	15
2015-01-01	Cailiao	CO	0.74	0.7	0.66	0.61	0.51	0.51	0.51	0.6	0.62	0.58	0.53	0.49
2015-01-01	Cailiao	NO	1	0.8	1.1	1.7	2	1.7	1.9	2.4	3.4	3.7	3.5	3.4
2015-01-01	Cailiao	NO2	15	13	13	12	11	13	13	16	16	14	12	11
2015-01-01	Cailiao	NOx	16	14	14	13	13	15	15	18	19	18	15	15
2015-01-01	Cailiao	O3	35	36	35	34	34	32	30	26	26	29	33	38

After cleaning:

date	station	hour	AMB_TEMP	CO	NO	NO2	NOx	O3	PM10	PM2.5	RH	SO2	WD_HR
2015-01-01	Cailiao	00:00:00	16	0.74	1.0	15	16	35	171	76	57	9.2	74
2015-01-01	Cailiao	01:00:00	16	0.70	0.8	13	14	36	174	78	57	7.7	72
2015-01-01	Cailiao	02:00:00	15	0.66	1.1	13	14	35	160	69	58	6.6	74
2015-01-01	Cailiao	03:00:00	15	0.61	1.7	12	13	34	142	60	59	5.4	71
2015-01-01	Cailiao	04:00:00	15	0.51	2.0	11	13	34	123	52	59	4.8	67
2015-01-01	Cailiao	05:00:00	14	0.51	1.7	13	15	32	110	44	57	5.0	63

2.3 Using this cleaned dataset, plot the daily variation in ambient temperature on September 25, 2015, as shown below.

2.4 Plot the daily average ambient temperature throughout the year with a continuous line, as shown below.

2.5 Plot the total rainfall per month in a bar chart, as shown below. (Hint: seperating date into three columns might be helpful.)

2.6 Plot the per hour variation in PM2.5 in the first week of September with a continuous line, as shown below. (Hint: uniting the date and hour and parsing the new variable might be helpful.)

Exercise 3: Titanic passengers

This dataset contains information about a subset of the passengers who were aboard the Titanic when it sank in 1912.

titanic<-read_csv("https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/Titanic.csv")
kable(head(titanic))

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NA	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NA	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NA	S
6	0	3	Moran, Mr. James	male	NA	0	330877	8.4583	NA	Q

And here are descriptions of the variables in the dataset

notes<-read_csv("https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/Notes.csv")
kable(notes)

Variable	Definition	Key
PassengerId	Passenger ID	NA
Survival	Survival	0 = No, 1 = Yes
Pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
Name	Pasenger name	NA
Sex	Sex	NA
Age	Age in years	NA
Sibsp	# of siblings / spouses aboard the Titanic	NA
Parch	# of parents / children aboard the Titanic	NA
Ticket	Ticket number	NA
Fare	Passenger fare	NA
Cabin	Cabin number	NA
Embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Note: Age is fractional if less than 1. If the age is estimated, it is in the form of xx.5

Part 1.

Answer the questions below and use figures or tables to support your answer. Choose the most appropriate type of figure or table for each question.

3.1 According to Wikipedia, there was an estimated 2,224 passengers and crew onboard the Titanic when it sank. How many of them do we have information on in this dataset? How many of them survived and how many did not? What is the overall survival rate?

## Write your code here

Write your response here

3.2 How many passengers on the Titanic were males and how many were females? What do you find when you break it down by ticket class?

## Write your code here

Write your response here

3.3 How many passengers of each sex survived and how many of them did not? What is the survival rate for passengers of each sex?

## Write your code here

Write your response here

3.4 How many passengers do we have age information for (including estimated age)? For how many is the age information missing? What is the age distribution for passengers whose age information is available?

## Write your code here

Write your response here

3.5 Show the age distribution per ticket class, per sex. What do you find?

## Write your code here

Write your response here

3.6 What is the age distribution of passengers who survived vs. those who did not? What hypothesis can you come up with when comparing these two distributions?

## Write your code here

Write your response here

3.7 In this dataset, the Fare variable does not represent the fare per person. Instead, each ticket number has a corresponding fare, and some passengers share one single ticket number. Therefore, the Fare variable is the total fare for a group of passengers sharing the same ticket number. Knowing this, calculate the average fare per person. (You don’t need to show a table or a figure for this question, just show the code for the calculation)

## Write your code here

Write your response here

3.8 Show the distribution of the number of family members (including siblings, spouses, parents, and children) that each passenger was accompanied by. Were most passengers travelling solo or with family?

## Write your code here

Write your response here

3.9 Which ticket class did most of the largest families get? And which ticket class has the lowest proportion of female passengers who travelled solo out of all the female passengers in that class?

## Write your code here

Write your response here

3.10 Do the port of embarkation matter for the survival rates of passengers? Why might this be the case?

## Write your code here

Write your response here

Exercise 4. Camera data (OPTIONAL)

This dataset contains information on 1038 camera models. It was obtained from the following website: https://perso.telecom-paristech.fr/eagan/class/igr204/

4.1 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/camera.csv` to R. You will see that the `Model` columns contains both the brand names and model names of cameras. Split this column into two, one with brand name, and the other with model name, as shown below. Print the first 6 rows of the new data frame. (Hint: check the merge argument in `separate()`)

Brand	Model	Release date	Max resolution	Low resolution	Effective pixels	Zoom wide (W)	Zoom tele (T)	Normal focus range	Macro focus range	Storage included	Weight (inc. batteries)	Dimensions	Price
Agfa	ePhoto 1280	1997	1024	640	0	38	114	70	40	4	420	95	179
Agfa	ePhoto 1680	1998	1280	640	1	38	114	50	0	4	420	158	179
Agfa	ePhoto CL18	2000	640	0	0	45	45	0	0	2	0	0	179
Agfa	ePhoto CL30	1999	1152	640	0	35	35	0	0	4	0	0	269
Agfa	ePhoto CL30 Clik!	1999	1152	640	0	43	43	50	0	40	300	128	1299
Agfa	ePhoto CL45	2001	1600	640	1	51	51	50	20	8	270	119	179

4.2 Many model names start with a name for the product line, which is then followed by a name for the particular model. Select all Canon cameras, and further split the model names into product line names (in this case, they are either “Powershot” or “EOS”) and model names. Show the first 6 lines of this new data frame. (Hint: notice that there are more than one possible separators)

Brand	Line	Model	Release date	Max resolution	Low resolution	Effective pixels	Zoom wide (W)	Zoom tele (T)	Normal focus range	Macro focus range	Storage included	Weight (inc. batteries)	Dimensions	Price
Canon	PowerShot	350	1997	640	0	0	42	42	70	3	2	320	93	149
Canon	PowerShot	600	1996	832	640	0	50	50	40	10	1	460	160	139
Canon	PowerShot	A10	2001	1280	1024	1	35	105	76	16	8	375	110	139
Canon	PowerShot	A100	2002	1280	1024	1	39	39	20	5	8	225	110	139
Canon	PowerShot	A20	2001	1600	1024	1	35	105	76	16	8	375	110	139
Canon	PowerShot	A200	2002	1600	1024	1	39	39	20	5	8	225	110	139

4.3 Explore the full dataset (and some subsetted ones if you are interested) on your own using a variety of the skills that you have learned in this class so far (i.e. data visualization, transformation, and exploration). Come up with at least 3 interesting findings or potential hypotheses.

4.3.1

## Write your code here

Write your response here

4.3.2

## Write your code here

Write your response here

4.3.3

## Write your code here

Write your response here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assignment_5.md

assignment_5.md

Assignment 5: Data import (readr) and tidy data (tidyr)

Instructions: Please read through this before you begin

Exercise 1. Tibble and Data Import (OPTIONAL)

1.1 Create the following tibble manually, first using `tribble()` and then using `tibble()`. Print both results.

1.2 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset2.txt` into R. Change the column names into “Name”, “Weight”, “Price”.

1.3 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset3.txt` into R. Watch out for the first few lines, missing values, separators, quotation marks, and deliminaters.

1.4 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset4.txt` into R. Watch out for comments, units, and decimal marks (which are `,` in this case).

1.5 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset5.txt` into R. Parse the columns properly. Write this imported and parsed data frame into a new csv file named `dataset5_new.csv` in your `problem_sets` folder.

Exercise 2. Weather station

2.1 The text file `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/2015y_Weather_Station_notes.txt` contains desciptions of different variables collected by the station. Import it into R and print it in a table as shown below.

2.3 Using this cleaned dataset, plot the daily variation in ambient temperature on September 25, 2015, as shown below.

2.4 Plot the daily average ambient temperature throughout the year with a continuous line, as shown below.

2.5 Plot the total rainfall per month in a bar chart, as shown below. (Hint: seperating date into three columns might be helpful.)

2.6 Plot the per hour variation in PM2.5 in the first week of September with a continuous line, as shown below. (Hint: uniting the date and hour and parsing the new variable might be helpful.)

Exercise 3: Titanic passengers

Part 1.

3.1 According to Wikipedia, there was an estimated 2,224 passengers and crew onboard the Titanic when it sank. How many of them do we have information on in this dataset? How many of them survived and how many did not? What is the overall survival rate?

3.2 How many passengers on the Titanic were males and how many were females? What do you find when you break it down by ticket class?

3.3 How many passengers of each sex survived and how many of them did not? What is the survival rate for passengers of each sex?

3.4 How many passengers do we have age information for (including estimated age)? For how many is the age information missing? What is the age distribution for passengers whose age information is available?

3.5 Show the age distribution per ticket class, per sex. What do you find?

3.6 What is the age distribution of passengers who survived vs. those who did not? What hypothesis can you come up with when comparing these two distributions?

3.8 Show the distribution of the number of family members (including siblings, spouses, parents, and children) that each passenger was accompanied by. Were most passengers travelling solo or with family?

3.9 Which ticket class did most of the largest families get? And which ticket class has the lowest proportion of female passengers who travelled solo out of all the female passengers in that class?

3.10 Do the port of embarkation matter for the survival rates of passengers? Why might this be the case?

Exercise 4. Camera data (OPTIONAL)

4.3 Explore the full dataset (and some subsetted ones if you are interested) on your own using a variety of the skills that you have learned in this class so far (i.e. data visualization, transformation, and exploration). Come up with at least 3 interesting findings or potential hypotheses.

4.3.1

4.3.2

4.3.3

Don’t hesitate to continue writing if you have more than 3 interesting findings.

Files

assignment_5.md

Latest commit

History

assignment_5.md

File metadata and controls

Assignment 5: Data import (readr) and tidy data (tidyr)

Instructions: Please read through this before you begin

Exercise 1. Tibble and Data Import (OPTIONAL)

1.1 Create the following tibble manually, first using tribble() and then using tibble(). Print both results.

1.2 Import https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset2.txt into R. Change the column names into “Name”, “Weight”, “Price”.

1.3 Import https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset3.txt into R. Watch out for the first few lines, missing values, separators, quotation marks, and deliminaters.

1.4 Import https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset4.txt into R. Watch out for comments, units, and decimal marks (which are , in this case).

1.5 Import https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset5.txt into R. Parse the columns properly. Write this imported and parsed data frame into a new csv file named dataset5_new.csv in your problem_sets folder.

Exercise 2. Weather station

2.1 The text file https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/2015y_Weather_Station_notes.txt contains desciptions of different variables collected by the station. Import it into R and print it in a table as shown below.

2.3 Using this cleaned dataset, plot the daily variation in ambient temperature on September 25, 2015, as shown below.

2.4 Plot the daily average ambient temperature throughout the year with a continuous line, as shown below.

2.5 Plot the total rainfall per month in a bar chart, as shown below. (Hint: seperating date into three columns might be helpful.)

2.6 Plot the per hour variation in PM2.5 in the first week of September with a continuous line, as shown below. (Hint: uniting the date and hour and parsing the new variable might be helpful.)

Exercise 3: Titanic passengers

Part 1.

3.1 According to Wikipedia, there was an estimated 2,224 passengers and crew onboard the Titanic when it sank. How many of them do we have information on in this dataset? How many of them survived and how many did not? What is the overall survival rate?

3.2 How many passengers on the Titanic were males and how many were females? What do you find when you break it down by ticket class?

3.3 How many passengers of each sex survived and how many of them did not? What is the survival rate for passengers of each sex?

3.4 How many passengers do we have age information for (including estimated age)? For how many is the age information missing? What is the age distribution for passengers whose age information is available?

3.5 Show the age distribution per ticket class, per sex. What do you find?

3.6 What is the age distribution of passengers who survived vs. those who did not? What hypothesis can you come up with when comparing these two distributions?

3.8 Show the distribution of the number of family members (including siblings, spouses, parents, and children) that each passenger was accompanied by. Were most passengers travelling solo or with family?

3.9 Which ticket class did most of the largest families get? And which ticket class has the lowest proportion of female passengers who travelled solo out of all the female passengers in that class?

3.10 Do the port of embarkation matter for the survival rates of passengers? Why might this be the case?

Exercise 4. Camera data (OPTIONAL)

4.3 Explore the full dataset (and some subsetted ones if you are interested) on your own using a variety of the skills that you have learned in this class so far (i.e. data visualization, transformation, and exploration). Come up with at least 3 interesting findings or potential hypotheses.

4.3.1

4.3.2

4.3.3

Don’t hesitate to continue writing if you have more than 3 interesting findings.

1.1 Create the following tibble manually, first using `tribble()` and then using `tibble()`. Print both results.

1.2 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset2.txt` into R. Change the column names into “Name”, “Weight”, “Price”.

1.3 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset3.txt` into R. Watch out for the first few lines, missing values, separators, quotation marks, and deliminaters.

1.4 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset4.txt` into R. Watch out for comments, units, and decimal marks (which are `,` in this case).

1.5 Import `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/dataset5.txt` into R. Parse the columns properly. Write this imported and parsed data frame into a new csv file named `dataset5_new.csv` in your `problem_sets` folder.

2.1 The text file `https://raw.githubusercontent.com/nt246/NTRES6940-data-science/master/datasets/2015y_Weather_Station_notes.txt` contains desciptions of different variables collected by the station. Import it into R and print it in a table as shown below.