Data Engineer Pairing Exercise

The Task

The task involves developing a data pipeline and its underlying infrastructure. In order to complete the following user story sample data sources will be provided.

User story:

we would like to explore and process the flights data in order to answer questions such as:

how many days does the flights table cover ?
how many departure cities the flight database covers ?
what is the relationship between flights and planes tables ?
which airplane manufacturer incurred the most delays in the analysis period ?
what are the two most connected cities?

The input data sources are comprised of (csv files in data.tar.gz):

airlines
airports
flights
planes
weather

Output:

answer the above user story questions
provide instructions on how to create the pipeline and the infrastructure
the data is kept small for the exercice, what considerations can be made when dealing with real data.
what considerations can be made to promote this work to production.

Extra Credit

Another user story related to the launch of a loyalty program has reached us, customer data will be coming in json format.
how can we use this customer data alongside the above data to enable this program ?

Acknowledgements

The flight delay and cancellation data was collected and published by the DOT's Bureau of Transportation Statistics.
the files in data tarball are sourced from the following repos:

https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airlines.csv https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/airports.csv https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/planes.csv https://raw.githubusercontent.com/hadley/nycflights13/master/data-raw/weather.csv
https://github.com/rich-iannone/so-many-pyspark-examples/raw/master/data-files/nycflights13.csv (renamed to flights.csv)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
data.tar.gz		data.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineer Pairing Exercise

The Task

User story:

The input data sources are comprised of (csv files in data.tar.gz):

Output:

Extra Credit

Acknowledgements

About

Releases

Packages

Contributors 2

101-Ways/pairing-test-data-engineer

Folders and files

Latest commit

History

Repository files navigation

Data Engineer Pairing Exercise

The Task

User story:

The input data sources are comprised of (csv files in data.tar.gz):

Output:

Extra Credit

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages