Skill Assessment: Data Engineer

Background

For this assessment, please build a dbt project that addresses the problem listed below. Ensure that you clone it to a git-based sharable repository and include a README.md file in the repository's root. The README.md should discuss your analysis of the problem and the steps you took to solve it. You may use whichever SQL database flavor you prefer, but please state which one you chose in your README.md. (Note: We use Snowflake, but open-source database alternatives like DuckDB are recommended as they are easy to set up and get started.)

Problem: Where in the World is Carmen Sandiego?

Some may know this popular game franchise in which players answer questions to solve the titular riddle. Your task today is an homage to this game! You've recently come aboard Interpol's team as a data engineer. Their dedicated data team has collected, parsed, and assembled several agent field reports over the years, but in Excel only. Their final collection is provided in the next subsection - your task is to engineer this data to provide analytical answers to the Interpol team!

Data Sources & Common Model Development

The data is contained in the attached Excel workbook carmen_sightings_20220629061307.xlsx. Note the sheets are organized by eight (nearly continential) regions - there is an Interpol agency HQ in a city of each region to which the agents report. Each agency HQ uses their own language or dialect to compile their regional reports, but those reports are in first normal form (1NF).

The first step of your task is to extract data from Excel workbook, treating as initial sources. HINT: CSV exports into seeds - whether by Excel or pandas - is a great way to start... 👀

As seen from the data, agencies are free to name report columns according to their custom - but let's call each "yablaka" an apple! 🍎 Each source ought follow a common data dictionary of

Column	Description	Type
date_witness	Date of witness sighting	date
witness	Name of witness sighting the perpetrator	string
agent	Name of field agent filing the report	string
date_agent	Date of field agent filing the report	date
city_agent	HQ city where field agent files the report	string
country	Country of sighting	string
city	City of sighting	string
latitude	Latitude of sighting	float
longitude	Longitude of sighting	float
has_weapon	Was the perpetrator observed to be armed?	boolean
has_hat	Was the perpetrator wearing a hat?	boolean
has_jacket	Was the perpetrator wearing a jacket?	boolean
behavior	Short description of perpetrator behavior	string

The second step of your task is to create view models that columnarly maps these eight different sources, each into this common data dictionary. HINT: You can do this as you wish - via CTE stages, macros, go wild! The end result however must be a view model for each source.

Now that you have eight models, all with the same columns - join them together, but with a caveat:

The third step of your task is to join the eight different views into ONE schema that goes beyond 1NF ([2-6]NF, BCNF). You have a great deal of design freedom here, so get creative! Just persist final resulting schema as tables into a new schema - please present your design's entity-relation-diagram in your README.md.

Analytics

This new schema includes the >1NF model you've just developed, as tables. From this model, it ought be fairly straightforward for you to create analytical view(s) to answer the following questions:

a. For each month, which agency region is Carmen Sandiego most likely to be found?

b. Also for each month, what is the probability that Ms. Sandiego is armed AND wearing a jacket, but NOT a hat? What general observations about Ms. Sandiego can you make from this?

c. What are the three most occuring behaviors of Ms. Sandiego?

d. For each month, what is the probability Ms. Sandiego exhibits one of her three most occurring behaviors?

Create analytical views in your new schema to answer the four above questions. Document your steps and logic in your README.md. HINT: dbt docs (and its screenshots) are a great resource!

Submission

NOTE: Throughout, I've referenced README.md. If you are unfamiliar with GitHub Markdown, feel free to use your most convenient method: Word document, Google Doc, textfile (with image attachments), html - however you can best communicate your thoughts and ideas!

Ensure that your project runs fully BEFORE submission!
Verify that you have the four analytical questions answered in your README.md, and you are confident with your presentation.
Push and merge into your git repository's main branch.
Then notify the team that you're ready for a review.

Enjoy the challenge and good luck!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skill Assessment: Data Engineer

Background

Problem: Where in the World is Carmen Sandiego?

Data Sources & Common Model Development

Analytics

Submission

About

Releases

Packages

Contributors 2

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
carmen_sightings		carmen_sightings
LICENSE		LICENSE
README.md		README.md

License

achilala/wheres-carmen

Folders and files

Latest commit

History

Repository files navigation

Skill Assessment: Data Engineer

Background

Problem: Where in the World is Carmen Sandiego?

Data Sources & Common Model Development

Analytics

Submission

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages