The aim of the first exercise is to test your SQL knowledge. The second exercise is a machine learning model creation.
Please find below a simple schema of two SQL tables extracted from a basic application containing TV show data. The SQL dump is available in the exercise_1 repository.
- tv_shows: Contains TV show data.
- episode_sample: Contains information regarding TV show episodes.
We would like to retrieve some information from these tables. Volumes will increase rapidly in this application, so we need to pay attention to query execution performance.
Please write queries to retrieve the following:
- The number of TV shows available in the tv_shows table.
- The oldest TV show (name) available in the tv_shows table.
- The TV show (name) with the highest number of episodes in the episode_sample table.
- The TV show (name) with the longest episode title in the episode_sample table.
- The most recent episode by TV show.
In this second exercice you'll need to use the attached dataset. This dataset contains CO2 emissions by country, year-wise, from 1750 to 2022.
We would like you to create a Python notebook that will:
- Load data from the dataset.
- Clean the data and make it usable for the next steps.
- Create a machine learning model to predict CO2 emissions for 2021 and 2022 based on the features contained in the dataset.
In this part the main focus is not on code quality, we are interested in the logic and methods you use to create, train, and test your model. All these informations will be shared during the debriefing session.
There is no expected format for deliverable you can either send files and report by email or store this in a github repository. In case of using a github repository, please grant Alexis (Github Id: Alexis) access to this repo.