Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Homework
In this homework, we'll deploy the ride duration model in batch mode. Like in homework 1, we'll use the Yellow Taxi Trip Records dataset.
You'll find the starter code in the homework directory.
Q1. Notebook
We'll start with the same notebook we ended up with in homework 1.
We cleaned it a little bit and kept only the scoring part. You can find the initial notebook here.
Run this notebook for the March 2023 data.
What's the standard deviation of the predicted duration for this dataset?
Q2. Preparing the output
Like in the course videos, we want to prepare the dataframe with the output.
First, let's create an artificial
ride_id
column:Next, write the ride id and the predictions to a dataframe with results.
Save it as parquet:
What's the size of the output file?
Note: Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use
pyarrow
, notfastparquet
.Q3. Creating the scoring script
Now let's turn the notebook into a script.
Which command you need to execute for that?
Q4. Virtual environment
Now let's put everything into a virtual environment. We'll use pipenv for that.
Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter
notebook.
After installing the libraries, pipenv creates two files:
Pipfile
and
Pipfile.lock
. ThePipfile.lock
file keeps the hashes of thedependencies we use for the virtual env.
What's the first hash for the Scikit-Learn dependency?
R:
Q5. Parametrize the script
Let's now make the script configurable via CLI. We'll create two
parameters: year and month.
Run the script for April 2023.
What's the mean predicted duration?
Hint: just add a print statement to your script.
Q6. Docker container
Finally, we'll package the script in the docker container.
For that, you'll need to use a base image that we prepared.
This is what the content of this image is:
Note: you don't need to run it. We have already done it.
It is pushed it to
agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim
,which you need to use as your base image.
That is, your Dockerfile should start with:
This image already has a pickle file with a dictionary vectorizer
and a model. You will need to use them.
Important: don't copy the model to the docker image. You will need
to use the pickle file already in the image.
Now run the script with docker. What's the mean predicted duration
for May 2023?
response
Bonus: upload the result to the cloud (Not graded)
Just printing the mean duration inside the docker image
doesn't seem very practical. Typically, after creating the output
file, we upload it to the cloud storage.
Modify your code to upload the parquet file to S3/GCS/etc.
Bonus: Use Mage for batch inference
Here we didn't use any orchestration. In practice we usually do.
.scripts/start.sh
http://localhost:6789/
for mage UIPublishing the image to dockerhub
This is how we published the image to Docker hub:
docker build -t mlops-zoomcamp-model:2024-3.10.13-slim . docker tag mlops-zoomcamp-model:2024-3.10.13-slim agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim docker login --username USERNAME docker push agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim
This is just for your reference, you don't need to do it.
Submit the results