Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: homework 4 local deployment #6

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
208 changes: 207 additions & 1 deletion homework/04-deployment/homework.md
Original file line number Diff line number Diff line change
@@ -1 +1,207 @@
TBA. Come back later
## Homework

In this homework, we'll deploy the ride duration model in batch mode. Like in homework 1, we'll use the Yellow Taxi Trip Records dataset.

You'll find the starter code in the [homework](homework) directory.


## Q1. Notebook

We'll start with the same notebook we ended up with in homework 1.
We cleaned it a little bit and kept only the scoring part. You can find the initial notebook [here](homework/starter.ipynb).

Run this notebook for the March 2023 data.

What's the standard deviation of the predicted duration for this dataset?

* 1.24
* 6.24 [x]
* 12.28
* 18.28


## Q2. Preparing the output

Like in the course videos, we want to prepare the dataframe with the output.

First, let's create an artificial `ride_id` column:

```python
df['ride_id'] = f'{year:04d}/{month:02d}_' + df.index.astype('str')
```

Next, write the ride id and the predictions to a dataframe with results.

Save it as parquet:

```python
df_result.to_parquet(
output_file,
engine='pyarrow',
compression=None,
index=False
)
```

What's the size of the output file?

* 36M
* 46M
* 56M
* 66M [x]

__Note:__ Make sure you use the snippet above for saving the file. It should contain only these two columns. For this question, don't change the
dtypes of the columns and use `pyarrow`, not `fastparquet`.


## Q3. Creating the scoring script

Now let's turn the notebook into a script.

Which command you need to execute for that?

```
jupyter nbconvert --to script starter.ipynb
```


## Q4. Virtual environment

Now let's put everything into a virtual environment. We'll use pipenv for that.

Install all the required libraries. Pay attention to the Scikit-Learn version: it should be the same as in the starter
notebook.

After installing the libraries, pipenv creates two files: `Pipfile`
and `Pipfile.lock`. The `Pipfile.lock` file keeps the hashes of the
dependencies we use for the virtual env.

What's the first hash for the Scikit-Learn dependency?

R:
```
sha256:057b991ac64b3e75c9c04b5f9395eaf19a6179244c089afdebaad98264bff37c
```

## Q5. Parametrize the script

Let's now make the script configurable via CLI. We'll create two
parameters: year and month.

Run the script for April 2023.

What's the mean predicted duration?

* 7.29
* 14.29 [x]
* 21.29
* 28.29

Hint: just add a print statement to your script.


## Q6. Docker container

Finally, we'll package the script in the docker container.
For that, you'll need to use a base image that we prepared.

This is what the content of this image is:
```
FROM python:3.10.13-slim

WORKDIR /app
COPY [ "model2.bin", "model.bin" ]
```

Note: you don't need to run it. We have already done it.

It is pushed it to [`agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim`](https://hub.docker.com/layers/agrigorev/zoomcamp-model/mlops-2024-3.10.13-slim/images/sha256-f54535b73a8c3ef91967d5588de57d4e251b22addcbbfb6e71304a91c1c7027f?context=repo),
which you need to use as your base image.

That is, your Dockerfile should start with:

```docker
FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

RUN pip install -U pip
RUN pip install pipenv

WORKDIR /app

COPY ["/data/yellow_tripdata_2023-05.parquet", "/app/data/"]

COPY ["Pipfile", "Pipfile.lock", "starter.py", "./"]

RUN pipenv install --system --deploy

ENTRYPOINT ["python", "starter.py"]
```

This image already has a pickle file with a dictionary vectorizer
and a model. You will need to use them.

Important: don't copy the model to the docker image. You will need
to use the pickle file already in the image.

Now run the script with docker. What's the mean predicted duration
for May 2023?

* 0.19 [x]
* 7.24
* 14.24
* 21.19

### response

- Using just year and month as args seems to little especially in this type of "quick demo deployment" with local data where the data goes within the docker image.
- So it was also added input directory as arg and output


## Bonus: upload the result to the cloud (Not graded)

Just printing the mean duration inside the docker image
doesn't seem very practical. Typically, after creating the output
file, we upload it to the cloud storage.

Modify your code to upload the parquet file to S3/GCS/etc.


- Quite simple to do so. For S3 for example:




## Bonus: Use Mage for batch inference

Here we didn't use any orchestration. In practice we usually do.

* Split the code into logical code blocks
* Use Mage to orchestrate the execution


1. Using the Mage setup from previous homework. cd to mlops folder again
2. Run `.scripts/start.sh`
3. Access `http://localhost:6789/` for mage UI
4. Add args to pipeline

![global vars to setup year, month of data](image.png)

## Publishing the image to dockerhub

This is how we published the image to Docker hub:

```bash
docker build -t mlops-zoomcamp-model:2024-3.10.13-slim .
docker tag mlops-zoomcamp-model:2024-3.10.13-slim agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim

docker login --username USERNAME
docker push agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim
```

This is just for your reference, you don't need to do it.


## Submit the results

* Submit your results here: https://courses.datatalks.club/mlops-zoomcamp-2024/homework/hw4
* It's possible that your answers won't match exactly. If it's the case, select the closest one.
13 changes: 13 additions & 0 deletions homework/04-deployment/homework/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM agrigorev/zoomcamp-model:mlops-2024-3.10.13-slim
RUN pip install -U pip
RUN pip install pipenv

WORKDIR /app

COPY ["/data/yellow_tripdata_2023-05.parquet", "/app/data/"]

COPY ["Pipfile", "Pipfile.lock", "starter.py", "./"]

RUN pipenv install --system --deploy

ENTRYPOINT ["python", "starter.py"]
15 changes: 15 additions & 0 deletions homework/04-deployment/homework/Pipfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
scikit-learn = "==1.5.0"
pandas = "*"
pyarrow = "*"
fastparquet = "*"

[dev-packages]

[requires]
python_version = "3.11"
Loading