Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use iac config for scaledown cron #347

Merged
merged 21 commits into from
Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion chart-infra/templates/scale-to-zero.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{{- if (eq .Values.kubernetes.env "staging") }}
{{- if or (eq .Values.kubernetes.env "staging") (eq .Values.myAccount.accountId "160782110667") }}
apiVersion: batch/v1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabling cron for production in HMS

kind: CronJob
metadata:
Expand Down
5 changes: 2 additions & 3 deletions python/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
# pull official base image
FROM python:3.8-buster AS builder
FROM python:3.8-slim-buster AS builder
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smaller base image


# set working directory
WORKDIR /src

# install app dependencies
COPY requirements.txt ./
RUN pip3 install -r requirements.txt

RUN pip3 install --no-cache-dir -r requirements.txt
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smaller layer

overall, compressed image for python container in ECR goes from ~630mb --> 100mb


# ---------------------------------------------------
# PRODUCTION BUILD
Expand Down
1 change: 0 additions & 1 deletion python/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ PyYAML==5.4
requests==2.24.0
responses==0.12.0
sinfo==0.3.1
umap-learn==0.4.6
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed on python image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw this dependency in an unrelated context and was confused by it. I suppose it's a vestigial remnant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there are probably other that may not be needed either but doesn't seem so important since image size will be ~100MB now

watchdog==0.10.3
backoff==1.10.0
socket_io_emitter==0.1.5.1
Expand Down
2 changes: 2 additions & 0 deletions python/src/worker/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@
def main():
if config.IGNORE_TIMEOUT:
info("Worker configured to ignore timeout, will run forever...")
else:
info(f"Worker timeout is {config.TIMEOUT} (s)...")

while not config.EXPERIMENT_ID:
info("Experiment not yet assigned, waiting...")
Expand Down
10 changes: 3 additions & 7 deletions python/src/worker/config/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,16 @@
import re
import types
from functools import cached_property
from .domain_specific import get_domain_specific

import boto3
import redis
from aws_xray_sdk import core, global_sdk_config


kube_env = os.getenv("K8S_ENV")
cluster_env = os.getenv("CLUSTER_ENV")

timeout = int(
os.getenv(
"WORK_TIMEOUT",
default=str(60 * 10),
)
)
timeout = get_domain_specific().get(kube_env, {}).get('timeout', 10 * 60)

ignore_timeout = os.getenv("IGNORE_TIMEOUT") == "true"
aws_account_id = os.getenv("AWS_ACCOUNT_ID")
Expand Down
23 changes: 23 additions & 0 deletions python/src/worker/config/domain_specific.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import os

ACCOUNT_ID = {
'BIOMAGE': '242905224710',
'HMS': '160782110667',
}

domain_specific = {
'HMS': {'production': {'timeout': 40 * 60}},
'BIOMAGE': {},
'BIOMAGE_PRIVATE': {}
}


def get_domain_specific():
aws_account_id = os.getenv('AWS_ACCOUNT_ID')

if aws_account_id == ACCOUNT_ID['HMS']:
return domain_specific['HMS']
elif aws_account_id == ACCOUNT_ID['BIOMAGE']:
return domain_specific['BIOMAGE']
else:
return domain_specific['BIOMAGE_PRIVATE']
Comment on lines +1 to +23
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modeled after getDomainSpecificContent.js in api

19 changes: 6 additions & 13 deletions r/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -34,14 +34,6 @@ RUN Rscript restore_renv.R
# deactivate so that above .libPaths used
RUN R -q -e "renv::deactivate()"

# install miniconda and python umap-learn for RunUMAP
# clean conda
ENV RETICULATE_MINICONDA_PATH=/src/r-miniconda
RUN R -q -e "reticulate::install_miniconda()" && \
R -q -e "reticulate::conda_install(packages = 'umap-learn=0.5.3', python_version='3.10')" && \
CONDA_PATH=$(R -q -s -e "cat(reticulate::conda_binary())") && \
$CONDA_PATH clean --force-pkgs-dirs -y

# use renv::snapshot() while R dependency updates are quick to build
COPY ./renv.lock .
RUN Rscript restore_renv.R
Expand All @@ -63,7 +55,6 @@ RUN Rscript check_package_licenses.R
# ---------------------------------------------------
FROM rocker/r-ver:4.2.2 AS common
WORKDIR /src/worker
ENV RETICULATE_MINICONDA_PATH=/src/r-miniconda

# get source code and R packages
COPY --from=builder /src /src
Expand All @@ -73,14 +64,16 @@ ENV RENV_LIB=/src/lib
RUN echo ".libPaths(c('$RENV_LIB', .libPaths()))" >> $(R RHOME)/etc/Rprofile.site

# install runtime system deps
# python3-dev prevents 'reticulate can only bind to copies of Python built with --enable-shared'
# cleanup setup files
RUN echo "python3-pip" >> sysdeps_run.txt && \
RUN echo "python3-pip python3-venv python3-dev" >> sysdeps_run.txt && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not declare these in the sysdeps_build_debian.txt instead of the dynamically generated sysdeps_run.txt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@alexvpickering alexvpickering Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sysdeps_build_debian.txt are dependencies for building the R-packages. Once they have been built, the run-time system dependencies for R packages are determined by get_sysdeps_run.R. This strategy reduces the overall image size. There are however some additional dependencies that are necessary for non R-package things:

  • python3-pip for the development build (not needed for prod but installed here for code simplicity)
  • python3-venv to setup the venv for umap-learn
  • python3-dev to have a copy of python that reticulate can use

Copy link
Member

@gerbeldo gerbeldo Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it in the pipeline PR and it does not work (a pip install fails due to not finding pip), for some reason I don't understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look in the pipeline and see if I can get it set up as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not tested but builds for me: hms-dbmi-cellenics/pipeline#340

Copy link
Member

@gerbeldo gerbeldo Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll test the pipeline PR.

Regarding build time dependencies, are they removed downstream, after building the R packages? That relates to my initial question, why do we need python3-pip installed twice?

Copy link
Contributor Author

@alexvpickering alexvpickering Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the common build doesn't start from the end of the builder, but rather from a fresh rocker/r-ver. It transfers over already built packages etc (COPY --from=builder /src /src) and then installs only the system dependencies that are needed to run packages (less hefty than packages needed to build all the R/python packages). The approach is a common one to reduce the size of Docker images.

cat sysdeps_run.txt | xargs ./install_debian_packages.sh && \
rm -rf *

# link system libstdc++ to conda installed version
RUN rm /usr/lib/x86_64-linux-gnu/libstdc++.so.6 && \
ln -s /src/r-miniconda/envs/r-reticulate/lib/libstdc++.so.6 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
Comment on lines -81 to -83
Copy link
Contributor Author

@alexvpickering alexvpickering Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think any equivalent of this is necessary anymore as libstdc++.so.6 isn't in the environment folder.

can't remember exactly what error under what set of conditions this fixed but RunUMAP works fine.

# install umap-learn for RunUMAP
ENV WORKON_HOME=/src/.virtualenvs
RUN R -q -e "reticulate::virtualenv_create('r-reticulate')" && \
R -q -e "reticulate::virtualenv_install('r-reticulate', 'umap-learn==0.5.3', pip_options='--no-cache-dir')"
Comment on lines +73 to +76
Copy link
Contributor Author

@alexvpickering alexvpickering Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

motivation is reducing image size (before: ~1010 MB, after: 884MB) which speeds up worker pod being ready. also a bit simpler

Together with python image size reduction (~530MB) worker ready time is down from ~2:30 --> 1:50

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally removing conda! I tried something similar months ago but got stuck. We can port this to the pipeline as well


# ---------------------------------------------------
# PRODUCTION BUILD
Expand Down
Loading