Docker-compose improvements #92

sandertan · 2022-10-28T13:21:34Z

⚠️ This currently breaks docker-compose files other than docker-compose-dev.yml

Hi @tomolopolis, thanks a lot for all the recent updates to MedCAT Trainer. Especially the addition of Solr for the concept lookup looks nice!

We're starting a new annotation project for HPO concepts in Dutch text, and I had to make some changes to make the deployment fit our use case. It would be nice to have our changes, which add configuration options, in the master branch; this'll make it easier for us to update MedCAT trainer in the future.

Included in this PR

There are a number of docker-compose-files in this repository, some of which don't work anymore with current master I think. The only compose file that builds a new Dockerfile is docker-compose-dev.yml, so I made my changes only there. What is your plan with the other compose files? Having multiple compose files is difficult to maintain. I can add these changes to other compose files if you want.
I added an .env-example-file and documentation on how this should be used with Docker Compose. This makes it relatively easy to pass on environment variables to docker-compose, similar to what I added for CogStack-Nifi. For maintainers it would be nice to have the least amount of places where configuration can be changed, but I think this file is needed to configure variables at compose-level.
I made the spaCy model configurable, similar to what I did for MedCAT-Service. This is especially useful for non-English languages. I've set the default to the current en_core_web_md. Alternatively, we can set the default to all 3 (sm, md & lg) English language models.
By first copying requirements.txt and then copying the other backend files, Docker can use the cache during building. When nothing is changed to the requirements file, the cached step is used, which greatly reduces the time it tames to build the image.
I made the exposed Solr port configurable, similar to MCTRAINER_PORT.
I also cleaned up some unnecessary components:
- Remove expose: , as this does not do anything anymore (see last edit at https://stackoverflow.com/a/40801773/4141535)
- Removed the container_name for Solr. Within the docker compose network, the containers can find each other using the service name.

Not included in this PR

I did not update the other compose files to work with these changes. I didn't want to start updating those before discussing these changes with you.

Nice to have in future

It would be nice to have an option to disable loading the example data (

MedCATtrainer/webapp/run.sh

Line 19 in 296d4d7

python /home/load_examples.py &

) . When using a different concept universe (for example SNOMED or HPO, or a different UMLS language), I think it's preferred to not have the example data in the instance.
It would also be nice to clean up the how the environment variables for layers lower than the docker-compose are used. Currently they are passed on both using envs/env-file and environment: in the compose. There's also some discrepancy in how different compose files do this. We could move the environment variables from envs/env to the proposed .env-example-file and load them in the container with something like:

environment:
  - MEDCAT_CONFIG_FILE=${MEDCAT_CONFIG_FILE}

It would also be nice to make it more clear how the config for MedCAT is used (configs/base.txt). With the recent versions of MedCAT containing the config within the CDB file, do you agree it would be easier to use this as the default? This is important for using non-English models, because it can set the spaCy model.
The location of the Solr instance is currently configurable at docker-compose level. Is this really necessary? If it's always running along with the other 2 components of this Docker compose, it's always running at solr:8983 within the Docker network.
I think the example project is currently not working. I commented it out in my local fork to get the Dutch instance up, but when I tested this repo with current master branch, the example load process reported some errors.
Perhaps building the Docker image can be faster by using Docker's multi-stage build functionality (https://docs.docker.com/build/building/multi-stage/) with a node.js image.

Let me know what you like/dislike. I can make separate GitHub issues of the nice to have-items if you prefer.

tomolopolis

@sandertan - changes look great!

Good stuff consolidating all the config into a single place and removing unnecessary settings. The default docker-compose.yml is supposed to use the latest stable release image with default settings. If you have time you can update that compose file also, otherwise I'll hopefully get to it sometime this week.

tomolopolis · 2022-11-22T13:24:02Z

for the nice to haves - lets get these current changes in an cleanup the compose files, and address each one of these nice to haves via gh issues - we can then split the work thanks!

sandertan · 2022-11-23T16:18:50Z

@tomolopolis I added the changes to docker-compose.yml. I also bumped the minor release to future-version 2.3.8 because it's incompatible with current 2.3.7. Before merging, perhaps you can check out this branch, build the MCT & NGINX images, push to Docker Hub as 2.3.8, and see if docker-compose.yml works?

tomolopolis · 2022-11-25T11:12:04Z

I've rebuilt locally on your branch, looks fine - thanks!

sandertan · 2022-11-25T12:08:25Z

@tomolopolis I moved some nice to haves to separate issues. I'm not sure using multi-stage docker builds would be a significant improvement (https://docs.docker.com/build/building/multi-stage/, https://www.snopoke.com/2020/11/10/multi-stage-docker-build/ and https://stackoverflow.com/q/54246600/4141535), so I did not add that.

tomolopolis · 2022-11-25T13:09:59Z

@sandertan - thanks!

a multi-stage build should reduce the size a bit, I might experiment and see if it's worthwhile.

vvcb · 2023-03-26T20:30:23Z

@tomolopolis @sandertan , I have been experimenting with optimising docker-compose.yml as part of the work I am doing to port this to k8s using Helm. This may also support #124 if MedCATtrainer itself supports multiple users annotating simultaneously.

One of the first things I have been looking at is https://github.com/CogStack/MedCATtrainer/blob/master/webapp/Dockerfile. So far I have managed to drop the uncompressed size of the image from 7.71GB to 6.13GB and there are more optimisations to be had.

I am yet to test if everything works in spite of these modifications - and will do that in the coming days/weeks.

The following are some of the findings that may help with your experiments as well.

Multi-stage build

The frontend build lends itself nicely to a separate stage. This saves ~450MB by avoiding everything we don't need in frontend/ except the newly built dist folder but more importantly avoids bringing the massive node_modules folder into the final build. This is how I did it.

FROM node:16-slim as node_image
WORKDIR /home
COPY ./frontend /home/frontend
RUN cd /home/frontend \
    && npm install -g npm@latest \
    && npm install \
    && npm run build

and then later on just copy the dist folder.

FROM python:3.10

# do other stuff here

COPY --from=node_image  /home/frontend/dist /home/frontend/dist

# continue more layers

Omit pip cache

The following lines retain ~900MB of downloaded pip packages in cache.

# Install requirements for backend
WORKDIR /home/
RUN pip install pip --upgrade 
RUN pip install -r requirements.txt
ARG SPACY_MODELS="en_core_web_md"
RUN for SPACY_MODEL in ${SPACY_MODELS}; do python -m spacy download ${SPACY_MODEL}; done

This can be improved as follows:

ARG SPACY_MODELS="en_core_web_md"
RUN pip install pip --upgrade --no-cache-dir \
    && pip install -r requirements.txt --no-cache-dir \
    && for SPACY_MODEL in ${SPACY_MODELS}; do python -m spacy download ${SPACY_MODEL}; done

This improves the size of this layer by 900MB from 6.1GB to 5.2GB.

Other improvements

I guess each of these could be raised as a separate issue - but in keeping with the title of this PR, I will list the proposals here and we can spin off into separate issues if necessary.

Run as non-root

It looks like everything in the container gets installed as root. This is generally discouraged. Is there any reason why we can't do something like below? I am still experimenting with this - including doing the pip install step as appuser and am running into wierd permission issues from django and manage.py. I know that this should work in principle as it is a standard step in FastAPI containers that I have deployed to k8s. But, I have never worked with django before and am biting off more than I can chew.

RUN groupadd -g 999 appuser \
    && useradd -r -u 999 -g appuser appuser

USER appuser
# Copy project
WORKDIR /home
# See below for why this line needs modifying
COPY --chown=appuser:appuser . .
COPY --from=node_image --chown=appuser:appuser  /home/frontend/dist /home/frontend/dist

Django Admin user creation

Currently the Django admin user gets created with credentials in plain text in run.sh. It will be good to have the option of reading this from an environment variable and defaulting to a generic password otherwise.

Copy only required folders/files

Currently all contents under webapp/ are copied into the container. If we are using multi-stage build, then copying everything under webapp/frontend/ is unnecessary.

Environment variables

Currently environment variables in docker-compose require the presence of a envs/env file. Would it be better to specify these in the container itself? For the Helm chart, the user will be able to set this in the values.yaml file when doing a helm install.

sandertan · 2023-04-03T09:41:18Z

@vvcb Thanks for the extensive list of suggestions! I'm currently less involved in the this project at our institute, so I'll let Tom respond to your PR.

sandertan added 5 commits October 28, 2022 11:56

Improve docker-compose; Make spacy models configurable

36d62ab

Add documentation; Small fix to Solr container name

04ea2d0

Restructure Dockerfile to use cache more often

b604884

Add IntelliJ's project folder to gitignore

9428b3f

Solve issue with incorrectly copying src files

c320d3f

tomolopolis approved these changes Nov 22, 2022

View reviewed changes

Sync primary docker-compose with docker-compose-dev

924726f

tomolopolis merged commit 9c483f7 into CogStack:master Nov 25, 2022

vvcb mentioned this pull request Mar 26, 2023

Optimised multi-stage docker build for cogstacksystems/medcat-trainer #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker-compose improvements #92

Docker-compose improvements #92

sandertan commented Oct 28, 2022 •

edited

Loading

tomolopolis left a comment

tomolopolis commented Nov 22, 2022

sandertan commented Nov 23, 2022

tomolopolis commented Nov 25, 2022

sandertan commented Nov 25, 2022

tomolopolis commented Nov 25, 2022

vvcb commented Mar 26, 2023

sandertan commented Apr 3, 2023

Docker-compose improvements #92

Docker-compose improvements #92

Conversation

sandertan commented Oct 28, 2022 • edited Loading

Included in this PR

Not included in this PR

Nice to have in future

tomolopolis left a comment

Choose a reason for hiding this comment

tomolopolis commented Nov 22, 2022

sandertan commented Nov 23, 2022

tomolopolis commented Nov 25, 2022

sandertan commented Nov 25, 2022

tomolopolis commented Nov 25, 2022

vvcb commented Mar 26, 2023

Multi-stage build

Omit pip cache

Other improvements

Run as non-root

Django Admin user creation

Copy only required folders/files

Environment variables

sandertan commented Apr 3, 2023

sandertan commented Oct 28, 2022 •

edited

Loading