Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker-compose improvements #92

Merged
merged 6 commits into from
Nov 25, 2022
Merged

Docker-compose improvements #92

merged 6 commits into from
Nov 25, 2022

Conversation

sandertan
Copy link
Contributor

@sandertan sandertan commented Oct 28, 2022

⚠️ This currently breaks docker-compose files other than docker-compose-dev.yml

Hi @tomolopolis, thanks a lot for all the recent updates to MedCAT Trainer. Especially the addition of Solr for the concept lookup looks nice!

We're starting a new annotation project for HPO concepts in Dutch text, and I had to make some changes to make the deployment fit our use case. It would be nice to have our changes, which add configuration options, in the master branch; this'll make it easier for us to update MedCAT trainer in the future.

Included in this PR

  • There are a number of docker-compose-files in this repository, some of which don't work anymore with current master I think. The only compose file that builds a new Dockerfile is docker-compose-dev.yml, so I made my changes only there. What is your plan with the other compose files? Having multiple compose files is difficult to maintain. I can add these changes to other compose files if you want.
  • I added an .env-example-file and documentation on how this should be used with Docker Compose. This makes it relatively easy to pass on environment variables to docker-compose, similar to what I added for CogStack-Nifi. For maintainers it would be nice to have the least amount of places where configuration can be changed, but I think this file is needed to configure variables at compose-level.
  • I made the spaCy model configurable, similar to what I did for MedCAT-Service. This is especially useful for non-English languages. I've set the default to the current en_core_web_md. Alternatively, we can set the default to all 3 (sm, md & lg) English language models.
  • By first copying requirements.txt and then copying the other backend files, Docker can use the cache during building. When nothing is changed to the requirements file, the cached step is used, which greatly reduces the time it tames to build the image.
  • I made the exposed Solr port configurable, similar to MCTRAINER_PORT.
  • I also cleaned up some unnecessary components:
    • Remove expose: , as this does not do anything anymore (see last edit at https://stackoverflow.com/a/40801773/4141535)
    • Removed the container_name for Solr. Within the docker compose network, the containers can find each other using the service name.

Not included in this PR

  • I did not update the other compose files to work with these changes. I didn't want to start updating those before discussing these changes with you.

Nice to have in future

  • It would be nice to have an option to disable loading the example data (
    python /home/load_examples.py &
    ) . When using a different concept universe (for example SNOMED or HPO, or a different UMLS language), I think it's preferred to not have the example data in the instance.
  • It would also be nice to clean up the how the environment variables for layers lower than the docker-compose are used. Currently they are passed on both using envs/env-file and environment: in the compose. There's also some discrepancy in how different compose files do this. We could move the environment variables from envs/env to the proposed .env-example-file and load them in the container with something like:
environment:
  - MEDCAT_CONFIG_FILE=${MEDCAT_CONFIG_FILE}
  • It would also be nice to make it more clear how the config for MedCAT is used (configs/base.txt). With the recent versions of MedCAT containing the config within the CDB file, do you agree it would be easier to use this as the default? This is important for using non-English models, because it can set the spaCy model.
  • The location of the Solr instance is currently configurable at docker-compose level. Is this really necessary? If it's always running along with the other 2 components of this Docker compose, it's always running at solr:8983 within the Docker network.
  • I think the example project is currently not working. I commented it out in my local fork to get the Dutch instance up, but when I tested this repo with current master branch, the example load process reported some errors.
  • Perhaps building the Docker image can be faster by using Docker's multi-stage build functionality (https://docs.docker.com/build/building/multi-stage/) with a node.js image.

Let me know what you like/dislike. I can make separate GitHub issues of the nice to have-items if you prefer.

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandertan - changes look great!

Good stuff consolidating all the config into a single place and removing unnecessary settings. The default docker-compose.yml is supposed to use the latest stable release image with default settings. If you have time you can update that compose file also, otherwise I'll hopefully get to it sometime this week.

@tomolopolis
Copy link
Member

for the nice to haves - lets get these current changes in an cleanup the compose files, and address each one of these nice to haves via gh issues - we can then split the work thanks!

@sandertan
Copy link
Contributor Author

@tomolopolis I added the changes to docker-compose.yml. I also bumped the minor release to future-version 2.3.8 because it's incompatible with current 2.3.7. Before merging, perhaps you can check out this branch, build the MCT & NGINX images, push to Docker Hub as 2.3.8, and see if docker-compose.yml works?

@tomolopolis
Copy link
Member

I've rebuilt locally on your branch, looks fine - thanks!

@tomolopolis tomolopolis merged commit 9c483f7 into CogStack:master Nov 25, 2022
@sandertan
Copy link
Contributor Author

@tomolopolis I moved some nice to haves to separate issues. I'm not sure using multi-stage docker builds would be a significant improvement (https://docs.docker.com/build/building/multi-stage/, https://www.snopoke.com/2020/11/10/multi-stage-docker-build/ and https://stackoverflow.com/q/54246600/4141535), so I did not add that.

@tomolopolis
Copy link
Member

@sandertan - thanks!

a multi-stage build should reduce the size a bit, I might experiment and see if it's worthwhile.

@vvcb
Copy link

vvcb commented Mar 26, 2023

@tomolopolis @sandertan , I have been experimenting with optimising docker-compose.yml as part of the work I am doing to port this to k8s using Helm. This may also support #124 if MedCATtrainer itself supports multiple users annotating simultaneously.

One of the first things I have been looking at is https://github.com/CogStack/MedCATtrainer/blob/master/webapp/Dockerfile. So far I have managed to drop the uncompressed size of the image from 7.71GB to 6.13GB and there are more optimisations to be had.

I am yet to test if everything works in spite of these modifications - and will do that in the coming days/weeks.

The following are some of the findings that may help with your experiments as well.

Multi-stage build

The frontend build lends itself nicely to a separate stage. This saves ~450MB by avoiding everything we don't need in frontend/ except the newly built dist folder but more importantly avoids bringing the massive node_modules folder into the final build. This is how I did it.

FROM node:16-slim as node_image
WORKDIR /home
COPY ./frontend /home/frontend
RUN cd /home/frontend \
    && npm install -g npm@latest \
    && npm install \
    && npm run build

and then later on just copy the dist folder.

FROM python:3.10

# do other stuff here

COPY --from=node_image  /home/frontend/dist /home/frontend/dist

# continue more layers

Omit pip cache

The following lines retain ~900MB of downloaded pip packages in cache.

# Install requirements for backend
WORKDIR /home/
RUN pip install pip --upgrade 
RUN pip install -r requirements.txt
ARG SPACY_MODELS="en_core_web_md"
RUN for SPACY_MODEL in ${SPACY_MODELS}; do python -m spacy download ${SPACY_MODEL}; done

This can be improved as follows:

ARG SPACY_MODELS="en_core_web_md"
RUN pip install pip --upgrade --no-cache-dir \
    && pip install -r requirements.txt --no-cache-dir \
    && for SPACY_MODEL in ${SPACY_MODELS}; do python -m spacy download ${SPACY_MODEL}; done

This improves the size of this layer by 900MB from 6.1GB to 5.2GB.

Other improvements

I guess each of these could be raised as a separate issue - but in keeping with the title of this PR, I will list the proposals here and we can spin off into separate issues if necessary.

Run as non-root

It looks like everything in the container gets installed as root. This is generally discouraged. Is there any reason why we can't do something like below? I am still experimenting with this - including doing the pip install step as appuser and am running into wierd permission issues from django and manage.py. I know that this should work in principle as it is a standard step in FastAPI containers that I have deployed to k8s. But, I have never worked with django before and am biting off more than I can chew.

RUN groupadd -g 999 appuser \
    && useradd -r -u 999 -g appuser appuser

USER appuser
# Copy project
WORKDIR /home
# See below for why this line needs modifying
COPY --chown=appuser:appuser . .
COPY --from=node_image --chown=appuser:appuser  /home/frontend/dist /home/frontend/dist

Django Admin user creation

Currently the Django admin user gets created with credentials in plain text in run.sh. It will be good to have the option of reading this from an environment variable and defaulting to a generic password otherwise.

Copy only required folders/files

Currently all contents under webapp/ are copied into the container. If we are using multi-stage build, then copying everything under webapp/frontend/ is unnecessary.

Environment variables

Currently environment variables in docker-compose require the presence of a envs/env file. Would it be better to specify these in the container itself? For the Helm chart, the user will be able to set this in the values.yaml file when doing a helm install.

@sandertan
Copy link
Contributor Author

@vvcb Thanks for the extensive list of suggestions! I'm currently less involved in the this project at our institute, so I'll let Tom respond to your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants