Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

[BBS-292] Release v0.1.0 #265

Merged
merged 14 commits into from
Mar 4, 2021
Merged

[BBS-292] Release v0.1.0 #265

merged 14 commits into from
Mar 4, 2021

Conversation

FrancescoCasalegno
Copy link
Contributor

@FrancescoCasalegno FrancescoCasalegno commented Mar 2, 2021

Fixes BBS-292.

Description

  1. Remove GitHub Actions workflow to automatically upload to PyPI. This was done after discussing with @jankrepl about pros and cons of this choice. See Remove GitHub Actions workflow for publishing on PyPI atlas-alignment#19 for the rationale.
  2. DVC Dockerfiles use versions specified inrequirements.txt. This change improves the reproducibility of results with DVC. Before this, we were just using the dependencies specified in setup.py which means that there was no guarantee on the version of each dependency used to run the DVC pipelines and results could have slightly changed.
  3. Upgrade prodigy version to fix this error.
  4. Run dvc repro (and dvc push to update the remote storage).
  5. DVC-untrack BSV model. This reduces the .dvc/cache to only 4.4 GiB.

How to test?

  1. Get Prodigy wheel.
    cp /raid/prodigy_downloads/1.10.7/prodigy-1.10.7-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl path/to/Search
  2. Get images for running the ner and sentemb pipelines. You have two options.
    1. Option1 — You can use the images I already built on DGX1: bbs_dvc_ner:v0.1.0 and bbs_dvc_sentemb:v0.1.0. In the following I assume you use this option—otherwise replace the image names in the following commands!
    2. Option 2 — You can build the images yourself:
      • docker build -f data_and_models/pipelines/ner/Dockerfile -t <ner_image_name> .
      • docker build -f data_and_models/pipelines/sentence_embedding/Dockerfile -t <sentemb_image_name> .
  3. Reproduce the ner pipeline.
    1. Get all dvc files dvc pull
    2. Run containerdocker run --rm -it <your_options> --name <ner_container_name> bbs_dvc_ner:v0.1.0
    3. Move to directory cd data_and_models/pipelines/ner/
    4. Reproduce pipeline dvc repro (or dvc repro -f if you have some time and want to be 100% sure)
    5. Check that nothing changed: git diff should be empty
  4. Reproduce the sentemb pipeline.
    1. Get all dvc files dvc pull (if you run the ner pipeline already, then this step can be skipped)
    2. Run containerdocker run --rm -it <your_options> --name <sentemb_container_name> bbs_dvc_sentemb:v0.1.0
    3. Move to directory cd data_and_models/pipelines/sentence_embedding/
    4. Reproduce pipeline dvc repro (or dvc repro -f if you have some time and want to be 100% sure)
    5. Check that nothing changed: git diff should be empty

Checklist

  • This PR refers to an issue from the issue tracker.
    (if it is not the case, please create an issue first).
  • Documentation and whatsnew.rst updated.
    (if needed)
  • All CI tests pass.

@FrancescoCasalegno FrancescoCasalegno marked this pull request as ready for review March 3, 2021 10:35
@pafonta
Copy link
Contributor

pafonta commented Mar 3, 2021

It seems that

docker build -f data_and_models/pipelines/sentemb/Dockerfile -t <sentemb_image_name> .

should instead be

docker build -f data_and_models/pipelines/sentence_embedding/Dockerfile -t <sentemb_image_name> .

in the test instructions.

@jankrepl
Copy link
Contributor

jankrepl commented Mar 3, 2021

When building the NER image I get the following error on the line RUN pip install -r requirements.txt

ERROR: Cannot uninstall 'ruamel-yaml'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

The internet suggests to use --ignore-installed, however, I am not sure how replicable this error is.

@FrancescoCasalegno
Copy link
Contributor Author

@jankrepl I have rebuild the image right now from scratch (using a fresh clone of this repo) and I cannot reproduce the error with RUN pip install -r requirements.txt that you mention. Could you please give me some more detail about it?

Maybe a fresh git clone could help?

@pafonta
Copy link
Contributor

pafonta commented Mar 3, 2021

Hello @FrancescoCasalegno !

I cannot run dvc repro inside the container. I have tried with

/Search/data_and_models/pipelines/ner# dvc repro --pull

but I get the error

Verifying data sources in stage: '../../annotations/ner/annotations10_EmmanuelleLogette_2020-08-28_raw1_raw5_10EntityTypes.jsonl.dvc'
ERROR: failed to reproduce '../../annotations/ner/annotations10_EmmanuelleLogette_2020-08-28_raw1_raw5_10EntityTypes.jsonl.dvc': missing data 'source': ../../annotations/ner/annotations10_EmmanuelleLogette_2020-08-28_raw1_raw5_10EntityTypes.jsonl

@jankrepl
Copy link
Contributor

jankrepl commented Mar 3, 2021

Hello @FrancescoCasalegno !

I cannot run dvc repro inside the container. I have tried with

/Search/data_and_models/pipelines/ner# dvc repro --pull

but I get the error

Verifying data sources in stage: '../../annotations/ner/annotations10_EmmanuelleLogette_2020-08-28_raw1_raw5_10EntityTypes.jsonl.dvc'
ERROR: failed to reproduce '../../annotations/ner/annotations10_EmmanuelleLogette_2020-08-28_raw1_raw5_10EntityTypes.jsonl.dvc': missing data 'source': ../../annotations/ner/annotations10_EmmanuelleLogette_2020-08-28_raw1_raw5_10EntityTypes.jsonl

I got exactly the same error.

@jankrepl
Copy link
Contributor

jankrepl commented Mar 3, 2021

@jankrepl I have rebuild the image right now from scratch (using a fresh clone of this repo) and I cannot reproduce the error with RUN pip install -r requirements.txt that you mention. Could you please give me some more detail about it?

Maybe a fresh git clone could help?

As discussed, I got this error when using the most recent continuumio/miniconda3 with digest sha 7838d0ce65783b0d944c19d193e2e6232196bada9e5f3762dc7a9f07dc271179

@FrancescoCasalegno
Copy link
Contributor Author

As discussed, I got this error when using the most recent ontinuumio/miniconda3 with digest sha 7838d0ce65783b0d944c19d193e2e6232196bada9e5f3762dc7a9f07dc271179

Thanks, I now also see the error! This seems to be a known issue, and reading this SO post as well as this GH issue I think we only have 3 options:

  1. Manually downgrade pip to <1.10
  2. Manually rm -rf the ruamel install
  3. pip install with --ignore-installed

I think the third option seems the least ugly, I am now testing if it works fine.

@jankrepl
Copy link
Contributor

jankrepl commented Mar 3, 2021

As discussed, I got this error when using the most recent ontinuumio/miniconda3 with digest sha 7838d0ce65783b0d944c19d193e2e6232196bada9e5f3762dc7a9f07dc271179

Thanks, I now also see the error! This seems to be a known issue, and reading this SO post as well as this GH issue I think we only have 3 options:

  1. Manually downgrade pip to <1.10
  2. Manually rm -rf the ruamel install
  3. pip install with --ignore-installed

I think the third option seems the least ugly, I am now testing if it works fine.

Yes, I went for 3. locally and it worked. Anyway, I think we should pin the continuumio/miniconda3 to a specific tag in our Dockerfiles.

@FrancescoCasalegno
Copy link
Contributor Author

@jankrepl and @pafonta — I always did dvc pull + dvc repro -f rather than the dvc repro --pull that you mentioned in your comments, so I never noticed that this latter doesn't work. But you are definitely right, thanks!

It seems that this is a known issue that should be fixed in the future. I updated the instructions to suggest a testing workflow that should run smoothly using dvc pull + dvc repro -f.

Copy link
Contributor

@pafonta pafonta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work!

I have followed the new testing instructions. I was able to reproduce the expected results (i.e. no output for git diff after dvc repro -f).

I don't approve yet because I have the 2 questions and the suggestion below.

data_and_models/pipelines/ner/Dockerfile Outdated Show resolved Hide resolved
data_and_models/pipelines/sentence_embedding/Dockerfile Outdated Show resolved Hide resolved
docs/source/whatsnew.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@jankrepl jankrepl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for the work!

Copy link
Contributor

@pafonta pafonta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants