Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

[BBS-266] Adapt Getting Started for DVC remote from Zenodo #267

Merged
merged 4 commits into from
Mar 10, 2021
Merged

Conversation

pafonta
Copy link
Contributor

@pafonta pafonta commented Mar 5, 2021

Fixes BBS-266.

TODO

Currently blocked by BBS-306.

Otherwise, after the tests are successful:

  • upload the TAR to Zenodo,
  • put the link in the instructions,
  • PR checklist below.

Description

Adapt the Getting Started from the README to use the DVC remote from Zenodo.

How to test?

1 - Getting Started

Follow the Getting Stared from the README after

git clone --depth 1 --branch bbs_266 https://github.com/BlueBrain/Search.git

The notebook at the end should be usable for search and mining.

2 - DVC

a) Setup

wget <fixme>
tar xf bbs_dvc_remote.tar.gz

git clone --depth 1 --branch v0.1.0 https://github.com/BlueBrain/Search.git

cd Search/

cp /path/to/prodigy/wheel .

dvc remote add --default local ../bbs_dvc_remote

b) Pulling

dvc pull

should retrieve all DVC tracked data and models successfully.

c) pipelines/sentence_embedding

After

docker build -f data_and_models/pipelines/sentence_embedding/Dockerfile -t <image_name> .
docker run -v /raid:/raid -it --rm --name <container_name> <image_name>
cd data_and_models/pipelines/sentence_embedding
dvc repro -f

the command

git diff

should show no change.

d) pipelines/ner

After

docker build -f data_and_models/pipelines/ner/Dockerfile -t <image_name> .
docker run -v /raid:/raid -it --rm --name <container_name> <image_name>
cd data_and_models/pipelines/ner
dvc repro -f

the command

git diff

should show no change but it will show these changes

--- a/data_and_models/pipelines/ner/dvc.lock
+++ b/data_and_models/pipelines/ner/dvc.lock
@@ -130,7 +130,7 @@ eval_disease:
   - path: ../../models/ner_er/model1
-    md5: 5757d1ea583e2ac6aa7e642e11b56325.dir
+    md5: a51a1dc7148255d3fb0e1f5b05079ec0.dir
     size: 100183520
     nfiles: 18
@@ -398,7 +398,7 @@ add_er_1:
   - path: ../../models/ner_er/model1
-    md5: 5757d1ea583e2ac6aa7e642e11b56325.dir
+    md5: a51a1dc7148255d3fb0e1f5b05079ec0.dir
     size: 100183520
     nfiles: 18

This issue isn't introduced by the current PR. This is already the situation on master.

Indeed, the issue is semi-reproducible (see #267 (comment)) outside this PR like this

git clone --depth 1 --branch v0.1.0 https://github.com/BlueBrain/Search.git
cd Search/
cp /path/to/prodigy/wheel .
docker build -f data_and_models/pipelines/ner/Dockerfile -t <image_name> .
docker run -v /raid:/raid -it --rm --name <container_name> <image_name>
dvc remote modify gpfs_ssh ask_password true
dvc remote modify gpfs_ssh user <user_name>
dvc pull
cd data_and_models/pipelines/ner
dvc repro -f
git diff dvc.lock

Checklist

  • This PR refers to an issue from the issue tracker.
  • Documentation and whatsnew.rst updated.
  • All CI tests pass.

@pafonta pafonta added documentation Improvements or additions to documentation new feature New feature or request 🦉 dvc Add DVC data or models labels Mar 5, 2021
@pafonta pafonta marked this pull request as ready for review March 8, 2021 17:00
@FrancescoCasalegno
Copy link
Contributor

This issue isn't introduced by the current PR. This is already the situation on master.
Indeed, the issue is reproducible outside this PR like this

For the record, the issue mentioned here is semi-reproducible, in the sense that it is due to the non-deterministic behavior of a component of SpaCy which uses set. @jankrepl will fix this in #268 .

Copy link
Contributor

@FrancescoCasalegno FrancescoCasalegno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @pafonta , this was really a great work!

Copy link
Contributor

@EmilieDel EmilieDel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice job! 🎉 Thanks a lot for doing this Zenodo thing @pafonta! I just did a small comment but can be ignored!

Comment on lines +225 to +226
Third, keep track of the path to the working directory, the repository
directory, and the data and models directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I could be better to say keep track of the paths of the working directory, the repository directory, and the data and models directory. However, it is really a minor comment and I am not confident about what I am suggesting!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence is coming from #229 (see here lines 220-221).

@Stannislav was the author of this PR. Maybe he knows about what is valid English?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have a look at this @Stannislav?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me either way sounds fine :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep it that way :)

Comment on lines +21 to +24
# Commented to allow the use of a DVC remote from a different type than SSH.
# ssh_check
# Not usable in README as it works only when inside the `bbs_` containers.
# dvc_pull_models
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remind me how the server is going to find the NER models now?

Copy link
Contributor Author

@pafonta pafonta Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision taken by the team yesterday to comment these lines puts the mining server in the same situation of the search server: dvc pull needs to have been run before. In the README, there is already a dvc pull happening before launching the mining server. Commenting the two lines doesn't therefore break something. For further discussions on the subject, there is the ticket BBS-306.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was trying to make sure I understand the setting of the path to the models correctly. The mining server finds the models by reading the BBS_DATA_AND_MODELS_DIR environment variable. Here's my current understanding of setting this variable, could you maybe please check if it's correct?

With dvc_pull_models the models would be pulled somewhere locally on the container, so that the value of this variable needs to be configured to point to a local path, e.g.:

BBS_DATA_AND_MODELS_DIR=/src/data_and_models

After commenting out dvc_pull_models the models will be provided externally. So we need to docker run with a mount of the volume where these models are, e.g. with the -v /raid:/raid flag, and then configure the environment variable to point to a location on that volume:

BBS_DATA_AND_MODELS_DIR=/raid/<...>/projects/search/data_and_models

Is this understanding correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

@jankrepl jankrepl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!!

Comment on lines +225 to +226
Third, keep track of the path to the working directory, the repository
directory, and the data and models directory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me either way sounds fine :)

@pafonta pafonta merged commit b6b0448 into master Mar 10, 2021
@pafonta pafonta deleted the bbs_266 branch March 10, 2021 09:49
@pafonta pafonta self-assigned this Mar 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Improvements or additions to documentation 🦉 dvc Add DVC data or models new feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants