-
Notifications
You must be signed in to change notification settings - Fork 12
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work!
What about removing the data_and_models
segment in the model ID pattern?
i.e. going from
data_and_models/models/ner_er/model<number>
to
models/ner_er/model<number>
Indeed, with the new DATA_DIR
environment variable, everything will be relative to DATA_DIR
.
Besides, this would remove the need for the extra model_path
column. This way, for a given model, model_path = DATA_DIR + model_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, really nice!
@@ -22,6 +22,7 @@ USER root | |||
# Install the app | |||
ADD . /src | |||
WORKDIR /src | |||
ENV DATA_DIR="/src/data_and_models" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about defining this variable in the .env
file instead?
This could also be useful if in the future we decide to move data_and_models
to another repo.
Also, as mentioned in the checklist of this PR, it would be good to mention this variable somewhere in the docs or readme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, yes, that's an option. The logic here was that the previous ADD . /src
command arleady hard-codes a part of the path, so I wasn't sure how to reconcile this with the .env
file. Once the data is indepent this would of course make a lot more sense.
Should we maybe wait until the point where the data is separate and then implement this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here was that the previous ADD . /src command arleady hard-codes a part of the path
Good observation. I guess it's fine as it is now for the moment!
Hey. I think there are two points here. Removing the The reason why |
I see. As of now |
Instead introduce a new parameter `data_and_models_dir` that replaces the environment variable. Hopefully this makes it more transparent. The callee of `load_ee_models_library()` now has to decide where to get the value of this parameter from. In our codebase we have two places where `load_ee_models_library()` is called: 1. The mining server. It gets the value from the environment variable `BBS_DATA_AND_MODELS_DIR`, which can also be defined in the .env file. 2. The create mining cache entrypoint. It has a new parameter `--data-and-models-dir` which should be used to pass value. If left blank then the environment variable `BBS_DATA_AND_MODELS_DIR is checked, and if it's not defined then an error will be thrown.
Hey all, based on your feedback and after thinking about the changes again, I made some modifications: Dont look for the environment variable In our codebase we have two places where
What do you think? |
Main idea
The main change that led to all the other changes is this:
utils.get_root_path()
utils.DVC
(which containedDVC.load_ee_models_library()
)utils.load_ee_models_library()
This new
utils.load_ee_models_library()
will look for the environment variable namedDATA_DIR
, which should be pointing to thedata_and_models
folder. UsingDATA_DIR
it will find and load theee_models_library.csv
file and pre-process it in the following way:Before:
After:
then returns it as a pandas data frame.
Still to do
utils.load_ee_models_library()
BBS_DATA_AND_MODELS_DIR
environment variable (CLI help message, docs)To do after merging
bbsearch.utils.MissingEnvironmentVariable
inbbsearch.entrypoint._helper.get_var()
model_id