Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

[BBS-272] Eliminate get_root_path() #229

Merged
merged 10 commits into from
Feb 9, 2021
Merged

[BBS-272] Eliminate get_root_path() #229

merged 10 commits into from
Feb 9, 2021

Conversation

Stannislav
Copy link
Contributor

@Stannislav Stannislav commented Feb 1, 2021

Main idea

The main change that led to all the other changes is this:

  • Remove utils.get_root_path()
  • Remove utils.DVC (which contained DVC.load_ee_models_library())
  • Introduce utils.load_ee_models_library()

This new utils.load_ee_models_library() will look for the environment variable named DATA_DIR, which should be pointing to the data_and_models folder. Using DATA_DIR it will find and load the ee_models_library.csv file and pre-process it in the following way:

Before:

entity_type model entity_type_name
CHEMICAL model1 CHEBI
ORGANISM model1 TAXON

After:

entity_type entity_type_name model_path model_id
CHEMICAL CHEBI DATA_DIR/models/ner_er/model1 data_and_models/models/ner_er/model1
ORGANISM TAXON DATA_DIR/models/ner_er/model1 data_and_models/models/ner_er/model1

then returns it as a pandas data frame.

Still to do

  • test utils.load_ee_models_library()
  • document setting of the BBS_DATA_AND_MODELS_DIR environment variable (CLI help message, docs)

To do after merging

  • raise bbsearch.utils.MissingEnvironmentVariable in bbsearch.entrypoint._helper.get_var()
  • think about improving model_id
  • simplify the mining server constructor: BBS-280

@Stannislav Stannislav requested review from FrancescoCasalegno and pafonta and removed request for FrancescoCasalegno February 2, 2021 08:30
Copy link
Contributor

@pafonta pafonta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work!

What about removing the data_and_models segment in the model ID pattern?

i.e. going from

data_and_models/models/ner_er/model<number>

to

models/ner_er/model<number>

Indeed, with the new DATA_DIR environment variable, everything will be relative to DATA_DIR.

Besides, this would remove the need for the extra model_path column. This way, for a given model, model_path = DATA_DIR + model_id.

Copy link
Contributor

@FrancescoCasalegno FrancescoCasalegno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, really nice!

@@ -22,6 +22,7 @@ USER root
# Install the app
ADD . /src
WORKDIR /src
ENV DATA_DIR="/src/data_and_models"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about defining this variable in the .env file instead?
This could also be useful if in the future we decide to move data_and_models to another repo.

Also, as mentioned in the checklist of this PR, it would be good to mention this variable somewhere in the docs or readme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, yes, that's an option. The logic here was that the previous ADD . /src command arleady hard-codes a part of the path, so I wasn't sure how to reconcile this with the .env file. Once the data is indepent this would of course make a lot more sense.

Should we maybe wait until the point where the data is separate and then implement this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here was that the previous ADD . /src command arleady hard-codes a part of the path

Good observation. I guess it's fine as it is now for the moment!

@Stannislav
Copy link
Contributor Author

Thank you for the work!

What about removing the data_and_models segment in the model ID pattern?

i.e. going from

data_and_models/models/ner_er/model<number>

to

models/ner_er/model<number>

Indeed, with the new DATA_DIR environment variable, everything will be relative to DATA_DIR.

Besides, this would remove the need for the extra model_path column. This way, for a given model, model_path = DATA_DIR + model_id.

Hey. I think there are two points here.

Removing the model_path column. I think this would be totally the opposite to what this PR is trying to achieve. The idea is to say that model_path shouldn't have anything to do with model_id. This is how we got into the whole trouble - the path changes and suddenly nothing works because we set model_path = model_id.

The reason why model_id still looks like a path should be considered as for legacy reasons and for the sake of incremetal change away from using a path as ID. This leads to the second point whether or not to include data_and_models in model_id. In my view both versions are ugly anyway since model ID shouldn't be a hard-coded path anyway. So removing data_and_models wouldn't give any real benefit, but more work like adjusting all entries in the MySQL mining cache table and re-testing everything again.

@pafonta
Copy link
Contributor

pafonta commented Feb 8, 2021

I see.

As of now model_path = <repo dir> + model_id, I thought there was a reason to introduce in this PR the model_path as a column. I guess it would have made more sense to me to introduce such column in the PR changing the way our models are identified. Indeed, it felt like the current PR tries also to define what model_id should be.

Instead introduce a new parameter `data_and_models_dir` that
replaces the environment variable. Hopefully this makes it
more transparent. The callee of `load_ee_models_library()` now
has to decide where to get the value of this parameter from.

In our codebase we have two places where `load_ee_models_library()`
is called:

1. The mining server. It gets the value from the environment variable
`BBS_DATA_AND_MODELS_DIR`, which can also be defined in the .env file.

2. The create mining cache entrypoint. It has a new parameter
`--data-and-models-dir` which should be used to pass value. If left
blank then the environment variable `BBS_DATA_AND_MODELS_DIR is
checked, and if it's not defined then an error will be thrown.
@Stannislav
Copy link
Contributor Author

Hey all, based on your feedback and after thinking about the changes again, I made some modifications:

Dont look for the environment variable DATA_DIR in the
load_ee_models_library() function. Instead introduce a
new parameter data_and_models_dir that
replaces the environment variable. Hopefully this makes it
more transparent. The callee of load_ee_models_library() now
has to decide where to get the value of this parameter from.

In our codebase we have two places where load_ee_models_library()
is called:

  1. The mining server. It gets the value from the environment variable
    BBS_DATA_AND_MODELS_DIR, which can also be defined in the .env file.

  2. The create mining cache entrypoint. It has a new parameter
    --data-and-models-dir which should be used to define this value. If left
    blank then the environment variable `BBS_DATA_AND_MODELS_DIR is
    checked, and if it's not defined then an error will be thrown.

What do you think?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants