Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usage in multi-user environments -> automated weights download #53

Open
alisterburt opened this issue Feb 8, 2024 · 20 comments
Open

usage in multi-user environments -> automated weights download #53

alisterburt opened this issue Feb 8, 2024 · 20 comments

Comments

@alisterburt
Copy link
Contributor

I'm trying to install membrain-seg in a multi-user environment where not having the weights download be automated means having to tell each user to download the checkpoint, send it to the cluster before they can use the program - not ideal

I'll outline my suggested implementation below - feedback welcome!

Zenodo recently added support for multiple curators so hosting the data on zenodo will work well.

  1. I would upload an entry for the weights with your own account then invite my account (alisterburt) to also manage the entry.
  2. Add code for automatically downloading the weights using pooch

https://github.com/teamtomo/fidder/blob/3ac64b13db256b598bf8951eb9a66137c04b86b6/src/fidder/model/_pretrained_weights.py#L6-L16

  1. make sure pooch is added as a dependency to pyproject.toml

@LorenzLamm what do you think?

@LorenzLamm
Copy link
Collaborator

Absolutely agreed! I'll look into it today.
Thanks a lot for your implementation suggestion. @rdrighetto and I also looked into some options other than Zenodo, but couldn't find a more comfortable solution.
So let's go for it :)

Sorry for this feature taking so long.

@rdrighetto
Copy link
Contributor

A couple comments on this:

  1. The problem with a multi-user environment like this seems more that there is no shared location where one person could upload the weights and everybody can read it from there? This is how our lab users membrain-seg on our cluster. This is not a membrain-seg issue IMO.

Having said that:

  1. Given the weights are heavy (~700 MB) I would never have them downloaded automatically without explicit consent from the user.

Having said that [2]:

  1. I totally agree it's great and scientifically a good practice to share the weights on Zenodo and include a feature in membrain-seg to fetch it from there :-)

@LorenzLamm
Copy link
Collaborator

Yes, I agree. It should not be downloaded automatically.
I guess it's best to add the downloading functionality to the CLI?
Maybe we could also ask the user whether an automatic download should be performed in case no path to a checkpoint is provided for segmentation.

@rdrighetto
Copy link
Contributor

Yes, I think a CLI option to download the weights (with an optional path argument) is the way to go.

And the check in case a checkpoint has not been provided is nice to have, but not essential IMO.

@LorenzLamm
Copy link
Collaborator

MemBrain-seg models now available on Zenodo: https://zenodo.org/records/10633840

However, I was not able to add you as data curator, @alisterburt . As far as I understand, this feature works only for Zenodo communities, where multiple people can take various roles.
Submissions can then be added to this community and community data curators can edit the submissions associated to the community.

So maybe it makes sense if you create a teamtomo community on Zenodo, and we add the MemBrain-seg weights submission to this community?

@alisterburt
Copy link
Contributor Author

Is it possible to have the weights available by default rather than needing their path to be specified every time if they are in a shared location with the proposed setup? This would be ideal - otherwise agree with what you've all said!

We could also have a prompt at the CLI which asks if you want to download/where you want to download to and this could run by default on the first run of the program, rather than a separate download program?

@alisterburt
Copy link
Contributor Author

Basically, can we persistently set the default location for the checkpoint from the download code

@LorenzLamm
Copy link
Collaborator

Yes, would be cool if the checkpoint argument doesn't need to be passed every time membrain-seg is running.

Would it make sense to have a config file with the location to the checkpoint file?
If the config file is empty, we can ask whether the automatic model should be run and then add the download path to the config file.

Or is there a more elegant way to do this?

@alisterburt
Copy link
Contributor Author

I don't love config files for such a simple application, they introduce extra state and an extra step to running the program

Thanks for looking at zenodo I'll look into a community today :-)

@rdrighetto
Copy link
Contributor

rdrighetto commented Feb 8, 2024

Agree with @alisterburt that config files should be avoided, especially for simple things like this. My suggestion is to specify the checkpoint path through an environment variable like MEMBRAIN_SEG_MODEL (the approach I took for the Scipion plugin).
In a multi-user setup this environment variable can be set when loading the MemBrain-seg module, for example. I don't know if it's possible to set arbitrary environment variables when activating a conda environment, but that would be an alternative as well. In any case, I wouldn't worry so much about how this environment variable is set because it's something highly dependent on the computational environment.

If the environment variable is empty/unset then the model must be specified via the CLI argument (as is currently).
If the model is neither provided by the environment variable nor by the CLI, then we ask if it should be downloaded.

Alternatively, we don't ask anything but just add an extra option to membrain segment for downloading the model.

@alisterburt
Copy link
Contributor Author

@rdrighetto MEMBRAIN_SEG_WEIGHTS works nicely!

You can set arbitrary environment variables when activating a conda environment in a unix system by adding shell scripts to the activate.d and deactivate.d directories inside <env_path>/etc/conda

(warp_build) [burta2@ec-hub1-sc1 ~]$ ls /home/burta2/mambaforge/envs/warp_build/etc/conda/
activate.d/   deactivate.d/

I would still add a code path for being prompted to download the weights to a user-space cache if they are not provided at the CLI or in the environment variable.

@alisterburt
Copy link
Contributor Author

okay: to summarise

Ideal version of the future

  • model weights can be points at from an environment variable called MEMBRAIN_SEG_WEIGHTS
  • if a checkpoint file is provided at the CLI it will override the env variable
  • if no checkpoint file is provided at the CLI and no env variable is set then the cache is checked with pooch, if the file is not present the user is prompted to download the weights to the user space cache

sound good?!

@alisterburt
Copy link
Contributor Author

I created a "community" on zenodo - could I get your zenodo usernames so we can all curate the record? I need to figure out how to add existing records to a community too :-)

@LorenzLamm
Copy link
Collaborator

Great, thanks for creating the community! My username is Lorenz_Lamm.

As far as I understand anyone can apply for adding their existing Zenodo entries to a community, so existing records shouldn't be a problem :)

@LorenzLamm
Copy link
Collaborator

okay: to summarise

Ideal version of the future

  • model weights can be points at from an environment variable called MEMBRAIN_SEG_WEIGHTS
  • if a checkpoint file is provided at the CLI it will override the env variable
  • if no checkpoint file is provided at the CLI and no env variable is set then the cache is checked with pooch, if the file is not present the user is prompted to download the weights to the user space cache

sound good?!

I like the setting of environment variables, but is there a clever way to edit the activation scripts? (I guess also depends on conda vs virtualenv vs ...). So probably the user would need to do that manually, right? I feel this makes the threshold to use it a bit higher.

Would something like python-dotenv (https://pypi.org/project/python-dotenv/) be an option? We could create an .env file (I know, it's basically a config file :D), but fill it automatically when running membrain-seg and downloading a model to a specified location. This could reduce user input and make it maybe easier to use.

@rdrighetto
Copy link
Contributor

@alisterburt:

okay: to summarise

Ideal version of the future

* model weights can be points at from an environment variable called `MEMBRAIN_SEG_WEIGHTS`

* if a checkpoint file is provided at the CLI it will override the env variable

* if no checkpoint file is provided at the CLI and no env variable is set then the cache is checked with pooch, if the file is not present the user is prompted to download the weights to the user space cache

sound good?!

Sounds great to me!

@LorenzLamm:

I like the setting of environment variables, but is there a clever way to edit the activation scripts? (I guess also depends on conda vs virtualenv vs ...). So probably the user would need to do that manually, right? I feel this makes the threshold to use it a bit higher.

I agree that python-dotenv could be a good solution for going this way, but you have to think how to ensure that in a multi-user environment like @alisterburt describes the weights are downloaded only once and the same environment remains available to everyone. This would be achieved by ensuring the path where the model is downloaded is visible to all users, but how will you know what this path is? For a single user it sounds perfectly fine, if that's what you are thinking.
Adding/editing conda activation scripts does add more complexity and it should not be required, but we could still provide instructions for users that want to go this way. I'm afraid we are trying to guess too much about everyone's computational setup, and there's no solution that is gonna work for everyone. That's why I ultimately don't care how this environment variable will be set. For a starter I would just implement the ability to read the weights from MEMBRAIN_SEG_WEIGHTS.

Most importantly, I think the current way of using membrain-seg (explicit CLI option with the checkpoint) should remain unchanged to minimize disruption to user's workflows.

@alisterburt
Copy link
Contributor Author

@rdrighetto do you have a zenodo account?
@LorenzLamm you should have an invite to the zenodo community

I'm with @rdrighetto here, the people setting up the installation in the multi-user environment will be comfortable with setting an environment variable in the activation script. For single-user situations, the weights downloaded/cached with pooch will work transparently. python-dotenv looks cool but the introduction of extra state to manage feels a little unnecessary in this case and the dotenv looks to be user specific by default so not quite right for a multi-user install solution :-)

@LorenzLamm
Copy link
Collaborator

Okay, sounds good. Then let's add that as some advanced installation instruction somewhere, but keep the basic functionalities as they are. :)

@LorenzLamm
Copy link
Collaborator

Thanks for adding me to the Zenodo community. I just accepted the invitation, and submitted the previous upload to the community.
I cannot accept at the moment, though. Maybe it takes a while until my membership status is updated?

@alisterburt
Copy link
Contributor Author

Yeah probably takes some time to sync, I accepted the entry in any case :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants