Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create data_fetch and set_metadata celery tasks #13655

Merged
merged 62 commits into from
Jun 7, 2022

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Apr 1, 2022

TODO:

  • Make upload job cancelable
  • Separate queue for upload and metadata tasks
  • Gravity update to start workers for new queues

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these contributions under Galaxy's current license.
  • I agree to allow the Galaxy committers to license these and all my past contributions to the core galaxy codebase under the MIT license. If this condition is an issue, uncheck and just let us know why with an e-mail to [email protected].

@mvdbeek
Copy link
Member Author

mvdbeek commented Apr 26, 2022

  • Make upload job cancelable

That didn't seem very hard at first glance, but it looks like this is going to be a bit more complicated than I had anticipated. Once the task is actually running it is pretty hard to cancel reliably (that's true for celery and dask). We could:

  • pass a state callback that we poll at given intervals or chunks
    • that is also not exactly ideal (what if the chunks come in very slowly ?)
  • Use a ProcessPool
    • This could be canceled cleanly, but probably involves significant overhead
  • Use asyncio libraries for long-running stuff (httpx for a requests functionality ?)
    • Canceling should work fine that way

I don't really love that the state callback approach means we'd have to identify potentially long running things and pass the callback around. The asyncio approach sounds really cool but might also be a bigger project. I think I'm going to try using a ProcessPool for the worker and see how well that would work.

@mvdbeek
Copy link
Member Author

mvdbeek commented Apr 26, 2022

I think I'm going to try using a ProcessPool for the worker and see how well that would work.

Failed attempt in mvdbeek@cd8da77 ... this fails because we can't (easily ....) pickle the complex arguments that our tasks receive. I tried cloudpickle which did a little better with the magic_partial functions, but we still hold on to a lot of unpicklable things, so I don't think this is going to be a good solution.

I'll see if we can use asgiref's sync_to_async and async_to_sync to simulate canceling a work thread.

@mvdbeek
Copy link
Member Author

mvdbeek commented Apr 27, 2022

Hmm, converting sync functions to coroutines and then canceling the coroutine also fails as soon as we're blocking on a synchronous function. Failed attempt in 2cd208e :(.

This could be made to work if we convert the download and file handling to use async libraries, but that will fail at the latest when we get to the file sources, which don't support async operations at all and there's no plan to do this upstream, currently.

@mvdbeek mvdbeek force-pushed the metadata_embed branch 2 times, most recently from 6a7a385 to 67c5438 Compare April 27, 2022 17:08
@mvdbeek
Copy link
Member Author

mvdbeek commented Apr 27, 2022

a723b35 and 67c5438 are the best I can come up with to cancel running tasks. It's not exactly what I had hoped for, but I suppose this could be worth a try. It'll still need a bit of cleanup, and we should cancel the downstream tasks cleanly.

@mvdbeek
Copy link
Member Author

mvdbeek commented Apr 28, 2022

Another idea worth trying to bring down the pickling / process startup overhead is to spawn processes in forkserver mode. https://gist.github.com/mvdbeek/caa94edff5776ad078826376e9400c99 is how that could look like. Celery workers could have a pool of the size that is equivalent to their concurrency setting ready, with a datatype registry and an initialized model "ready to go".

@mvdbeek mvdbeek force-pushed the metadata_embed branch 4 times, most recently from c97e37f to a389fe6 Compare April 30, 2022 10:35
lib/galaxy/model/metadata.py Outdated Show resolved Hide resolved
@dannon dannon merged commit 4644922 into galaxyproject:dev Jun 7, 2022
@github-actions
Copy link

github-actions bot commented Jun 7, 2022

This PR was merged without a "kind/" label, please correct.

@nsoranzo nsoranzo deleted the metadata_embed branch June 7, 2022 14:28
@mvdbeek mvdbeek added the highlight/dev Included in admin/dev release notes label Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants