-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Begin to generalize integration tests to other queue systems #160
base: develop
Are you sure you want to change the base?
Conversation
78f6054
to
35d0e40
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #160 +/- ##
========================================
Coverage 47.92% 47.92%
========================================
Files 43 43
Lines 5156 5156
Branches 1118 1118
========================================
Hits 2471 2471
Misses 2424 2424
Partials 261 261 |
Hi @ml-evs, thanks for setting this up. Since the dependence on the queue system is only in the submission and check of the jobs I wonder if it is worth running all the tests for all the systems. |
Hi @gpetretto, sorry I missed this. Happy to have this migrated wherever you see fit, though I think having e2e tests for jobflow-remote are still appropriate here. A status update on this PR in particular: SGE is proving pretty nasty to set up, attempting to follow the few guides I can find online and sadly the main |
Indeed it might be good to have a few different queue managers to test. Especially if it is confirmed that for example SGE does not support query by list of ids. |
0e54d6b
to
ae798fd
Compare
7577c6c
to
4062fdf
Compare
e311fb0
to
b1ee827
Compare
…' from Liverpool/Michigan
4c7c7ea
to
e3cd1a3
Compare
Hi @gpetretto, I think this is now 99% of the way there... at least locally, I can now SSH into the SGE container and run jobs. Any remaining issues might be related to @QuantumChemist's qtoolkit PR, but it is hard to say at this point. This was really painful to implement, but I think the pain was specific to SGE itself. The main issue was needing to reverse engineer how to set up user permissions for the SGE queue (see cc613ba) without being able to use the bundled Locally, I see the jobs being submitted and the usual JFR processes "working", but for some reason no output is being written or copied back to the runner at the moment. As a general point, we might consider entirely splitting the integration tests and unit tests into separate jobs. I'm not sure how much the integration tests are really contributing to the codecov, which is the only real reason to run them together, but as the integration tests are slow (both building the Docker containers and actually running them), it's probably beneficial to split them and figure out a way to combine the coverage stats down the line (e.g., the jobs could both create a GH actions artifact containing the coverage, then a separate job can combine them and upload to codecov). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ml-evs for all the work! The PR looks good to me.
As for the coverage, indeed most will be from the standard tests, but there are definitely some functionalities that are only tested in the integration tests (e.g. the batch submission). If the merging of the coverage files is done as in #180 it could be fine running the tests separately.
From discussion in #201, should we add python 3.12 here as part of the github ci testing workflow. |
Hi @gpetretto, the remaining issue I'm facing here seems to be that jobs are not retried with SGE in these tests. The test for whether the job has outputs triggers before the output has been written, though if I stick in a breakpoint and inspect the container, I can definitely see the output. For the simple {'_id': ObjectId('6738c69ccd47701991707182'), 'job': {'@module': 'jobflow.core.job', '@class': 'Job', '@version': '0.1.18', 'function': {'@module': 'jobflow_remote.testing', '@callable': 'add', '@bound': None}, 'function_args': [1, 5], 'function_kwargs': {}, 'output_schema': None, 'uuid': 'b4be80b1-e272-4694-a2fc-ba7f697e06bc', 'index': 1, 'name': 'add', 'metadata': {}, 'config': {'@module': 'jobflow.core.job', '@class': 'JobConfig', '@version': '0.1.18', 'resolve_references': True, 'on_missing_references': 'error', 'manager_config': {}, 'expose_store': False, 'pass_manager_config': True, 'response_manager_config': {}}, 'hosts': ['5aa4a8ef-79d1-42b6-822e-647ccd8e5f0f'], 'metadata_updates': [], 'config_updates': [], 'name_updates': []}, 'uuid': 'b4be80b1-e272-4694-a2fc-ba7f697e06bc', 'index': 1, 'db_id': '1', 'worker': 'test_remote_sge_worker', 'state': 'REMOTE_ERROR', 'remote': {'step_attempts': 0, 'queue_state': None, 'process_id': '1', 'retry_time_limit': None, 'error': 'Remote error: file /home/jobflow/jfr/b4/be/80/b4be80b1-e272-4694-a2fc-ba7f697e06bc_1/jfremote_out.json for job b4be80b1-e272-4694-a2fc-ba7f697e06bc does not exist'}, 'parents': [], 'previous_state': 'TERMINATED', 'error': None, 'lock_id': None, 'lock_time': None, 'run_dir': '/home/jobflow/jfr/b4/be/80/b4be80b1-e272-4694-a2fc-ba7f697e06bc_1', 'start_time': None, 'end_time': None, 'created_on': datetime.datetime(2024, 11, 16, 16, 21, 48, 423000), 'updated_on': datetime.datetime(2024, 11, 16, 16, 21, 50, 502000), 'priority': 0, 'exec_config': None, 'resources': None, 'stored_data': None} I've tried resetting all of the |
Hi @ml-evs, thanks for all the efforts spent on this. I also wanted to test it locally to investigate this further but in my case the Job fails before even running. It cannot submit the SGE job to the queue. I also tried to connect to the container and submit a job manually but it fails with the same error:
Did this ever happen to you? |
Hi @ml-evs,
Not sure if this could/should be added to the Dockerfile for everybody. For some reason the Slurm container was already coming up with amd64. Maybe because of some settings in the nathanhess/slurm image? Then I did some tests and indeed the problem lies in the qtk implementation. Or at least in connection with this version of SGE. The problem seems to be that the I then managed to run the tests by setting I have a few more notes/questions related to the container. Just starting the containers takes ~10 minutes every time on my laptop. It does not seem to benefit much from caching. Again, I don't know if this is because of emulation required, but it might be a problem when developing, as running the integration tests actually helps intercepting some kind of failures and waiting ~15/20 minutes for them to complete would be quite annoying. Also every execution leaves behind a number of dangling images and volumes, that I have to remember to clean up. At some point I reached the maximum space available and I started getting different errors. Does this also happen to you, or is there some configuration that I can set to improve speed, caching and cleanup? |
Thanks a lot for investigating! I guess the slurm image is only available on amd64, so it was just defaulting to that, but the base ubuntu image can be used with apple silicon. Happy to add that platform line (will do it now).
It's more likely just how this SGE has been configured, rather than version. You really have to set everything manually, and I have no idea where I would control the default qstat output. If the user workaround is robust then we should just use that, I think.
There might be some tweaking here regarding the internal polling rate of SGE -- I've tried to do this in 73d2ce6, let's see how it affects the test times.
Running the tests "from scratch" for me takes around 10 minutes without any cache, with the container builds taking a couple of minutes (similar to the "build test bench job" in the CI). The caching should be fairly reasonable -- I'm using a separate build stage for jobflow and the queues, so once the queues are built they should be essentially frozen. You can at least select either sge or slurm for the testing part (
Could you try building the images natively outside of Python? I know people have performance issues with Docker on Mac (which is why things like https://orbstack.dev/ exist), but generally the docker-python connection is also not ideal (cannot use buildkit features, for example) so we might be able to improve that by just manually running docker as subcommands (recommended in docker/docker-py#2230) Here's my timings for the relevant command: $ time docker buildx bake --no-cache -f tests/integration/dockerfiles/docker-bake.hcl
docker buildx bake --no-cache -f tests/integration/dockerfiles/docker-bake.hc 0.97s user 0.62s system 1% cpu 2:24.77 total # then, edit a jobflow src file (add a new line) in a minor way and run with cache to mimic development -- should show only jobflow being reinstalled
$ time docker buildx bake -f tests/integration/dockerfiles/docker-bake.hcl
docker buildx bake -f tests/integration/dockerfiles/docker-bake.hcl 0.15s user 0.09s system 1% cpu 21.058 total
The teardown methods of the tests should remove the container, but your cache may continue to grow if you are making lots of edits to the queues themselves. I do tend to let my docker cache get quite large on my local machine (few hundred GBs) but that's across all my projects.
It's not exactly fast for me, but I haven't seen these issues with space. One way we could consider getting around this is pushing the final built containers to ghcr.io (and labelling them as such) so most of the time will just be spent downloading them. In this way, qtoolkit could just pull the containers and write custom tests against them. |
I think we could also consider only running the integration tests for the minimum Python version, and perhaps running the tests for each queue in parallel to speed things up. |
Hey @gpetretto and @ml-evs . Pls let me know if there is anything I can also help you with, like making fixes to the SGE qtk implementation or more testing with my colleague! |
Thanka @ml-evs for the reply. I tried running the container without python, while being slower than for you, it is indeed way faster than usually, especially the cached version time docker buildx bake --no-cache -f tests/integration/dockerfiles/docker-bake.hcl
docker buildx bake --no-cache -f tests/integration/dockerfiles/docker-bake.hc 1,79s user 1,83s system 1% cpu 3:50,59 total time docker buildx bake -f tests/integration/dockerfiles/docker-bake.hcl
docker buildx bake -f tests/integration/dockerfiles/docker-bake.hcl 0,26s user 0,35s system 2% cpu 24,922 total After this I tried running the tests and it ended up build everything again, taking almost 10 minutes as well. I have seen in the issue that you linked that there is a library wrapping the docker CLI https://github.com/gabrieldemarmiesse/python-on-whales. Maybe I should try if switching to that avoids the issues that I encountered? |
Hmm okay, I think wit mine it at least uses the cache in Python too. Can we get this PR passing and merged first, then we can try to split it up and optimise it? |
578da9a
to
242148b
Compare
Pending e.g. #159, I have started to generalise the integration tests so we can make a mega combined Dockerfile that runs e.g., Slurm, SGE and potentially other queuing systems in the same container for testing purposes. This PR is the first step in that. Probably we could test remote shell execution first too. Depending on how awkward it is to set up multiple queues together, it might be that each one runs in a different container.