Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory issue with pangolin version 4 and pangoLEARN #395

Open
rrdavis77 opened this issue Apr 1, 2022 · 17 comments
Open

memory issue with pangolin version 4 and pangoLEARN #395

rrdavis77 opened this issue Apr 1, 2022 · 17 comments

Comments

@rrdavis77
Copy link

Performed the update to version 4 today and ran a test query on ~50 samples and it worked as expected in UShER mode. However, in pangoLEARN mode --analysis-mode fast, I am running out of memory. I only have 12GB of RAM on this test system but I can run pangolin version 3.* with no problems on the same system.
Have the requirements for version 4 changed in terms of memory use?

Thanks!

@kapsakcj
Copy link

kapsakcj commented Apr 2, 2022

Just noticed this myself too. The error message doesn't say it explicitly, but memory usage swelled when testing pangolearn mode with the supplied test sequences pangolin /pangolin/test/test_seqs.fasta --analysis-mode pangolearn -o test_seqs-output-plearn

Logs for failure here:
https://github.com/StaPH-B/docker-builds/runs/5796292215?check_suite_focus=true

@fanninpm
Copy link

fanninpm commented Apr 2, 2022

Note to self: this is a perfect candidate for profiling with Scalene.

@aineniamh
Copy link
Member

The changes to pangoLEARN in pangolin 4 include a shift to a random forest model by @emilyscher. This model was performing more robustly to missing data and homoplasies, and is less overfit than the decision tree model so is a definite welcome shift.

I'll add a warning about memory usage to the user, as 12GB is a lot, but had been under the impression that it uses 5GB of RAM- saying that I was struggling with github actions most of Thursday for ubuntu and their max is 7GB I believe so this might check out. Unfortunately my fix wasn't an actual solution, it was just to remove the github actions test for ubuntu as the macos test ran fine (which is allocated a max of 14GB).

I'll add that warning in about RAM, but is this something that will need resolving or is a warning enough?

@rrdavis77
Copy link
Author

I am not sure how many users process their data on <16GB RAM systems but that was possible with pangolin v3. A warning would be great. An optimization would be nice for those users with less ressources at their hands but perhaps that is a very small subset of users?

@tseemann
Copy link

tseemann commented Apr 3, 2022

Thanks for posting, this solved a mystery of crashes when I increased parallel -j XX for pangolearn.
Finally checke system logs and see it was being killed by the kernel OOM killer:

[Sun Apr  3 11:18:43 2022] Out of memory: Killed process 3307066 (python) total-vm:31142324kB, anon-rss:10222624kB, file-rss:0kB, shmem-rss:0kB, UID:1424802263 pgtables:21788kB oom_score_adj:0
[Sun Apr  3 11:18:44 2022] oom_reaper: reaped process 3307066 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I was trying to run 64 pangolins at once on 200,000 FASTA split into 64 chunks, so each one was trying to use 30GB RAM at least.

This only happens on pangolearn mode not usher.
It didn't happen on v3 - i could use 256 in parallel then.

@aineniamh
Copy link
Member

Thanks for flagging this, 30GB of RAM sounds like a lot more than we had expected. The model itself shouldn't be that large, it could be an issue of holding lots of sequences in memory at once.
I guess one solution could be to have options:

  • usher
  • pangolearn-rf (the new random forest option)
  • pangolearn-dt (the original decision tree option available in v3)

The rf has shown to be a more robust model, but because it's a bunch of decision trees together it's going to be more memory intensive. I was told it would need 5GB of RAM, but it seems that isn't the case! @emilyscher might have the profiling results from before, but if not I can run some tests and see what step is taking so much memory.

I know @rmcolq was working on parallelising the model (on this branch so that chunking up a massive fasta file wouldn't be necessary (and then only one decompressed model would be held in memory at a time), but I can't remember if it was done or not. I'll revisit that branch this week too as that might be a good compromise if the model file is going to be very large.

@rgerhards
Copy link

As a reference point the German RKI DESH [1] sequences require currently "a bit" over 8GiB main memory (I could go up to 10 on one VM, and then it finished).

[1] https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland

@kapsakcj
Copy link

kapsakcj commented Apr 7, 2022

I've found that for a single sample with pangolin v4.0.2, it requires roughly 11.5 GB RAM. Anything less will run OOM.

though it would be more thorough if the memory usage was profiled as @fanninpm suggested!

@rrdavis77
Copy link
Author

rrdavis77 commented Apr 7, 2022 via email

@rrdavis77
Copy link
Author

wondering if anyone is seeing an increase in memory usage with the latest pangolin-data version 1.8? my jobs are failing when requesting 16GB. I did not have that issue with pangoling-data version 1.6. Thanks!

@kapsakcj
Copy link

kapsakcj commented May 2, 2022

@rrdavis77 Same here. Roughly need about 16-17GB of RAM with --analysis-mode pangolearn with pangolin-data 1.8

@kapsakcj
Copy link

kapsakcj commented May 2, 2022

16 wasn't enough, but 16.5GB was enough. I'm guessing the minimum is between 16 and 16.5GB?

# limiting docker container to 16GB memory
$ docker run --rm -ti -m 16000000000 -v $PWD:/data staphb/pangolin:4.0.6-pdata-1.8 pangolin EPI_ISL_6825395-B.1.1.529-omicron.fasta --analysis-mode pangolearn -o test-16GB-memory.csv
****
Pangolin running in pangolearn mode.
****
Warning: pangoLEARN mode may use a significant amount of RAM, be aware that it will not suit every system.
Maximum ambiguity allowed is 0.3.
****
Query file:     /data/EPI_ISL_6825395-B.1.1.529-omicron.fasta
****
Data files found:
plearn_model:   /opt/conda/envs/pangolin/lib/python3.8/site-packages/pangolin_data/data/randomForest_v1.joblib
plearn_header:  /opt/conda/envs/pangolin/lib/python3.8/site-packages/pangolin_data/data/randomForestHeaders_v1.joblib
****
Job stats:
job                      count    min threads    max threads
---------------------  -------  -------------  -------------
align_to_reference           1              1              1
all                          1              1              1
cache_sequence_assign        1              1              1
create_seq_hash              1              1              1
get_constellations           1              1              1
merged_info                  1              1              1
scorpio                      1              1              1
sequence_qc                  1              1              1
total                        8              1              1

****
Query sequences collapsed from 1 to 1 unique sequences.
****
1 sequences assigned via designations.
****
Running sequence QC
Total passing QC: 1
Job stats:
job                  count    min threads    max threads
-----------------  -------  -------------  -------------
all                      1              1              1
pangolearn               1              1              1
pangolearn_output        1              1              1
total                    3              1              1

Running pangoLEARN assignment
Loading model 05/02/2022, 21:40:25
Killed
Exiting because a job execution failed. Look above for error message

# now with 16.5 GB RAM
$ docker run --rm -ti -m 16500000000 -v $PWD:/data staphb/pangolin:4.0.6-pdata-1.8 pangolin EPI_ISL_6825395-B.1.1.529-omicron.fasta --analysis-mode pangolearn -o test-16.5GB-memory.csv
****
Pangolin running in pangolearn mode.
****
Warning: pangoLEARN mode may use a significant amount of RAM, be aware that it will not suit every system.
Maximum ambiguity allowed is 0.3.
****
Query file:     /data/EPI_ISL_6825395-B.1.1.529-omicron.fasta
****
Data files found:
plearn_model:   /opt/conda/envs/pangolin/lib/python3.8/site-packages/pangolin_data/data/randomForest_v1.joblib
plearn_header:  /opt/conda/envs/pangolin/lib/python3.8/site-packages/pangolin_data/data/randomForestHeaders_v1.joblib
****
Job stats:
job                      count    min threads    max threads
---------------------  -------  -------------  -------------
align_to_reference           1              1              1
all                          1              1              1
cache_sequence_assign        1              1              1
create_seq_hash              1              1              1
get_constellations           1              1              1
merged_info                  1              1              1
scorpio                      1              1              1
sequence_qc                  1              1              1
total                        8              1              1

****
Query sequences collapsed from 1 to 1 unique sequences.
****
1 sequences assigned via designations.
****
Running sequence QC
Total passing QC: 1
Job stats:
job                  count    min threads    max threads
-----------------  -------  -------------  -------------
all                      1              1              1
pangolearn               1              1              1
pangolearn_output        1              1              1
total                    3              1              1

Running pangoLEARN assignment
Loading model 05/02/2022, 21:41:33
Finished loading model 05/02/2022, 21:41:59
Processing block of 1 sequences 05/02/2022, 21:41:59
Complete 05/02/2022, 21:42:00
****
Output file written to: /data/test-16.5GB-memory.csv/lineage_report.csv

@rrdavis77
Copy link
Author

with the latest pangolin-data version 1.9, my jobs are failing when requesting 20GB. The RAM requirements seem to grow after each pangolin-data update :(

@kapsakcj
Copy link

kapsakcj commented Jun 3, 2022

^ Can confirm, with my tests using pangolin-data v1.9 it required approximately 19GB of RAM to run pangolearn mode to completion. It failed when 18GB of RAM was allocated

# killed/OOM/failed
docker run --rm -ti -m 18000000000 -v $PWD:/data staphb/pangolin:4.0.6-pdata-1.9 pangolin EPI_ISL_6825395-B.1.1.529-omicron.fasta --analysis-mode pangolearn -o test-18GB-memory.csv

# passed
docker run --rm -ti -m 19000000000 -v $PWD:/data staphb/pangolin:4.0.6-pdata-1.9 pangolin EPI_ISL_6825395-B.1.1.529-omicron.fasta --analysis-mode pangolearn -o test-19GB-memory.csv

@aineniamh
Copy link
Member

Thanks for the updated information- I think with so many categories and random forest as it is, I'm not sure what I can do in the short term to resolve this.

We've discussed potentially splitting the model into hierarchical, variant-specific random forests, which should reduce the memory requirements, but will also slow it down. With UShER as the default inference now (and the developer of pangoLEARN in a new job) we may need to just have the warning in place for the time being.

If there's any machine-learning aficionados who see something that could help, I'm very happy to take PRs too!

@kapsakcj
Copy link

kapsakcj commented Nov 9, 2022

Quick update: I've found it requires roughly 32 GB of RAM to use pangolearn mode with pangolin-data v1.16

I'm simply documenting as the models grow larger - it's not an issue for me since I use the usher analysis mode

@aineniamh
Copy link
Member

Thanks! It's good to keep track of this- we may need to make a decision about this approach soon because at the training end we're close to maxing out the server's RAM.

Thanks for keeping us updated though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants