Speed up inStrain compare by bypassing the extensive pandas.DataFrame subset extraction processes #181
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Greatings!
I've made some optimizations and fixes that I believe will significantly improve performance and clarify error messaging. Here's a summary of the changes:
TLDR
inStrain.compare_utils.hash_SNP_table
to pre-process SNP tables based on scaffolds, significantly reducing redundant calculations.inStrain.compare_utils.subset_SNP_table
in the comparison process from<number of scaffolds> * <number of samples>
times to just once per sample, resulting in a drastic reduction in overall processing time.These changes have been tested thoroughly and have shown significant improvements in both efficiency and usability. I believe they will enhance the overall user experience and streamline the analysis process.
In detail
I'm recently comparing 49 deep-sequencing metagenomes (~100 GB per sample), the compare process takes almost forever at Step 2 running group 1 of 220. I think your suggestions of comparing one genome each time in #148 is a good start and really did the trick (~3 hours per genome).
But after I added more debug loggings in the source code, the finding is very surprising: In the one-genome case it took just 30 minutes to load each profile but took another 40 minutes before the parallelized comparing workers are actually running (note the time lap between the last cumulative_snv_table and the first WorkerLog).
After further digging, it turns out that the function
inStrain.compare_utils.subset_SNP_table
is called for every scaffold * sample pair. In my case, it's 69 * 49 = 3381 calls, and scales up very quickly if you have more scaffolds and samples. Although each call on this function takes less than 1 sec, the total running time can easily be magnitudes higher than all the other steps.To overcome this, I added a new function
inStrain.compare_utils.hash_SNP_table
to split the SNP tables based on scaffolds and store them using dictionary. This function needs only to be called once per sample, and for the rest of the for loop you only need to get the subset SNP tables from the dictionary, which costs almost no time.After this small modification, comparing 155750 scaffolds among 49 samples took only 2 min. after loading all covTs and SNP tables and before the compare worker is actually running.
I also did another small fix on the error message of plotting functions 1 and 3. When running with --dadabase_mode, they says:
But the actual reason is:
I welcome your feedback and suggestions for further improvements.
Thank you!
Best regards,
Jing