Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up inStrain compare by bypassing the extensive pandas.DataFrame subset extraction processes #181

Merged
merged 2 commits into from
May 8, 2024

Conversation

zhenjiaofenjie
Copy link

Greatings!

I've made some optimizations and fixes that I believe will significantly improve performance and clarify error messaging. Here's a summary of the changes:

TLDR

  • Efficiency improvements:
    • Introduced a new function inStrain.compare_utils.hash_SNP_table to pre-process SNP tables based on scaffolds, significantly reducing redundant calculations.
    • This change avoids calling inStrain.compare_utils.subset_SNP_table in the comparison process from <number of scaffolds> * <number of samples> times to just once per sample, resulting in a drastic reduction in overall processing time.
    • After implementing this modification, the preparation of workers for comparing 155,750 scaffolds among 49 samples now takes only 2 minutes after loading all necessary data.
  • Error message clarification:
    • Fixed error messages in plotting functions 1 and 3 to provide more accurate information when running with --database_mode or --skip_mm_profiling.
    • Previously, the error message incorrectly suggested running inStrain genome_wide when the actual reason was related to the mode in which the script was executed.

These changes have been tested thoroughly and have shown significant improvements in both efficiency and usability. I believe they will enhance the overall user experience and streamline the analysis process.

In detail

I'm recently comparing 49 deep-sequencing metagenomes (~100 GB per sample), the compare process takes almost forever at Step 2 running group 1 of 220. I think your suggestions of comparing one genome each time in #148 is a good start and really did the trick (~3 hours per genome).

But after I added more debug loggings in the source code, the finding is very surprising: In the one-genome case it took just 30 minutes to load each profile but took another 40 minutes before the parallelized comparing workers are actually running (note the time lap between the last cumulative_snv_table and the first WorkerLog).

24-05-03 15:47:26 DEBUG    Checkpoint Compare multiprocessing start 448057344
24-05-03 15:47:26 INFO     Running group 1 of 1
24-05-03 15:47:26 DEBUG    Loading KT_MT_tpm..FDZ127__total_bwa.bam
24-05-03 15:47:36 DEBUG    covT 10.300886
24-05-03 15:47:51 DEBUG    cumulative_snv_table 15.016690
24-05-03 15:47:51 DEBUG    Loading KT_MT_tpm..FDZ128__total_bwa.bam
24-05-03 15:47:58 DEBUG    covT 7.444016
24-05-03 15:48:19 DEBUG    cumulative_snv_table 20.384073
24-05-03 15:48:19 DEBUG    Loading KT_MT_tpm..FDZ129__total_bwa.bam
24-05-03 15:48:26 DEBUG    covT 7.212746
24-05-03 15:48:35 DEBUG    cumulative_snv_table 8.734889
...
24-05-03 16:14:07 DEBUG    Loading KT_MT_tpm..TY1-18__total_bwa.bam
24-05-03 16:14:17 DEBUG    covT 9.986339
24-05-03 16:14:32 DEBUG    cumulative_snv_table 15.182714
24-05-03 16:51:40 DEBUG
WorkerLog Compare FDZ134|k141_1831661 start 201064448 1714726293.2355466 1334316

After further digging, it turns out that the function inStrain.compare_utils.subset_SNP_table is called for every scaffold * sample pair. In my case, it's 69 * 49 = 3381 calls, and scales up very quickly if you have more scaffolds and samples. Although each call on this function takes less than 1 sec, the total running time can easily be magnitudes higher than all the other steps.

To overcome this, I added a new function inStrain.compare_utils.hash_SNP_table to split the SNP tables based on scaffolds and store them using dictionary. This function needs only to be called once per sample, and for the rest of the for loop you only need to get the subset SNP tables from the dictionary, which costs almost no time.

After this small modification, comparing 155750 scaffolds among 49 samples took only 2 min. after loading all covTs and SNP tables and before the compare worker is actually running.

24-05-04 21:37:58 DEBUG    Loading KT_MT_tpm..FDZ150__total_bwa.bam
24-05-04 21:38:10 DEBUG    covT 12.349708
24-05-04 21:38:39 DEBUG    cumulative_snv_table 28.679268
24-05-04 21:39:19 DEBUG    Load cache done.
24-05-04 21:41:28 DEBUG
WorkerLog Compare FDZ147|k141_383270 start 341340160 1714830028.4265416 1282417

I also did another small fix on the error message of plotting functions 1 and 3. When running with --dadabase_mode, they says:

"you don't have all required information. You need to run inStrain genome_wide first"

But the actual reason is:

"Plot cannot be created when run with --database_mode or --skip_mm_profiling".

I welcome your feedback and suggestions for further improvements.

Thank you!

Best regards,
Jing

@MrOlm
Copy link
Owner

MrOlm commented May 6, 2024

Hi @zhenjiaofenjie - Thanks a million for this! What a delightful surprise to find this weekend, and thank you very much for submitting this change as a pull request.

I ran a couple tests, though, and I'm hitting an issue when one of the profiles being compared doesn't have any called SNVs. Running the following command on the attached files should reveal the error:

inStrain compare -i /Users/mattolm/Programs/inStrain/test/test_data/Ecoli_ani.100.0.subset.sorted.bam.IS /Users/mattolm/Programs/inStrain/test/test_data/Ecoli_ani.99.9.subset.sorted.bam.IS -o /Users/mattolm/Programs/inStrain/test/test_backend/testdir/testR --store_mismatch_locations

I tried doing a simple modification (below), but then the resulting dataframe didn't find the SNV locations. When I run that command on the master branch, the resulting testR_pairwise_SNP_locations.tsv file should have 11 SNVs called.

I'm sorry I couldn't fix this error myself, but I'd love to merge this into the master if we can get this working.

Thanks,
Matt

    """
    Split SNP table according to scaffold
    """
    if len(db) == 0:
        print('empty')
        return {}

    db = db.sort_values(['scaffold', 'mm'])
    ukeys, index = np.unique(db['scaffold'], True)
    dbhash = {}
    for key, idx1, idx2 in zip(ukeys, index[:-1], index[1:]):
        dbhash[key] = db.iloc[idx1:idx2]

    return dbhash

share.zip

@zhenjiaofenjie
Copy link
Author

zhenjiaofenjie commented May 7, 2024

Hi Matt,

I'm terribly sorry about the stupid mistake in taking care of the dataframe index. It by accident skips the last scaffold in every SNP table, which is hard to notice when you have more than one scaffold to compare. Lucky you catch it in time because I've used the hacked version to run several jobs 😅
The correct version should be:

def hash_SNP_table(db):
    """
    Split SNP table according to scaffold
    """
    if len(db) == 0:
        print('empty')
        return {}

    db = db.sort_values(['scaffold', 'mm'])
    ukeys, index = np.unique(db['scaffold'], True)
    index = np.append(index, len(db))  # add the index of the last row in db
    dbhash = {}
    for key, idx1, idx2 in zip(ukeys, index[:-1], index[1:]):
        dbhash[key] = db.iloc[idx1:idx2]

    return dbhash

Let me know if this works.

Best,
Jing

@MrOlm MrOlm merged commit f0e23ee into MrOlm:master May 8, 2024
@MrOlm
Copy link
Owner

MrOlm commented May 8, 2024

Wonderful- this change addressed the issue. I've merged your changes in the master branch of inStrain, and assigned it version 1.9.0. Thanks again for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants