Speed up inStrain compare by bypassing the extensive pandas.DataFrame subset extraction processes #181

zhenjiaofenjie · 2024-05-05T11:27:22Z

Greatings!

I've made some optimizations and fixes that I believe will significantly improve performance and clarify error messaging. Here's a summary of the changes:

TLDR

Efficiency improvements:
- Introduced a new function inStrain.compare_utils.hash_SNP_table to pre-process SNP tables based on scaffolds, significantly reducing redundant calculations.
- This change avoids calling inStrain.compare_utils.subset_SNP_table in the comparison process from <number of scaffolds> * <number of samples> times to just once per sample, resulting in a drastic reduction in overall processing time.
- After implementing this modification, the preparation of workers for comparing 155,750 scaffolds among 49 samples now takes only 2 minutes after loading all necessary data.
Error message clarification:
- Fixed error messages in plotting functions 1 and 3 to provide more accurate information when running with --database_mode or --skip_mm_profiling.
- Previously, the error message incorrectly suggested running inStrain genome_wide when the actual reason was related to the mode in which the script was executed.

These changes have been tested thoroughly and have shown significant improvements in both efficiency and usability. I believe they will enhance the overall user experience and streamline the analysis process.

In detail

I'm recently comparing 49 deep-sequencing metagenomes (~100 GB per sample), the compare process takes almost forever at Step 2 running group 1 of 220. I think your suggestions of comparing one genome each time in #148 is a good start and really did the trick (~3 hours per genome).

But after I added more debug loggings in the source code, the finding is very surprising: In the one-genome case it took just 30 minutes to load each profile but took another 40 minutes before the parallelized comparing workers are actually running (note the time lap between the last cumulative_snv_table and the first WorkerLog).

24-05-03 15:47:26 DEBUG    Checkpoint Compare multiprocessing start 448057344
24-05-03 15:47:26 INFO     Running group 1 of 1
24-05-03 15:47:26 DEBUG    Loading KT_MT_tpm..FDZ127__total_bwa.bam
24-05-03 15:47:36 DEBUG    covT 10.300886
24-05-03 15:47:51 DEBUG    cumulative_snv_table 15.016690
24-05-03 15:47:51 DEBUG    Loading KT_MT_tpm..FDZ128__total_bwa.bam
24-05-03 15:47:58 DEBUG    covT 7.444016
24-05-03 15:48:19 DEBUG    cumulative_snv_table 20.384073
24-05-03 15:48:19 DEBUG    Loading KT_MT_tpm..FDZ129__total_bwa.bam
24-05-03 15:48:26 DEBUG    covT 7.212746
24-05-03 15:48:35 DEBUG    cumulative_snv_table 8.734889
...
24-05-03 16:14:07 DEBUG    Loading KT_MT_tpm..TY1-18__total_bwa.bam
24-05-03 16:14:17 DEBUG    covT 9.986339
24-05-03 16:14:32 DEBUG    cumulative_snv_table 15.182714
24-05-03 16:51:40 DEBUG
WorkerLog Compare FDZ134|k141_1831661 start 201064448 1714726293.2355466 1334316

After further digging, it turns out that the function inStrain.compare_utils.subset_SNP_table is called for every scaffold * sample pair. In my case, it's 69 * 49 = 3381 calls, and scales up very quickly if you have more scaffolds and samples. Although each call on this function takes less than 1 sec, the total running time can easily be magnitudes higher than all the other steps.

To overcome this, I added a new function inStrain.compare_utils.hash_SNP_table to split the SNP tables based on scaffolds and store them using dictionary. This function needs only to be called once per sample, and for the rest of the for loop you only need to get the subset SNP tables from the dictionary, which costs almost no time.

After this small modification, comparing 155750 scaffolds among 49 samples took only 2 min. after loading all covTs and SNP tables and before the compare worker is actually running.

24-05-04 21:37:58 DEBUG    Loading KT_MT_tpm..FDZ150__total_bwa.bam
24-05-04 21:38:10 DEBUG    covT 12.349708
24-05-04 21:38:39 DEBUG    cumulative_snv_table 28.679268
24-05-04 21:39:19 DEBUG    Load cache done.
24-05-04 21:41:28 DEBUG
WorkerLog Compare FDZ147|k141_383270 start 341340160 1714830028.4265416 1282417

I also did another small fix on the error message of plotting functions 1 and 3. When running with --dadabase_mode, they says:

"you don't have all required information. You need to run inStrain genome_wide first"

But the actual reason is:

"Plot cannot be created when run with --database_mode or --skip_mm_profiling".

I welcome your feedback and suggestions for further improvements.

Thank you!

Best regards,
Jing

…ompare pipeline significantly.

MrOlm · 2024-05-06T22:42:04Z

Hi @zhenjiaofenjie - Thanks a million for this! What a delightful surprise to find this weekend, and thank you very much for submitting this change as a pull request.

I ran a couple tests, though, and I'm hitting an issue when one of the profiles being compared doesn't have any called SNVs. Running the following command on the attached files should reveal the error:

inStrain compare -i /Users/mattolm/Programs/inStrain/test/test_data/Ecoli_ani.100.0.subset.sorted.bam.IS /Users/mattolm/Programs/inStrain/test/test_data/Ecoli_ani.99.9.subset.sorted.bam.IS -o /Users/mattolm/Programs/inStrain/test/test_backend/testdir/testR --store_mismatch_locations

I tried doing a simple modification (below), but then the resulting dataframe didn't find the SNV locations. When I run that command on the master branch, the resulting testR_pairwise_SNP_locations.tsv file should have 11 SNVs called.

I'm sorry I couldn't fix this error myself, but I'd love to merge this into the master if we can get this working.

Thanks,
Matt

    """
    Split SNP table according to scaffold
    """
    if len(db) == 0:
        print('empty')
        return {}

    db = db.sort_values(['scaffold', 'mm'])
    ukeys, index = np.unique(db['scaffold'], True)
    dbhash = {}
    for key, idx1, idx2 in zip(ukeys, index[:-1], index[1:]):
        dbhash[key] = db.iloc[idx1:idx2]

    return dbhash

share.zip

zhenjiaofenjie · 2024-05-07T04:35:06Z

Hi Matt,

I'm terribly sorry about the stupid mistake in taking care of the dataframe index. It by accident skips the last scaffold in every SNP table, which is hard to notice when you have more than one scaffold to compare. Lucky you catch it in time because I've used the hacked version to run several jobs 😅
The correct version should be:

def hash_SNP_table(db):
    """
    Split SNP table according to scaffold
    """
    if len(db) == 0:
        print('empty')
        return {}

    db = db.sort_values(['scaffold', 'mm'])
    ukeys, index = np.unique(db['scaffold'], True)
    index = np.append(index, len(db))  # add the index of the last row in db
    dbhash = {}
    for key, idx1, idx2 in zip(ukeys, index[:-1], index[1:]):
        dbhash[key] = db.iloc[idx1:idx2]

    return dbhash

Let me know if this works.

Best,
Jing

MrOlm · 2024-05-08T15:53:38Z

Wonderful- this change addressed the issue. I've merged your changes in the master branch of inStrain, and assigned it version 1.9.0. Thanks again for contributing!

Jing Wang added 2 commits May 5, 2024 13:10

Replace pandas.DataFrame subsetting with dictionary, which speed up c…

10c312a

…ompare pipeline significantly.

Modify message when plotting IS run with --database_mode

f0e23ee

MrOlm merged commit f0e23ee into MrOlm:master May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up inStrain compare by bypassing the extensive pandas.DataFrame subset extraction processes #181

Speed up inStrain compare by bypassing the extensive pandas.DataFrame subset extraction processes #181

zhenjiaofenjie commented May 5, 2024

MrOlm commented May 6, 2024

zhenjiaofenjie commented May 7, 2024 •

edited

Loading

MrOlm commented May 8, 2024

Speed up inStrain compare by bypassing the extensive pandas.DataFrame subset extraction processes #181

Speed up inStrain compare by bypassing the extensive pandas.DataFrame subset extraction processes #181

Conversation

zhenjiaofenjie commented May 5, 2024

TLDR

In detail

MrOlm commented May 6, 2024

zhenjiaofenjie commented May 7, 2024 • edited Loading

MrOlm commented May 8, 2024

zhenjiaofenjie commented May 7, 2024 •

edited

Loading