Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diaPASEF first tests #123

Open
KlemensFroehlich opened this issue Feb 25, 2024 · 27 comments
Open

diaPASEF first tests #123

KlemensFroehlich opened this issue Feb 25, 2024 · 27 comments

Comments

@KlemensFroehlich
Copy link

hi Michael

Please forgive me for bringing this up again... In my defense: You said nothing is preventing users from converting .d folders to mzml and analyze diaPASEF data :)

I have tried a few things to see how different softwares currently handly diaPASEF. For this I generated a ~10ng, 5min human HEK active gradient diaPASEF run on a timsTOF Ultra.
file size as .d folder:
1.4GB

I have now tried 3 different ways to generate an mzml file:

  1. MSConvert, standard settings, so I think it came down to 4.5 million scan events which led to a final size of 25GB.
  2. MSConvert, Combine ion mobility scans 6.3GB
  3. timsconvert, standard settings 5.7GB

I searched the data with sage and for comparison I also included a 30min ddaPASEF run, which was directly analyzed from .d

image
image
image

the ddaPASEF seems to work nicely looking at the q value distribution of spectrum, peptide or protein level, but for all other searches the q value distribution looks odd to me.

I would be happy to share the data with you, or do more testing benchmarking. As I already mentioned there is currently no non-proprietary option to analyze diaPASEF data with large search spaces, so I am highly motivated to get this off the ground.

You also said earlier that it is not trivial to collapse / handle the ion mobility in tims data.
Would you be open to have a look at the msconvert ion mobility combine option and at the resulting data structure?
Maybe this can already solve the problem and sage "only" needs to be adapted to be able to handle this specific mzml ?

Oh and btw working with sage made me smile more than once! It is a lot of fun to work with a tool so fast! I still remember the days I had to run semitryptic searches in MaxQuant for a week xD

Best, Klemens

@lazear
Copy link
Owner

lazear commented Feb 26, 2024

I did say theoretically possible 😉 - there are clearly some practical issues.

I'm definitely interested in supporting diaPASEF. To be honest though, it will probably be ~2-3 months or so before I have time to really dig in and experiment (I have a lot going on until late spring!). It seems like there are a couple other people interested in perhaps trying it out too, and I am always happy to provide guidance on Sage internals - if there's a way to collapse into a single spectrum, plugging it into Sage should be straightforward.

If you're willing to share the files, I can download them and hack on them when I get a chance!

@KlemensFroehlich
Copy link
Author

KlemensFroehlich commented Feb 26, 2024

Thank you so much for even considering this!

please find all data (also a DIA-NN analysis directly from .d) here:
SwitchDrive

If supporting diaPASEF is a longterm possibility, may I ask what you are thinking about very roughly (I will not hold you to it I promise expect if you say that I can do something theoretically :P)?

More specifically, I was thinking about whether you would consider reaching out to Vadim Demichev to see whether it is possible to create the library with SAGE and then do the main DIA analysis with DIA-NN. It is not open-source though, so it might not be ideal. But looking at FragPipe, this could be a powerful combination of tools.

Again, if I can help in any way, please let me know! Especially, the files that I provided: If there is any other metric I should check or summarize that might be helpful, I would be happy to do so!

Best, Klemens

@prvst
Copy link

prvst commented Mar 4, 2024

Thank you so much for even considering this!

please find all data (also a DIA-NN analysis directly from .d) here: SwitchDrive

If supporting diaPASEF is a longterm possibility, may I ask what you are thinking about very roughly (I will not hold you to it I promise expect if you say that I can do something theoretically :P)?

More specifically, I was thinking about whether you would consider reaching out to Vadim Demichev to see whether it is possible to create the library with SAGE and then do the main DIA analysis with DIA-NN. It is not open-source though, so it might not be ideal. But looking at FragPipe, this could be a powerful combination of tools.

Again, if I can help in any way, please let me know! Especially, the files that I provided: If there is any other metric I should check or summarize that might be helpful, I would be happy to do so!

Best, Klemens

Hello @KlemensFroehlich and @lazear. I would like to point out that the FragPipe DIA workflow uses a program called easypqp (OpenMS) to build the library. The library is then passed down to DIANN to perform the quantification.

If you go to the easypqp GitHub page, you'll see that someone already made this request to them.

Glad to see that people share the same interests and strategies.

Best

@lazear
Copy link
Owner

lazear commented Mar 5, 2024

Yes, sage will write all matched fragments when using the --annotate-matches CLI flag, so it should be possible to plug into easypqp at some point, or other library building tools!

@RobbinBouwmeester
Copy link

RobbinBouwmeester commented Mar 27, 2024

@KlemensFroehlich @lazear , could this repo be useful? https://github.com/mafreitas/tdf2mzml, in the description they claim to have support for DIA bruker data (and based on a closed discussion/issue on that repo this concerns diaPASEF).

@jsnedeco
Copy link

To add to this: I tried out diaPASEF file to just see if I could use something besides DIA-NN for quantification (both for closed source reasons and also because it seems to take a very long time to generate a spectral library from a diaPASEF run).

I used tdf2mzml to generate an mzML file from a test bruker TimsTOF HT file and tested out the released version of Sage (0.14.7) with chimera, wide_window and report_psms to 5 (I can send the full config file if it helps). The top level identification stats look like this

[2024-07-18T19:37:40Z INFO sage] discovered 109719 target peptide-spectrum matches at 1% FDR
[2024-07-18T19:37:40Z INFO sage] discovered 21658 target peptides at 1% FDR
[2024-07-18T19:37:40Z INFO sage] discovered 4533 target proteins at 1% FDR

Not bad! Running on DIA-NN gives the following top line stats:

[34:35] Number of IDs at 0.01 FDR: 117535
[34:38] Calculating protein q-values
[34:38] Number of genes identified at 1% FDR: 8749 (precursor-level), 7925 (protein-level) (inference performed using proteotypic peptides only)

So it would be nice to boost the protein level identifications a touch, but it doesn't look bad as a first pass. It's also possible that DIA-NN is using different metrics for protein counts or something, I haven't investigated that yet.

Then I saw this PR (#140) and got extremely excited to try out the raw .d inputs to see if I could skip the conversion step altogether since it's pretty slow. I built a new version off of the master branch used the same parameters and...

[2024-07-18T20:34:23Z INFO sage] discovered 2776 target peptide-spectrum matches at 1% FDR
[2024-07-18T20:34:23Z INFO sage] discovered 593 target peptides at 1% FDR
[2024-07-18T20:34:23Z INFO sage] discovered 263 target proteins at 1% FDR

That seems quite a bit worse. I know that this is still very much an experimental use case, but I'm wondering if I'm doing something wrong?

@sander-willems-bruker
Copy link
Contributor

Hi @jsnedeco . This indeed is very experimental, @jspaezp is working on a PR that should boost these numbers quite a bit (MannLabs/timsrust#21). Without looking at the dia.d folder and your config, I can't tell if you are doing anything wrong of course, but I am happy to share my findings with the implementaiton you also have available (MannLabs/timsrust#15 (comment)). I used this config: MannLabs/timsrust#15 (comment).

@jsnedeco
Copy link

Hi @sander-willems-bruker! I used these parameters:
results_test.json

It looks pretty similar, except your report_psms is a lot higher. I probably can't share the specific .d file I was testing, but I'll see if I can get a representative sample that I can share.

@jspaezp
Copy link
Contributor

jspaezp commented Jul 19, 2024

As a reference ... #140 this PR added support for dia-pasef direct reading.

@jsnedeco (wont discuss diann, since it does a completely different thing) but from what I see in message there is something wrong ...
PROVIDED THE SAME PARAMETERS FOR SAGE ARE USED, tdf2mzml->sage and timsrust->sage should give the same result or at least VERY similar (which has been the case for my data) ... since both pieces of software IN ESSENCE do the same thing.

I have seen on occasion that mass calibration looks different between them, so try running it with +- 50 ppm on ms2 (its overkill but... who knows :P and if that works let us know, since it would point to an error in how the mz is calculated in timsrust).

LMK if you can send over a file with representative data that suffers from the same issue.

@jsnedeco
Copy link

jsnedeco commented Jul 19, 2024

DIANN was mostly intended as a rough benchmark, not as something to achieve complete parity with :)

PROVIDED THE SAME PARAMETERS FOR SAGE ARE USED, tdf2mzml->sage and timsrust->sage should give the same result or at least VERY similar (which has been the case for my data) ... since both pieces of software IN ESSENCE do the same thing.

I used the same parameters for both, but I did use the released version of sage for the mzml file and the unreleased version of sage for the .d analysis (because it needed to be.) I just tested again using the unreleased version for both with the same parameters I used originally and the results look the same:

tdf2mzml->sage
[2024-07-19T19:24:55Z INFO sage] discovered 109719 target peptide-spectrum matches at 1% FDR
[2024-07-19T19:24:55Z INFO sage] discovered 21658 target peptides at 1% FDR
[2024-07-19T19:24:55Z INFO sage] discovered 4533 target proteins at 1% FDR

timsrust->sage
[2024-07-18T20:34:23Z INFO sage] discovered 2776 target peptide-spectrum matches at 1% FDR
[2024-07-18T20:34:23Z INFO sage] discovered 593 target peptides at 1% FDR
[2024-07-18T20:34:23Z INFO sage] discovered 263 target proteins at 1% FDR

timsrust->sage +/- 50 ppm ms2
[2024-07-19T19:35:27Z INFO sage] discovered 1764 target peptide-spectrum matches at 1% FDR
[2024-07-19T19:35:27Z INFO sage] discovered 506 target peptides at 1% FDR
[2024-07-19T19:35:27Z INFO sage] discovered 247 target proteins at 1% FDR

In both these cases I'm using report_psms: 5, just so I'm not tweaking too many parameters.
I'll see if I can get an example that I can actually share.

@jspaezp
Copy link
Contributor

jspaezp commented Jul 20, 2024

All of my samples are 22 min either ddaPASEF/diaPASEF

timsrust dda -> discovered 10519 target peptides at 1% FDR
timsrust dia -> discovered 3951 target peptides at 1% FDR
tdf2mzml dda -> discovered 10498 target peptides at 1% FDR
tdf2mzml dia -> discovered 4530 target peptides at 1% FDR

    "./231121_RH30_NMIAA_E1_DDA.d"
[2024-07-20T16:09:54Z INFO  sage] - file IO:     1210 ms
[2024-07-20T16:10:09Z INFO  sage] - search:     14355 ms (3409 spectra/s)
[2024-07-20T16:10:09Z INFO  sage_core::ml::retention_alignment] aligning file #0: y = 1.0000x + 0.0000
[2024-07-20T16:10:09Z INFO  sage_core::ml::retention_alignment] aligned retention times across 1 files
[2024-07-20T16:10:09Z INFO  sage_core::ml::retention_model] - fit retention time model, rsq = 0.8335711996562211
[2024-07-20T16:10:09Z INFO  sage_core::ml::mobility_model] - fit mobility model, rsq = 0.9504371376944695, mse = 0.00047716065311882407
[2024-07-20T16:10:09Z INFO  sage] discovered 18122 target peptide-spectrum matches at 1% FDR
[2024-07-20T16:10:09Z INFO  sage] discovered 10519 target peptides at 1% FDR
[2024-07-20T16:10:09Z INFO  sage] discovered 2225 target proteins at 1% FDR
[2024-07-20T16:10:09Z INFO  sage] finished in 27s
[2024-07-20T16:10:09Z INFO  sage] cite: "Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale" https://doi.org/10.1021/acs.jproteome.3c00486

    "./231121_RH30_NMIAA_E1_DIA.d"
[2024-07-20T16:11:06Z INFO  sage] - file IO:    38320 ms
[2024-07-20T16:11:54Z INFO  sage] - search:     47667 ms (863 spectra/s)
[2024-07-20T16:11:54Z INFO  sage_core::ml::retention_alignment] aligning file #0: y = 1.0000x + 0.0000
[2024-07-20T16:11:54Z INFO  sage_core::ml::retention_alignment] aligned retention times across 1 files
[2024-07-20T16:11:54Z INFO  sage_core::ml::retention_model] - fit retention time model, rsq = 0.8302852190047076
[2024-07-20T16:11:54Z INFO  sage_core::ml::mobility_model] - fit mobility model, rsq = 0.9679125124575643, mse = 0.0007812790787113548
[2024-07-20T16:11:54Z INFO  sage] discovered 16037 target peptide-spectrum matches at 1% FDR
[2024-07-20T16:11:54Z INFO  sage] discovered 3951 target peptides at 1% FDR
[2024-07-20T16:11:54Z INFO  sage] discovered 1101 target proteins at 1% FDR
[2024-07-20T16:11:55Z INFO  sage] finished in 96s
[2024-07-20T16:11:55Z INFO  sage] cite: "Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale" https://doi.org/10.1021/acs.jproteome.3c00486

    "./231121_RH30_NMIAA_E1_DDA.mzml"
[2024-07-20T16:12:08Z INFO  sage] generated 319009266 fragments, 12240058 peptides in 7125ms
[2024-07-20T16:12:08Z INFO  sage] processing files 0 .. 1 
[2024-07-20T16:12:09Z INFO  sage] - file IO:     1570 ms
[2024-07-20T16:12:18Z INFO  sage] - search:      8046 ms (6081 spectra/s)
[2024-07-20T16:12:18Z INFO  sage_core::ml::retention_alignment] aligning file #0: y = 1.0000x + 0.0000
[2024-07-20T16:12:18Z INFO  sage_core::ml::retention_alignment] aligned retention times across 1 files
[2024-07-20T16:12:18Z INFO  sage_core::ml::retention_model] - fit retention time model, rsq = 0.8349065613194171
[2024-07-20T16:12:18Z INFO  sage_core::ml::mobility_model] - fit mobility model, rsq = 0.950576663864528, mse = 0.0005678266815400485
[2024-07-20T16:12:18Z INFO  sage] discovered 18242 target peptide-spectrum matches at 1% FDR
[2024-07-20T16:12:18Z INFO  sage] discovered 10498 target peptides at 1% FDR
[2024-07-20T16:12:18Z INFO  sage] discovered 2247 target proteins at 1% FDR
[2024-07-20T16:12:18Z WARN  sage] parquet output format is currently unstable! There may be failures or schema changes!
[2024-07-20T16:12:18Z INFO  sage] finished in 17s
[2024-07-20T16:12:18Z INFO  sage] cite: "Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale" https://doi.org/10.1021/acs.jproteome.3c00486

    "./231121_RH30_NMIAA_E1_DIA.mzml"
[2024-07-20T16:12:31Z INFO  sage] generated 319009266 fragments, 12240058 peptides in 7424ms
[2024-07-20T16:12:31Z INFO  sage] processing files 0 .. 1 
[2024-07-20T16:12:50Z INFO  sage] - file IO:    18955 ms
[2024-07-20T16:13:38Z INFO  sage] - search:     48018 ms (857 spectra/s)
[2024-07-20T16:13:39Z INFO  sage_core::ml::retention_alignment] aligning file #0: y = 1.0000x + 0.0000
[2024-07-20T16:13:39Z INFO  sage_core::ml::retention_alignment] aligned retention times across 1 files
[2024-07-20T16:13:39Z INFO  sage_core::ml::retention_model] - fit retention time model, rsq = 0.835649318485739
[2024-07-20T16:13:39Z INFO  sage_core::ml::mobility_model] - fit mobility model, rsq = NaN, mse = 0
[2024-07-20T16:13:39Z INFO  sage] discovered 18558 target peptide-spectrum matches at 1% FDR
[2024-07-20T16:13:39Z INFO  sage] discovered 4530 target peptides at 1% FDR
[2024-07-20T16:13:39Z INFO  sage] discovered 1236 target proteins at 1% FDR
[2024-07-20T16:13:39Z WARN  sage] parquet output format is currently unstable! There may be failures or schema changes!
[2024-07-20T16:13:39Z INFO  sage] finished in 75s
[2024-07-20T16:13:39Z INFO  sage] cite: "Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale" https://doi.org/10.1021/acs.jproteome.3c00486

my config:

{
    "database": {
        "bucket_size": 8192,
        "enzyme": {
            "missed_cleavages": 2,
            "min_len": 7,
            "max_len": 50,
            "cleave_at": "KR",
            "restrict": "P"
        },
        "fragment_min_mz": 150.0,
        "fragment_max_mz": 2000.0,
        "peptide_min_mass": 500.0,
        "peptide_max_mass": 5000.0,
        "ion_kinds": [
            "b",
            "y"
        ],
        "min_ion_index": 2,
        "max_variable_mods": 3,
        "static_mods": {
            "C": 57.0215
        },
        "variable_mods": {
            "M": [15.994],
            "C": [71.03711]
        },
        "decoy_tag": "rev_",
        "generate_decoys": true
    },
    "precursor_tol": {
        "ppm": [
            -20.0,
            20.0
        ]
    },
    "fragment_tol": {
        "ppm": [
            -20.0,
            20.0
        ]
    },
    "deisotope": true,
    "min_peaks": 15,
    "max_peaks": 250,
    "max_fragment_charge": 1,
    "min_matched_peaks": 4,
    "predict_rt": true,
    "wide_window": true,
    "chimera": false,
    "report_psms": 1
}

I wonder what other things are different from the bruker sdk and timsrust ... LMK if you can upload data anywhere that we could work on together. RN I am also failing to find a file in PRIDE that is not zipped in a stupid way (zipping all the .d of the repo in a single file ...)

Edit: Downloading these two guys for testing ...

wget https://ftp.pride.ebi.ac.uk/pride/data/archive/2022/02/PXD028735/LFQ_timsTOFPro_PASEF_Ecoli_03.d.zip
wget https://ftp.pride.ebi.ac.uk/pride/data/archive/2022/02/PXD028735/LFQ_timsTOFPro_diaPASEF_Ecoli_01.d.zip

@sander-willems-bruker
Copy link
Contributor

Hi @sander-willems-bruker! I used these parameters: results_test.json

It looks pretty similar, except your report_psms is a lot higher. I probably can't share the specific .d file I was testing, but I'll see if I can get a representative sample that I can share.

I'd definately increase frag_ppm (try 15ppm first, as @jspaezp mentions there are samples where you need to go to 50 due to calibration issues with timsrust), max_peaks (200+ for sure) and report_psms (20+ wont hurt in this case).

What might be going wrong on your particular search is that your mzml might be generated per scan, making them not too complex (meaning the max_peaks and report_psms indeed are not too influential for mzml), whereas timsrust projects the whole window (potentially 500+ scans aggregated, thuis far more peaks and far more complex).

@sander-willems-bruker
Copy link
Contributor

I wonder what other things are different from the bruker sdk and timsrust ... LMK if you can upload data anywhere that we could work on together. RN I am also failing to find a file in PRIDE that is not zipped in a stupid way (zipping all the .d of the repo in a single file ...)

The raw data is identical, as well as all info in the tdf. The primary things that are different are mass and im calibration and centroiding. Depending on sample, there might also be a difference in raw intensities, as timsrust does not yet take ICC into account by default (with the api you can correct for this though, this is just a simple correction factor applied per frame based on TIMS ramp times)

@jsnedeco
Copy link

jsnedeco commented Jul 22, 2024

Hi @sander-willems-bruker! I used these parameters: results_test.json
It looks pretty similar, except your report_psms is a lot higher. I probably can't share the specific .d file I was testing, but I'll see if I can get a representative sample that I can share.

I'd definately increase frag_ppm (try 15ppm first, as @jspaezp mentions there are samples where you need to go to 50 due to calibration issues with timsrust), config_test.json(200+ for sure) and report_psms (20+ wont hurt in this case).

What might be going wrong on your particular search is that your mzml might be generated per scan, making them not too complex (meaning the max_peaks and report_psms indeed are not too influential for mzml), whereas timsrust projects the whole window (potentially 500+ scans aggregated, thuis far more peaks and far more complex).

I tried again with the attached config file and the results looked like this

[2024-07-22T22:05:09Z INFO sage] discovered 1980 target peptide-spectrum matches at 1% FDR
[2024-07-22T22:05:09Z INFO sage] discovered 340 target peptides at 1% FDR
[2024-07-22T22:05:09Z INFO sage] discovered 167 target proteins at 1% FDR

So worse than all the previous tests.

Annoyingly, I tried it on another file that I can share and I'm not seeing the same results!
Test file 2

timsrust->sage
[2024-07-22T20:55:37Z INFO sage] discovered 101933 target peptide-spectrum matches at 1% FDR
[2024-07-22T20:55:37Z INFO sage] discovered 18568 target peptides at 1% FDR
[2024-07-22T20:55:37Z INFO sage] discovered 4091 target proteins at 1% FDR

tdf2mzml->sage
[2024-07-22T22:12:25Z INFO sage] discovered 107831 target peptide-spectrum matches at 1% FDR
[2024-07-22T22:12:25Z INFO sage] discovered 19217 target peptides at 1% FDR
[2024-07-22T22:12:25Z INFO sage] discovered 4223 target proteins at 1% FDR

I am going to dig into the fragment annotations on the first unshareable file and see if there are any patterns. If there's any interest, I can share the second test file but it's not showing the behavior so it probably won't be that helpful.

[edit]

For what it's worth, here are the .d and mzml files for Test file 2.

https://filebin.net/frqq1bks9pbtzm8s

And just for fun, I took a look at the fragment_ppm for the psm ids that were called both in the mzml file and the .d file:
Test File 1 (not shared)
image

Test File 2 (shared in filebin)
image

@jspaezp
Copy link
Contributor

jspaezp commented Jul 23, 2024

It's also worth mentioning ... Can you share the .tdf without the tdf_bin? It would give information on the instrument, methods and calibration but not the actual scans (technically it also includes the TIC ). (Also I'm assuming it is done on the same instrument .. I've only tested it on SCP/ultra data ... The one publicly available is HT I think ... Which I haven't run just yet..)

@jsnedeco
Copy link

It's also worth mentioning ... Can you share the .tdf without the tdf_bin? It would give information on the instrument, methods and calibration but not the actual scans (technically it also includes the TIC ). (Also I'm assuming it is done on the same instrument .. I've only tested it on SCP/ultra data ... The one publicly available is HT I think ... Which I haven't run just yet..)

That's a good idea, I added it to the filebin with the other two .d and .mzml files from test 2. It's called test1_analysis.tdf

https://filebin.net/frqq1bks9pbtzm8s

And it should be the same instrument, though I'll double check.

@sander-willems-bruker
Copy link
Contributor

As far as I can tell there is nothing wrong with the tdf you shared. For file 2, I notice the (uncompressed) mzml is not even twice as big as .d. Just a gut feeling, but could you confirm this is also the case for sample 1, or is the mzml much larger in that case?

@jsnedeco
Copy link

For test 1 it's around 4.1 GB for the .d directory and ~6GB for the mzml. So not a big difference, but different.

@sander-willems-bruker
Copy link
Contributor

In that case I remain unsure as to why test 2 runs fine and test 1 does not... I checked the mz calibration and that indeed looks fine as well. Biggest difference in you tdfs seems to be that test 1 has less signal (peaks per scan and summed intensity per frame), but I am skeptical that is the root cause of the issue...

@jsnedeco
Copy link

I got permission to share the raw data for test 1, it is in the same filebin as before labeled test1

https://filebin.net/frqq1bks9pbtzm8s

@sander-willems-bruker
Copy link
Contributor

sander-willems-bruker commented Jul 25, 2024

I regret to say that I actually run the sample without issues:

09:26:27 sander@censored: ~/software/sage (master)$ time ./target/release/sage "/mnt/d/data/data/sage/config_dia.json.txt" /mnt/d/data/sandbox/sage_issue3/test1_240117_Cambridge_P2_RKO_A01_Slot1-01_1_3224.mzml
[2024-07-25T07:27:19Z WARN  sage_core::modification] Variable modifications must be specified as a list of modifications: [15.9949]. This will become a HARD ERROR by v0.15
[2024-07-25T07:27:20Z INFO  sage] generated 14125063 fragments, 683501 peptides in 778ms
[2024-07-25T07:27:20Z INFO  sage] processing files 0 .. 1

[2024-07-25T07:29:23Z INFO  sage] - file IO:   123451 ms
[2024-07-25T07:30:15Z INFO  sage] - search:     51967 ms (1919 spectra/s)
[2024-07-25T07:30:15Z INFO  sage_core::ml::retention_alignment] aligning file #0: y = 1.0000x + 0.0000
[2024-07-25T07:30:15Z INFO  sage_core::ml::retention_alignment] aligned retention times across 1 files
[2024-07-25T07:30:16Z INFO  sage_core::ml::retention_model] - fit retention time model, rsq = 0.8882220966956443
[2024-07-25T07:30:20Z INFO  sage_core::ml::mobility_model] - fit mobility model, rsq = NaN, mse = 0
[2024-07-25T07:30:30Z INFO  sage] discovered 123780 target peptide-spectrum matches at 1% FDR
[2024-07-25T07:30:30Z INFO  sage] discovered 22162 target peptides at 1% FDR
[2024-07-25T07:30:30Z INFO  sage] discovered 4796 target proteins at 1% FDR
09:30:35 sander@censored: ~/software/sage (master)$ time ./target/release/sage "/mnt/d/data/data/sage/config_dia.json.txt" /mnt/d/data/sandbox/sage_issue3/240117_P2_RKO_A01_Slot1-01_1_3224.d/
[2024-07-25T07:31:36Z WARN  sage_core::modification] Variable modifications must be specified as a list of modifications: [15.9949]. This will become a HARD ERROR by v0.15
[2024-07-25T07:31:37Z INFO  sage] generated 14125063 fragments, 683501 peptides in 769ms
[2024-07-25T07:31:37Z INFO  sage] processing files 0 .. 1
[2024-07-25T07:33:40Z INFO  sage] - file IO:   123340 ms
[2024-07-25T07:34:48Z INFO  sage] - search:     68161 ms (1463 spectra/s)
[2024-07-25T07:34:49Z INFO  sage_core::ml::retention_alignment] aligning file #0: y = 1.0000x + 0.0000
[2024-07-25T07:34:49Z INFO  sage_core::ml::retention_alignment] aligned retention times across 1 files
[2024-07-25T07:34:50Z INFO  sage_core::ml::retention_model] - fit retention time model, rsq = 0.8877301282224193
[2024-07-25T07:34:52Z INFO  sage_core::ml::mobility_model] - fit mobility model, rsq = 0.9288593680447328, mse = 0.0009716378078931589
[2024-07-25T07:34:58Z INFO  sage] discovered 111770 target peptide-spectrum matches at 1% FDR
[2024-07-25T07:34:58Z INFO  sage] discovered 19840 target peptides at 1% FDR
[2024-07-25T07:34:58Z INFO  sage] discovered 4337 target proteins at 1% FDR

In attachment the latest config I used
config_dia.json.txt

For completeness, my sage is build from master, with latest commit: commit 93a9a8a7c9f717238fc6c582c0dd501a56159be7 (which should be up to date according to master branch here on github). Only difference is that I use a locally build timsrust, which is ahead of the public timsrust 0.3.0, but on top of my head this is only API changes and nothing algorithmic. If it's not, Timsrust 0.4.0 with @jspaezp excellent implementation to partition diaPASEF windows in smaller scan-ranges will hopefully be released soon giving you the exact same version I use. If that does not fix it, you'll have to post on issue to your PC manufacturer instead of Sage...;)

@jsnedeco
Copy link

I can't believe this. I was using the wrong file for the .d input in the first test. I am so mad.

Sorry to waste everyone's time, hopefully at least the test files will be useful. Thank you everyone for your help!

@jspaezp
Copy link
Contributor

jspaezp commented Jul 25, 2024

@jsnedeco sorry to hear that. But looking at the glass half full it means it is working! I think we can close this issue and follow up on the progress of the alternative methods later. (Ill make sure to tag you when we make a PR related to this.)

@sander-willems-bruker
Copy link
Contributor

I can't believe this. I was using the wrong file for the .d input in the first test. I am so mad.

Sorry to waste everyone's time, hopefully at least the test files will be useful. Thank you everyone for your help!

Happens to all of us;). As @jspaezp meantions, the good news is we've got it fixed:)

@animesh
Copy link

animesh commented Aug 1, 2024

just adding the timsconvert to the tdf2mzml mix as suggested by @RobbinBouwmeester , tried converting https://bioshare.bioinformatics.ucdavis.edu/bioshare/download/cts8a50sb36put8/26june24_hel200_100spd_OT_1ulirt_S2-H2_1_6370.d.zip using both where https://figshare.com/ndownloader/files/48073981 and https://figshare.com/ndownloader/files/48073936 are the timsconvert and tdf2mzml converted files respectively and running sage with parameter sage.json and human_crap.fasta

docker run --rm -it -v /mnt/z/HeLa:/data animesh1977/sage [/data/sage.json](https://github.com/user-attachments/files/16455675/sage.json) -f [/data/human_crap.fasta](https://github.com/user-attachments/files/16455646/fasta.zip) -o /data [/data/26june24_hel200_100spd_OT_1ulirt_S2-H2_1_6370.tdf.mzML](https://figshare.com/ndownloader/files/48073936) [/data/26june24_hel200_100spd_OT_1ulirt_S2-H2_1_6370.timsconvert.mzML](https://figshare.com/ndownloader/files/48073981)
[2024-08-01T11:06:25Z INFO  sage] generated 187444336 fragments, 5604466 peptides in 58835ms
[2024-08-01T11:06:25Z INFO  sage] processing files 0 .. 2 
[2024-08-01T11:11:40Z INFO  sage] - file IO:   314254 ms
[2024-08-01T11:12:51Z INFO  sage] - search:     71557 ms (630 spectra/s)
[2024-08-01T11:12:51Z INFO  sage_core::ml::retention_alignment] aligning file #1: y = 0.9989x + -0.0001
[2024-08-01T11:12:51Z INFO  sage_core::ml::retention_alignment] aligning file #0: y = 1.0001x + -0.0000
[2024-08-01T11:12:51Z INFO  sage_core::ml::retention_alignment] aligned retention times across 2 files
[2024-08-01T11:12:51Z INFO  sage_core::ml::retention_model] - fit retention time model, rsq = 0.896083130236588
[2024-08-01T11:12:51Z INFO  sage_core::ml::mobility_model] - fit mobility model, rsq = NaN, mse = 0
[2024-08-01T11:13:49Z INFO  sage_core::lfq] tracing MS1 features
[2024-08-01T11:13:52Z INFO  sage_core::lfq] integrating MS1 features
[2024-08-01T11:13:52Z INFO  sage] discovered 1221 target MS1 peaks at 5% FDR
[2024-08-01T11:13:52Z INFO  sage] discovered 7031 target peptide-spectrum matches at 1% FDR
[2024-08-01T11:13:52Z INFO  sage] discovered 1519 target peptides at 1% FDR
[2024-08-01T11:13:52Z INFO  sage] discovered 1038 target proteins at 1% FDR
{
  "version": "0.15.0-alpha",
  "database": {
    "bucket_size": 8192,
    "enzyme": {
      "missed_cleavages": 2,
      "min_len": 7,
      "max_len": 50,
      "cleave_at": "KR",
      "restrict": "P",
      "c_terminal": null,
      "semi_enzymatic": null
    },
    "peptide_min_mass": 500.0,
    "peptide_max_mass": 5000.0,
    "ion_kinds": [
      "b",
      "y"
    ],
    "min_ion_index": 2,
    "static_mods": {},
    "variable_mods": {},
    "max_variable_mods": 3,
    "decoy_tag": "rev_",
    "generate_decoys": true,
    "fasta": "/data/human_crap.fasta"
  },
  "quant": {
    "tmt": null,
    "tmt_settings": {
      "level": 3,
      "sn": false
    },
    "lfq": true,
    "lfq_settings": {
      "peak_scoring": "Hybrid",
      "integration": "Sum",
      "spectral_angle": 0.6,
      "ppm_tolerance": 5.0,
      "combine_charge_states": true
    }
  },
  "precursor_tol": {
    "ppm": [
      -20.0,
      20.0
    ]
  },
  "fragment_tol": {
    "ppm": [
      -20.0,
      20.0
    ]
  },
  "precursor_charge": [
    2,
    4
  ],
  "override_precursor_charge": false,
  "isotope_errors": [
    0,
    2
  ],
  "deisotope": true,
  "chimera": true,
  "wide_window": true,
  "min_peaks": 15,
  "max_peaks": 150,
  "max_fragment_charge": 1,
  "min_matched_peaks": 4,
  "report_psms": 5,
  "predict_rt": true,
  "mzml_paths": [
    "/data/26june24_hel200_100spd_OT_1ulirt_S2-H2_1_6370.tdf.mzML",
    "/data/26june24_hel200_100spd_OT_1ulirt_S2-H2_1_6370.timsconvert.mzML"
  ],
  "output_paths": [
    "/data/results.sage.tsv",
    "/data/lfq.tsv",
    "/data/results.json"
  ]
}
[2024-08-01T11:13:54Z INFO  sage] finished in 507s

giving about 1500 peptides lfq.txt and R^2 of 0.764 in log2 space

image

now i am not sure which converter to trust 🤪

Just to compare with a direct search of .d folder using MaxQuant with https://github.com/user-attachments/files/16439221/fasta.zip, full parameter https://github.com/user-attachments/files/16439180/parameters.txt without LFQ, yield is almost 10 times
peptides.txt, probably it can be made better playing with parameter files but i am hoping sage can directly search as well to circumvent all this confusion caused just by the conversion process 🫡

@jspaezp
Copy link
Contributor

jspaezp commented Aug 1, 2024

Just to compare with a direct search of .d folder using MaxQuant with https://github.com/user-attachments/files/16439221/fasta.zip, full parameter https://github.com/user-attachments/files/16439180/parameters.txt without LFQ, yield is almost 10 times

I am not sure what maxquant does nowadays ... LMK if it is a true spectrum centric search. Otherwise it would be the equivalent of comparing de-novo search with a database search.

just adding the timsconvert to the tdf2mzml mix

  1. I have not read that much of the repos nor have that much experience using the SDK ... but I see a couple of places where there are parameters that might not be the same between the converters, I would expect that small changes in those parameters change what the data looks like (at least the mass error might be different... not 100% sure).
    https://github.com/gtluu/pyTDFSDK/blob/6c58bb171b6968564466bd5da0a08c0831e77770/pyTDFSDK/tims.py#L52
  "ppm_tolerance": 5.0,

You are using 20ppm for the search but 5 for the lfq... is there any reason why that is the case? I think having a narrow mass error for the integration might make more noticeable differences in calibration.

probably it can be made better playing with parameter files but i am hoping sage can directly search as well to circumvent all this confusion caused just by the conversion process

I am not sure I agree with this statement ... since there IS a conversion that happens, it is just a lot faster, furthermore, I added within timsrust 2 more ways to convert frames to scans (and timsrust right now does not even expose the parameters to smooth+centroid spectra, which are additional variables). I don't believe there is a single "right" way to do things. To be clear this not an issue unique to .d files, I encourage you to compare proprietary vs OSS centroiding on orbi data ;)

I would encourage you to make this comparison a GH repo with scripts for the reproduction and open an issue tagging the maintainers of the converter projects (I am also happy to pitch in), but I sincerely feel like this is not a sage issue.

Kindest wishes,
Sebastian

@animesh
Copy link

animesh commented Aug 2, 2024

Thanks for looking into this @jspaezp , yes it does look like MaxQuant is doing spectrum match even for DIA but it is hard to confirm since it is not open source 🤪 for what it's worth, Juergen goes through a little bit of detail in his latest talk https://www.youtube.com/watch?v=jT4eLkRU1eQ&t=75s 🙏

  "ppm_tolerance": 5.0,

haven't started playing with parameters yet, just copied the one from the documentation of sage https://sage-docs.vercel.app/docs/configuration/example_PXD003881 , will check it out for sure, thanks for pointing 🤞

Yes, that is correct that some kind of conversion is bound to happen as timsTOF is not writing mzML which sage can understand but i was hoping not to go through this process myself specially when these tools for conversion are giving quite a lot of differences 🤪 BTW is there a command line tool to pinpoint precise difference in these mzML files? Once i have that, i will follow your advice on creating that repo 👍

Call me lazy but i hope in near future i don't have to check it as expected, "Sage is somewhat of a rejection of the UNIX/traditional bioinformatics philosophy of "write programs that do one thing and do it well". (Or, perhaps it is an expansion of this concept... where "one thing" means "the whole analysis")" 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants