-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diaPASEF first tests #123
Comments
I did say theoretically possible 😉 - there are clearly some practical issues. I'm definitely interested in supporting diaPASEF. To be honest though, it will probably be ~2-3 months or so before I have time to really dig in and experiment (I have a lot going on until late spring!). It seems like there are a couple other people interested in perhaps trying it out too, and I am always happy to provide guidance on Sage internals - if there's a way to collapse into a single spectrum, plugging it into Sage should be straightforward. If you're willing to share the files, I can download them and hack on them when I get a chance! |
Thank you so much for even considering this! please find all data (also a DIA-NN analysis directly from .d) here: If supporting diaPASEF is a longterm possibility, may I ask what you are thinking about very roughly (I will not hold you to it I promise expect if you say that I can do something theoretically :P)? More specifically, I was thinking about whether you would consider reaching out to Vadim Demichev to see whether it is possible to create the library with SAGE and then do the main DIA analysis with DIA-NN. It is not open-source though, so it might not be ideal. But looking at FragPipe, this could be a powerful combination of tools. Again, if I can help in any way, please let me know! Especially, the files that I provided: If there is any other metric I should check or summarize that might be helpful, I would be happy to do so! Best, Klemens |
Hello @KlemensFroehlich and @lazear. I would like to point out that the FragPipe DIA workflow uses a program called easypqp (OpenMS) to build the library. The library is then passed down to DIANN to perform the quantification. If you go to the easypqp GitHub page, you'll see that someone already made this request to them. Glad to see that people share the same interests and strategies. Best |
Yes, sage will write all matched fragments when using the |
@KlemensFroehlich @lazear , could this repo be useful? https://github.com/mafreitas/tdf2mzml, in the description they claim to have support for DIA bruker data (and based on a closed discussion/issue on that repo this concerns diaPASEF). |
To add to this: I tried out diaPASEF file to just see if I could use something besides DIA-NN for quantification (both for closed source reasons and also because it seems to take a very long time to generate a spectral library from a diaPASEF run). I used tdf2mzml to generate an mzML file from a test bruker TimsTOF HT file and tested out the released version of Sage (0.14.7) with chimera, wide_window and report_psms to 5 (I can send the full config file if it helps). The top level identification stats look like this [2024-07-18T19:37:40Z INFO sage] discovered 109719 target peptide-spectrum matches at 1% FDR Not bad! Running on DIA-NN gives the following top line stats: [34:35] Number of IDs at 0.01 FDR: 117535 So it would be nice to boost the protein level identifications a touch, but it doesn't look bad as a first pass. It's also possible that DIA-NN is using different metrics for protein counts or something, I haven't investigated that yet. Then I saw this PR (#140) and got extremely excited to try out the raw .d inputs to see if I could skip the conversion step altogether since it's pretty slow. I built a new version off of the master branch used the same parameters and... [2024-07-18T20:34:23Z INFO sage] discovered 2776 target peptide-spectrum matches at 1% FDR That seems quite a bit worse. I know that this is still very much an experimental use case, but I'm wondering if I'm doing something wrong? |
Hi @jsnedeco . This indeed is very experimental, @jspaezp is working on a PR that should boost these numbers quite a bit (MannLabs/timsrust#21). Without looking at the dia.d folder and your config, I can't tell if you are doing anything wrong of course, but I am happy to share my findings with the implementaiton you also have available (MannLabs/timsrust#15 (comment)). I used this config: MannLabs/timsrust#15 (comment). |
Hi @sander-willems-bruker! I used these parameters: It looks pretty similar, except your report_psms is a lot higher. I probably can't share the specific .d file I was testing, but I'll see if I can get a representative sample that I can share. |
As a reference ... #140 this PR added support for dia-pasef direct reading. @jsnedeco (wont discuss diann, since it does a completely different thing) but from what I see in message there is something wrong ... I have seen on occasion that mass calibration looks different between them, so try running it with +- 50 ppm on ms2 (its overkill but... who knows :P and if that works let us know, since it would point to an error in how the mz is calculated in timsrust). LMK if you can send over a file with representative data that suffers from the same issue. |
DIANN was mostly intended as a rough benchmark, not as something to achieve complete parity with :)
I used the same parameters for both, but I did use the released version of sage for the mzml file and the unreleased version of sage for the .d analysis (because it needed to be.) I just tested again using the unreleased version for both with the same parameters I used originally and the results look the same: tdf2mzml->sage timsrust->sage timsrust->sage +/- 50 ppm ms2 In both these cases I'm using report_psms: 5, just so I'm not tweaking too many parameters. |
All of my samples are 22 min either ddaPASEF/diaPASEF timsrust dda -> discovered 10519 target peptides at 1% FDR
my config:
I wonder what other things are different from the bruker sdk and timsrust ... LMK if you can upload data anywhere that we could work on together. RN I am also failing to find a file in PRIDE that is not zipped in a stupid way (zipping all the .d of the repo in a single file ...) Edit: Downloading these two guys for testing ...
|
I'd definately increase frag_ppm (try 15ppm first, as @jspaezp mentions there are samples where you need to go to 50 due to calibration issues with timsrust), max_peaks (200+ for sure) and report_psms (20+ wont hurt in this case). What might be going wrong on your particular search is that your mzml might be generated per scan, making them not too complex (meaning the max_peaks and report_psms indeed are not too influential for mzml), whereas timsrust projects the whole window (potentially 500+ scans aggregated, thuis far more peaks and far more complex). |
The raw data is identical, as well as all info in the tdf. The primary things that are different are mass and im calibration and centroiding. Depending on sample, there might also be a difference in raw intensities, as timsrust does not yet take ICC into account by default (with the api you can correct for this though, this is just a simple correction factor applied per frame based on TIMS ramp times) |
I tried again with the attached config file and the results looked like this [2024-07-22T22:05:09Z INFO sage] discovered 1980 target peptide-spectrum matches at 1% FDR So worse than all the previous tests. Annoyingly, I tried it on another file that I can share and I'm not seeing the same results! timsrust->sage tdf2mzml->sage I am going to dig into the fragment annotations on the first unshareable file and see if there are any patterns. If there's any interest, I can share the second test file but it's not showing the behavior so it probably won't be that helpful. [edit] For what it's worth, here are the .d and mzml files for Test file 2. https://filebin.net/frqq1bks9pbtzm8s And just for fun, I took a look at the fragment_ppm for the psm ids that were called both in the mzml file and the .d file: |
It's also worth mentioning ... Can you share the .tdf without the tdf_bin? It would give information on the instrument, methods and calibration but not the actual scans (technically it also includes the TIC ). (Also I'm assuming it is done on the same instrument .. I've only tested it on SCP/ultra data ... The one publicly available is HT I think ... Which I haven't run just yet..) |
That's a good idea, I added it to the filebin with the other two .d and .mzml files from test 2. It's called test1_analysis.tdf https://filebin.net/frqq1bks9pbtzm8s And it should be the same instrument, though I'll double check. |
As far as I can tell there is nothing wrong with the tdf you shared. For file 2, I notice the (uncompressed) mzml is not even twice as big as .d. Just a gut feeling, but could you confirm this is also the case for sample 1, or is the mzml much larger in that case? |
For test 1 it's around 4.1 GB for the .d directory and ~6GB for the mzml. So not a big difference, but different. |
In that case I remain unsure as to why test 2 runs fine and test 1 does not... I checked the mz calibration and that indeed looks fine as well. Biggest difference in you tdfs seems to be that test 1 has less signal (peaks per scan and summed intensity per frame), but I am skeptical that is the root cause of the issue... |
I got permission to share the raw data for test 1, it is in the same filebin as before labeled test1 |
I regret to say that I actually run the sample without issues:
In attachment the latest config I used For completeness, my sage is build from master, with latest commit: |
I can't believe this. I was using the wrong file for the .d input in the first test. I am so mad. Sorry to waste everyone's time, hopefully at least the test files will be useful. Thank you everyone for your help! |
@jsnedeco sorry to hear that. But looking at the glass half full it means it is working! I think we can close this issue and follow up on the progress of the alternative methods later. (Ill make sure to tag you when we make a PR related to this.) |
Happens to all of us;). As @jspaezp meantions, the good news is we've got it fixed:) |
just adding the timsconvert to the tdf2mzml mix as suggested by @RobbinBouwmeester , tried converting https://bioshare.bioinformatics.ucdavis.edu/bioshare/download/cts8a50sb36put8/26june24_hel200_100spd_OT_1ulirt_S2-H2_1_6370.d.zip using both where https://figshare.com/ndownloader/files/48073981 and https://figshare.com/ndownloader/files/48073936 are the timsconvert and tdf2mzml converted files respectively and running sage with parameter sage.json and human_crap.fasta
giving about 1500 peptides lfq.txt and R^2 of 0.764 in log2 space now i am not sure which converter to trust 🤪 Just to compare with a direct search of .d folder using MaxQuant with https://github.com/user-attachments/files/16439221/fasta.zip, full parameter https://github.com/user-attachments/files/16439180/parameters.txt without LFQ, yield is almost 10 times |
I am not sure what maxquant does nowadays ... LMK if it is a true spectrum centric search. Otherwise it would be the equivalent of comparing de-novo search with a database search.
You are using 20ppm for the search but 5 for the lfq... is there any reason why that is the case? I think having a narrow mass error for the integration might make more noticeable differences in calibration.
I am not sure I agree with this statement ... since there IS a conversion that happens, it is just a lot faster, furthermore, I added within timsrust 2 more ways to convert frames to scans (and timsrust right now does not even expose the parameters to smooth+centroid spectra, which are additional variables). I don't believe there is a single "right" way to do things. To be clear this not an issue unique to .d files, I encourage you to compare proprietary vs OSS centroiding on orbi data ;) I would encourage you to make this comparison a GH repo with scripts for the reproduction and open an issue tagging the maintainers of the converter projects (I am also happy to pitch in), but I sincerely feel like this is not a sage issue. Kindest wishes, |
Thanks for looking into this @jspaezp , yes it does look like MaxQuant is doing spectrum match even for DIA but it is hard to confirm since it is not open source 🤪 for what it's worth, Juergen goes through a little bit of detail in his latest talk https://www.youtube.com/watch?v=jT4eLkRU1eQ&t=75s 🙏
haven't started playing with parameters yet, just copied the one from the documentation of sage https://sage-docs.vercel.app/docs/configuration/example_PXD003881 , will check it out for sure, thanks for pointing 🤞 Yes, that is correct that some kind of conversion is bound to happen as timsTOF is not writing mzML which sage can understand but i was hoping not to go through this process myself specially when these tools for conversion are giving quite a lot of differences 🤪 BTW is there a command line tool to pinpoint precise difference in these mzML files? Once i have that, i will follow your advice on creating that repo 👍 Call me lazy but i hope in near future i don't have to check it as expected, "Sage is somewhat of a rejection of the UNIX/traditional bioinformatics philosophy of "write programs that do one thing and do it well". (Or, perhaps it is an expansion of this concept... where "one thing" means "the whole analysis")" 🙏 |
hi Michael
Please forgive me for bringing this up again... In my defense: You said nothing is preventing users from converting .d folders to mzml and analyze diaPASEF data :)
I have tried a few things to see how different softwares currently handly diaPASEF. For this I generated a ~10ng, 5min human HEK active gradient diaPASEF run on a timsTOF Ultra.
file size as .d folder:
1.4GB
I have now tried 3 different ways to generate an mzml file:
I searched the data with sage and for comparison I also included a 30min ddaPASEF run, which was directly analyzed from .d
the ddaPASEF seems to work nicely looking at the q value distribution of spectrum, peptide or protein level, but for all other searches the q value distribution looks odd to me.
I would be happy to share the data with you, or do more testing benchmarking. As I already mentioned there is currently no non-proprietary option to analyze diaPASEF data with large search spaces, so I am highly motivated to get this off the ground.
You also said earlier that it is not trivial to collapse / handle the ion mobility in tims data.
Would you be open to have a look at the msconvert ion mobility combine option and at the resulting data structure?
Maybe this can already solve the problem and sage "only" needs to be adapted to be able to handle this specific mzml ?
Oh and btw working with sage made me smile more than once! It is a lot of fun to work with a tool so fast! I still remember the days I had to run semitryptic searches in MaxQuant for a week xD
Best, Klemens
The text was updated successfully, but these errors were encountered: