-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with Sage #334
Comments
Hi Michael, Yes! In fact, we have already talked about the option of adding Sage! I will try to get the time to dig a bit deeper next week and get back to you when I have gotten an overview of how easy it would be to include Sage in SearchGUI (and PeptideShaker). But as far as I remember there did not seem to be any real issues. Best regards, |
Hi again, I had a closer look at the possibility of adding Sage to SearchGUI and as far as I can see there should not be any major showstoppers. I did however come across a couple of things that ought to be addressed before we can continue. The first thing I noticed when trying to run Sage was that it seemed to lack any indication of progress? My search failed quite quickly (more on that below), but for longer searches it would be very nice to have some sort of indication of progress? Nothing fancy, just some text showing what the tool is currently doing and ideally a percent indicator. (It could of course be that such information is already displayed after the step where my test came to an end, but would still be great to have some indication as soon as the tool starts as well?) When going through the json parameters I could not find any way of specifying the enzyme? Seems like trypsin is assumed? Would it be possible to add support for additional enzymes? Similarly, it would be great to be able to define terminal modifications at specific amino acids? Maybe this is already possible? I did not find any examples in the json comments though? For the decoy_prefix option, would it be possible to also support suffix annotation? As our headers generally look like this: You write that "MS2 search results will be stored as a Percolator-compatible (.sage.pin) file". Could you share an example of a results file from Sage such that we can verify whether it contains the information we need to process it in PeptideShaker? Finally, below is the error I got when trying to run the following command line:
I used a standard FASTA file from UniProt and the mzML file is based on a conversion from our standard example raw file. You can find all of the input files here: https://www.dropbox.com/s/9qg5dhsxdf0pqrz/Sage%20test%20files.zip?dl=0 Hopefully all of these points should not be too hard to fix/implement? Best regards, |
Hi Harald, Thanks for the comments.
I am definitely interested in getting integrated with PeptideShaker - more so than SearchGUI. Seems like less blocking issues too, while I figure out how/if to tackle some of the points raised above. Best, |
Hi Mike,
If a progress indicator takes up too much resources, maybe just adding short messages indicating which step that is currently being executed?
This is not really a showstopper from our end as we have to recalculate the FDR in PeptideShaker anyway after merging the results from all of the individual search engines. So the most important for us here is actually being able to tell Sage to treat all of the provided sequences in the same way, i.e. disregarding the decoy tags. But I guess that is what happens if simply leaving out the decoy_prefix option?
Thanks! Seems to have most of what we need. The only thing I could spot missing after a quick check is a way to link back to the original spectrum. For this we need the name of the spectrum file and ideally the spectrum title or index. Perhaps the scan number can be used for the latter, but I would need to test. We also prefer having more than one hit per spectrum. I see that you do not recommend this in the comments for your json example file, but if we do increase the report_psms option to a number higher than one, how would the ranks be shown in the output file? Finally, it would be great to have a way to indicate that this file came from Sage and ideally from which version as well? Best regards, |
Sounds good - just to clarify, these are the things that need to be resolved for PeptideShaker integration:
Anything else I missed? |
Yes, I think that should cover it. Not sure about the scan number. I guess we'll find out when trying to implement it. But maybe you can also keep the index from the mzML headers (instead of the titles/ids)? To be honest I'm not really sure what the scan number represents, as it does not seem to be a separate tag and rather part of the spectrum id, i.e. title, tag? Example: Regarding how to indicate the Sage version I'm not sure what the best option would be. Is there a way of adding comments at the start of a pin file, i.e. before the actual data starts? Or would this result in the pin file not being readable by other tools? Adding a new column may be the easiest option as it can simply be ignored by other tools. In any case, it is more important to be able to tell that the pin file came from Sage than the exact version number. At the moment there is nothing in the current file that can be used for this I think? Would it make sense to add Sage to any of the current column headers? If this information cannot easily be added, we should still be able to parse the file, we just would not be able to tell which algorithm that was used to generate it and will simply have to call it "generic pin file format" or something like that in PeptideShaker. |
I pulled out the scan number because it was required for percolator/mokapot input ("ScanNr"), and is also reported by Proteome Discoverer, MSFragger, Comet, etc - this enables easy comparison of results from multiple search engines (using Pandas, etc). I do extract it from the I can change the |
Here are the definitions from the mzML specification document:
To me it seems like the index ought to be included in the pin file as it is the only guaranteed way of referring back to a given spectrum in the mzML file. Actually, it would also be great to have the name of the mzML file in a separate column so that we do not have to guess this from the name of the pin file?
That would work. :) |
That is a fair point - clearly, though, scan numbers can be used to uniquely refer to spectra, given that they are the preferred indexType in the USI format (https://raw.githubusercontent.com/HUPO-PSI/usi/master/Specification/USI_SpecDoc_1.0.draft10_2021-05-20.pdf). They do also note this complication with some instruments (appears to be just AB SCIEX?) not reporting scan numbers. Currently, Sage will default to using the spectrum index + 1 as the internal "scan number" if the scan field isn't available. This doesn't seem like correct behavior, but acts as a work around if someone uses a SCIEX instrument, I suppose (which I don't think has happened yet...). I will look into the performance cost of keeping track of the entire spectrum ID string, if it's acceptable, then I will implement that! |
Here are a few lines from an example file - let me know if this works! |
Also, regarding the above error:
This is occurring because there is no |
Thanks! I will look at it as soon as possible. I'll keep you posted.
Yes, I think that makes sense and is in line with what other search engines seem to do. |
Yes, it does! I've now implemented a parser for Percolator pin files in PeptideShaker and it is able to load your short example file! I've deployed a beta version that you can test here: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip I have only tested it on the small example file yet though, so there are probably still undetected issues. Note also that the pin file parser currently only supports your specific pin file format and requires that the following column headers are used: peptide, charge, posterior_error, rank, scannr and filename (in any order). (A column named sage_discriminant_score is also needed to detect that the file was generated by Sage, but this is not mandatory.) Please try the beta version on a complete Sage results file and let me know how it goes? |
I got it to load! Couple issues/comment:
N-term modsLooks like there is a failure with N-terminally modified peptides (out of bounds) here: https://github.com/compomics/compomics-utilities/blob/0a75023c9ba7b4bb6676460b70005ec0ccee88a8/src/main/java/com/compomics/util/experiment/io/identification/idfilereaders/PercolatorInputfileReader.java#L259 I'm using ProForma suggested notation for modifications now, so will have something like:
|
I've added support for PeptideShaker/SearchGUI style database, and an option to disable internal decoy generation so we are good on that front - looks absolutely great, numbers are matching what I expect, etc! One additional bug/quirk: I noticed that when reporting multiple PSMs for the same spectrum (e.g. |
By default fixed modification are not shown. You can change this in the View > Fixed Modifications menu option.
You would have to set this in the search settings before loading the data. You'll find it in the same dialog has where you set the PTMs. See the Isotopes option in the lower right-hand corner. The default is 0-1.
The only place that you will see the search engine scores will be in the Spectrum IDs tab and only if also checking the View > Scores menu option. However, at the moment I'm using the posterior_error value and not the sage_discriminant_score. Do you want me to change this?
Indeed, as I did not have any such PTMs in the short example file. Should not be too hard to add the required support though. I will look into it a deploy a new version.
I will need to look deeper into this later. But yes, rank 1 is best, rank 2 is second best etc. Note that we only retain the best one though (except for in the Spectrum IDs tab). The additional hits are mainly there for better comparison between search engines, or so that we can overrule the search engine if we don't agree with the best hit or or if more than one hit have the same score. |
I've just redeployed the beta version which should now also support terminal PTMs: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip |
For cysteine mod, the peptides aren't making it through the import filter:
I suppose PS is looking for peptides of the form "PEPCA" not "PEPC[+57.0214]A" with fixed modifications?
No, this should be better/more stable across runs, thanks! I'll test out with N-term mods in a bit. |
We're just parsing all of the modifications provided in the peptide sequences. These should later be mapped to the correct modifications, i.e. comparing to the modifications set in the search settings. But perhaps something is going wrong there. Can you share the pin file so that I can test?
So stick with posterior_error? |
Seems like we currently treat all modifications in pin files as variable. Hence, I need to detect the fixed ones and treat them differently. I'll see what I can do and deploy a new version. |
Here is a link to the PeptideShaker project zip file - can use this mzml file for additional testing too: https://www.dropbox.com/s/kp057sc4n9t92mj/b.psdb?dl=0 And yes, I should've been more clear - stick with posterior_error! |
Seems to only contain the psdb file? Better if you could share the pin file and the search settings used (I already have the mzml file)? That way I can try to load the data myself and hopefully fix the fixed modification issue. |
I've just redeployed the beta version which should now also support fixed PTMs: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip |
Whoops... copied the wrong file. Here is a link for the zipped results files/parameters: https://www.dropbox.com/s/9lyf2qoobwhe7t3/Archive.zip?dl=0 |
Thanks! As far as I can tell it all seems to be working fine now? At least I'm able to load this particular file without any issues. :) |
Great! Then all that remains is to also add Sage to SearchGUI? :)
Probably, as our current parser is rather Sage-specific anyway and not really a generic pin file parser.
Not sure, could be. If you look at the validation plots in the Validation tab there seems to be a very clear separation between the good scoring target hits and the bad scoring decoy hits. At least at the protein and peptide level. The PSM level is however quite strange with only hits scoring 100 for both target and decoy. Which is very different from the PeptideShaker example dataset. Not really sure why? However, the Score/Confidence column is often, but not always, displaying 100 for the PeptideShaker example dataset as well. I think I will have to forward this question to my co-developer Marc Vaudel as this is more his field of expertise. @mvaudel Any comments? |
I think so - I will get started on the remaining issues. I think additional enzyme support is probably the biggest blocker?
I will do this then. I noticed this line of code: https://github.com/compomics/compomics-utilities/blob/d5e88a13d298d9b159b37a93ba1e04124693c730/src/main/java/com/compomics/util/experiment/io/identification/idfilereaders/PercolatorInputfileReader.java#L280 This is, I believe, incorrect behavior. There should be a check to see if This hasn't bitten me yet, and I don't have any test files with such cases yet, but I imagine it will happen at some point.
Yeah, that's exactly what I noticed as well |
Wow this is all very impressive, both in quality and speed.
If my memory is correct (this is 10+ years old code) the score column on the GUI requires the score to distribute like a PEP and displays 100*(1-PEP), i.e. a good score should be close to zero and then the column displays 100, and a bad score should be close to 1 and then the column displays 0. |
Yes, I think so. Would also be great to get the precursors without charge state information issue addressed? (As this would allow us to also search our (old) standard example dataset.) I'm now fully booked with other things until the end of this week. I'll move on to the SearchGUI support over the weekend.
Yes, you are correct. This was indeed a bug. I made a slightly different fix that seems to work. Redeployed at the same link: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip. |
Great - no rush! I will let you know when initial support for enzymes & charge-less precursors is done, will probably be next week for me as well. Re: Scoring, Sage is reporting the posterior_error (PEP) already log10-transformed, so should be an easy transform to get it back to what's expect. |
Hi @hbarsnes, I'm ready for round 2 when you are! Enzyme support & charge-less searching have been merged and released. |
Great! Thanks for the update! I'm still stuck with other duties, but I have managed to make some progress on adding Sage to SearchGUI. Hopefully I'll get the time to complete the work later this week. I will keep you posted. |
I finally got some time to work on adding Sage to SearchGUI and already have the command line set up and running from inside SearchGUI and the results opening in PeptideShaker. Still need to add the custom parameters and set up the enzyme and modification mapping, but no major problems uncovered so far. I did however come across a couple of minor issues that you can have a look at while I complete the integration.
|
|
Ah, ok, that makes sense. Not sure how to make this fit with the other search engines though, as they all provide one output file per spectrum file (or at least that is how we use them). Basically, we run one command line per search engine per spectrum file (as the results are later merged in PeptideShaker anyway). But if we do that for Sage, I guess we would end up overwriting the result files and only be left with the results from the final spectrum file (given that the output folder would be the same). I will see what I can do, but l will first focus on getting it all to work for a single spectrum file. |
A first beta version of SearchGUI with Sage support is now available here: https://genesis.ugent.be/maven2/eu/isas/searchgui/SearchGUI/4.2.0-beta/SearchGUI-4.2.0-beta-windows.zip and https://genesis.ugent.be/maven2/eu/isas/searchgui/SearchGUI/4.2.0-beta/SearchGUI-4.2.0-beta-mac_and_linux.tar.gz To open the data in PeptideShaker you will also need to update your PeptideShaker beta version: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip You will also have to make a new search parameters file in SearchGUI (will look into auto updating old parameter files later). Note that I've not yet had the time to do much real testing, but thought I'd share these anyway so that at least you can check them out and perhaps discover a bug or two. However, from the tests I've done, it all seems to be working fine. But there is most likely still lots of finetuning required. Note also that I have not yet found a way to easily include multiple linux and osx builds of Sage. Partly as I was not sure which versions to include? Do we really need all of the four linux builds in SearchGUI? |
Woo! I will give it a try - also working on some of the additional features as well. I think for Linux, just "sage-v0.7.1-x86_64-unknown-linux-gnu.tar.gz" - I will probably remove the pre-compiled musl builds since I don't think anyone is using them? aarch64 builds are mostly for when I am running Sage on cheaper AWS gravitron (ARM) instances - I doubt many people are running aarch64 linux with a GUI. For Mac, I would perhaps include both versions? aarch64 is required for M1 support, x86_64 for anyone not on M1 hardware (I think? I only have an M1). |
I think that was the one I included, but in any case, I'll make sure that it is the one included in the next beta release.
Yes, I think that would be the best option. Just need to find an easy way to figure out which one to run on a given system. Again, I'm not sure which version I included in the current beta release, but I guess you will find that out quickly enough when you try to test it. BTW, if you want to quickly replace the Sage version included in SearchGUI you can simply replace the given sage executable found in the subfolders under SearchGUI-4.2.0-beta\resources\Sage. Oh, and if you are not too familiar with SearchGUI, if you want to change the Sage-specific parameters, you will find these by clicking the cogwheel to the right of where you turn on/off each individual search engine. (All of the options will also later be added here: https://github.com/compomics/compomics-utilities/wiki/IdentificationParametersCLI.) |
Couple quick notes on parameters:
|
This seems to be an error in Sage and not SearchGUI? As I've just double checked that the json file created by SearchGUI does include the decoy tag set in the SearchGUI options. Example: sage.txt. (Renamed to .txt so that it could be uploaded to the issue tracker.) I've also tested with generate_decoys set to both true and false, but this does not seem to make any difference either.
Ok, good to know. I think we'll just keep it anyway as it's already implemented. And then we'll have it available if we change SearchGUI so that it searches multiple files at the same time at a later stage. |
The file you sent has "decoy_prefix" set, not "decoy_tag"! I changed the option name a while back to make it more clear that it's not limited to prefix. I just tried running SearchGUI on OSX (was on Windows yesterday), and I can no longer open the search settings (cog) for Sage, but I can for other tools, which causes NPE
|
Aha! Nice catch! Fixed and new beta versions deployed: https://genesis.ugent.be/maven2/eu/isas/searchgui/SearchGUI/4.2.0-beta/SearchGUI-4.2.0-beta-windows.zip and https://genesis.ugent.be/maven2/eu/isas/searchgui/SearchGUI/4.2.0-beta/SearchGUI-4.2.0-beta-mac_and_linux.tar.gz
Does this happen when using previously created settings or also new ones created in the latest beta release? As for the moment you cannot use the old settings as I did not get the time to make this backwards compatible yet. |
@lazear I noticed that you had released a new version of Sage with support for more modification types. I therefore updated our code to take advantage of this and released new beta versions of SearchGUI as well. Same links as before. |
Awesome, thanks. I'm busy with some other stuff this week, but I will get around to additional enzymes soon! |
@hbarsnes do you have any favorite test data sets including enzymatic cleavage with N-specific enzymes or any non-specific digestion? I think I'm ready for the next phase of testing! |
No, I'm afraid not. We generally just test with tryptic digestion ourselves. You can probably find some in PRIDE though?
Great! Looking forward to testing the new Sage versions! |
Just ran across sage today. Wow it is fast. I thought it would be nice to have this in SearchGUI and I ran across this thread! I downloaded the beta and am trying it out. What can I do to help? |
@mknierman thanks! I always welcome any contributions/bug reports/etc. Right now, I just need to get some free time to do some more testing on the next round of updates to Sage - hopefully in the next couple of weeks. |
@mknierman Thanks for the encouragement! Feel free to test the current beta version and let us know if you come across any issues. Then I will let you know when there is a new beta versions to test. |
@lazear With your recently added support for additional modification types in Sage, which is now included in SearchGUI v4.2.13, I guess we can close this "Integration with Sage" issue and say that Sage is now fully integrated in SearchGUI (and PeptideShaker)? At least I cannot remember any still remaining features? It would of course also be possible to add support for showing the Sage quantification data in PeptideShaker, but that is definitely a separate task. :) |
I think we hit all of the important features! Thank you for all of the silent background work you've been doing to keep Sage updated in SearchGUI/PeptideShaker. We can definitely add some quantification support - right now the MS1 LFQ data is being written to a separate file ( |
Great! I will then close this issue. I will get back to you when I've had the chance to think more about how to potentially integrate the quantification data from Sage into PeptideShaker. |
Hi,
Is there any interest in integrating additional search engines? Sage is an open source (MIT licensed) search engine that utilizes the ion-indexing approach to run fast narrow & open searches (>5x faster than MSFragger). I would be happy to assist with getting it integrated.
The text was updated successfully, but these errors were encountered: