Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing mzIdentML is whitespace sensitive #493

Open
lazear opened this issue Oct 30, 2022 · 3 comments
Open

Parsing mzIdentML is whitespace sensitive #493

lazear opened this issue Oct 30, 2022 · 3 comments
Assignees

Comments

@lazear
Copy link

lazear commented Oct 30, 2022

Hi,

Since we were discussing integration of Sage (compomics/searchgui#334), I wrote an MzIdentML module to write results, since I wanted to play around with PeptideShaker a bit more.

Unfortunately, it appears that parsing Modification (if not other items) appears to be whitespace dependent. The XML library I am using to write the MzIdentML files (serializing from Rust structs) does not support whitespace/indents at this time...
I have included links to two minimal examples of the same mzid file (that passes the PSI Validator tool), where one is formatted by an external tool and is loaded in PS fine - the other is the unformatted version that throws the below error:

Formatted mzid: https://gist.github.com/lazear/c7bc428bd7e5227d85a7b5745085c346
Unformatted mzid: https://gist.github.com/lazear/7dd0403d2df1c3f7dd2f0d08c91302f8

Notably, changing any Modification entry in the working file to a single line is sufficient to reproduce the issue.

<Modification monoisotopicMassDelta="15.9949" location="2"><cvParam cvRef="PSI-MS" accession="MS:1001460" name="unknown modification" /></Modification>

Spectrum file is "b1906_293T_proteinID_01A_QE3_122212.raw" from http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD001468

Error message:


Sun Oct 30 11:51:01 PDT 2022: PeptideShaker version 2.2.17.
Memory given to the Java virtual machine: 4294967296.
Total amount of memory in the Java virtual machine: 138412032.
Free memory: 95266672.
Java version: 19.
1714 script command tokens
(C) 2009 Jmol Development
Jmol Version: 12.0.43  2011-05-03 14:21
java.vendor: Homebrew
java.version: 19
os.name: Mac OS X
memory: 54.2/134.2
processors available: 8
useCommandThread: false
WARNING: row index is bigger than sorter's row count. Most likely this is a wrong sorter usage.
java.lang.IllegalArgumentException: Could not parse PTM!
	at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.parsePeptide(MzIdentMLIdfileReader.java:392)
	at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.parseFile(MzIdentMLIdfileReader.java:293)
	at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.getAllSpectrumMatches(MzIdentMLIdfileReader.java:202)
	at eu.isas.peptideshaker.fileimport.FileImporter.importPsms(FileImporter.java:466)
	at eu.isas.peptideshaker.fileimport.FileImporter.importFiles(FileImporter.java:277)
	at eu.isas.peptideshaker.PeptideShaker.importFiles(PeptideShaker.java:219)
	at eu.isas.peptideshaker.gui.NewDialog$20.run(NewDialog.java:736)
	at java.base/java.lang.Thread.run(Thread.java:1589)
java.lang.IllegalArgumentException: Could not parse PTM!
	at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.parsePeptide(MzIdentMLIdfileReader.java:392)
	at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.parseFile(MzIdentMLIdfileReader.java:293)
	at com.compomics.util.experiment.io.identification.idfilereaders.MzIdentMLIdfileReader.getAllSpectrumMatches(MzIdentMLIdfileReader.java:202)
	at eu.isas.peptideshaker.fileimport.FileImporter.importPsms(FileImporter.java:466)
	at eu.isas.peptideshaker.fileimport.FileImporter.importFiles(FileImporter.java:277)
	at eu.isas.peptideshaker.PeptideShaker.importFiles(PeptideShaker.java:219)
	at eu.isas.peptideshaker.gui.NewDialog$20.run(NewDialog.java:736)
	at java.base/java.lang.Thread.run(Thread.java:1589)

Also, while I'm here... is there a way to completely turn off all of PeptideShaker's filters & validation features? I would love to be able to use it as just a GUI/PSM visualizer that blindly trusts what is in the mzIdentML file - I understand if this doesn't align with the goals of the project though

@hbarsnes hbarsnes self-assigned this Oct 30, 2022
@hbarsnes hbarsnes changed the title Parsing Modification is whitespace sensitive Parsing mzIdentML is whitespace sensitive Oct 30, 2022
@hbarsnes
Copy link
Member

Yes, you are indeed correct in that it seems like our mzid parsing is formatting-specific. I guess we never considered that anyone would want to write an mzid file without any formatting, as it makes it near impossible to read for humans. I could look into trying to adapt it, but probably better that I prioritize the pin file import instead?

Also, while I'm here... is there a way to completely turn off all of PeptideShaker's filters & validation features? I would love to be able to use it as just a GUI/PSM visualizer that blindly trusts what is in the mzIdentML file - I understand if this doesn't align with the goals of the project though

No, I'm afraid this is not currently supported. It has been talked about, but we concluded that it would require too many changes to the underlying code to be worth the effort. At least with the current limited resources.

@lazear
Copy link
Author

lazear commented Oct 30, 2022

Technically XML is supposed to be whitespace agnostic (except where it isn't), and I would assume that mzIdentML files follow that (given that the PSI Validator accepts unformatted mzid's). I can't imagine too many people prefer to read mzIdentMLs over tsv/csv/etc!

Obviously not a pressing issue for me, but figured I would document this in the case of future bugs.

but probably better that I prioritize the pin file import instead

Absolutely - I should be ready very soon.

@hbarsnes
Copy link
Member

Technically XML is supposed to be whitespace agnostic

"Should" is the keyword there. ;) But yes, this is clearly something that ought to be fixed in our home made mzid parser. The reason for making our own parser was that the available ones, at least at the time, were all too slow and used too much memory. Our parser only reads through the file once and only extracts the stuff we need and ignores everything else. I will try to get the time to look into improving it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants