Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing to multiple mz[X]ML files #257

Closed
lgatto opened this issue Sep 18, 2017 · 6 comments
Closed

Writing to multiple mz[X]ML files #257

lgatto opened this issue Sep 18, 2017 · 6 comments

Comments

@lgatto
Copy link
Owner

lgatto commented Sep 18, 2017

Looking into .writeMSData, I see the following

    if (length(files) != length(fileNames(x)))
        stop("length of 'files' has to match the number of samples")

which, as far as I understand, expects as many files to be written as were initially read in to create the MSnExp (or OnDiskMSnExp).

I am wondering whether it would be possible and/or preferable to write an MSnExp to a single file, even if it originally stems from multiple files.

@jotsetung @sgibb - what do you think?

@sgibb
Copy link
Collaborator

sgibb commented Sep 18, 2017

A single file would be more useful (and/or to allow the user to split the files by a specific value, e.g. precursor mz or original filename or what ever).

@jorainer
Copy link
Collaborator

Actually, I would prefer multiple files:

  • Performance of OnDiskMSnExp would suffer from a single huge mzML file. As it is now parallel processing is performed .
  • The mzML can become very large (imagine an experiment with several 100 of samples - which is quite common for LCMS metabolomics).
  • I like to have an experiment modular, e.g. being able to exclude single files from an experiment if e.g. quality was not OK. That's why I like the one-sample-one-file approach.
  • RNA-seq and microarray experiments are also always based on one-sample-one (or multiple) files, but never multiple samples within a file.

Finally - I have not yet found an example where data from multiple samples is saved into one mzML file using proteowizard (documentation is not very helpful there).

@lgatto
Copy link
Owner Author

lgatto commented Sep 19, 2017

Ok with your points, but I think the user should be able to choose. Here's one example where I think one file you be useful.

I have a TMT experiment with 10 files, where, at the end, I only need MS2 spectra. I do all my quantitation, identification, and end up with a MSnSet containing 1e5 MS2 spectra that are reliably quantified and identified. That is the data that will be used downstream for all the analysis and interpretation. I think it would be really handy to store these raw MS2 spectra into a single file. That would make it easy to look at the raw data for any biological discovery down the line.

So my suggestion would be not to enforce to save to n files, when n files were used as input to create the MSnExp/OnDiskMSnExp object. May be it could be either a single files, and then everything is dumped into one file, of n files.

Question: if the one file is not possible, how would you deal with the following

rw <- readMSData(c("file1", "file2", ..., "file10")
rwtest <- rw[c(23, 110, 234)]
writeMSData(twtest, ...)

@jorainer
Copy link
Collaborator

Yes, an option to allow the user to choose whether to save the data into one or multiple files is OK for me (suggestion: argument merge = FALSE?). I would make it however an either-or, so, saving to as many output files than there were input files or to a single output file, regardless of the number of input files.

Re your question:
As of now, the writeMSData would save the OnDiskMSnExp or MSnExp data as it is, i.e. if the spectra that are selected by rw[c(23, 110, 234)] are from, say file1 and file2 it would save that data to two files. The number of files to which the data is written depends on length(fileNames(twtest)).

I have to check then how to save multiple samples into the same mzML file, i.e. how to add multiple samples and runs into a proteowizard::msdata.

@jorainer
Copy link
Collaborator

jorainer commented Oct 6, 2017

After looking at mzR and proteowizard's MSData class I think saving multiple files into a single one might be complicated. It seems to me that each MSData can have only a single Run (single, consecutive and coherent set of scans on an instrument); there is only a single Run defined. The clean solution would have been to add each mzML as its own Run into the final mzML file. Besides that this does not seem to be possible, it would also require major changes in mzR itself since the concept of a Run, or even having potentially multiple samples in one file is not covered at all.
Note: proteowizard has a MSDataMerger class - eventually that might do the expected merge, but there is not much info on it - I'm also afraid that we would not be able to read such a merged file with mzR(?).

Now, a possible solution would be to merge all spectra from the different files into one list of spectra and save this to a single file. This could eventually be done by:

  • set the fromFile for all spectra to 1.
  • re-set the acquisitionNum/spectrumIdx to ensure they are unique and consecutive.
  • eventually re-order the spectra by retention time.

Problem is any reference to spectra IDs, acquisitionNum and files on the original multi-file MSnExp/OnDiskMSnExp is then different in the new single-file MSnExp/OnDiskMSnExp. Don't know if that's then a problem for the MSnSet.

Also, we would have to be careful with InstrumentInfo and general run info in the new file, since it will look like all of the data was generated in a single run.

@lgatto, would this fit your expectations? Would be nice to define a concrete use case based on a set of files on which to implement it - for me it's hard to guess what is required and what not since I'll most likely never need this option (yet; never say no :) ).

@lgatto
Copy link
Owner Author

lgatto commented Oct 6, 2017

I think I could live with re-writing fromFile and acquisitionNum/spectrumIdx (and re-ordering) - these wouldn't make much sense in this new file any more anyway.

Regarding instrumentInfo, one could assume that all files were acquired on the same setup (even same instrument), at least that would be the default case, which would be straightforward to merge. It that's not the case, then again, that concept doesn’t make sense any more, unless we can write am instrumentInfoList in the mzML.

I can see many cases where writing to a single file is confusing. Somehow, it is more of a convenience thing than anything else. At the end of the day, I need to provide all raw files and a script as a reproducible pipeline, and writing intermediate mzML file/files, is just one part of the processing.

Also, in terms of MSnbase, I could imagine a pre-processing-before-writing-to-a-single-file function that overwrite some data a described above and stores the old values, including the original files in the feature data slot. This would be lost upon writing, if course, but it would possibly make things more transparent. All these side effects will have to be documented, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants