Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add write,MSnExp and write,OnDiskMSnExp #253

Merged
merged 15 commits into from
Oct 17, 2017
Merged

Add write,MSnExp and write,OnDiskMSnExp #253

merged 15 commits into from
Oct 17, 2017

Conversation

jorainer
Copy link
Collaborator

Naming of the methods is open for discussion; I've called them write because in mzR there is already a writeMSData function. Reasoning for calling the function just write: we do have MS data stored in the object, calling write on them should be enough for the user to know what happens - writeMSData basically means *write the MS data in the provided object that contains MS data, write: *write the provided object that contains MS data.

Note also that the unit tests fail without the latest fix (version 2.11.10) in mzR.

- Add write,MSnExp and write,OnDiskMSnExp to allow saving these objects to mzML
  or mzXML file(s) (issue #250).
- Add related documentation and unit tests.
Copy link
Collaborator

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great new feature for MSnbase 👍

I have just a few minor comments.

(BTW: Personally I don't like the snake_case of temporary variables and arguments in functions. In my vim-r-plugin installation _ is mapped to <- (of course I could change that) and the bioc style guide suggests camelCase anyway. @lgatto I am not sure how our current style guide is. (as we don't follow the bioc style completely, e.g. we use somefunc(a = 1, b = 2) instead of somefunc(a=1, b=2) bioc style guide)

## For some odd reasons we get also NULL back - parallel processing?
if (!length(fast_load))
fast_load <- FALSE
fast_load
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be isTRUE(MSnbaseOptions()$fastload) sufficient here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought so. But I got very strange errors that took me one day to hunt down: seems that sometimes MSnbaseOptions()$fastload is simply not set and NULL is returned. This was somehow related to parallel processing.

Re: camelCase, I know, I'm inconsistent here. I used to use always only camelCase, but now I got used to snake case because I find it reads better - I can change back if you don't like.

R/writeMSData.R Outdated
x_split <- splitByFile(x, f = factor(fileNames(x)))
## Using mapply below - in principle we could then even switch to
## bpmapply to perform parallel saving of files.
res <- mapply(x_split, files, FUN = function(z, f, outformat,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For easier testing I would suggest to put this anonymous function in an extra internal function. .writeMSData would be just mapply(splitByFile(x, fileNames(x)), .writeSingleMsData()). IMHO testing import/export is already hard and testing it for multiple files at once is even harder. So I prefer to use an internal function instead of an anonymous here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. OK, I'll do.

R/writeMSData.R Outdated
}
## o Re-calculate stuff we don't have or which might have been
## changed:
new_vals <- do.call(rbind, lapply(pks, function(sp) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are creating exact these values in .spectrum_header already. How could they change from line 64 to line 71 without being manipulated by us? If only the fData values could have been modified this do.call and the following hdr$... should go into the if(is(x, "OnDiskMSnExp")) block above (line 61).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You see everything ;) For MSnExp that's correct, for OnDiskMSnExp not, because the fData for these objects is in fact the header at the time when the file is read. So it's better/saver to (re)calculate for OnDiskMSnExp.
Yes, I'll put that into the if(is.... 👍

R/writeMSData.R Outdated
matches <- vapply(names(.PATTERN.TO.CV), FUN = function(z) {
length(grep(pattern = z, x = pattern, ignore.case = TRUE)) > 0
}, FUN.VALUE = logical(1))
if (any(matches))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "preallocating" res in the first line. We could simply use

if (any(matches))
    .PATTERN.TO.CV[matches]
else
   ifnotfound

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do.

R/writeMSData.R Outdated
## other way round. So we can provide a string describing the processing
## step and return the best matching CV terms.
matches <- vapply(names(.PATTERN.TO.CV), FUN = function(z) {
length(grep(pattern = z, x = pattern, ignore.case = TRUE)) > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't any(grepl(...)) easier to read?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lgatto
Copy link
Owner

lgatto commented Sep 18, 2017

Regarding write vs. writeMSData: I surely like write, although the latter matches nicely readMSData. My only suggestion with write would be to ask on the bioc-devel list if they don't want to add the write generic to BiocGenerics.

@lgatto
Copy link
Owner

lgatto commented Sep 18, 2017

Some bugs/issues:

- Ensure that precursor data is preserved in write,MSnExp and write,OnDiskMSnExp
  even if the precursor scan is not available (issue #256).
- Add related unit tests.
R/writeMSData.R Outdated
BPPARAM = SerialParam())
if (is(x, "OnDiskMSnExp")) {
hdr <- fData(z)
res <- mapply(FUN = .writeSingleMSData, x_split, files,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assignment to res is not needed here. Just mapply(...) }, so the return value of mapply will be the return value of .writeMSData.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to return anything - so I'll use invisible instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought .writeSingleMSData (or mzR::writeMSData) would return a value for success/error. But if not invisible is fine.

R/writeMSData.R Outdated
cbind(peaksCount = nrow(sp),
totIonCurrent = sum(sp[, 2], na.rm = TRUE),
basePeakMZ = sp[base::which.max(sp[, 2]), 1][1],
basePeakIntensity = max(sp[, 2], na.rm = TRUE))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you calculating the maximum twice? We could simply just call base::which.max once and get nearly 20% speed improvement (I guess this is a noncritical function but anyway):

library("msdata")
library("MSnbase")
library("rbenchmark")

m <- readMSData(msdata::proteomics("Top", full.names=TRUE), mode="onDisk")

pks <- spectrapply(m, function(sp)cbind(sp@mz, sp@intensity))

f1 <- function(pks) {
    do.call(rbind, lapply(pks, function(sp) {
        cbind(peaksCount = nrow(sp),
              totIonCurrent = sum(sp[, 2], na.rm = TRUE),
              basePeakMZ = sp[base::which.max(sp[, 2]), 1][1],
              basePeakIntensity = max(sp[, 2], na.rm = TRUE))
}))}

f2 <- function(pks) {
    do.call(rbind, lapply(pks, function(sp) {
        i <- base::which.max(sp[,2])[1]
        cbind(peaksCount = nrow(sp),
              totIonCurrent = sum(sp[, 2], na.rm = TRUE),
              basePeakMZ = sp[i, 1],
              basePeakIntensity = sp[i, 2])
}))}

benchmark(f1(pks), f2(pks), order="relative")
#      test replications elapsed relative user.self sys.self user.child sys.child
# 2 f2(pks)          100   7.791    1.000     7.764    0.004          0         0
# 1 f1(pks)          100   9.447    1.213     9.360    0.008          0         0

all.equal(f1(pks), f2(pks))
# [1] TRUE

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

- Require ProtGenerics 1.9.1 (defining the writeMSData generic).
- Require most recent mzR (>= 2.11.11).
- Change write,MSnExp to writeMSData,MSnExp (issue #259).
- Add argument merge = FALSE to writeMSData,MSnExp to enable future merging of
  MS data into a single file.
@jorainer
Copy link
Collaborator Author

For the time being I would skip the implementation of the merge = TRUE (i.e. merging all data into a single mzML file), since it looks to be a little more complicated and I don't have that much time presently. That OK for you @lgatto and @sgibb ?

Btw, since I changed write to writeMSData this will require my pull request in mzR (sneumann/mzR#133) being merged.

@lgatto
Copy link
Owner

lgatto commented Sep 22, 2017

  • mzR PR merged and pushed to bioc.
  • Ok, let's start with multiple file write support, but let's keep it high on the priority list.

I still need to do some more reviewing, but I haven't had the time so far.

- Fix writeMSData for CDF input files (issue #250): calculate lowMZ and highMZ
  prior to exporting the data (lowMZ is missing in CDF header).
- Add unit tests to check correct export of CDF files.
- Set default for argument copy = FALSE because copy = TRUE is not supported for
  input files other than mzML or mzXML.
R/writeMSData.R Outdated
basePeakIntensity = sp[max_pos, 2])
basePeakIntensity = sp[max_pos, 2],
lowMZ = min(sp[, 1]),
highMZ = max(sp[, 1]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't lowMZ = sp[1, 1] and highMZ = sp[nrow(sp), 1]? Or can't we assume correct order at this stage? (I am not sure anymore but if I remember correctly you ensure ordering of mz values in the ctor for spectra.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right - that simplifies it (assuming correct ordering of m/z, but if I'm not mistaken that is also checked in the validity method of Spectrum objects).

@lgatto
Copy link
Owner

lgatto commented Oct 16, 2017

A few more comments/suggestions/observations;

The test

library("MSnbase")
f <- msdata::proteomics(full.names = TRUE)[3]
## in memory, MS2 only
x0 <- readMSData(f)
writeMSData(x0, file = "write.mzML")
x1 <- readMSData("write.mzML")
## on disk, MS1 and MS2
x2 <- readMSData(f, mode = "onDisk")
writeMSData(x2, file = "write2.mzML")
x3 <- readMSData("write2.mzML", mode = "onDisk")

File exists

When writing to a file that exists, the error

> x0 <- readMSData(f)
> writeMSData(x0, file = "write.mzML")
Error in (function (msData, file, outformat, software_processing = NULL,  : 
  File write.mzML does already exist! Please use a different file name.

should happen immediately. With my test (see below), it takes 10 seconds. There should be something like

if (file.exists(file))
  stop("File ", file, " does already exist! Please use a different file name.")

right at the beginning of the writeMSData method.

File size

This is not an issue as such, and the difference is relatively small, but it would be useful to know where the difference in file size stems from, and be explicit in the man page in what isn't retained when reading and writing.

> file.info(f)[[1]]
[1] 314974121
> file.info("write2.mzML")[[1]]
[1] 437995651
> file.info(f)[[1]] / file.info("write2.mzML")[[1]]
[1] 0.7191261

Differences

> all.equal(x2, x3)
Note: method with signatureOnDiskMSnExp#MSnExp’ chosen for function ‘all.equal’,
 target signatureOnDiskMSnExp#OnDiskMSnExp’.
 "MSnExp#OnDiskMSnExp" would also be valid
[1] "Spectra at position 1 are different."
[2] "Spectra at position 1 are different."
> all.equal(x2[[1]], x3[[1]])
[1] "Attributes: < Component “tic”: Mean relative difference: 2.343212 >"
> tic(x2[[1]])
[1] 9187505
> tic(x3[[1]])
[1] 30715775

For the in-memory MSnExp objects that contain MS2 data only:

> options(max.print = 10)
> all.equal(x0, x1)
 [1] "Attributes: < Component “assayData”: Component “F1.S0001”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"      
 [2] "Attributes: < Component “assayData”: Component “F1.S0001”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9955947 > >"
 [3] "Attributes: < Component “assayData”: Component “F1.S0002”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"      
 [4] "Attributes: < Component “assayData”: Component “F1.S0002”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9912281 > >"
 [5] "Attributes: < Component “assayData”: Component “F1.S0003”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"      
 [6] "Attributes: < Component “assayData”: Component “F1.S0003”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9868996 > >"
 [7] "Attributes: < Component “assayData”: Component “F1.S0004”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"      
 [8] "Attributes: < Component “assayData”: Component “F1.S0004”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9826087 > >"
 [9] "Attributes: < Component “assayData”: Component “F1.S0005”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"      
[10] "Attributes: < Component “assayData”: Component “F1.S0005”: Attributes: < Component “scanIndex”: Mean relative difference: 0.978355 > >" 
 [ reached getOption("max.print") -- omitted 12206 entries ]

and

> head(precScanNum(x0))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
     226      226      226      226      226      226 
> head(precScanNum(x1))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
       0        0        0        0        0        0 
> head(scanIndex(x0))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
     227      228      229      230      231      232 
> head(scanIndex(x1))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
       1        2        3        4        5        6 

I believe the scanIndex difference relates to the discussion we had in #257: it doesn't start with 1 in x0 because only MS2 spectra are loaded, and when writing on disc, the index always starts at 1. But I think the precusor scan numbers should ideally be preserved.

- Address first comment from @lgatto related to the error for existing file.
- Add more information about what is saved/copied from/to mzML files.
@jorainer
Copy link
Collaborator Author

Regarding file size: one reason for the difference in file size is that the original file f is a gzipped mzML file, while the exported mzML file is uncompressed mzML with only the binary data zlib compressed.

Apart from that, what is not copied/saved is all chromatogram information in the original file. In addition there will be differences in the processing information (i.e. depending on copy run-specific metadata is copied over from the original file, such as instrument info, processing steps etc).
I added some more information to the details section of the writeMSData,MSnExp method.

@jorainer
Copy link
Collaborator Author

Regarding the differences: (as also stated in the note section of writeMSData,MSnExp) total ion current, peak count, base peak m/z and base peak intensity are calculated from the actual spectrum data prior to saving them. Thus you see the differences there in the TIC. Seems the original file contained the ones from the machine.

Regarding the precScanNum: that's a tricky situation. I think keeping the precursorScanNum is problematic, because it might lead to linking to a wrong spectrum. For the example above it would link to the spectrum number 226, which in the new mzML file is however a MS2 spectrum. That's why I decided to set the precursorScanNum to 0. Note that precursorMz, precursorIntensity and precursorCharge are retained. Also, proteowizard throws an error if the precursor with the specified ID (scanNum) is not available in the MSData.

I did update the writeMSData,MSnExp help page and added a sentence explaining this in the note section.

Copy link
Owner

@lgatto lgatto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All points were discussed in PR comments. Excellent contribution, thanks Jo!

@lgatto lgatto merged commit 690c1af into master Oct 17, 2017
@lgatto
Copy link
Owner

lgatto commented Oct 17, 2017

@jotsetung - have you added a mention of writeMSData in the demo and io vignettes? I'll push to Bioc as soon as there's a short mention, otherwise users might never learn about this exciting feature.

@jorainer
Copy link
Collaborator Author

No I didn't, but will.

@jorainer
Copy link
Collaborator Author

I added a small section to both vignettes and pushed them to the writeMSData branch (didn't want to push directly to master). If you're happy with it you can merge it into master.

@lgatto
Copy link
Owner

lgatto commented Oct 17, 2017

If you are familiar with tikz, you could also update MSnbase-io-out.tex to add an arrow from MSnExp to raw data. If now, I'll do it tonight.

@jorainer
Copy link
Collaborator Author

I'd prefer if you could do that - I'm currently busy fixing other stuff too...

@lgatto
Copy link
Owner

lgatto commented Oct 17, 2017

On Bioc now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants