Add write,MSnExp and write,OnDiskMSnExp #253

jorainer · 2017-09-18T03:58:45Z

Add write,MSnExp and write,OnDiskMSnExp to allow saving these objects to mzML
or mzXML file(s) (issue Implement a writeMSData,OnDiskMSnExp method #250).
Add related documentation and unit tests.

Naming of the methods is open for discussion; I've called them write because in mzR there is already a writeMSData function. Reasoning for calling the function just write: we do have MS data stored in the object, calling write on them should be enough for the user to know what happens - writeMSData basically means *write the MS data in the provided object that contains MS data, write: *write the provided object that contains MS data.

Note also that the unit tests fail without the latest fix (version 2.11.10) in mzR.

- Add write,MSnExp and write,OnDiskMSnExp to allow saving these objects to mzML or mzXML file(s) (issue #250). - Add related documentation and unit tests.

sgibb

Great new feature for MSnbase 👍

I have just a few minor comments.

(BTW: Personally I don't like the snake_case of temporary variables and arguments in functions. In my vim-r-plugin installation _ is mapped to <- (of course I could change that) and the bioc style guide suggests camelCase anyway. @lgatto I am not sure how our current style guide is. (as we don't follow the bioc style completely, e.g. we use somefunc(a = 1, b = 2) instead of somefunc(a=1, b=2) bioc style guide)

sgibb · 2017-09-18T10:51:39Z

R/options.R

+    ## For some odd reasons we get also NULL back - parallel processing?
+    if (!length(fast_load))
+        fast_load <- FALSE
+    fast_load
 }


Wouldn't be isTRUE(MSnbaseOptions()$fastload) sufficient here?

I also thought so. But I got very strange errors that took me one day to hunt down: seems that sometimes MSnbaseOptions()$fastload is simply not set and NULL is returned. This was somehow related to parallel processing.

Re: camelCase, I know, I'm inconsistent here. I used to use always only camelCase, but now I got used to snake case because I find it reads better - I can change back if you don't like.

sgibb · 2017-09-18T10:58:12Z

R/writeMSData.R

+    x_split <- splitByFile(x, f = factor(fileNames(x)))
+    ## Using mapply below - in principle we could then even switch to
+    ## bpmapply to perform parallel saving of files.
+    res <- mapply(x_split, files, FUN = function(z, f, outformat,


For easier testing I would suggest to put this anonymous function in an extra internal function. .writeMSData would be just mapply(splitByFile(x, fileNames(x)), .writeSingleMsData()). IMHO testing import/export is already hard and testing it for multiple files at once is even harder. So I prefer to use an internal function instead of an anonymous here.

Makes sense. OK, I'll do.

sgibb · 2017-09-18T11:02:02Z

R/writeMSData.R

+        }
+        ## o Re-calculate stuff we don't have or which might have been
+        ##   changed:
+        new_vals <- do.call(rbind, lapply(pks, function(sp) {


You are creating exact these values in .spectrum_header already. How could they change from line 64 to line 71 without being manipulated by us? If only the fData values could have been modified this do.call and the following hdr$... should go into the if(is(x, "OnDiskMSnExp")) block above (line 61).

You see everything ;) For MSnExp that's correct, for OnDiskMSnExp not, because the fData for these objects is in fact the header at the time when the file is read. So it's better/saver to (re)calculate for OnDiskMSnExp.
Yes, I'll put that into the if(is.... 👍

sgibb · 2017-09-18T11:10:28Z

R/writeMSData.R

+    matches <- vapply(names(.PATTERN.TO.CV), FUN = function(z) {
+        length(grep(pattern = z, x = pattern, ignore.case = TRUE)) > 0
+    }, FUN.VALUE = logical(1))
+    if (any(matches))


Instead of "preallocating" res in the first line. We could simply use

if (any(matches)) .PATTERN.TO.CV[matches] else ifnotfound

sgibb · 2017-09-18T11:12:29Z

R/writeMSData.R

+    ## other way round. So we can provide a string describing the processing
+    ## step and return the best matching CV terms.
+    matches <- vapply(names(.PATTERN.TO.CV), FUN = function(z) {
+        length(grep(pattern = z, x = pattern, ignore.case = TRUE)) > 0


Isn't any(grepl(...)) easier to read?

lgatto · 2017-09-18T15:19:29Z

Regarding write vs. writeMSData: I surely like write, although the latter matches nicely readMSData. My only suggestion with write would be to ask on the bioc-devel list if they don't want to add the write generic to BiocGenerics.

lgatto · 2017-09-18T15:28:46Z

Some bugs/issues:

An apparent bug in Apparent problem with tilde exansion in write,OnDiskMSnExp #255 with tilde expansion
Data loss when writing and reading Data loss when writing MS data #256

- Ensure that precursor data is preserved in write,MSnExp and write,OnDiskMSnExp even if the precursor scan is not available (issue #256). - Add related unit tests.

sgibb · 2017-09-19T20:35:54Z

R/writeMSData.R

-                           BPPARAM = SerialParam())
-        if (is(x, "OnDiskMSnExp")) {
-            hdr <- fData(z)
+    res <- mapply(FUN = .writeSingleMSData, x_split, files,


The assignment to res is not needed here. Just mapply(...) }, so the return value of mapply will be the return value of .writeMSData.

I don't want to return anything - so I'll use invisible instead.

Ok, I thought .writeSingleMSData (or mzR::writeMSData) would return a value for success/error. But if not invisible is fine.

sgibb · 2017-09-19T21:08:19Z

R/writeMSData.R

+                cbind(peaksCount = nrow(sp),
+                      totIonCurrent = sum(sp[, 2], na.rm = TRUE),
+                      basePeakMZ = sp[base::which.max(sp[, 2]), 1][1],
+                      basePeakIntensity = max(sp[, 2], na.rm = TRUE))


Why you calculating the maximum twice? We could simply just call base::which.max once and get nearly 20% speed improvement (I guess this is a noncritical function but anyway):

library("msdata") library("MSnbase") library("rbenchmark") m <- readMSData(msdata::proteomics("Top", full.names=TRUE), mode="onDisk") pks <- spectrapply(m, function(sp)cbind(sp@mz, sp@intensity)) f1 <- function(pks) { do.call(rbind, lapply(pks, function(sp) { cbind(peaksCount = nrow(sp), totIonCurrent = sum(sp[, 2], na.rm = TRUE), basePeakMZ = sp[base::which.max(sp[, 2]), 1][1], basePeakIntensity = max(sp[, 2], na.rm = TRUE)) }))} f2 <- function(pks) { do.call(rbind, lapply(pks, function(sp) { i <- base::which.max(sp[,2])[1] cbind(peaksCount = nrow(sp), totIonCurrent = sum(sp[, 2], na.rm = TRUE), basePeakMZ = sp[i, 1], basePeakIntensity = sp[i, 2]) }))} benchmark(f1(pks), f2(pks), order="relative") # test replications elapsed relative user.self sys.self user.child sys.child # 2 f2(pks) 100 7.791 1.000 7.764 0.004 0 0 # 1 f1(pks) 100 9.447 1.213 9.360 0.008 0 0 all.equal(f1(pks), f2(pks)) # [1] TRUE

good point.

- Require ProtGenerics 1.9.1 (defining the writeMSData generic). - Require most recent mzR (>= 2.11.11). - Change write,MSnExp to writeMSData,MSnExp (issue #259). - Add argument merge = FALSE to writeMSData,MSnExp to enable future merging of MS data into a single file.

jorainer · 2017-09-22T08:37:30Z

For the time being I would skip the implementation of the merge = TRUE (i.e. merging all data into a single mzML file), since it looks to be a little more complicated and I don't have that much time presently. That OK for you @lgatto and @sgibb ?

Btw, since I changed write to writeMSData this will require my pull request in mzR (sneumann/mzR#133) being merged.

lgatto · 2017-09-22T22:18:16Z

mzR PR merged and pushed to bioc.
Ok, let's start with multiple file write support, but let's keep it high on the priority list.

I still need to do some more reviewing, but I haven't had the time so far.

- Fix writeMSData for CDF input files (issue #250): calculate lowMZ and highMZ prior to exporting the data (lowMZ is missing in CDF header). - Add unit tests to check correct export of CDF files. - Set default for argument copy = FALSE because copy = TRUE is not supported for input files other than mzML or mzXML.

sgibb · 2017-09-26T08:52:25Z

R/writeMSData.R

-                      basePeakIntensity = sp[max_pos, 2])
+                      basePeakIntensity = sp[max_pos, 2],
+                      lowMZ = min(sp[, 1]),
+                      highMZ = max(sp[, 1]))


Isn't lowMZ = sp[1, 1] and highMZ = sp[nrow(sp), 1]? Or can't we assume correct order at this stage? (I am not sure anymore but if I remember correctly you ensure ordering of mz values in the ctor for spectra.)

You are right - that simplifies it (assuming correct ordering of m/z, but if I'm not mistaken that is also checked in the validity method of Spectrum objects).

lgatto · 2017-10-16T19:45:13Z

A few more comments/suggestions/observations;

The test

library("MSnbase")
f <- msdata::proteomics(full.names = TRUE)[3]
## in memory, MS2 only
x0 <- readMSData(f)
writeMSData(x0, file = "write.mzML")
x1 <- readMSData("write.mzML")
## on disk, MS1 and MS2
x2 <- readMSData(f, mode = "onDisk")
writeMSData(x2, file = "write2.mzML")
x3 <- readMSData("write2.mzML", mode = "onDisk")

File exists

When writing to a file that exists, the error

> x0 <- readMSData(f)
> writeMSData(x0, file = "write.mzML")
Error in (function (msData, file, outformat, software_processing = NULL,  : 
  File write.mzML does already exist! Please use a different file name.

should happen immediately. With my test (see below), it takes 10 seconds. There should be something like

if (file.exists(file))
  stop("File ", file, " does already exist! Please use a different file name.")

right at the beginning of the writeMSData method.

File size

This is not an issue as such, and the difference is relatively small, but it would be useful to know where the difference in file size stems from, and be explicit in the man page in what isn't retained when reading and writing.

> file.info(f)[[1]]
[1] 314974121
> file.info("write2.mzML")[[1]]
[1] 437995651
> file.info(f)[[1]] / file.info("write2.mzML")[[1]]
[1] 0.7191261

Differences

> all.equal(x2, x3)
Note: method with signature ‘OnDiskMSnExp#MSnExp’ chosen for function ‘all.equal’,
 target signature ‘OnDiskMSnExp#OnDiskMSnExp’.
 "MSnExp#OnDiskMSnExp" would also be valid
[1] "Spectra at position 1 are different."
[2] "Spectra at position 1 are different."
> all.equal(x2[[1]], x3[[1]])
[1] "Attributes: < Component “tic”: Mean relative difference: 2.343212 >"
> tic(x2[[1]])
[1] 9187505
> tic(x3[[1]])
[1] 30715775

For the in-memory MSnExp objects that contain MS2 data only:

> options(max.print = 10)
> all.equal(x0, x1)
 [1] "Attributes: < Component “assayData”: Component “F1.S0001”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"      
 [2] "Attributes: < Component “assayData”: Component “F1.S0001”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9955947 > >"
 [3] "Attributes: < Component “assayData”: Component “F1.S0002”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"      
 [4] "Attributes: < Component “assayData”: Component “F1.S0002”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9912281 > >"
 [5] "Attributes: < Component “assayData”: Component “F1.S0003”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"      
 [6] "Attributes: < Component “assayData”: Component “F1.S0003”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9868996 > >"
 [7] "Attributes: < Component “assayData”: Component “F1.S0004”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"      
 [8] "Attributes: < Component “assayData”: Component “F1.S0004”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9826087 > >"
 [9] "Attributes: < Component “assayData”: Component “F1.S0005”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"      
[10] "Attributes: < Component “assayData”: Component “F1.S0005”: Attributes: < Component “scanIndex”: Mean relative difference: 0.978355 > >" 
 [ reached getOption("max.print") -- omitted 12206 entries ]

and

> head(precScanNum(x0))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
     226      226      226      226      226      226 
> head(precScanNum(x1))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
       0        0        0        0        0        0 
> head(scanIndex(x0))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
     227      228      229      230      231      232 
> head(scanIndex(x1))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006 
       1        2        3        4        5        6

I believe the scanIndex difference relates to the discussion we had in #257: it doesn't start with 1 in x0 because only MS2 spectra are loaded, and when writing on disc, the index always starts at 1. But I think the precusor scan numbers should ideally be preserved.

@lgatto

- Address first comment from @lgatto related to the error for existing file.

- Add more information about what is saved/copied from/to mzML files.

jorainer · 2017-10-17T06:43:57Z

Regarding file size: one reason for the difference in file size is that the original file f is a gzipped mzML file, while the exported mzML file is uncompressed mzML with only the binary data zlib compressed.

Apart from that, what is not copied/saved is all chromatogram information in the original file. In addition there will be differences in the processing information (i.e. depending on copy run-specific metadata is copied over from the original file, such as instrument info, processing steps etc).
I added some more information to the details section of the writeMSData,MSnExp method.

jorainer · 2017-10-17T09:08:34Z

Regarding the differences: (as also stated in the note section of writeMSData,MSnExp) total ion current, peak count, base peak m/z and base peak intensity are calculated from the actual spectrum data prior to saving them. Thus you see the differences there in the TIC. Seems the original file contained the ones from the machine.

Regarding the precScanNum: that's a tricky situation. I think keeping the precursorScanNum is problematic, because it might lead to linking to a wrong spectrum. For the example above it would link to the spectrum number 226, which in the new mzML file is however a MS2 spectrum. That's why I decided to set the precursorScanNum to 0. Note that precursorMz, precursorIntensity and precursorCharge are retained. Also, proteowizard throws an error if the precursor with the specified ID (scanNum) is not available in the MSData.

I did update the writeMSData,MSnExp help page and added a sentence explaining this in the note section.

lgatto

All points were discussed in PR comments. Excellent contribution, thanks Jo!

lgatto · 2017-10-17T09:33:06Z

@jotsetung - have you added a mention of writeMSData in the demo and io vignettes? I'll push to Bioc as soon as there's a short mention, otherwise users might never learn about this exciting feature.

jorainer · 2017-10-17T10:13:53Z

No I didn't, but will.

jorainer · 2017-10-17T10:50:12Z

I added a small section to both vignettes and pushed them to the writeMSData branch (didn't want to push directly to master). If you're happy with it you can merge it into master.

lgatto · 2017-10-17T11:14:26Z

If you are familiar with tikz, you could also update MSnbase-io-out.tex to add an arrow from MSnExp to raw data. If now, I'll do it tonight.

jorainer · 2017-10-17T11:21:06Z

I'd prefer if you could do that - I'm currently busy fixing other stuff too...

lgatto · 2017-10-17T11:50:05Z

On Bioc now

Add write,MSnExp and write,OnDiskMSnExp

a5e3351

- Add write,MSnExp and write,OnDiskMSnExp to allow saving these objects to mzML or mzXML file(s) (issue #250). - Add related documentation and unit tests.

jorainer requested review from lgatto and sgibb September 18, 2017 03:59

sgibb reviewed Sep 18, 2017

View reviewed changes

jorainer added 3 commits September 18, 2017 19:22

Address @sgibb's comments.

f2c745e

Expand file path for write,MSnExp (issue #255)

5163c3d

Preserve precursor data in export (issue #256)

3f2db2b

- Ensure that precursor data is preserved in write,MSnExp and write,OnDiskMSnExp even if the precursor scan is not available (issue #256). - Add related unit tests.

lgatto mentioned this pull request Sep 19, 2017

Data loss when writing MS data #256

Closed

sgibb reviewed Sep 19, 2017

View reviewed changes

jorainer added 2 commits September 20, 2017 12:01

Address next round of comments from @sgibb

800eea4

jorainer added 2 commits September 26, 2017 06:25

Merge branch 'master' into writeMSData

d771f58

sgibb reviewed Sep 26, 2017

View reviewed changes

jorainer added 3 commits September 26, 2017 12:07

Address comment from @sgibb

d228a76

Merge branch 'master' into writeMSData

7214c12

Update NEWS and NEWS.md

12c056b

jorainer mentioned this pull request Oct 12, 2017

Write support in Bioconductor 3.6 release #267

Closed

fix missing quote in smooth man page

68be4f6

jorainer added 2 commits October 17, 2017 08:32

Check existing file at beginning of .writeMSData

367bd03

- Address first comment from @lgatto related to the error for existing file.

Add more information to details section of writeMSData,MSnExp

434ac16

- Add more information about what is saved/copied from/to mzML files.

Add note explaining precursorScanNum being set to 0

c043544

lgatto approved these changes Oct 17, 2017

View reviewed changes

lgatto merged commit 690c1af into master Oct 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add write,MSnExp and write,OnDiskMSnExp #253

Add write,MSnExp and write,OnDiskMSnExp #253

jorainer commented Sep 18, 2017

sgibb left a comment •

edited

Loading

sgibb Sep 18, 2017

jorainer Sep 18, 2017

sgibb Sep 18, 2017

jorainer Sep 18, 2017

sgibb Sep 18, 2017

jorainer Sep 18, 2017

sgibb Sep 18, 2017

jorainer Sep 18, 2017

sgibb Sep 18, 2017

jorainer Sep 18, 2017

lgatto commented Sep 18, 2017

lgatto commented Sep 18, 2017 •

edited

Loading

sgibb Sep 19, 2017

jorainer Sep 20, 2017

sgibb Sep 20, 2017

sgibb Sep 19, 2017

jorainer Sep 20, 2017

jorainer commented Sep 22, 2017

lgatto commented Sep 22, 2017

sgibb Sep 26, 2017

jorainer Sep 26, 2017

lgatto commented Oct 16, 2017

jorainer commented Oct 17, 2017

jorainer commented Oct 17, 2017

lgatto left a comment

lgatto commented Oct 17, 2017 •

edited

Loading

jorainer commented Oct 17, 2017

jorainer commented Oct 17, 2017

lgatto commented Oct 17, 2017

jorainer commented Oct 17, 2017

lgatto commented Oct 17, 2017

Add write,MSnExp and write,OnDiskMSnExp #253

Add write,MSnExp and write,OnDiskMSnExp #253

Conversation

jorainer commented Sep 18, 2017

sgibb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgatto commented Sep 18, 2017

lgatto commented Sep 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorainer commented Sep 22, 2017

lgatto commented Sep 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgatto commented Oct 16, 2017

The test

File exists

File size

Differences

jorainer commented Oct 17, 2017

jorainer commented Oct 17, 2017

lgatto left a comment

Choose a reason for hiding this comment

lgatto commented Oct 17, 2017 • edited Loading

jorainer commented Oct 17, 2017

jorainer commented Oct 17, 2017

lgatto commented Oct 17, 2017

jorainer commented Oct 17, 2017

lgatto commented Oct 17, 2017

sgibb left a comment •

edited

Loading

lgatto commented Sep 18, 2017 •

edited

Loading

lgatto commented Oct 17, 2017 •

edited

Loading