-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add write,MSnExp and write,OnDiskMSnExp #253
Conversation
- Add write,MSnExp and write,OnDiskMSnExp to allow saving these objects to mzML or mzXML file(s) (issue #250). - Add related documentation and unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great new feature for MSnbase
👍
I have just a few minor comments.
(BTW: Personally I don't like the snake_case
of temporary variables and arguments in functions. In my vim-r-plugin installation _
is mapped to <-
(of course I could change that) and the bioc style guide suggests camelCase
anyway. @lgatto I am not sure how our current style guide is. (as we don't follow the bioc style completely, e.g. we use somefunc(a = 1, b = 2)
instead of somefunc(a=1, b=2)
bioc style guide)
## For some odd reasons we get also NULL back - parallel processing? | ||
if (!length(fast_load)) | ||
fast_load <- FALSE | ||
fast_load | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't be isTRUE(MSnbaseOptions()$fastload)
sufficient here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also thought so. But I got very strange errors that took me one day to hunt down: seems that sometimes MSnbaseOptions()$fastload
is simply not set and NULL
is returned. This was somehow related to parallel processing.
Re: camelCase
, I know, I'm inconsistent here. I used to use always only camelCase, but now I got used to snake case because I find it reads better - I can change back if you don't like.
R/writeMSData.R
Outdated
x_split <- splitByFile(x, f = factor(fileNames(x))) | ||
## Using mapply below - in principle we could then even switch to | ||
## bpmapply to perform parallel saving of files. | ||
res <- mapply(x_split, files, FUN = function(z, f, outformat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For easier testing I would suggest to put this anonymous function in an extra internal function. .writeMSData
would be just mapply(splitByFile(x, fileNames(x)), .writeSingleMsData())
. IMHO testing import/export is already hard and testing it for multiple files at once is even harder. So I prefer to use an internal function instead of an anonymous here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. OK, I'll do.
R/writeMSData.R
Outdated
} | ||
## o Re-calculate stuff we don't have or which might have been | ||
## changed: | ||
new_vals <- do.call(rbind, lapply(pks, function(sp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are creating exact these values in .spectrum_header
already. How could they change from line 64 to line 71 without being manipulated by us? If only the fData
values could have been modified this do.call
and the following hdr$...
should go into the if(is(x, "OnDiskMSnExp"))
block above (line 61).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You see everything ;) For MSnExp
that's correct, for OnDiskMSnExp
not, because the fData
for these objects is in fact the header
at the time when the file is read. So it's better/saver to (re)calculate for OnDiskMSnExp
.
Yes, I'll put that into the if(is...
. 👍
R/writeMSData.R
Outdated
matches <- vapply(names(.PATTERN.TO.CV), FUN = function(z) { | ||
length(grep(pattern = z, x = pattern, ignore.case = TRUE)) > 0 | ||
}, FUN.VALUE = logical(1)) | ||
if (any(matches)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "preallocating" res
in the first line. We could simply use
if (any(matches))
.PATTERN.TO.CV[matches]
else
ifnotfound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do.
R/writeMSData.R
Outdated
## other way round. So we can provide a string describing the processing | ||
## step and return the best matching CV terms. | ||
matches <- vapply(names(.PATTERN.TO.CV), FUN = function(z) { | ||
length(grep(pattern = z, x = pattern, ignore.case = TRUE)) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't any(grepl(...))
easier to read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Regarding |
Some bugs/issues:
|
- Ensure that precursor data is preserved in write,MSnExp and write,OnDiskMSnExp even if the precursor scan is not available (issue #256). - Add related unit tests.
R/writeMSData.R
Outdated
BPPARAM = SerialParam()) | ||
if (is(x, "OnDiskMSnExp")) { | ||
hdr <- fData(z) | ||
res <- mapply(FUN = .writeSingleMSData, x_split, files, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assignment to res
is not needed here. Just mapply(...) }
, so the return value of mapply
will be the return value of .writeMSData
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to return anything - so I'll use invisible
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I thought .writeSingleMSData
(or mzR::writeMSData
) would return a value for success/error. But if not invisible
is fine.
R/writeMSData.R
Outdated
cbind(peaksCount = nrow(sp), | ||
totIonCurrent = sum(sp[, 2], na.rm = TRUE), | ||
basePeakMZ = sp[base::which.max(sp[, 2]), 1][1], | ||
basePeakIntensity = max(sp[, 2], na.rm = TRUE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you calculating the maximum twice? We could simply just call base::which.max
once and get nearly 20% speed improvement (I guess this is a noncritical function but anyway):
library("msdata")
library("MSnbase")
library("rbenchmark")
m <- readMSData(msdata::proteomics("Top", full.names=TRUE), mode="onDisk")
pks <- spectrapply(m, function(sp)cbind(sp@mz, sp@intensity))
f1 <- function(pks) {
do.call(rbind, lapply(pks, function(sp) {
cbind(peaksCount = nrow(sp),
totIonCurrent = sum(sp[, 2], na.rm = TRUE),
basePeakMZ = sp[base::which.max(sp[, 2]), 1][1],
basePeakIntensity = max(sp[, 2], na.rm = TRUE))
}))}
f2 <- function(pks) {
do.call(rbind, lapply(pks, function(sp) {
i <- base::which.max(sp[,2])[1]
cbind(peaksCount = nrow(sp),
totIonCurrent = sum(sp[, 2], na.rm = TRUE),
basePeakMZ = sp[i, 1],
basePeakIntensity = sp[i, 2])
}))}
benchmark(f1(pks), f2(pks), order="relative")
# test replications elapsed relative user.self sys.self user.child sys.child
# 2 f2(pks) 100 7.791 1.000 7.764 0.004 0 0
# 1 f1(pks) 100 9.447 1.213 9.360 0.008 0 0
all.equal(f1(pks), f2(pks))
# [1] TRUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
- Require ProtGenerics 1.9.1 (defining the writeMSData generic). - Require most recent mzR (>= 2.11.11). - Change write,MSnExp to writeMSData,MSnExp (issue #259). - Add argument merge = FALSE to writeMSData,MSnExp to enable future merging of MS data into a single file.
For the time being I would skip the implementation of the Btw, since I changed |
I still need to do some more reviewing, but I haven't had the time so far. |
- Fix writeMSData for CDF input files (issue #250): calculate lowMZ and highMZ prior to exporting the data (lowMZ is missing in CDF header). - Add unit tests to check correct export of CDF files. - Set default for argument copy = FALSE because copy = TRUE is not supported for input files other than mzML or mzXML.
R/writeMSData.R
Outdated
basePeakIntensity = sp[max_pos, 2]) | ||
basePeakIntensity = sp[max_pos, 2], | ||
lowMZ = min(sp[, 1]), | ||
highMZ = max(sp[, 1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't lowMZ = sp[1, 1]
and highMZ = sp[nrow(sp), 1]
? Or can't we assume correct order at this stage? (I am not sure anymore but if I remember correctly you ensure ordering of mz values in the ctor for spectra.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right - that simplifies it (assuming correct ordering of m/z, but if I'm not mistaken that is also checked in the validity method of Spectrum
objects).
A few more comments/suggestions/observations; The testlibrary("MSnbase")
f <- msdata::proteomics(full.names = TRUE)[3]
## in memory, MS2 only
x0 <- readMSData(f)
writeMSData(x0, file = "write.mzML")
x1 <- readMSData("write.mzML")
## on disk, MS1 and MS2
x2 <- readMSData(f, mode = "onDisk")
writeMSData(x2, file = "write2.mzML")
x3 <- readMSData("write2.mzML", mode = "onDisk") File existsWhen writing to a file that exists, the error > x0 <- readMSData(f)
> writeMSData(x0, file = "write.mzML")
Error in (function (msData, file, outformat, software_processing = NULL, :
File write.mzML does already exist! Please use a different file name. should happen immediately. With my test (see below), it takes 10 seconds. There should be something like if (file.exists(file))
stop("File ", file, " does already exist! Please use a different file name.") right at the beginning of the File sizeThis is not an issue as such, and the difference is relatively small, but it would be useful to know where the difference in file size stems from, and be explicit in the man page in what isn't retained when reading and writing. > file.info(f)[[1]]
[1] 314974121
> file.info("write2.mzML")[[1]]
[1] 437995651
> file.info(f)[[1]] / file.info("write2.mzML")[[1]]
[1] 0.7191261 Differences> all.equal(x2, x3)
Note: method with signature ‘OnDiskMSnExp#MSnExp’ chosen for function ‘all.equal’,
target signature ‘OnDiskMSnExp#OnDiskMSnExp’.
"MSnExp#OnDiskMSnExp" would also be valid
[1] "Spectra at position 1 are different."
[2] "Spectra at position 1 are different."
> all.equal(x2[[1]], x3[[1]])
[1] "Attributes: < Component “tic”: Mean relative difference: 2.343212 >"
> tic(x2[[1]])
[1] 9187505
> tic(x3[[1]])
[1] 30715775 For the in-memory > options(max.print = 10)
> all.equal(x0, x1)
[1] "Attributes: < Component “assayData”: Component “F1.S0001”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"
[2] "Attributes: < Component “assayData”: Component “F1.S0001”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9955947 > >"
[3] "Attributes: < Component “assayData”: Component “F1.S0002”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"
[4] "Attributes: < Component “assayData”: Component “F1.S0002”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9912281 > >"
[5] "Attributes: < Component “assayData”: Component “F1.S0003”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"
[6] "Attributes: < Component “assayData”: Component “F1.S0003”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9868996 > >"
[7] "Attributes: < Component “assayData”: Component “F1.S0004”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"
[8] "Attributes: < Component “assayData”: Component “F1.S0004”: Attributes: < Component “scanIndex”: Mean relative difference: 0.9826087 > >"
[9] "Attributes: < Component “assayData”: Component “F1.S0005”: Attributes: < Component “precScanNum”: Mean relative difference: 1 > >"
[10] "Attributes: < Component “assayData”: Component “F1.S0005”: Attributes: < Component “scanIndex”: Mean relative difference: 0.978355 > >"
[ reached getOption("max.print") -- omitted 12206 entries ] and > head(precScanNum(x0))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006
226 226 226 226 226 226
> head(precScanNum(x1))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006
0 0 0 0 0 0
> head(scanIndex(x0))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006
227 228 229 230 231 232
> head(scanIndex(x1))
F1.S0001 F1.S0002 F1.S0003 F1.S0004 F1.S0005 F1.S0006
1 2 3 4 5 6 I believe the |
- Address first comment from @lgatto related to the error for existing file.
- Add more information about what is saved/copied from/to mzML files.
Regarding file size: one reason for the difference in file size is that the original file Apart from that, what is not copied/saved is all chromatogram information in the original file. In addition there will be differences in the processing information (i.e. depending on |
Regarding the differences: (as also stated in the note section of Regarding the I did update the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All points were discussed in PR comments. Excellent contribution, thanks Jo!
@jotsetung - have you added a mention of |
No I didn't, but will. |
I added a small section to both vignettes and pushed them to the |
If you are familiar with |
I'd prefer if you could do that - I'm currently busy fixing other stuff too... |
On Bioc now |
write,MSnExp
andwrite,OnDiskMSnExp
to allow saving these objects to mzMLor mzXML file(s) (issue Implement a writeMSData,OnDiskMSnExp method #250).
Naming of the methods is open for discussion; I've called them
write
because inmzR
there is already awriteMSData
function. Reasoning for calling the function justwrite
: we do have MS data stored in the object, callingwrite
on them should be enough for the user to know what happens -writeMSData
basically means *write the MS data in the provided object that contains MS data,
write: *write the provided object that contains MS data
.Note also that the unit tests fail without the latest fix (version 2.11.10) in
mzR
.