-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDF format with XCMS #332
Comments
thanks for the detailed @RomainDallet comparison. I will investigate later today. Some of the differences might be explained by the fact that the |
OK, for the differences between what library(MSnbase)
od <- readMSData(system.file("cdf/ko15.CDF", package = "msdata"), mode = "onDisk")
## Get the tic
tic_od <- tic(od)
## Calculate the TIC
tic_calc <- unlist(spectrapply(od, function(z) sum(z@intensity)))
## Finally extract a TIC
tic_chr <- chromatogram(od, aggregationFun = "sum")
## Comparing the values:
par(mfrow = c(1, 2))
plot(tic_od, tic_calc, pch = 16, col = "#00000080")
plot(intensity(tic_chr[1, 1]), tic_calc, pch = 16, col = "#00000080") As you see, the reported total ion current is completely different to the sum of intensities per spectrum. It might well be that this is specific for this file (no idea how this was generated or manipulated) - I haven't checked any other CDF files since I have none at hand. Does this explain the differences that you see too? Regarding the differences you see between CDF and mzXML, could it be explained by the same fact? I do not have data in both CDF and mzXML format to check, unfortunately. To test I exported the CDF above as mzXML file and compared the In general, I try to avoid the |
@jotsetung - isn't this something we should add to the documentation file with a link to this issue? |
Good point. I think there is already a note in the documentation - if not I will add it - and add a link to this issue. |
- Fix tic,OnDiskMSnExp with initial = FALSE to calculate the total ion count from the full data (related to issue #332).
In the documentation of |
OK, so the And between the mzXML and the CDF, is there still a difference? There should not be a difference for the calculated tic - if there is we have a bigger problem that is probably related to the different backends to read the data (proteowizard for mzXML and netcdf for CDF). |
OK, I think the problem is here. As you can see on the next plot representing the calculated tic, the shape is similar but the difference is in the scale that is different between the 2 plots. library(MSnbase)
mzxml <- readMSData("20180322_002.mzXML", mode = "onDisk")
cdf <- readMSData("20180322_002.cdf", mode = "onDisk")
## Calculate the TIC
tic_calc_mzxml <- unlist(spectrapply(mzxml, function(z) sum(z@intensity)))
tic_calc_cdf <- unlist(spectrapply(cdf, function(z) sum(z@intensity
par(mfrow = c(1, 2)))))
plot(tic_calc_mzxml, pch=16, col="#00000080")
plot(tic_calc_cdf, pch=16, col="#00000080") Edit : |
OK, so, if I understand it correctly, the difference is in the calculated TIC from both files. Could this be something related to the precision in which the intensity values are stored in the file? something like: unlist(intensity(data_cdf))[1:40]
unlist(intensity(data_mzxml))[1:40] I just checked the |
Sorry for the delay. So, as you said, intensities in CDF files are well stored as integer and in mzXML as double. unlist(intensity(mzxml))[1:40]
F1.S00011 F1.S00012 F1.S00013 F1.S00014 F1.S00015 F1.S00016
76913.039 26206.951 12683.207 15941.148 295482.719 10936.617
F1.S00017 F1.S00018 F1.S00019 F1.S000110 F1.S000111 F1.S000112
66302.383 137487.109 14040.188 10798.498 27801.715 9913.201
F1.S000113 F1.S000114 F1.S000115 F1.S000116 F1.S000117 F1.S000118
14710.414 520396.812 25490.676 106768.203 10073.900 53364.871
F1.S000119 F1.S000120 F1.S000121 F1.S000122 F1.S000123 F1.S000124
6840915.000 16774.482 54500.172 12954.951 26629.545 11592.397
F1.S000125 F1.S000126 F1.S000127 F1.S000128 F1.S000129 F1.S000130
10285.073 68888.305 116811.531 21737.303 150419.891 10987.931
F1.S000131 F1.S000132 F1.S000133 F1.S000134 F1.S000135 F1.S000136
15961.298 17425.691 22313.035 2653486.750 119605.781 14484.614
F1.S000137 F1.S000138 F1.S000139 F1.S000140
51200.082 116223.227 15091.529 262209.719
unlist(intensity(cdf))[1:40]
F1.S00011 F1.S00012 F1.S00013 F1.S00014 F1.S00015 F1.S00016 F1.S00017
0 0 0 0 0 0 0
F1.S00018 F1.S00019 F1.S000110 F1.S000111 F1.S000112 F1.S000113 F1.S000114
0 3732 17367 41130 67273 75002 51556
F1.S000115 F1.S000116 F1.S000117 F1.S000118 F1.S000119 F1.S000120 F1.S000121
26791 6916 0 0 0 0 0
F1.S000122 F1.S000123 F1.S000124 F1.S000125 F1.S000126 F1.S000127 F1.S000128
0 0 0 2667 8286 17441 24756
F1.S000129 F1.S000130 F1.S000131 F1.S000132 F1.S000133 F1.S000134 F1.S000135
24876 17727 8500 2474 0 0 0
F1.S000136 F1.S000137 F1.S000138 F1.S000139 F1.S000140
0 0 0 0 0 I don't have enough knowledge in MSnBase so maybe it's a dumb question but why values are so different ? To solve this problem more easily, I propose you this link if you want to get the dataset : https://filesender.renater.fr/?s=download&token=2d862d79-d366-6d23-8e6b-2fdc16d0fedf |
Thanks for the files. I'll have a look at it. |
I looked at your files and the first thing that I realized was that the spectra in the mzXML contain much less data points than in the CDF (number of spectra is the same and their library(xcms)
mzxml <- readMSData(file = "20180322_002.mzXML", mode = "onDisk")
cdf <- readMSData(file = "20180322_002.cdf", mode = "onDisk")
mean(unlist(spectrapply(mzxml, function(z) length(intensity(z)))))
[1] 1141.21
mean(unlist(spectrapply(cdf, function(z) length(intensity(z)))))
[1] 17012.48 In the mzXML is stated:
so, the mzXML file for sure is centroided and I guess the CDF file is in profile mode. That would explain the large difference in data points and also the difference in the calculated TIC (because each mass peak is reduced to a single m/z-intensity pair, while the CDF file contains all values). The total ion current listed in the header of the spectra is the total ion current measured by the MS instrument for the specific retention tima and AFAIK not changed/updated by proteowizard, even if you do centroiding (hence Summarizing, the two files you provided contain a different amount of data (CDF file contains more data points than the mzXML). |
Okay, it's much clearer now. So I will try to deprofile my CDF and make it centroided. Thank you very much for your time. |
Hi @jotsetung ,
@lecorguille, @yguitton, @0driveshaft0 and myself noticed an inconsistency in the construction of chromatograms when raw data come from CDF files.
Context :
We try to draw chromatograms from CDF or mzXML files.
We test the performance of :
=> We assume that they are supposed to give the same result
Problem :
File extensions :
tic() vs chromatogram(aggregationFun = 'sum') => TICs different
bpi() vs chromatogram(aggregationFun = 'max') => Same BPCs
tic() vs chromatogram(aggregationFun = 'sum') => Same TICs
bpi() vs chromatogram(aggregationFun = 'max') => Same BPCs
Functions :
CDF vs mzXML => Same chromatograms
CDF vs mzXML => Chromatograms different (+ Multiplying factor ?)
Conclusion :
Is there something wrong with the CDF format with XCMS (v1.46) ?
The text was updated successfully, but these errors were encountered: