Fix bug in `robustSummary()` #349

adder · 2018-06-26T07:13:57Z

When samplenames of input are not alphabetically ordered, the order of the robustSummary() output is incorrect.

Model.matrix orders the ellements in a categorical variable alphabetically.
So the coefficients of rlm are also ordered alphabetically.
This order should be again reversed in the output into the same order as the samplesnames of input.

The proposed code change fixes this.

When samplenames of input are not alphabetically ordered, the order of the `robustSummary()` output is incorrect. Model.matrix orders the ellements in a categorical variable alphabetically. So the coefficients of rlm are also ordered alphabetically. This order should be again reversed in the output into the same order as the samplesnames of input. The proposed code change fixes this.

jorainer · 2018-06-26T13:48:12Z

Just a note: you could avoid this alphabetical re-ordering from the beginning by defining the factors providing also the levels:

## Instead of
sample <- rep(colnames(e), each = nrow(e))[p]

## Use
sample <- factor(rep(colnames(e), each = nrow(e))[p], levels = colnames(e))

there should be no need to sort them afterwords.
you could do the same for feature.

adder · 2018-06-26T21:44:19Z

Yes, you are offcourse right. That would be a more elegant way to handle the sample ordering.
One remark: I think you should also remove the samples that have missing intensities from levels
Else model.matrix makes to many collumns.

## Instead of
sample <- factor(rep(colnames(e), each = nrow(e))[p], levels = colnames(e))

## Use
sample <- factor(rep(colnames(e), each = nrow(e))[p], levels = colnames(e)[p])

As a side, note that my suggested patch also handels missing values in the output (samples with no intensity = NA) while reducing the involved lines of code from 4 to 2 :)

sgibb · 2018-06-27T20:58:44Z

R/combineFeatures.R

+    coef = fit$coefficients[sampleid]
+    ## Sort the sample coefficients in the same way as the samplenames of expression matrix
+    ## Puts NA for the samples without any expression value  
+    coef[paste0('sample',colnames(e))]
 }


Why you don't replace line 318 fit <- MASS::rlm(X, expression, ...) with

MASS::rlm(X, expression)$coefficients[paste0("sample", colnames(e))]

(maybe covered by unname) and remove all the following lines?

adder · 2018-06-28T09:08:00Z

@sgibb Yes this should indeed provide the wanted results in the least amount of lines of codes.
I'm in favor of this.

@lgatto Not sure how to proceed from here.
This code is already in the latest stable bioconductor release,but this is in my opinion a serious bug.
I only noticed it now because I only used the MSnBase version of robust summary in combination with maxquant and maxquant apparently orders samples alphabetically by default in its output.
But in data from other sources this is of course often not the case.

Can this still fixed somehow in stable bioconductor?
Or only is this only possible in the next release?

greets

lgatto · 2018-06-28T10:21:11Z

This code is already in the latest stable bioconductor release,but this is in my opinion a serious bug. Can this still fixed somehow in stable bioconductor?

Yes, that is possible and recommended. I will fix the bug in the devel and release. I hope to be able to review the PR tonight.

lgatto · 2018-07-28T13:03:05Z

Sorry for the delay - looking into this now.

@adder - in general, in such cases I would suggest to also provide a unit test that identifies the bug and verifies that is is fixed.

lgatto · 2018-07-28T13:23:34Z

Here's an example unit test:

test_that("Robust summary and sample names order (bug PR# 349)", {
    ## This identified the bug
    data(msnset)
    msnset2 <- msnset <- log(filterNA(msnset), 2)
    ## Expected results
    res0 <- combineFeatures(msnset,
                            fcol = "ProteinAccession",
                            fun = "robust")
    ## Identify the bug
    sampleNames(msnset2)[1] <- "zzz"
    res2 <- combineFeatures(msnset2,
                            fcol = "ProteinAccession",
                            fun = "robust")
    ## Re-set sample name
    sampleNames(res2) <- sampleNames(res0)
    expect_equal(exprs(res0), exprs(res2))
})

This identifies the bug

> 
>     ## This identified the bug
>     data(msnset)
>     msnset2 <- msnset <- log(filterNA(msnset), 2)
>     res0 <- combineFeatures(msnset,
+                             fcol = "ProteinAccession",
+                             fun = "robust")
>     sampleNames(msnset2)[1] <- "zzz"
>     res2 <- combineFeatures(msnset2,
+                             fcol = "ProteinAccession",
+                             fun = "robust")
>     sampleNames(res2) <- sampleNames(res0)
>     expect_equal(exprs(res0), exprs(res2))

Error: exprs(res0) not equal to exprs(res2).
32/160 mismatches (average diff: 0.397)
[1]   9.52 - 10.16 == -0.6410
[3]  11.09 - 11.30 == -0.2125
[5]  10.06 - 10.01 ==  0.0468
[26] 10.89 - 11.10 == -0.2134
[29] 11.50 - 11.59 == -0.0900
[34] 10.04 -  9.87 ==  0.1789
[39] 12.49 - 12.55 == -0.0655
[40] 11.46 - 10.82 ==  0.6406
[41] 10.16 - 11.08 == -0.9193
...
> tail(exprs(res0))
        iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
ECA4030  14.754384   15.02730  14.970478  14.700696
ECA4037  10.613557   10.49613  10.507692  10.604738
ECA4512   9.496451   10.04831   9.733812   9.883604
ECA4513  13.309896   13.35631  13.427600  13.462701
ECA4514  12.489348   12.55487  12.576657  12.621516
ENO      11.455669   10.81511  10.140333   9.225928
> tail(exprs(res2))
        iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
ECA4030  14.754384   15.02730  14.970478  14.700696
ECA4037  10.613557   10.49613  10.507692  10.604738
ECA4512   9.496451   10.04831   9.733812   9.883604
ECA4513  13.309896   13.35631  13.427600  13.462701
ECA4514  12.554874   12.57666  12.621516  12.489348
ENO      10.815109   10.14033   9.225928  11.455669

Now with the code in this PR:

>     res2 <- combineFeatures(msnset2,
+                             fcol = "ProteinAccession",
+                             fun = "robust")
>     sampleNames(res2) <- sampleNames(res0)
>     expect_equal(exprs(res0), exprs(res2))

sgibb reviewed Jun 27, 2018

View reviewed changes

lgatto merged commit 005491f into lgatto:master Jul 28, 2018

lgatto pushed a commit that referenced this pull request Jul 28, 2018

add unit test for PR #349

dfbd2e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in `robustSummary()` #349

Fix bug in `robustSummary()` #349

adder commented Jun 26, 2018

jorainer commented Jun 26, 2018

adder commented Jun 26, 2018

sgibb Jun 27, 2018

adder commented Jun 28, 2018

lgatto commented Jun 28, 2018

lgatto commented Jul 28, 2018

lgatto commented Jul 28, 2018

Fix bug in robustSummary() #349

Fix bug in robustSummary() #349

Conversation

adder commented Jun 26, 2018

jorainer commented Jun 26, 2018

adder commented Jun 26, 2018

sgibb Jun 27, 2018

Choose a reason for hiding this comment

adder commented Jun 28, 2018

lgatto commented Jun 28, 2018

lgatto commented Jul 28, 2018

lgatto commented Jul 28, 2018

Fix bug in `robustSummary()` #349

Fix bug in `robustSummary()` #349