Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in robustSummary() #349

Merged
merged 1 commit into from
Jul 28, 2018
Merged

Fix bug in robustSummary() #349

merged 1 commit into from
Jul 28, 2018

Conversation

adder
Copy link
Contributor

@adder adder commented Jun 26, 2018

When samplenames of input are not alphabetically ordered, the order of the robustSummary() output is incorrect.

Model.matrix orders the ellements in a categorical variable alphabetically.
So the coefficients of rlm are also ordered alphabetically.
This order should be again reversed in the output into the same order as the samplesnames of input.

The proposed code change fixes this.

When samplenames of input are not alphabetically ordered, the order of the `robustSummary()` output is incorrect.

Model.matrix orders the ellements in a categorical variable alphabetically.
So the coefficients of rlm are also ordered alphabetically.
This order should be again reversed in the output into the same order as the samplesnames of input.

The proposed code change fixes this.
@jorainer
Copy link
Collaborator

Just a note: you could avoid this alphabetical re-ordering from the beginning by defining the factors providing also the levels:

## Instead of
sample <- rep(colnames(e), each = nrow(e))[p]

## Use
sample <- factor(rep(colnames(e), each = nrow(e))[p], levels = colnames(e))

there should be no need to sort them afterwords.
you could do the same for feature.

@adder
Copy link
Contributor Author

adder commented Jun 26, 2018

Yes, you are offcourse right. That would be a more elegant way to handle the sample ordering.
One remark: I think you should also remove the samples that have missing intensities from levels
Else model.matrix makes to many collumns.

## Instead of
sample <- factor(rep(colnames(e), each = nrow(e))[p], levels = colnames(e))

## Use
sample <- factor(rep(colnames(e), each = nrow(e))[p], levels = colnames(e)[p])

As a side, note that my suggested patch also handels missing values in the output (samples with no intensity = NA) while reducing the involved lines of code from 4 to 2 :)

coef = fit$coefficients[sampleid]
## Sort the sample coefficients in the same way as the samplenames of expression matrix
## Puts NA for the samples without any expression value
coef[paste0('sample',colnames(e))]
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you don't replace line 318 fit <- MASS::rlm(X, expression, ...) with

MASS::rlm(X, expression)$coefficients[paste0("sample", colnames(e))]

(maybe covered by unname) and remove all the following lines?

@adder
Copy link
Contributor Author

adder commented Jun 28, 2018

@sgibb Yes this should indeed provide the wanted results in the least amount of lines of codes.
I'm in favor of this.

@lgatto Not sure how to proceed from here.
This code is already in the latest stable bioconductor release,but this is in my opinion a serious bug.
I only noticed it now because I only used the MSnBase version of robust summary in combination with maxquant and maxquant apparently orders samples alphabetically by default in its output.
But in data from other sources this is of course often not the case.

Can this still fixed somehow in stable bioconductor?
Or only is this only possible in the next release?

greets

@lgatto
Copy link
Owner

lgatto commented Jun 28, 2018

This code is already in the latest stable bioconductor release,but this is in my opinion a serious bug. Can this still fixed somehow in stable bioconductor?

Yes, that is possible and recommended. I will fix the bug in the devel and release. I hope to be able to review the PR tonight.

@lgatto
Copy link
Owner

lgatto commented Jul 28, 2018

Sorry for the delay - looking into this now.

@adder - in general, in such cases I would suggest to also provide a unit test that identifies the bug and verifies that is is fixed.

@lgatto
Copy link
Owner

lgatto commented Jul 28, 2018

Here's an example unit test:

test_that("Robust summary and sample names order (bug PR# 349)", {
    ## This identified the bug
    data(msnset)
    msnset2 <- msnset <- log(filterNA(msnset), 2)
    ## Expected results
    res0 <- combineFeatures(msnset,
                            fcol = "ProteinAccession",
                            fun = "robust")
    ## Identify the bug
    sampleNames(msnset2)[1] <- "zzz"
    res2 <- combineFeatures(msnset2,
                            fcol = "ProteinAccession",
                            fun = "robust")
    ## Re-set sample name
    sampleNames(res2) <- sampleNames(res0)
    expect_equal(exprs(res0), exprs(res2))
})

This identifies the bug

> 
>     ## This identified the bug
>     data(msnset)
>     msnset2 <- msnset <- log(filterNA(msnset), 2)
>     res0 <- combineFeatures(msnset,
+                             fcol = "ProteinAccession",
+                             fun = "robust")
>     sampleNames(msnset2)[1] <- "zzz"
>     res2 <- combineFeatures(msnset2,
+                             fcol = "ProteinAccession",
+                             fun = "robust")
>     sampleNames(res2) <- sampleNames(res0)
>     expect_equal(exprs(res0), exprs(res2))

Error: exprs(res0) not equal to exprs(res2).
32/160 mismatches (average diff: 0.397)
[1]   9.52 - 10.16 == -0.6410
[3]  11.09 - 11.30 == -0.2125
[5]  10.06 - 10.01 ==  0.0468
[26] 10.89 - 11.10 == -0.2134
[29] 11.50 - 11.59 == -0.0900
[34] 10.04 -  9.87 ==  0.1789
[39] 12.49 - 12.55 == -0.0655
[40] 11.46 - 10.82 ==  0.6406
[41] 10.16 - 11.08 == -0.9193
...
> tail(exprs(res0))
        iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
ECA4030  14.754384   15.02730  14.970478  14.700696
ECA4037  10.613557   10.49613  10.507692  10.604738
ECA4512   9.496451   10.04831   9.733812   9.883604
ECA4513  13.309896   13.35631  13.427600  13.462701
ECA4514  12.489348   12.55487  12.576657  12.621516
ENO      11.455669   10.81511  10.140333   9.225928
> tail(exprs(res2))
        iTRAQ4.114 iTRAQ4.115 iTRAQ4.116 iTRAQ4.117
ECA4030  14.754384   15.02730  14.970478  14.700696
ECA4037  10.613557   10.49613  10.507692  10.604738
ECA4512   9.496451   10.04831   9.733812   9.883604
ECA4513  13.309896   13.35631  13.427600  13.462701
ECA4514  12.554874   12.57666  12.621516  12.489348
ENO      10.815109   10.14033   9.225928  11.455669

Now with the code in this PR:

>     res2 <- combineFeatures(msnset2,
+                             fcol = "ProteinAccession",
+                             fun = "robust")
>     sampleNames(res2) <- sampleNames(res0)
>     expect_equal(exprs(res0), exprs(res2))

@lgatto lgatto merged commit 005491f into lgatto:master Jul 28, 2018
lgatto pushed a commit that referenced this pull request Jul 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants