-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getPrevalence, NA values #486
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good if we assume that users will always want to consider NA as "not detected". I think this is OK.
Another option would be to let users pass na.rm=TRUE for rowSums on line 216 (or perhaps this could be the default then; albeit now na.rm=FALSE by default in other parts of this R file).
It might be advisable to throw a warning when NAs are interpreted as "not detected". User should get to know this.
I would also add some unit tests to spot this problem in the future if anything is changed?
One option is to catch --> Use na.rm in rowSums in line 216 |
Yes the |
TBH I am not sure how necessary it is to allow users decide whether they want to have to have I would pay some attention for documentation, examples & unit tests to keep it clear for users. |
Is it now good to go? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions in the comments.
tests/testthat/test-5prevalence.R
Outdated
assay(tse, "counts")[remove, ] <- NA | ||
# There should be 3 NA values if na.rm = FALSE. Otherwise there should be 0 | ||
expect_true( sum(is.na( | ||
getPrevalence(tse, assay.type = "counts", na.rm = FALSE) )) == 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this also work when rank
argument is specified? Because earlier there were differences how this works when rank is additionally provided, compared to this present example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These modifications do not affect this thing, I guess. (I am not sure what differences there were)
When rank is specified
-
TreeSE is agglomerated to level specified by
rank
.agg.na.rm
(I renameddrop.empty.ranks
) is used to specify whether to drop those rows that do not have taxonomy in level specified byrank.
--> Instead of "phylum_something" & "class_something2" there are only taxa that have taxonomy info in class level, for instance. -
If there are NA values, agglomerated tables have also NA values in higher level rank --> This might be misleading? We could add parameter to agglomerateByRank that removes NAs? (agglomerateByRanks --> mergeRows --> .merge_rows --> scuttle::sumCountsAcrossFeatures)
-
getPrevalence calculates prevalence of features. With
na.rm,
user can specify whether to remove NA values.
So currently, when there are NA values and rank is specified, there are some taxa that have prevalence NA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also have in unit tests test of the rank parameter, something like
getPrevalence(tse, assay.type = "counts", na.rm = FALSE, rank="Genus")
R/getPrevalence.R
Outdated
if( any( is.na(x) ) ){ | ||
msg <- paste0( | ||
"The abundance table contains NA values and they are ", | ||
ifelse(na.rm, "not", ""), "excluded (see 'na.rm').") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this in sync with the agg.na.rm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is na.rm
for getPrevalence
function (agg.na.rm
is for agglomerateByRanks
) --> so yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Just ensure it is sufficiently clear from documentation what na.rm is for what purposes and there is no real risk for confusion between those.
If there are NA values, they are interpreted as FALSE i.e., certain value do not exceed the detection threshold i.e. "this taxon cannot be found from this sample"