Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getPrevalence, NA values #486

Merged
merged 14 commits into from
Mar 5, 2024
Merged

getPrevalence, NA values #486

merged 14 commits into from
Mar 5, 2024

Conversation

TuomasBorman
Copy link
Contributor

If there are NA values, they are interpreted as FALSE i.e., certain value do not exceed the detection threshold i.e. "this taxon cannot be found from this sample"

@TuomasBorman TuomasBorman linked an issue Feb 14, 2024 that may be closed by this pull request
Copy link
Member

@antagomir antagomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good if we assume that users will always want to consider NA as "not detected". I think this is OK.

Another option would be to let users pass na.rm=TRUE for rowSums on line 216 (or perhaps this could be the default then; albeit now na.rm=FALSE by default in other parts of this R file).

It might be advisable to throw a warning when NAs are interpreted as "not detected". User should get to know this.

I would also add some unit tests to spot this problem in the future if anything is changed?

@TuomasBorman
Copy link
Contributor Author

One option is to catch na.rm in .agg_for_prevalence function and create new parameter called drop.empty.rank --> don't feed it to agglomreateByRank function instead use drop.empty.rank

--> Use na.rm in rowSums in line 216

@antagomir
Copy link
Member

Yes the drop.empty.rank might be more clear.

@antagomir
Copy link
Member

TBH I am not sure how necessary it is to allow users decide whether they want to have to have na.rm TRUE of FALSE in the merging. But perhaps it is good to let user decide ultimately.

I would pay some attention for documentation, examples & unit tests to keep it clear for users.

@TuomasBorman
Copy link
Contributor Author

Is it now good to go?

Copy link
Member

@antagomir antagomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions in the comments.

R/getPrevalence.R Outdated Show resolved Hide resolved
R/getPrevalence.R Show resolved Hide resolved
assay(tse, "counts")[remove, ] <- NA
# There should be 3 NA values if na.rm = FALSE. Otherwise there should be 0
expect_true( sum(is.na(
getPrevalence(tse, assay.type = "counts", na.rm = FALSE) )) == 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also work when rank argument is specified? Because earlier there were differences how this works when rank is additionally provided, compared to this present example.

Copy link
Contributor Author

@TuomasBorman TuomasBorman Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These modifications do not affect this thing, I guess. (I am not sure what differences there were)

When rank is specified

  1. TreeSE is agglomerated to level specified by rank. agg.na.rm (I renamed drop.empty.ranks) is used to specify whether to drop those rows that do not have taxonomy in level specified by rank. --> Instead of "phylum_something" & "class_something2" there are only taxa that have taxonomy info in class level, for instance.

  2. If there are NA values, agglomerated tables have also NA values in higher level rank --> This might be misleading? We could add parameter to agglomerateByRank that removes NAs? (agglomerateByRanks --> mergeRows --> .merge_rows --> scuttle::sumCountsAcrossFeatures)

  3. getPrevalence calculates prevalence of features. With na.rm, user can specify whether to remove NA values.

So currently, when there are NA values and rank is specified, there are some taxa that have prevalence NA.

Copy link
Member

@antagomir antagomir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also have in unit tests test of the rank parameter, something like

getPrevalence(tse, assay.type = "counts", na.rm = FALSE, rank="Genus")

if( any( is.na(x) ) ){
msg <- paste0(
"The abundance table contains NA values and they are ",
ifelse(na.rm, "not", ""), "excluded (see 'na.rm').")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this in sync with the agg.na.rm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is na.rm for getPrevalence function (agg.na.rm is for agglomerateByRanks) --> so yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Just ensure it is sufficiently clear from documentation what na.rm is for what purposes and there is no real risk for confusion between those.

TuomasBorman and others added 5 commits February 16, 2024 15:12
Merge branch 'getprevalence' of github.com:microbiome/mia into getprevalence

# Conflicts:
#	DESCRIPTION
@TuomasBorman TuomasBorman merged commit aa1d488 into master Mar 5, 2024
1 check passed
@TuomasBorman TuomasBorman deleted the getprevalence branch March 5, 2024 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

getPrevalence, named numeric(0) if NA values in the assay
2 participants