-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix nonmatching dimnames #78
Bugfix nonmatching dimnames #78
Conversation
Thanks! I cannot easily identify which functions you did change. Could you please point them to me? (About the reformatting, I am also trying to replace the |
Of course, sorry about that. The main code changes are in this commit: 8659e53 (there are also some changes in indentation, but if you append
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! It is great.
As I understand, now if a SummarizedExperiment
has inconsistent dim names, and we load library(tidySummarizedExperiment
), as soon as we print the object on screen, the as_tibble
function (called by print in the evaluation) will through an informative error.
This is good as it was throwing an error anyway.
However, while a user was able to do stuff with an inconsistent object, now is stopped from doing so, let's say printing the object on screen. I would say a more informative message, that maybe also says
Please do detach("tidySummarizedExperiment, ...") to keep operating on your object as ...
Or, is there a scope to make as_tibble
work with inconsistent objects (probably not, but I would like as least to think about it).
Having said all that, I think inconsistent objects as a pretty edge case.
What do you think @csoneson and @mikelove?
(also one check is failing)
R/utilities.R
Outdated
namefcn <- rownames | ||
dimfcn <- nrow | ||
} else { | ||
namefcn <- colnames | ||
dimfcn <- ncol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to keep the codebase acronyms-and-abbreviations-free. Would you mind replacing ...fcn
(and other acronyms I might have missed) with a verbose, self-explanatory alternative?
This helps all developers involved immediately understand.
Thanks @stemangiola - I have made the function names (hopefully) more informative.
Regarding the error in the Windows GHA (or did this refer to something else?) I think it's unrelated to this PR, as it was already there e.g. in https://github.com/stemangiola/tidySummarizedExperiment/actions/runs/5634008230/job/15263497689 (I did not yet look into what the reason might be). Regarding the inconsistent objects, I would agree that they're probably rare, but since they are valid SummarizedExperiment objects I also agree that it should be possible to at least do tidy-independent operations on it. |
Do we want to try to make as_tibble robust against this case? And if it works we can convert the stop into warning. |
I think there are two different aspects at the moment that would need to be addressed:
|
I agree. We could make this a new challenge as it is somehow independent of this one.
This seems modular and makes sense.
If the SE allows it, that is the responsibility of the framework, and we can be tranparent to that. And I am not even sure why |
…ty object when dimnames are not shared
I made some changes in the three commits above (still needs more testing, but I wanted to check what you think). It now allows non-overlapping dimnames, and will expand the tibble accordingly, and throw warnings whenever any assumptions are made. |
@csoneson please add your authorship details here https://docs.google.com/spreadsheets/d/19XqhN3xAMekCJ-esAolzoWT6fttruSEermjIsrOFcoo/edit?usp=sharing |
@csoneson I was thinking, what happens now if we try to apply the methods on this SE.
|
I think that trying to operate in this way directly on an inconsistent SE will cause problems. I still have to understand exactly where this is happening, but I get e.g.:
where |
OK, I have the feeling that adapting methods for this case will be incredibly time-consuming and not so rewarding, as this is an edge case. I am OK if methods fail, as long as we have issued the warnings. Should we merge this PR and accept after-warning failures? |
Sure, if the changes look sensible to you (I'm still getting used to the codebase and hope I haven't missed an implication somewhere - at least all the tests pass, so nothing that is covered there seems broken 🙂). It may be good to merge it to have a common starting points for further edits and more tests. |
Congrats and welcome to the team! 🎉 |
@csoneson, @stemangiola, thanks! It works. ## BiocManager::install('stemangiola/tidySummarizedExperiment', force = TRUE)
suppressMessages({
library(curatedMetagenomicData)
library(tidySummarizedExperiment)
})
dataset_name <- "HallAB_2017.relative_abundance"
tse <- curatedMetagenomicData(
pattern = dataset_name,
dryrun = FALSE, rownames = 'NCBI',
counts = TRUE
)[[1]]
#>
#> $`2021-10-14.HallAB_2017.relative_abundance`
#> dropping rows without rowTree matches:
#> k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Atopobiaceae|g__Olsenella|s__Olsenella_profusa
#> k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Collinsella|s__Collinsella_stercoris
#> k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Carnobacteriaceae|g__Granulicatella|s__Granulicatella_elegans
#> k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_champanellensis
#> k__Bacteria|p__Firmicutes|c__Erysipelotrichia|o__Erysipelotrichales|f__Erysipelotrichaceae|g__Bulleidia|s__Bulleidia_extructa
#> k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae|g__Sutterella|s__Sutterella_parvirubra
#> k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Cloacibacillus|s__Cloacibacillus_evryensis
tse
#> class: TreeSummarizedExperiment
#> dim: 503 259
#> metadata(1): agglomerated_by_rank
#> assays(1): relative_abundance
#> rownames(503): 853 820 ... 172901 1262744
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(259): p8582_mo1 p8582_mo10 ... SKST041_2_G103027
#> SKST041_3_G103028
#> colData names(24): study_name subject_id ... HBI SCCAI
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (503 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL
class(tse)
#> [1] "TreeSummarizedExperiment"
#> attr(,"package")
#> [1] "TreeSummarizedExperiment"
tidy_tse <- tidySummarizedExperiment::as_tibble(tse)
#> Warning in check_se_dimnames(se): tidySummarizedExperiment says: the assays in
#> your SummarizedExperiment have row names, but they don't agree with the row
#> names of the SummarizedExperiment object itself. It is strongly recommended to
#> make the assays consistent, to avoid erroneous matching of features.
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.1 (2023-06-16)
#> os Ubuntu 22.04.2 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2023-08-22
#> pandoc 3.1.1 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> abind 1.4-5 2016-07-21 [1] CRAN (R 4.1.2)
#> AnnotationDbi 1.62.2 2023-07-02 [1] Bioconductor
#> AnnotationHub 3.8.0 2023-04-25 [1] Bioconductor
#> ape 5.7-1 2023-03-13 [1] CRAN (R 4.2.2)
#> beachmat 2.16.0 2023-04-25 [1] Bioconductor
#> beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.1.2)
#> Biobase * 2.60.0 2023-04-25 [1] Bioconductor
#> BiocFileCache 2.8.0 2023-04-25 [1] Bioconductor
#> BiocGenerics * 0.46.0 2023-04-25 [1] Bioconductor
#> BiocManager 1.30.22 2023-08-08 [1] CRAN (R 4.3.1)
#> BiocNeighbors 1.18.0 2023-04-25 [1] Bioconductor
#> BiocParallel 1.34.2 2023-05-22 [1] Bioconductor
#> BiocSingular 1.16.0 2023-04-25 [1] Bioconductor
#> BiocVersion 3.17.1 2022-11-04 [1] Bioconductor
#> Biostrings * 2.68.1 2023-05-16 [1] Bioconductor
#> bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.2)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.2)
#> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.2)
#> blob 1.2.4 2023-03-17 [1] CRAN (R 4.2.2)
#> cachem 1.0.8 2023-05-01 [1] CRAN (R 4.2.2)
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.2)
#> cluster 2.1.4 2022-08-22 [1] CRAN (R 4.3.0)
#> codetools 0.2-19 2023-02-01 [3] CRAN (R 4.2.2)
#> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.2)
#> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.1)
#> curatedMetagenomicData * 3.8.0 2023-04-27 [1] Bioconductor
#> curl 5.0.2 2023-08-14 [1] CRAN (R 4.3.1)
#> data.table 1.14.8 2023-02-17 [1] CRAN (R 4.2.2)
#> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.2)
#> dbplyr 2.3.3 2023-07-07 [1] CRAN (R 4.3.1)
#> DECIPHER 2.28.0 2023-04-25 [1] Bioconductor
#> decontam 1.20.0 2023-04-25 [1] Bioconductor
#> DelayedArray 0.26.7 2023-07-28 [1] Bioconductor
#> DelayedMatrixStats 1.22.5 2023-08-10 [1] Bioconductor
#> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1)
#> DirichletMultinomial 1.42.0 2023-04-25 [1] Bioconductor
#> dplyr 1.1.2 2023-04-20 [1] CRAN (R 4.2.2)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.2)
#> evaluate 0.21 2023-05-05 [1] CRAN (R 4.2.2)
#> ExperimentHub 2.8.1 2023-07-12 [1] Bioconductor
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.2)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.2)
#> filelock 1.0.2 2018-10-05 [1] CRAN (R 4.1.2)
#> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.1)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
#> GenomeInfoDb * 1.36.1 2023-06-21 [1] Bioconductor
#> GenomeInfoDbData 1.2.10 2023-05-09 [1] Bioconductor
#> GenomicRanges * 1.52.0 2023-04-25 [1] Bioconductor
#> ggbeeswarm 0.7.2 2023-04-29 [1] CRAN (R 4.2.2)
#> ggplot2 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)
#> ggrepel 0.9.3 2023-02-03 [1] CRAN (R 4.2.2)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
#> gridExtra 2.3 2017-09-09 [1] CRAN (R 4.1.2)
#> gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.1)
#> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.3.1)
#> htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.2.2)
#> httpuv 1.6.11 2023-05-11 [1] CRAN (R 4.3.0)
#> httr 1.4.7 2023-08-15 [1] CRAN (R 4.3.1)
#> interactiveDisplayBase 1.38.0 2023-04-25 [1] Bioconductor
#> IRanges * 2.34.1 2023-06-22 [1] Bioconductor
#> irlba 2.3.5.1 2022-10-03 [1] CRAN (R 4.2.1)
#> jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.1)
#> KEGGREST 1.40.0 2023-04-25 [1] Bioconductor
#> knitr 1.43 2023-05-25 [1] CRAN (R 4.3.0)
#> later 1.3.1 2023-05-02 [1] CRAN (R 4.2.2)
#> lattice 0.21-8 2023-04-05 [3] CRAN (R 4.3.0)
#> lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.1.2)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.1)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3)
#> MASS 7.3-60 2023-05-04 [3] CRAN (R 4.3.1)
#> Matrix 1.6-1 2023-08-14 [1] CRAN (R 4.3.1)
#> MatrixGenerics * 1.12.3 2023-07-30 [1] Bioconductor
#> matrixStats * 1.0.0 2023-06-02 [1] CRAN (R 4.3.0)
#> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.1.2)
#> mgcv 1.9-0 2023-07-11 [3] CRAN (R 4.3.1)
#> mia 1.8.0 2023-04-25 [1] Bioconductor
#> mime 0.12 2021-09-28 [1] CRAN (R 4.1.2)
#> MultiAssayExperiment 1.27.0 2023-08-11 [1] Bioconductor
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.2)
#> nlme 3.1-163 2023-08-09 [3] CRAN (R 4.3.1)
#> permute 0.9-7 2022-01-27 [1] CRAN (R 4.1.2)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.2)
#> plotly 4.10.2 2023-06-03 [1] CRAN (R 4.3.0)
#> plyr 1.8.8 2022-11-11 [1] CRAN (R 4.2.2)
#> png 0.1-8 2022-11-29 [1] CRAN (R 4.2.2)
#> promises 1.2.1 2023-08-10 [1] CRAN (R 4.3.1)
#> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.1)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.1)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.2)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.2)
#> rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.1.2)
#> Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1)
#> RCurl 1.98-1.12 2023-03-27 [1] CRAN (R 4.2.2)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.1)
#> reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.1.2)
#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.2)
#> rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.1)
#> RSQLite 2.3.1 2023-04-03 [1] CRAN (R 4.2.2)
#> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1)
#> rsvd 1.0.5 2021-04-16 [1] CRAN (R 4.1.2)
#> S4Arrays 1.0.5 2023-07-24 [1] Bioconductor
#> S4Vectors * 0.38.1 2023-05-02 [1] Bioconductor
#> ScaledMatrix 1.8.1 2023-05-03 [1] Bioconductor
#> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.1)
#> scater 1.28.0 2023-04-25 [1] Bioconductor
#> scuttle 1.10.2 2023-08-03 [1] Bioconductor
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
#> shiny 1.7.5 2023-08-12 [1] CRAN (R 4.3.1)
#> SingleCellExperiment * 1.22.0 2023-04-25 [1] Bioconductor
#> sparseMatrixStats 1.12.2 2023-07-02 [1] Bioconductor
#> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.2)
#> stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.2)
#> styler 1.10.1 2023-06-05 [1] CRAN (R 4.3.1)
#> SummarizedExperiment * 1.30.2 2023-06-06 [1] Bioconductor
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.2)
#> tidyr 1.3.0 2023-01-24 [1] CRAN (R 4.2.2)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.1)
#> tidySummarizedExperiment * 1.11.1 2023-08-22 [1] Github (stemangiola/tidySummarizedExperiment@c798f5b)
#> tidytree 0.4.5 2023-08-10 [1] CRAN (R 4.3.1)
#> treeio 1.24.3 2023-07-24 [1] Bioconductor
#> TreeSummarizedExperiment * 2.8.0 2023-04-25 [1] Bioconductor
#> ttservice * 0.3.6 2023-06-20 [1] RSPM (R 4.3.0)
#> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.2)
#> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)
#> vegan 2.6-4 2022-10-11 [1] CRAN (R 4.2.1)
#> vipor 0.4.5 2017-03-22 [1] CRAN (R 4.1.2)
#> viridis 0.6.4 2023-07-22 [1] CRAN (R 4.3.1)
#> viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.2.2)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.2)
#> xfun 0.40 2023-08-09 [1] CRAN (R 4.3.1)
#> xtable 1.8-4 2019-04-21 [1] CRAN (R 4.1.2)
#> XVector * 0.40.0 2023-04-25 [1] Bioconductor
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.2)
#> yulab.utils 0.0.7 2023-08-09 [1] CRAN (R 4.3.1)
#> zlibbioc 1.46.0 2023-04-25 [1] Bioconductor
#>
#> [1] /mnt/STORE1/lib/R/bioc-release
#> [2] /usr/lib/R/site-library
#> [3] /usr/lib/R/library
#>
#> ────────────────────────────────────────────────────────────────────────────── Created on 2023-08-22 with reprex v2.0.2 |
@csoneson I noticed that the error triggers if we have duplicated colnames, because of the intersect, length approach. Also, a warning of duplicated colnames would be great. Can you please remind me if we did not decide at the end to convert to warning? Actually, being an error saved an analysis of mine. I am conflicted about what behaviours between error and warning would be best. |
Hm, good point. I'm thinking that this is a bit different from the other situations and may warrant an error, for a couple of reasons:
|
Yes unless we do clever masking for all assays this would be a problem. So I think we can
assay_1: A, A, B, C Now there is no mismatch technically but there is duplication. So just the duplication test should trigger, but not the mismatch check. |
I gave it a go here: #80 |
Here's a first attempt to address #70, as well as increasing the consistency of
utilities.R
in terms of indentation, spaces and assignment operators.Currently the situation where assays are unnamed is not handled well by
get_count_datasets()
. Do we want to require named assays? At the moment, adding a check for this makesnest()
fail (even if the assays are in fact named).