Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix nonmatching dimnames #78

Merged
merged 9 commits into from
Aug 21, 2023

Conversation

csoneson
Copy link
Contributor

Here's a first attempt to address #70, as well as increasing the consistency of utilities.R in terms of indentation, spaces and assignment operators.
Currently the situation where assays are unnamed is not handled well by get_count_datasets(). Do we want to require named assays? At the moment, adding a check for this makes nest() fail (even if the assays are in fact named).

@stemangiola
Copy link
Owner

Here's a first attempt to address #70, as well as increasing the consistency of utilities.R in terms of indentation, spaces and assignment operators. Currently the situation where assays are unnamed is not handled well by get_count_datasets(). Do we want to require named assays? At the moment, adding a check for this makes nest() fail (even if the assays are in fact named).

Thanks!

I cannot easily identify which functions you did change. Could you please point them to me?

(About the reformatting, I am also trying to replace the %>% pipe with the |> one. But maybe it is another project. If you happen to make new code please use |>).

@csoneson
Copy link
Contributor Author

I cannot easily identify which functions you did change. Could you please point them to me?

Of course, sorry about that. The main code changes are in this commit: 8659e53 (there are also some changes in indentation, but if you append ?w=1 to the URL you can hide those whitespace-only changes)

  • I pulled out the checks of dimnames into a new function (check_se_dimnames()) as they are now used in multiple places

  • This function is now used in get_count_datasets() and order_assays_internally_to_be_consistent()

  • I gave check_if_assays_are_NOT_overlapped() a new argument to allow checking either row or column names

  • I also added unit tests for get_count_datasets() in bc47ac6

  • The other changes are cosmetic

@stemangiola stemangiola self-requested a review August 17, 2023 06:03
@stemangiola stemangiola added the bug Something isn't working label Aug 17, 2023
Copy link
Owner

@stemangiola stemangiola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! It is great.

As I understand, now if a SummarizedExperiment has inconsistent dim names, and we load library(tidySummarizedExperiment), as soon as we print the object on screen, the as_tibble function (called by print in the evaluation) will through an informative error.

This is good as it was throwing an error anyway.

However, while a user was able to do stuff with an inconsistent object, now is stopped from doing so, let's say printing the object on screen. I would say a more informative message, that maybe also says

Please do detach("tidySummarizedExperiment, ...") to keep operating on your object as ...

Or, is there a scope to make as_tibble work with inconsistent objects (probably not, but I would like as least to think about it).

Having said all that, I think inconsistent objects as a pretty edge case.

What do you think @csoneson and @mikelove?

(also one check is failing)

R/utilities.R Outdated
Comment on lines 1262 to 1266
namefcn <- rownames
dimfcn <- nrow
} else {
namefcn <- colnames
dimfcn <- ncol
Copy link
Owner

@stemangiola stemangiola Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to keep the codebase acronyms-and-abbreviations-free. Would you mind replacing ...fcn (and other acronyms I might have missed) with a verbose, self-explanatory alternative?

This helps all developers involved immediately understand.

@csoneson
Copy link
Contributor Author

Thanks @stemangiola - I have made the function names (hopefully) more informative.

(also one check is failing)

Regarding the error in the Windows GHA (or did this refer to something else?) I think it's unrelated to this PR, as it was already there e.g. in https://github.com/stemangiola/tidySummarizedExperiment/actions/runs/5634008230/job/15263497689 (I did not yet look into what the reason might be).

Regarding the inconsistent objects, I would agree that they're probably rare, but since they are valid SummarizedExperiment objects I also agree that it should be possible to at least do tidy-independent operations on it.

@stemangiola
Copy link
Owner

Regarding the inconsistent objects, I would agree that they're probably rare, but since they are valid SummarizedExperiment objects I also agree that it should be possible to at least do tidy-independent operations on it.

Do we want to try to make as_tibble robust against this case? And if it works we can convert the stop into warning.

@csoneson
Copy link
Contributor Author

I think there are two different aspects at the moment that would need to be addressed:

  • Unnamed assays. This could perhaps be fixed by giving them names (assay1, etc), similar to what is currently done for row and column names if they are completely missing.
  • Inconsistent naming between assays or with the SE (say, one assay with feature IDs g1, ...,gN, and another with f1, ..., fN. Here I'm not so sure what to do - I guess we could consider them different features, with missing values in the assay(s) where they are not observed (which would make the tibble longer). However, this would somehow go against the assumption of a SE that the assays should be aligned (it would align better with the altExps). The other option would be to select the names from the first assay and propagate them to any assay that have non-overlapping names, but that feels dangerous. Moreover, what would we do if all names but one agree (e.g., if there was only a single fJ in the middle of the second assay above, but all the other names agreed).

@stemangiola
Copy link
Owner

stemangiola commented Aug 18, 2023

  • Unnamed assays. This could perhaps be fixed by giving them names (assay1, etc), similar to what is currently done for row and column names if they are completely missing.

I agree. We could make this a new challenge as it is somehow independent of this one.

I guess we could consider them different features, with missing values in the assay(s) where they are not observed (which would make the tibble longer).

This seems modular and makes sense.

However, this would somehow go against the assumption of a SE that the assays should be aligned (it would align better with the altExps).

If the SE allows it, that is the responsibility of the framework, and we can be tranparent to that.

And I am not even sure why as_tibble fails. Maybe I am doing a left join for the assays, but this should not fail.

@csoneson
Copy link
Contributor Author

I made some changes in the three commits above (still needs more testing, but I wanted to check what you think). It now allows non-overlapping dimnames, and will expand the tibble accordingly, and throw warnings whenever any assumptions are made.

@stemangiola
Copy link
Owner

Thanks @csoneson ! @sdgamboa please test this branch with your dataset and see if this looks as you would expect.

@stemangiola
Copy link
Owner

@csoneson please add your authorship details here https://docs.google.com/spreadsheets/d/19XqhN3xAMekCJ-esAolzoWT6fttruSEermjIsrOFcoo/edit?usp=sharing

@stemangiola
Copy link
Owner

Unnamed assays. This could perhaps be fixed by giving them names (assay1, etc), similar to what is currently done for row and column names if they are completely missing.

@csoneson new challenge created

#79

@stemangiola
Copy link
Owner

@csoneson I was thinking, what happens now if we try to apply the methods on this SE.

  • mutate
  • slice
  • select
  • nest, unnest
  • left_join

@csoneson
Copy link
Contributor Author

I think that trying to operate in this way directly on an inconsistent SE will cause problems. I still have to understand exactly where this is happening, but I get e.g.:

> se1 |> mutate(v = 3)
Error in data.frame(., row.names = pull(., !!s_(se)$symbol), check.names = FALSE) : 
  row names contain missing values

where se1 is a SE where dimnames don't agree between assays and the SE itself. After converting to a tibble of course things "work" in the sense that they don't raise errors (whether it's sensible for downstream use is a different question, but there are lots of warnings provided to the user).

@stemangiola
Copy link
Owner

I think that trying to operate in this way directly on an inconsistent SE will cause problems. I still have to understand exactly where this is happening, but I get e.g.:

> se1 |> mutate(v = 3)
Error in data.frame(., row.names = pull(., !!s_(se)$symbol), check.names = FALSE) : 
  row names contain missing values

where se1 is a SE where dimnames don't agree between assays and the SE itself. After converting to a tibble of course things "work" in the sense that they don't raise errors (whether it's sensible for downstream use is a different question, but there are lots of warnings provided to the user).

OK, I have the feeling that adapting methods for this case will be incredibly time-consuming and not so rewarding, as this is an edge case.

I am OK if methods fail, as long as we have issued the warnings.

Should we merge this PR and accept after-warning failures?

@csoneson
Copy link
Contributor Author

Should we merge this PR and accept after-warning failures?

Sure, if the changes look sensible to you (I'm still getting used to the codebase and hope I haven't missed an implication somewhere - at least all the tests pass, so nothing that is covered there seems broken 🙂). It may be good to merge it to have a common starting points for further edits and more tests.

@stemangiola stemangiola merged commit c798f5b into stemangiola:master Aug 21, 2023
@stemangiola
Copy link
Owner

Congrats and welcome to the team! 🎉

@csoneson csoneson deleted the bugfix-nonmatching-dimnames branch August 21, 2023 07:13
@sdgamboa
Copy link

@csoneson, @stemangiola, thanks! It works.

## BiocManager::install('stemangiola/tidySummarizedExperiment', force = TRUE)
suppressMessages({
    library(curatedMetagenomicData)
    library(tidySummarizedExperiment)
})
dataset_name <- "HallAB_2017.relative_abundance"
tse <- curatedMetagenomicData(
    pattern = dataset_name, 
    dryrun = FALSE, rownames = 'NCBI',
    counts = TRUE
)[[1]]
#> 
#> $`2021-10-14.HallAB_2017.relative_abundance`
#> dropping rows without rowTree matches:
#>   k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Atopobiaceae|g__Olsenella|s__Olsenella_profusa
#>   k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Collinsella|s__Collinsella_stercoris
#>   k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Carnobacteriaceae|g__Granulicatella|s__Granulicatella_elegans
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_champanellensis
#>   k__Bacteria|p__Firmicutes|c__Erysipelotrichia|o__Erysipelotrichales|f__Erysipelotrichaceae|g__Bulleidia|s__Bulleidia_extructa
#>   k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae|g__Sutterella|s__Sutterella_parvirubra
#>   k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Cloacibacillus|s__Cloacibacillus_evryensis
tse 
#> class: TreeSummarizedExperiment 
#> dim: 503 259 
#> metadata(1): agglomerated_by_rank
#> assays(1): relative_abundance
#> rownames(503): 853 820 ... 172901 1262744
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(259): p8582_mo1 p8582_mo10 ... SKST041_2_G103027
#>   SKST041_3_G103028
#> colData names(24): study_name subject_id ... HBI SCCAI
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (503 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL
class(tse)
#> [1] "TreeSummarizedExperiment"
#> attr(,"package")
#> [1] "TreeSummarizedExperiment"
tidy_tse <- tidySummarizedExperiment::as_tibble(tse)
#> Warning in check_se_dimnames(se): tidySummarizedExperiment says: the assays in
#> your SummarizedExperiment have row names, but they don't agree with the row
#> names of the SummarizedExperiment object itself. It is strongly recommended to
#> make the assays consistent, to avoid erroneous matching of features.
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16)
#>  os       Ubuntu 22.04.2 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2023-08-22
#>  pandoc   3.1.1 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package                  * version   date (UTC) lib source
#>  abind                      1.4-5     2016-07-21 [1] CRAN (R 4.1.2)
#>  AnnotationDbi              1.62.2    2023-07-02 [1] Bioconductor
#>  AnnotationHub              3.8.0     2023-04-25 [1] Bioconductor
#>  ape                        5.7-1     2023-03-13 [1] CRAN (R 4.2.2)
#>  beachmat                   2.16.0    2023-04-25 [1] Bioconductor
#>  beeswarm                   0.4.0     2021-06-01 [1] CRAN (R 4.1.2)
#>  Biobase                  * 2.60.0    2023-04-25 [1] Bioconductor
#>  BiocFileCache              2.8.0     2023-04-25 [1] Bioconductor
#>  BiocGenerics             * 0.46.0    2023-04-25 [1] Bioconductor
#>  BiocManager                1.30.22   2023-08-08 [1] CRAN (R 4.3.1)
#>  BiocNeighbors              1.18.0    2023-04-25 [1] Bioconductor
#>  BiocParallel               1.34.2    2023-05-22 [1] Bioconductor
#>  BiocSingular               1.16.0    2023-04-25 [1] Bioconductor
#>  BiocVersion                3.17.1    2022-11-04 [1] Bioconductor
#>  Biostrings               * 2.68.1    2023-05-16 [1] Bioconductor
#>  bit                        4.0.5     2022-11-15 [1] CRAN (R 4.2.2)
#>  bit64                      4.0.5     2020-08-30 [1] CRAN (R 4.1.2)
#>  bitops                     1.0-7     2021-04-24 [1] CRAN (R 4.1.2)
#>  blob                       1.2.4     2023-03-17 [1] CRAN (R 4.2.2)
#>  cachem                     1.0.8     2023-05-01 [1] CRAN (R 4.2.2)
#>  cli                        3.6.1     2023-03-23 [1] CRAN (R 4.2.2)
#>  cluster                    2.1.4     2022-08-22 [1] CRAN (R 4.3.0)
#>  codetools                  0.2-19    2023-02-01 [3] CRAN (R 4.2.2)
#>  colorspace                 2.1-0     2023-01-23 [1] CRAN (R 4.2.2)
#>  crayon                     1.5.2     2022-09-29 [1] CRAN (R 4.2.1)
#>  curatedMetagenomicData   * 3.8.0     2023-04-27 [1] Bioconductor
#>  curl                       5.0.2     2023-08-14 [1] CRAN (R 4.3.1)
#>  data.table                 1.14.8    2023-02-17 [1] CRAN (R 4.2.2)
#>  DBI                        1.1.3     2022-06-18 [1] CRAN (R 4.2.2)
#>  dbplyr                     2.3.3     2023-07-07 [1] CRAN (R 4.3.1)
#>  DECIPHER                   2.28.0    2023-04-25 [1] Bioconductor
#>  decontam                   1.20.0    2023-04-25 [1] Bioconductor
#>  DelayedArray               0.26.7    2023-07-28 [1] Bioconductor
#>  DelayedMatrixStats         1.22.5    2023-08-10 [1] Bioconductor
#>  digest                     0.6.33    2023-07-07 [1] CRAN (R 4.3.1)
#>  DirichletMultinomial       1.42.0    2023-04-25 [1] Bioconductor
#>  dplyr                      1.1.2     2023-04-20 [1] CRAN (R 4.2.2)
#>  ellipsis                   0.3.2     2021-04-29 [1] CRAN (R 4.1.2)
#>  evaluate                   0.21      2023-05-05 [1] CRAN (R 4.2.2)
#>  ExperimentHub              2.8.1     2023-07-12 [1] Bioconductor
#>  fansi                      1.0.4     2023-01-22 [1] CRAN (R 4.2.2)
#>  fastmap                    1.1.1     2023-02-24 [1] CRAN (R 4.2.2)
#>  filelock                   1.0.2     2018-10-05 [1] CRAN (R 4.1.2)
#>  fs                         1.6.3     2023-07-20 [1] CRAN (R 4.3.1)
#>  generics                   0.1.3     2022-07-05 [1] CRAN (R 4.2.1)
#>  GenomeInfoDb             * 1.36.1    2023-06-21 [1] Bioconductor
#>  GenomeInfoDbData           1.2.10    2023-05-09 [1] Bioconductor
#>  GenomicRanges            * 1.52.0    2023-04-25 [1] Bioconductor
#>  ggbeeswarm                 0.7.2     2023-04-29 [1] CRAN (R 4.2.2)
#>  ggplot2                    3.4.3     2023-08-14 [1] CRAN (R 4.3.1)
#>  ggrepel                    0.9.3     2023-02-03 [1] CRAN (R 4.2.2)
#>  glue                       1.6.2     2022-02-24 [1] CRAN (R 4.3.0)
#>  gridExtra                  2.3       2017-09-09 [1] CRAN (R 4.1.2)
#>  gtable                     0.3.4     2023-08-21 [1] CRAN (R 4.3.1)
#>  htmltools                  0.5.6     2023-08-10 [1] CRAN (R 4.3.1)
#>  htmlwidgets                1.6.2     2023-03-17 [1] CRAN (R 4.2.2)
#>  httpuv                     1.6.11    2023-05-11 [1] CRAN (R 4.3.0)
#>  httr                       1.4.7     2023-08-15 [1] CRAN (R 4.3.1)
#>  interactiveDisplayBase     1.38.0    2023-04-25 [1] Bioconductor
#>  IRanges                  * 2.34.1    2023-06-22 [1] Bioconductor
#>  irlba                      2.3.5.1   2022-10-03 [1] CRAN (R 4.2.1)
#>  jsonlite                   1.8.7     2023-06-29 [1] CRAN (R 4.3.1)
#>  KEGGREST                   1.40.0    2023-04-25 [1] Bioconductor
#>  knitr                      1.43      2023-05-25 [1] CRAN (R 4.3.0)
#>  later                      1.3.1     2023-05-02 [1] CRAN (R 4.2.2)
#>  lattice                    0.21-8    2023-04-05 [3] CRAN (R 4.3.0)
#>  lazyeval                   0.2.2     2019-03-15 [1] CRAN (R 4.1.2)
#>  lifecycle                  1.0.3     2022-10-07 [1] CRAN (R 4.2.1)
#>  magrittr                   2.0.3     2022-03-30 [1] CRAN (R 4.1.3)
#>  MASS                       7.3-60    2023-05-04 [3] CRAN (R 4.3.1)
#>  Matrix                     1.6-1     2023-08-14 [1] CRAN (R 4.3.1)
#>  MatrixGenerics           * 1.12.3    2023-07-30 [1] Bioconductor
#>  matrixStats              * 1.0.0     2023-06-02 [1] CRAN (R 4.3.0)
#>  memoise                    2.0.1     2021-11-26 [1] CRAN (R 4.1.2)
#>  mgcv                       1.9-0     2023-07-11 [3] CRAN (R 4.3.1)
#>  mia                        1.8.0     2023-04-25 [1] Bioconductor
#>  mime                       0.12      2021-09-28 [1] CRAN (R 4.1.2)
#>  MultiAssayExperiment       1.27.0    2023-08-11 [1] Bioconductor
#>  munsell                    0.5.0     2018-06-12 [1] CRAN (R 4.1.2)
#>  nlme                       3.1-163   2023-08-09 [3] CRAN (R 4.3.1)
#>  permute                    0.9-7     2022-01-27 [1] CRAN (R 4.1.2)
#>  pillar                     1.9.0     2023-03-22 [1] CRAN (R 4.2.2)
#>  pkgconfig                  2.0.3     2019-09-22 [1] CRAN (R 4.1.2)
#>  plotly                     4.10.2    2023-06-03 [1] CRAN (R 4.3.0)
#>  plyr                       1.8.8     2022-11-11 [1] CRAN (R 4.2.2)
#>  png                        0.1-8     2022-11-29 [1] CRAN (R 4.2.2)
#>  promises                   1.2.1     2023-08-10 [1] CRAN (R 4.3.1)
#>  purrr                      1.0.2     2023-08-10 [1] CRAN (R 4.3.1)
#>  R.cache                    0.16.0    2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3                1.8.2     2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo                       1.25.0    2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils                    2.12.2    2022-11-11 [1] CRAN (R 4.2.2)
#>  R6                         2.5.1     2021-08-19 [1] CRAN (R 4.1.2)
#>  rappdirs                   0.3.3     2021-01-31 [1] CRAN (R 4.1.2)
#>  Rcpp                       1.0.11    2023-07-06 [1] CRAN (R 4.3.1)
#>  RCurl                      1.98-1.12 2023-03-27 [1] CRAN (R 4.2.2)
#>  reprex                     2.0.2     2022-08-17 [1] CRAN (R 4.2.1)
#>  reshape2                   1.4.4     2020-04-09 [1] CRAN (R 4.1.2)
#>  rlang                      1.1.1     2023-04-28 [1] CRAN (R 4.2.2)
#>  rmarkdown                  2.24      2023-08-14 [1] CRAN (R 4.3.1)
#>  RSQLite                    2.3.1     2023-04-03 [1] CRAN (R 4.2.2)
#>  rstudioapi                 0.15.0    2023-07-07 [1] CRAN (R 4.3.1)
#>  rsvd                       1.0.5     2021-04-16 [1] CRAN (R 4.1.2)
#>  S4Arrays                   1.0.5     2023-07-24 [1] Bioconductor
#>  S4Vectors                * 0.38.1    2023-05-02 [1] Bioconductor
#>  ScaledMatrix               1.8.1     2023-05-03 [1] Bioconductor
#>  scales                     1.2.1     2022-08-20 [1] CRAN (R 4.2.1)
#>  scater                     1.28.0    2023-04-25 [1] Bioconductor
#>  scuttle                    1.10.2    2023-08-03 [1] Bioconductor
#>  sessioninfo                1.2.2     2021-12-06 [1] CRAN (R 4.1.2)
#>  shiny                      1.7.5     2023-08-12 [1] CRAN (R 4.3.1)
#>  SingleCellExperiment     * 1.22.0    2023-04-25 [1] Bioconductor
#>  sparseMatrixStats          1.12.2    2023-07-02 [1] Bioconductor
#>  stringi                    1.7.12    2023-01-11 [1] CRAN (R 4.2.2)
#>  stringr                    1.5.0     2022-12-02 [1] CRAN (R 4.2.2)
#>  styler                     1.10.1    2023-06-05 [1] CRAN (R 4.3.1)
#>  SummarizedExperiment     * 1.30.2    2023-06-06 [1] Bioconductor
#>  tibble                     3.2.1     2023-03-20 [1] CRAN (R 4.2.2)
#>  tidyr                      1.3.0     2023-01-24 [1] CRAN (R 4.2.2)
#>  tidyselect                 1.2.0     2022-10-10 [1] CRAN (R 4.2.1)
#>  tidySummarizedExperiment * 1.11.1    2023-08-22 [1] Github (stemangiola/tidySummarizedExperiment@c798f5b)
#>  tidytree                   0.4.5     2023-08-10 [1] CRAN (R 4.3.1)
#>  treeio                     1.24.3    2023-07-24 [1] Bioconductor
#>  TreeSummarizedExperiment * 2.8.0     2023-04-25 [1] Bioconductor
#>  ttservice                * 0.3.6     2023-06-20 [1] RSPM (R 4.3.0)
#>  utf8                       1.2.3     2023-01-31 [1] CRAN (R 4.2.2)
#>  vctrs                      0.6.3     2023-06-14 [1] CRAN (R 4.3.0)
#>  vegan                      2.6-4     2022-10-11 [1] CRAN (R 4.2.1)
#>  vipor                      0.4.5     2017-03-22 [1] CRAN (R 4.1.2)
#>  viridis                    0.6.4     2023-07-22 [1] CRAN (R 4.3.1)
#>  viridisLite                0.4.2     2023-05-02 [1] CRAN (R 4.2.2)
#>  withr                      2.5.0     2022-03-03 [1] CRAN (R 4.1.2)
#>  xfun                       0.40      2023-08-09 [1] CRAN (R 4.3.1)
#>  xtable                     1.8-4     2019-04-21 [1] CRAN (R 4.1.2)
#>  XVector                  * 0.40.0    2023-04-25 [1] Bioconductor
#>  yaml                       2.3.7     2023-01-23 [1] CRAN (R 4.2.2)
#>  yulab.utils                0.0.7     2023-08-09 [1] CRAN (R 4.3.1)
#>  zlibbioc                   1.46.0    2023-04-25 [1] Bioconductor
#> 
#>  [1] /mnt/STORE1/lib/R/bioc-release
#>  [2] /usr/lib/R/site-library
#>  [3] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Created on 2023-08-22 with reprex v2.0.2

@stemangiola
Copy link
Owner

@csoneson I noticed that the error triggers if we have duplicated colnames, because of the intersect, length approach. Also, a warning of duplicated colnames would be great.

Can you please remind me if we did not decide at the end to convert to warning? Actually, being an error saved an analysis of mine. I am conflicted about what behaviours between error and warning would be best.

@csoneson
Copy link
Contributor Author

Hm, good point. I'm thinking that this is a bit different from the other situations and may warrant an error, for a couple of reasons:

  • What do we print if one assay has column names S1, S2, S3, and another has S1, S1, S3 - which of the S1 samples in the second assay would be matched to the S1 sample in the first assay? I don't see an obvious solution here other than making the column names of the first assay unique, and then none of them would overlap with the second assay.
  • This kind of situation seems to raise an error also e.g. in as_tibble(), via gather():
> as_tibble(se) 
Error in `map2()`:
ℹ In index: 1.
ℹ With name: mat.
Caused by error in `gather()`:
! Names must be unique.
✖ These names are duplicated:
  * "S1" at locations 1 and 2.

@stemangiola
Copy link
Owner

Yes unless we do clever masking for all assays this would be a problem.

So I think we can

  1. Create a duplicated names error check
  2. correct the mismatch check to allow for duplication, for keeping these two checks independent. For example the mismatch check triggers even if I have

assay_1: A, A, B, C
assay_2: A, A, B, C

Now there is no mismatch technically but there is duplication. So just the duplication test should trigger, but not the mismatch check.

@csoneson
Copy link
Contributor Author

I gave it a go here: #80
I had to change the logic a bit to deal with situations of the type:
assay_1: A, A, B, C
assay_2: A, B, B, C
in which case the intersection of the names is the same as the number of unique entries for each, but they are not in fact the same...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tidySE manage to_tibble if rownames not match, or are absent
3 participants