Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packages breaking if useNames = FALSE becomes the new default #226

Closed
3 of 4 tasks
HenrikBengtsson opened this issue Nov 14, 2022 · 10 comments
Closed
3 of 4 tasks

Packages breaking if useNames = FALSE becomes the new default #226

HenrikBengtsson opened this issue Nov 14, 2022 · 10 comments
Milestone

Comments

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Nov 14, 2022

I'm going to make a decision on what the default for useNames will be. Based on the newly added Design Goals:

The objectives of the matrixStats package is to perform operations on matrices (i) as fast as possible, while (ii) not using unnecessary amounts of memory. These objectives drive the design, including the choice of the different defaults.

I'll make the new default useNames = FALSE.

Actions

  • Make useNames = FALSE the default everywhere
  • Deprecate useNames = NA
  • Run reverse-dependency check with useNames = FALSE on all 420+ CRAN and Bioconductor packages to identify which packages breaks and if so, which functions are involved
  • Work with revdep packages to update to explicitly specify useNames = TRUE, where needed
@HenrikBengtsson
Copy link
Owner Author

HenrikBengtsson commented Nov 15, 2022

Revdep results

Four packages break when switching to useNames = FALSE; one package is on CRAN and three on Bioconductor, cf. https://github.com/HenrikBengtsson/matrixStats/blob/dbd47cd722b42c538946173945740c8a8b9b130c/revdep/README.md.

CRAN:

Bioconductor:

  • MatrixGenerics
    • sparseMatrixStats
      • DelayedMatrixStats

Importantly, I have not run recursive revdep checks on the above Bioconductor package, so more things might break there, when those packages are update too.

EDIT 2022-11-18: Dropped packages Pigengene and SCFA, which both were false positives here.

@PeteHaitch
Copy link
Contributor

These should be okay to change in the BioC MatrixGenerics/sparseMatrixStats/DelayedMatrixStats stack because we try to mimic the matrixStats API as closely as possible.
Do you have a rough ETA on this change?
Only, I'm very busy for the rest of this month and unlikely to have time for any BioC package development/maintainence.

@HenrikBengtsson
Copy link
Owner Author

Thanks.

Regarding ETA: matrixStats 0.63.0 first needs to be on CRAN, then I'll make this move. FWIW, matrixStats 0.63.0 is currently in the CRAN incoming queue (unusually long this week), but I hope it's rolling out with 3-5 days.

Then, I need to get GenEst to update their package on CRAN too, before I can submit the useNames = FALSE version. If they don't respond within 2-3 weeks, I'll go ahead a push to CRAN anyway.

So, I think it'll be 3-4 weeks before you see the useNames = FALSE version, but I hope to get it out no later than mid-December.

@HenrikBengtsson
Copy link
Owner Author

HenrikBengtsson commented Nov 18, 2022

@PeteHaitch, I'm just updated the develop branch to use useNames = FALSE by default. Hopefully, this helps you test towards this upcoming version. To install it, use:

remotes::install_github("HenrikBengtsson/matrixStats", ref = "develop")

BTW, I also deprecated useNames = NA, so it gives a deprecatedWarning. Don't think that'll affect anything on your end. The goal is to defunct useNames = NA soon after the useNames = FALSE release is out.

PS. matrixStats 0.63.0 is now on CRAN, but that was just a minor release (NEWS), which shouldn't affect you at all.

@zeehio
Copy link

zeehio commented Nov 19, 2022

Hi Henrik,

Just following our twitter thread: https://twitter.com/henrikbengtsson/status/1593674832448614400?s=20&t=i3VT8eg5IYydevMC8gFfww

In spectrometry one typically can have a matrix with one spectra per row. The "mean spectra" is computed with colMeans(). Even if I have used R for more than a decade, I still doubt if rowMeans() returns a vector with "the mean of each row" or a vector with "the average row". I find embarrassing to admit that I often have to check it.

I've seen hidden bugs due to confusing rowMeans() with colMeans() or other similar row/col functions. Some are mine, some are in other people's code, possibly due to that same confusion I have. Those bugs would have been avoided if proper names had been set (and preserved) in the calculations. That's why I usually tell people to name things and check the names. Especially when teaching/learning, having good names by default helps students a lot.

I understand the performance impact of keeping track of names everywhere and I understand the goals of matrixStats. I'd say getting good performance comes after getting the right calculations, and it's easier to get things right first and then ignore names to make it faster, than get things fast but wrong and not having names to figure out why.

Knowing your HPC work, I fully understand that performance is a priority to you and to this package. I wish R was faster in keeping track of names so both goals would not conflict. So I will keep telling everyone to useNames=TRUE and explain its performance impact, as I once told them to use stringsAsFactors=FALSE, and I will wait for the day the performance impact is so low that useNames=FALSE doesn't pay off. (Submitting a patch to R fixing the cost would be great but I guess it's not that easy or it would have already been done).

Thanks for all the discussion and explanation,

Sergio

HenrikBengtsson added a commit that referenced this issue Nov 24, 2022
@PeteHaitch
Copy link
Contributor

@HenrikBengtsson I'm a bit confused between #226 and #227.
Is the plan for the next matrixStats release to make useNames = FALSE or useNames = TRUE for all functions?

Do you have an ETA for when you'd like to make that release?
FWIW I'm on summer break until January 10 and will be away from a computer most of that time.

@HenrikBengtsson
Copy link
Owner Author

@HenrikBengtsson I'm a bit confused between #226 and #227. Is the plan for the next matrixStats release to make useNames = FALSE or useNames = TRUE for all functions?

Yeah, you're in the rights to be confused. I created Issue #227 (useNames = TRUE) to see what would happen if that became the default, i.e. what packages would break with that default. Regardless what the default would be, it's worth reaching out to their maintainers to set useNames = FALSE (sic!) where they make that assumption. This would make their code more robust and remove part of the equation for us. I made a call for help to reach out to them there. So, I'm still procrastinating on the final decision, but want to clear the road ahead when the final decision is made.

What would be really useful is to have some serious benchmarking arguments from both sides, e.g.

library(matrixStats)
Xs <- lapply(1:10000, FUN = function(ii) {
  X <- matrix(rnorm(n = 16L), nrow = 4L, ncol = 4L)
  rownames(X) <- sample(letters, size = 4L)
  colnames(X) <- sample(LETTERS, size = 4L)
  X
})

stats <- bench::mark(lapply(Xs, FUN = rowSums2, useNames = FALSE), lapply(Xs, FUN = rowSums2, useNames = TRUE), check = FALSE, min_iterations = 100L)
print(stats)
## # A tibble: 2 × 13
##   expression                                        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time ## result memory               time             gc      
##   <bch:expr>                                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>               <list>           <list>  
## 1 lapply(Xs, FUN = rowSums2, useNames = FALSE)   32.9ms   34.2ms      28.8    78.2KB     25.5    53    47      1.84s <NULL> <Rprofmem [26 × 3]>  <bench_tm [100]> <tibble>
## 2 lapply(Xs, FUN = rowSums2, useNames = TRUE)    35.9ms   39.2ms      23.6    78.2KB     25.6    48    52      2.03s <NULL> <Rprofmem [102 × 3]> <bench_tm [100]> <tibble>

The more such use cases, the better. This would provide things to discuss.

Do you have an ETA for when you'd like to make that release? FWIW I'm on summer break until January 10 and will be away from a computer most of that time.

I won't make any move on this before the end of January at the earliest. It would be good to settle on this in good time before the next Bioconductor release. But importantly, make sure to enjoy your break and don't worry about this.

@PeteHaitch
Copy link
Contributor

I won't make any move on this before the end of January at the earliest. It would be good to settle on this in good time before the next Bioconductor release. But importantly, make sure to enjoy your break and don't worry about this.

Thanks, Henrik, I hope you have an enjoyable break over the Christmas and New Year.

const-ae added a commit to const-ae/MatrixGenerics that referenced this issue Dec 29, 2022
This addresses HenrikBengtsson/matrixStats#226
The change is a simple search and replace.
const-ae added a commit to const-ae/sparseMatrixStats that referenced this issue Dec 29, 2022
This addresses HenrikBengtsson/matrixStats#226
The change is a simple search and replace.
@const-ae
Copy link
Contributor

Hey, I created a new branch in sparseMatrixStats and MatrixGenerics called 'change_useNames_default', which implement useNames = FALSE. It is trivial to change the default using a simple search and replace.

So if you decide to go ahead with the change, we can just merge the branch to avoid disruption to the end users :)

@HenrikBengtsson HenrikBengtsson changed the title Make useNames = FALSE the new default Packages breaking if useNames = FALSE becomes the new default May 23, 2023
@HenrikBengtsson
Copy link
Owner Author

Closing. See Issue #232 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants