Replace `dplyr` use with `data.table` (or vice versa) #246

athowes · 2024-08-14T10:42:22Z

I am not a data.table native and so when under pressure default to using dplyr. I think this is fine to get things done for the first release of the package, but after that I think I should spend some time more properly learning data.table and going back over the places I've used dplyr in this package and replacing them. Reasons to do this? It's faster I think. And perhaps is one less dependency. Arguably it's not a good use of my time to do this, but learning data.table seems useful anyway.

The text was updated successfully, but these errors were encountered:

pearsonca · 2024-08-14T11:22:05Z

I'm DT native, so happy to tagged on discussions of "how do I ..." / review PRs aimed in this direction / etc.

seabbs · 2024-08-15T10:28:26Z

I just checked out dependencies and whilst brms doesn't depend on dplyr bayesplot which brms depends on does so adding/removing dplyr doesn't impact the dependency graph.

That being said I strongly think we want to only have one design language when it comes to data munging and as everywhere else that is data.table it feels like it should remain so.

We could discuss porting the whole package to dplyr potentially. I am not a massive func as I have struggled with maintaining dplyr based packages in the past but. Ithink things are now more stable

athowes · 2024-08-16T08:26:29Z

As well as in #242 (comment), i.e.

prior <- dplyr::full_join(...)

I've also been using dplyr in the vignettes.

athowes · 2024-09-04T09:39:26Z

I am happy to either try to learn data.table and move things to data.table or to try moving everything to dplyr which I already know well.

I don't have a great sense of the other aspects of this choice. I've not maintained packages using dplyr or data.table before.

If there aren't objections I am personally happy enough with dplyras it's what I already know.

I just checked out dependencies and whilst brms doesn't depend on dplyr bayesplot which brms depends on does so adding/removing dplyr doesn't impact the dependency graph.

This pushes more towards dplyr OK as as far as I understood one main benefit of data.table was the low dependencies.

seabbs · 2024-09-04T09:42:42Z

Yes, that and stability. Having said that I think dplyr is much more stable now. As this package is very much moving towards a cog in a wider tidy ecosystem I think its a good call to switch over to dplyr as much as possible and drop data.table.

@pearsonca thoughts?

FYI I see this as needing to be solved ahead of 0.1.0 so as to reduce ongoing technical debt

pearsonca · 2024-09-04T14:43:32Z

I'm generally anti-tidyverse. Too tight knit - pretend modular, not actually modular - though I'll agree they are doing better at coupling (and that as a whole system, does fine - I just prefer less vendor lock in).

For a moment considering false dichotomy problem: how fancy is the munging anyway? Would it just be fine in base R? Cursory glance suggests that the only in code use is join, filter, and select - that could be done approximately with merge, subset, and some sort of manual select. Those are all manageable in way that works portably for data.frame, data.table, and tibble.

Leaving it in the vignette is fine - probably desirable for the audience, and just becomes a "suggests" which is better posture wise (even if there is a transient dependency due to other packages - maybe they'll get better at some future time).

seabbs · 2024-09-04T15:14:41Z

I think those are reasonable points @pearsonca but I thing the missing piece is what system is going to be easiest to maintain for the active contributors. Historically that has been data.table but as @athowes is now the maintainer (and seems more comfortable with dplyr since it keeps slipping in) and other new contributors this is also likely to be the case.

The counter point that is strongest to me is that most of the code is data.tablee, its performant easy to maintain and fairly readable. To me the base arg doesn't really have that

pearsonca · 2024-09-04T15:22:38Z

I generally use the merge syntax in base over the full join data.table syntax - too fiddly. subset(dt, ...filter...) is definitely as easy as dt[...filter...]. Both of those operations are provided S3 generics by data.table and tibble, so they are performant and type preserving. Making the select portable is about the only thing that can be complicated last I checked, but dt might have revised this to less data.frame-paradigm breaking.

I think there's a bit of balance to be struck between easiest-for-active maintainers vs other engineering objectives (e.g. easiest for new maintainers [which might also cut towards tidyverse], reducing direct vs encapsulated dependencies).

* Remove data.table from simulate.R * Remove data.table from postprocess.R * Remove data.table from observe.R aside from filter_obs_by_ptime * Remove data.table from roxygen in latent_individual.R * Rewrite as_latent_individual to use dplyr * Rebase * Use dplyr to subsample * Using dplyr in epidist vignette * Need to round here! * Use dplyr in epidist_diagnostics * Use dplyr in filter_obs_by_ptime * Altering creation of latent_individual to correctly work with dplyr (and uncovering bugs here that existed before) * Working through getting tests to pass * Two fixes to epidist vignette * Update FAQ vignette away from dt * Update approximate inference vignette * Remove call to arrange which creates bug (add issue for this) * Remove data.table from Imports * Rebase * Rebase * Lint * Fix to logo and improve a little * data.frame not data.table * Hexsticker fixes * Don't library data.table * Removing final uses of data.table using find in files... * Import runif * Remove another mention of data.table * Use dplyr in cmdstan check * Need to use , with data.frame * Fix to index being a factor issue * Perhaps this is needed? * Typo * Regenerate globals and namespace * Remove excess imports in line with R packages (2e) recommendations * Need higher version of R for data importing * Missed some stats:: qualifiers * Remove old test code * Somethings break when I don't import all of brms (because I'm not importing Stan). I think the saved data * Run document * Skip tests with fit on CRAN * Revert import all of brms * Resize logo

athowes added the low For a future release label Aug 14, 2024

athowes mentioned this issue Aug 14, 2024

Issue 234: Allow prior passing and overwrite functionality #242

Merged

9 tasks

seabbs changed the title ~~Replace dplyr use with data.table~~ Replace dplyr use with data.table (or vice versa) Aug 21, 2024

athowes added medium Nice to have for next release and removed low For a future release labels Sep 4, 2024

athowes mentioned this issue Sep 4, 2024

Issue 48: Ebola vignette #52

Merged

9 tasks

athowes mentioned this issue Sep 13, 2024

Issue #246: dplyr over data.table #329

Merged

9 tasks

seabbs closed this as completed in #329 Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `dplyr` use with `data.table` (or vice versa) #246

Replace `dplyr` use with `data.table` (or vice versa) #246

athowes commented Aug 14, 2024

pearsonca commented Aug 14, 2024

seabbs commented Aug 15, 2024

athowes commented Aug 16, 2024 •

edited

Loading

athowes commented Sep 4, 2024

seabbs commented Sep 4, 2024 •

edited

Loading

pearsonca commented Sep 4, 2024

seabbs commented Sep 4, 2024

pearsonca commented Sep 4, 2024

Replace dplyr use with data.table (or vice versa) #246

Replace dplyr use with data.table (or vice versa) #246

Comments

athowes commented Aug 14, 2024

pearsonca commented Aug 14, 2024

seabbs commented Aug 15, 2024

athowes commented Aug 16, 2024 • edited Loading

athowes commented Sep 4, 2024

seabbs commented Sep 4, 2024 • edited Loading

pearsonca commented Sep 4, 2024

seabbs commented Sep 4, 2024

pearsonca commented Sep 4, 2024

Replace `dplyr` use with `data.table` (or vice versa) #246

Replace `dplyr` use with `data.table` (or vice versa) #246

athowes commented Aug 16, 2024 •

edited

Loading

seabbs commented Sep 4, 2024 •

edited

Loading