Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace dplyr use with data.table (or vice versa) #246

Closed
athowes opened this issue Aug 14, 2024 · 8 comments · Fixed by #329
Closed

Replace dplyr use with data.table (or vice versa) #246

athowes opened this issue Aug 14, 2024 · 8 comments · Fixed by #329
Labels
medium Nice to have for next release

Comments

@athowes
Copy link
Collaborator

athowes commented Aug 14, 2024

I am not a data.table native and so when under pressure default to using dplyr. I think this is fine to get things done for the first release of the package, but after that I think I should spend some time more properly learning data.table and going back over the places I've used dplyr in this package and replacing them. Reasons to do this? It's faster I think. And perhaps is one less dependency. Arguably it's not a good use of my time to do this, but learning data.table seems useful anyway.

@athowes athowes added the low For a future release label Aug 14, 2024
@pearsonca
Copy link

I'm DT native, so happy to tagged on discussions of "how do I ..." / review PRs aimed in this direction / etc.

@seabbs
Copy link
Contributor

seabbs commented Aug 15, 2024

I just checked out dependencies and whilst brms doesn't depend on dplyr bayesplot which brms depends on does so adding/removing dplyr doesn't impact the dependency graph.

That being said I strongly think we want to only have one design language when it comes to data munging and as everywhere else that is data.table it feels like it should remain so.

We could discuss porting the whole package to dplyr potentially. I am not a massive func as I have struggled with maintaining dplyr based packages in the past but. Ithink things are now more stable

@athowes
Copy link
Collaborator Author

athowes commented Aug 16, 2024

As well as in #242 (comment), i.e.

prior <- dplyr::full_join(...)

I've also been using dplyr in the vignettes.

@seabbs seabbs changed the title Replace dplyr use with data.table Replace dplyr use with data.table (or vice versa) Aug 21, 2024
@athowes
Copy link
Collaborator Author

athowes commented Sep 4, 2024

I am happy to either try to learn data.table and move things to data.table or to try moving everything to dplyr which I already know well.

I don't have a great sense of the other aspects of this choice. I've not maintained packages using dplyr or data.table before.

If there aren't objections I am personally happy enough with dplyras it's what I already know.

I just checked out dependencies and whilst brms doesn't depend on dplyr bayesplot which brms depends on does so adding/removing dplyr doesn't impact the dependency graph.

This pushes more towards dplyr OK as as far as I understood one main benefit of data.table was the low dependencies.

@seabbs
Copy link
Contributor

seabbs commented Sep 4, 2024

Yes, that and stability. Having said that I think dplyr is much more stable now. As this package is very much moving towards a cog in a wider tidy ecosystem I think its a good call to switch over to dplyr as much as possible and drop data.table.

@pearsonca thoughts?

FYI I see this as needing to be solved ahead of 0.1.0 so as to reduce ongoing technical debt

@athowes athowes added medium Nice to have for next release and removed low For a future release labels Sep 4, 2024
@athowes athowes mentioned this issue Sep 4, 2024
9 tasks
@pearsonca
Copy link

I'm generally anti-tidyverse. Too tight knit - pretend modular, not actually modular - though I'll agree they are doing better at coupling (and that as a whole system, does fine - I just prefer less vendor lock in).

For a moment considering false dichotomy problem: how fancy is the munging anyway? Would it just be fine in base R? Cursory glance suggests that the only in code use is join, filter, and select - that could be done approximately with merge, subset, and some sort of manual select. Those are all manageable in way that works portably for data.frame, data.table, and tibble.

Leaving it in the vignette is fine - probably desirable for the audience, and just becomes a "suggests" which is better posture wise (even if there is a transient dependency due to other packages - maybe they'll get better at some future time).

@seabbs
Copy link
Contributor

seabbs commented Sep 4, 2024

I think those are reasonable points @pearsonca but I thing the missing piece is what system is going to be easiest to maintain for the active contributors. Historically that has been data.table but as @athowes is now the maintainer (and seems more comfortable with dplyr since it keeps slipping in) and other new contributors this is also likely to be the case.

The counter point that is strongest to me is that most of the code is data.tablee, its performant easy to maintain and fairly readable. To me the base arg doesn't really have that

@pearsonca
Copy link

I generally use the merge syntax in base over the full join data.table syntax - too fiddly. subset(dt, ...filter...) is definitely as easy as dt[...filter...]. Both of those operations are provided S3 generics by data.table and tibble, so they are performant and type preserving. Making the select portable is about the only thing that can be complicated last I checked, but dt might have revised this to less data.frame-paradigm breaking.

I think there's a bit of balance to be struck between easiest-for-active maintainers vs other engineering objectives (e.g. easiest for new maintainers [which might also cut towards tidyverse], reducing direct vs encapsulated dependencies).

seabbs pushed a commit that referenced this issue Sep 18, 2024
* Remove data.table from simulate.R

* Remove data.table from postprocess.R

* Remove data.table from observe.R aside from filter_obs_by_ptime

* Remove data.table from roxygen in latent_individual.R

* Rewrite as_latent_individual to use dplyr

* Rebase

* Use dplyr to subsample

* Using dplyr in epidist vignette

* Need to round here!

* Use dplyr in epidist_diagnostics

* Use dplyr in filter_obs_by_ptime

* Altering creation of latent_individual to correctly work with dplyr (and uncovering bugs here that existed before)

* Working through getting tests to pass

* Two fixes to epidist vignette

* Update FAQ vignette away from dt

* Update approximate inference vignette

* Remove call to arrange which creates bug (add issue for this)

* Remove data.table from Imports

* Rebase

* Rebase

* Lint

* Fix to logo and improve a little

* data.frame not data.table

* Hexsticker fixes

* Don't library data.table

* Removing final uses of data.table using find in files...

* Import runif

* Remove another mention of data.table

* Use dplyr in cmdstan check

* Need to use , with data.frame

* Fix to index being a factor issue

* Perhaps this is needed?

* Typo

* Regenerate globals and namespace

* Remove excess imports in line with R packages (2e) recommendations

* Need higher version of R for data importing

* Missed some stats:: qualifiers

* Remove old test code

* Somethings break when I don't import all of brms (because I'm not importing Stan). I think the saved data

* Run document

* Skip tests with fit on CRAN

* Revert import all of brms

* Resize logo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
medium Nice to have for next release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants