Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DT[, across(.SD, fun1, fun2, fun3), by=group] #4970

Open
mattdowle opened this issue Apr 29, 2021 · 4 comments
Open

Implement DT[, across(.SD, fun1, fun2, fun3), by=group] #4970

mattdowle opened this issue Apr 29, 2021 · 4 comments
Labels
feature request top request One of our most-requested issues

Comments

@mattdowle
Copy link
Member

Inspired by dplyr::across and triggered by JuliaData/DataFrames.jl#2725 (comment)

Instead of :

DT[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), by=group]

it could be

DT[, across(.SD, min, max), by=group]

I didn't find any related issues or PRs in a quick search. If there are any, and S.O. questions, please link them here.

@avimallu
Copy link
Contributor

My 2¢:
While having across implemented similarly to dplyr::across will have utility if implemented only with .SD, I think its utility can be increased further if we could get across to work with a character, patterns or .SD to allow it to work the way dplyr::across allows. Example from here:

df %>%
  group_by(g1, g2) %>% 
  summarise(
    across(where(is.numeric), mean), 
    across(where(is.factor), nlevels),
    n = n(), 
  )

Where is this useful?
Creating different summary statistics (mean, median, unique counts, count of rows, % of any x over y) instead of doing a join after creating those columns, or chaining. Something that gives good flexibility (like a lapply and .SD currently do):

as.data.table(Lahman::Batting)[, .(
  across(patterns("$R^|(X.B)|HR"), .(sum, mean)),
  across(c("stint", "teamID"), .(last, uniqueN)),
  across(.SD, .(uniqueN, \(x) sum(x)/uniqueN(yearID))),
  playerID,
  .SDcols = c("R", "IBB", "SO")]

My understanding of baseball is a little rusty. What I was aiming for was to create a single table by payer that gives me

  1. the sum and mean of runs, doubles, triples and home-runs, followed by
  2. the last and unique count of stints and teams player for, followed by
  3. the number of runs, international walks and strikeouts per year

all in one shot. I often have to revert to summarizing my data in Excel or, if the data is too big, calculate all these independently in R and do a join at the end. In this specific case, we could even employ the proposed mergelist #4370 to create these efficiently, perhaps in parallel? It'll be amazing to have the ability to create this in R.

P.S. - I realize that .SD provides a list, while the others provide character vectors. I was aiming for more flexibility as opposed to using only .SD. It might also work if there was a way to split .SD into .SD_1, .SD_2 etc, but that would probably be a bit much.

@myoung3
Copy link
Contributor

myoung3 commented Jul 1, 2021

Hi @mattdowle, I think what you're proposing here is closely related to what's being discussed in #1063, specifically the discussion around "colwise". #1063 groups together row operations and column operations into one issue, but they seem pretty separate to me (and I think rowwise operations would be better solved by implementing more functions like pmin/pmax as you suggested for psum in #3467).

I also like your suggestion of across (rather than colwise), and the proposed syntax since it will be familiar to dplyr users. It seems like you're suggesting the second argument be "..." to take an arbitrary number of functions, but I think it would be better if the second argument took a single function or a list like dplyr across (https://dplyr.tidyverse.org/reference/across.html).

Before we can implement across, I think we should solve #2311 by merging my PR #4883, since this addresses how columns are named in this situation.

Once #4883 is merged, we could just implement across so that it expands into several lapply(.SD,) calls concatenated together with c(). This will ensure GForce optimization is used without any additional work. E.g:

x <- data.table(a=1:3,b=1:3)
x[, across(.SD, list(min=min, max=max)]  

would just internally be expanded to

x[, c(min=lapply(.SD, min), max=lapply(.SD, max))]

and the resulting column names would be c("min.a", "min.b", "max.a", "max.b") which is consistent with base R and how naming is implemented in #4883.

An outstanding question is how we might name columns when functions are not explicitly named:

x[, across(.SD, list(min, max)] 

Interactively it would be convenient for them to be c("min.a", "min.b", "max.a", "max.b") without explicitly tagging each function, but this might break down when the list of functions is specified programmatically so perhapsc("F1.a", "F1.b", "F2.a", "F2.b") is more predictable.

@myoung3
Copy link
Contributor

myoung3 commented Jul 1, 2021

Also see this discussion with Hadley (tidyverse/dtplyr#173) on translating dplyr::across to data.table syntax for the dtplyr package.

Note that the dplyr across allows arbitrary specification of how the function name and input column names are combined to determine how the output columns are named (specifying both order and the separator) but I'm not sure that's a road we want to go down (see the names argument here: https://dplyr.tidyverse.org/reference/across.html) . Sticking with base R's naming behavior (e.g. c(A=list(a=3,2), B=list(a=1,b=2)) ) will be much easier to maintain since naming in across will just rely directly on how naming works for x[, c(A=lapply(.SD), B=lapply(.SD))] (once #4883 is merged) without any additional magic code unique to across.

@r2evans
Copy link
Contributor

r2evans commented Nov 17, 2024

Any thought to reinvigorating this? #4883 was merged a couple of months ago. The code there is a little more detailed than I want to dive in on to try to implement this myself (at least, not at this moment).

BTW, I further suggest the first argument should default to the current .SD, such as

across = function(x = .SD, funs) {
}

As for the notion of funs = list(min, max), I suggest one of two paths:

  • require non-empty names, fail if is.null(names(funs)) || any(!nzchar(names(funs))); or
  • for empty names, use V1, V2, and/or other counting names.

Part of me wants to go with the first option to remove any/all ambiguity, but I'm conscious of interactive usability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

6 participants