Implement DT[, across(.SD, fun1, fun2, fun3), by=group] #4970

mattdowle · 2021-04-29T01:18:25Z

Inspired by dplyr::across and triggered by JuliaData/DataFrames.jl#2725 (comment)

Instead of :

DT[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), by=group]

it could be

DT[, across(.SD, min, max), by=group]

I didn't find any related issues or PRs in a quick search. If there are any, and S.O. questions, please link them here.

The text was updated successfully, but these errors were encountered:

avimallu · 2021-04-29T08:58:21Z

My 2￠:
While having across implemented similarly to dplyr::across will have utility if implemented only with .SD, I think its utility can be increased further if we could get across to work with a character, patterns or .SD to allow it to work the way dplyr::across allows. Example from here:

df %>%
  group_by(g1, g2) %>% 
  summarise(
    across(where(is.numeric), mean), 
    across(where(is.factor), nlevels),
    n = n(), 
  )

Where is this useful?
Creating different summary statistics (mean, median, unique counts, count of rows, % of any x over y) instead of doing a join after creating those columns, or chaining. Something that gives good flexibility (like a lapply and .SD currently do):

as.data.table(Lahman::Batting)[, .(
  across(patterns("$R^|(X.B)|HR"), .(sum, mean)),
  across(c("stint", "teamID"), .(last, uniqueN)),
  across(.SD, .(uniqueN, \(x) sum(x)/uniqueN(yearID))),
  playerID,
  .SDcols = c("R", "IBB", "SO")]

My understanding of baseball is a little rusty. What I was aiming for was to create a single table by payer that gives me

the sum and mean of runs, doubles, triples and home-runs, followed by
the last and unique count of stints and teams player for, followed by
the number of runs, international walks and strikeouts per year

all in one shot. I often have to revert to summarizing my data in Excel or, if the data is too big, calculate all these independently in R and do a join at the end. In this specific case, we could even employ the proposed mergelist #4370 to create these efficiently, perhaps in parallel? It'll be amazing to have the ability to create this in R.

P.S. - I realize that .SD provides a list, while the others provide character vectors. I was aiming for more flexibility as opposed to using only .SD. It might also work if there was a way to split .SD into .SD_1, .SD_2 etc, but that would probably be a bit much.

myoung3 · 2021-07-01T18:35:27Z

Hi @mattdowle, I think what you're proposing here is closely related to what's being discussed in #1063, specifically the discussion around "colwise". #1063 groups together row operations and column operations into one issue, but they seem pretty separate to me (and I think rowwise operations would be better solved by implementing more functions like pmin/pmax as you suggested for psum in #3467).

I also like your suggestion of across (rather than colwise), and the proposed syntax since it will be familiar to dplyr users. It seems like you're suggesting the second argument be "..." to take an arbitrary number of functions, but I think it would be better if the second argument took a single function or a list like dplyr across (https://dplyr.tidyverse.org/reference/across.html).

Before we can implement across, I think we should solve #2311 by merging my PR #4883, since this addresses how columns are named in this situation.

Once #4883 is merged, we could just implement across so that it expands into several lapply(.SD,) calls concatenated together with c(). This will ensure GForce optimization is used without any additional work. E.g:

x <- data.table(a=1:3,b=1:3)
x[, across(.SD, list(min=min, max=max)]

would just internally be expanded to

x[, c(min=lapply(.SD, min), max=lapply(.SD, max))]

and the resulting column names would be c("min.a", "min.b", "max.a", "max.b") which is consistent with base R and how naming is implemented in #4883.

An outstanding question is how we might name columns when functions are not explicitly named:

x[, across(.SD, list(min, max)]

Interactively it would be convenient for them to be c("min.a", "min.b", "max.a", "max.b") without explicitly tagging each function, but this might break down when the list of functions is specified programmatically so perhapsc("F1.a", "F1.b", "F2.a", "F2.b") is more predictable.

myoung3 · 2021-07-01T18:45:15Z

Also see this discussion with Hadley (tidyverse/dtplyr#173) on translating dplyr::across to data.table syntax for the dtplyr package.

Note that the dplyr across allows arbitrary specification of how the function name and input column names are combined to determine how the output columns are named (specifying both order and the separator) but I'm not sure that's a road we want to go down (see the names argument here: https://dplyr.tidyverse.org/reference/across.html) . Sticking with base R's naming behavior (e.g. c(A=list(a=3,2), B=list(a=1,b=2)) ) will be much easier to maintain since naming in across will just rely directly on how naming works for x[, c(A=lapply(.SD), B=lapply(.SD))] (once #4883 is merged) without any additional magic code unique to across.

r2evans · 2024-11-17T01:13:53Z

Any thought to reinvigorating this? #4883 was merged a couple of months ago. The code there is a little more detailed than I want to dive in on to try to implement this myself (at least, not at this moment).

BTW, I further suggest the first argument should default to the current .SD, such as

across = function(x = .SD, funs) {
}

As for the notion of funs = list(min, max), I suggest one of two paths:

require non-empty names, fail if is.null(names(funs)) || any(!nzchar(names(funs))); or
for empty names, use V1, V2, and/or other counting names.

Part of me wants to go with the first option to remove any/all ambiguity, but I'm conscious of interactive usability.

jangorecki added the feature request label Apr 29, 2021

avimallu mentioned this issue May 22, 2021

Passing named lists to .SDcols / .SD #5020

Open

grantmcdermott mentioned this issue Sep 1, 2021

Fix naming in j=c() under by= queries with lapply() optimization #4883

Merged

avimallu mentioned this issue Feb 11, 2022

by causes column names to be repeated #5329

Open

MichaelChirico mentioned this issue Mar 16, 2024

Master list of most-requested issues #3189

Open

75 tasks

MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement DT[, across(.SD, fun1, fun2, fun3), by=group] #4970

Implement DT[, across(.SD, fun1, fun2, fun3), by=group] #4970

mattdowle commented Apr 29, 2021

avimallu commented Apr 29, 2021

myoung3 commented Jul 1, 2021 •

edited

Loading

myoung3 commented Jul 1, 2021 •

edited

Loading

r2evans commented Nov 17, 2024

Implement DT[, across(.SD, fun1, fun2, fun3), by=group] #4970

Implement DT[, across(.SD, fun1, fun2, fun3), by=group] #4970

Comments

mattdowle commented Apr 29, 2021

avimallu commented Apr 29, 2021

myoung3 commented Jul 1, 2021 • edited Loading

myoung3 commented Jul 1, 2021 • edited Loading

r2evans commented Nov 17, 2024

myoung3 commented Jul 1, 2021 •

edited

Loading

myoung3 commented Jul 1, 2021 •

edited

Loading