Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a 'having' parameter to [.data.table #788

Open
nigmastar opened this issue Aug 29, 2014 · 28 comments
Open

add a 'having' parameter to [.data.table #788

nigmastar opened this issue Aug 29, 2014 · 28 comments
Labels
feature request top request One of our most-requested issues

Comments

@nigmastar
Copy link

Currently, to have the equivalent (or something similar) of the SQL having clause you need to write a [.data.table first using by and then feed the result into the i parameter of a second [.data.table, like in:

dt <- data.table(id   = rep(1:2, each = 2),
                 var  = c(0.2, 0.5, 1.5, 1.3))

dt[dt[, mean(var) > 1, by = id]$id]
   id var
1:  2 1.5
2:  2 1.3

Another option is to use conditional statement inside j, very powerful, I do all the time, and so far there is nothing that the current syntax did not allow me to do. However having a having parameter I believe will allow writing much more clear and readable codes. For example the above can be written as:

dt[, if(mean(var) > 1) .SD, by = id]

What I propose is something like:

dt[, .SD, by = id, having = mean(var) > 1]

The idea is to have an expression that always evaluates to a logical of length 1 which would tell whether or not j has to be evaluated for the current group.

Thanks,
Michele

@arunsrinivasan
Copy link
Member

Great FR. I've been pondering about this use case for quite a while as well. We can do this without the additional argument like so:

dt[, .SD[mean(var)>1], by=id]

(But for speed, this'll need optimisation of .SD[.] internally - #735.)

It's most likely this case that we resort to .I instead:

dt[dt[, .I[mean(var) > 1], by=id]$V1]

And it'd be great to get this directly (even better if we can achieve it without having) - maybe if j expression evaluates to a 1-column logical vector? Just thinking out loud.

@nigmastar
Copy link
Author

Hi Arun. Thanks for the answer. Once optimisation of .SD will be available then this would be just a matter of 'taste' in terms of what is clearer to read between:

dt[, .SD[mean(var)>1], by=id]

and

dt[, .SD, by = id, having = mean(var) > 1]

Even though the second may also have a better appeal for people coming from other languages (SQL in particular). But again, this might be just my opinion. Maybe I just using SQL too much in the last period (lol).

@eantonya
Copy link
Contributor

eantonya commented Sep 2, 2014

As far as the taste part goes - I really dislike adding an extra param, when it can be accomplished with simple and standard syntax (i.e. the first option above).

@nigmastar
Copy link
Author

Curious. I was sure that you were that one most likely to appreciate this :-) (considering how much you wanted to eliminate by-without-by, mainly to improve readability, especially for people coming from other languages, if I remember correctly). Anyway, I know the two are quite different scenarios. I just wanted to share my point of view,:

  • I'm pretty sure it will be more readable for anyone not very familiar with R (or just data.table)
  • 15 vs 14 (current) parameters of [.data.table doesn't really harm
  • You won't be forced to use it and it won't break any code. It will be an expression that, if provided, will possibly skip the execution of j for a particular group

@eantonya
Copy link
Contributor

eantonya commented Sep 2, 2014

The reasons I didn't like the silent by-without-by and "having" are actually the same - I don't like to remember extra stuff, whether it be extra params or extra strange behavior.

I would argue that the first expression you wrote is much easier to read, because you don't have to keep reading the line, then discover that some new param is specified, and have to go back to the beginning of the sentence and reevaluate your mental model of what's going on.

@jangorecki
Copy link
Member

What do you think of not adding having arg to [, but turning it into having() function and making it works in i argument, the same as order() works, e.g.:

dt[ having(var > 1), .(var = mean(var)), by = id ]
# would perform below without additional copy:
dt[, .(var = mean(var)), by = id ][ var > 1 ]

having would be a function to evaluate its argument in frame of dt and provide filtering to i.

@franknarf1
Copy link
Contributor

I think this FR is closely tied with #1269 "Returning only groups." I often want to get groups with some attribute and store them in a vector, like my_teams in this SO post. Here's the relevant line:

my_teams <- FantasyTeams[, max(table(Team)) <= 3, by=team_no][(V1)]$team_no
# or 
my_teams <- FantasyTeams[, if ( max(table(Team)) <= 3 ) 1, by=team_no]$team_no

With the having and "Returning only groups" FRs, this could be something like

my_teams <- FantasyTeams[, .(), by = team_no, having = { max(table(Team)) <= 3 }]$team_no

The code is just as long, but I prefer it, so I don't have to read j carefully to understand the objective.

@franknarf1
Copy link
Contributor

franknarf1 commented Mar 29, 2016

Another example from SO. The goal is to overwrite the Value column with 3L if some by-group level condition holds:

DT = setDT(structure(list(Ind = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), ID = c("A", 
"A", "A", "A", "B", "B", "B", "B"), RegionStart = c(1L, 101L, 
1L, 101L, 1L, 101L, 1L, 101L), RegionEnd = c(100L, 200L, 100L, 
200L, 100L, 200L, 100L, 200L), Value = c(3L, 2L, 3L, 2L, 3L, 
2L, 5L, 5L), TN = c("N", "N", "T", "T", "N", "N", "T", "T")), .Names = c("Ind", 
"ID", "RegionStart", "RegionEnd", "Value", "TN"), row.names = c(NA, 
-8L), class = "data.frame"))

# current syntax 
DT[, Value := { 
  fixit = ( Value[TN=="N"] != 3L ) & ( uniqueN(Value) == 1L )
  if (fixit) 3L else Value
}, by=.(ID, RegionStart)]

# with "having"
DT[,
  Value := 3L
, by=.(ID, RegionStart)
, having={ ( Value[TN=="N"] != 3L ) & ( n_distinct(Value) == 1L ) }]

Besides arguably nicer syntax, I guess the having= way might also be more efficient, since only a subset of by-groups need to be modified. The most efficient way without having= probably looks like...

myeyes = DT[, .I[ ( Value[TN=="N"] != 3L ) & ( uniqueN(Value) == 1L )], by=.(ID, RegionStart)]$V1
DT[ myeyes, Value := 3L]

# or 

mygs = DT[, ( Value[TN=="N"] != 3L ) & ( uniqueN(Value) == 1L ), by=.(ID, RegionStart)][(V1)][, V1 := NULL]
DT[ mygs, Value := 3L, on=names(mygs)]

which are rather convoluted.

Edit: And another example to update if/when this feature is available: http://stackoverflow.com/q/36292702
(2016/4/26:) http://stackoverflow.com/q/36869784
(2016/06/16:) http://stackoverflow.com/q/37855013/

@franknarf1
Copy link
Contributor

franknarf1 commented May 27, 2016

Another example from SO. It could be used to select strictly unique rows (related to #1163 ):

DT = setDT(structure(list(id = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 
2, 3, 4), dsp = c(5, 6, 7, 8, 6, 6, 7, 8, 5, 6, 9, 8, 5, 6, 7, 
NA), status = c(FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, 
TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)), .Names = c("id", 
"dsp", "status"), row.names = c(NA, -16L), class = "data.frame"))

# my current way to select "strictly unique" rows
Bigdt[, .N, by=names(Bigdt)][N == 1][, N := NULL][]

# could be...
Bigdt[, .SD, by=names(Bigdt), having ={.N == 1L}]

Note that Bigdt[, if (.N == 1L) .SD, by=names(Bigdt)] does not work here, since .SD is empty. Maybe that could be helped by #1269 , though.


And another from SO: http://stackoverflow.com/q/38272608/ They want to select groups based on stuff in the last row, so having = heath condition[.N] == "not healthy" should do it.


And another simple case (filtering by size): http://stackoverflow.com/q/39085450/


And another, with an anti join:

ID <- c("A","A","A","B","B","C","D")
Value <- c(0,1,2,0,2,0,0)
df <- data.frame(ID,Value)

library(data.table)
setDT(df)

# use j = max() to get GForce speedup
df[ !df[, max(Value), by=ID][V1 > 0, .(ID, Value = 0)], on=.(ID, Value)]

# do the more standard thing, if j = if (...) x
df[ !df[, if (max(Value) > 0) .(Value = 0), by=ID], on=.(Value, ID) ]

# desired syntax
df[ !df[, .(Value = 0), by=ID, having = max(Value) > 0], on=.(Value, ID) ]

Not such a great example, though.


And another with an answer like dt[, if(uniqueN(time)==1L) .SD, by=name, .SDcols="time"]


And another: http://stackoverflow.com/q/43354165/

And another: http://stackoverflow.com/q/43613087/

Another (though it might get deleted): http://stackoverflow.com/q/43635968/

Another http://stackoverflow.com/a/43765352/

Another http://chat.stackoverflow.com/transcript/message/37148860#37148860

Another https://stackoverflow.com/questions/45464333/assign-a-binary-vector-based-on-blocks-of-data-within-another-vector/

Another https://stackoverflow.com/questions/32259620/how-to-remove-unique-entry-and-keep-duplicates-in-r/32259758#32259758

Un autre https://stackoverflow.com/q/45557011/

Haiyou https://stackoverflow.com/questions/45598397/filter-data-frame-matching-all-values-of-a-vector

Um mais https://stackoverflow.com/a/45721286/

lingwai yige https://stackoverflow.com/a/45820567/

and https://stackoverflow.com/q/46251221/

uno mas https://stackoverflow.com/questions/46307315/show-sequences-that-include-a-variable-in-r

tambem https://stackoverflow.com/q/46638058/


And another. I want to subset my data.table (myDT) to entries that aren't found in a reference table (idDT):

library(data.table)
idDT = data.table(id = 1:3, v = c("A","B","C"))
myDT = data.table(id = 3:4, z = c("gah","egad"))

# my attempt
idDT[myDT, on=.(id), .SD[.N == 0L], by=.EACHI]
# Empty data.table (0 rows) of 2 cols: id,v

# workaround
myDT[, .SD[idDT[.SD, on=.(id), .N == 0, by=.EACHI]$V1]]

# desired notation (with having=)
myDT[, .SD, by = id, having = idDT[.BY, on=.(id), .N]==0L]

This would be inefficient, though, since my desired notation entails each by= value making a separate join to idDT. In that sense, maybe it's not the best example.


mais um https://stackoverflow.com/questions/47765283/r-data-table-group-by-where/47765308?noredirect=1#comment82524998_47765308 could do DT[, if (any(status == "A") && !any(status == "B")) .SD, by=id] or with a having parameter DT[, .SD, by=id, having = any(status == "A") && !any(status == "B")]

and then https://stackoverflow.com/a/48669032/ m[, if(isTRUE(any(passed))) .SD, by=id] should be m[by = id, having = isTRUE(any(passed))]

mais um exemplo https://stackoverflow.com/q/49072250/

ein anderer https://stackoverflow.com/a/49211292/ stock_profile[, sum(Value), by=Pcode, having=any(Location=="A" & NoSales == "Y")][, sum(V1)]

mais um https://stackoverflow.com/a/49366998/

autre https://stackoverflow.com/a/49919015/

y https://stackoverflow.com/questions/50257643/deleting-rows-in-r-with-value-less-than-x

moar https://stackoverflow.com/q/54582048

e https://stackoverflow.com/q/56283005

keep groups if .N==k (also many at the dupe target) https://stackoverflow.com/questions/56794306/only-get-data-table-groups-with-a-given-number-of-rows

keep groups if any(diff(sorted_col)) <= threshold https://stackoverflow.com/q/57512417

keep if max(x) < threshold https://stackoverflow.com/a/57698641

@ywhcuhk
Copy link

ywhcuhk commented May 28, 2016

@eantonya IMHO, adding the having parameter will actually make it easier to remember. Excessive conciseness can be hard to memorize. Moreover, making data.table more like SQL isn't a bad idea.

In data.table FAQ:

2.16 I’ve heard that data.table syntax is analogous to SQL.
Yes : ...

@jangorecki
Copy link
Member

jangorecki commented Jul 9, 2016

@ywhuofu data.table already accepts order function to i argument, which is what base R user would expect. The same way we translate sql ORDER to i = order(...) we can do with HAVING. It fits well, as i in data.frame is used for subsetting (having is just delayed subsetting after aggregation) or reordering.

@ColeMiller1
Copy link
Contributor

Could this be the API?

dt <- data.table(id   = rep(1:2, each = 2),
                 var  = c(0.2, 0.5, 1.5, 1.3))

dt[having.i(mean(var) > 1, by = id)]
  id var
1  2 1.5
2  2 1.3

I have implemented this version although it sets a restriction of only using gforce optimized functions as well as some functions that do not depend on groupings (e.g., +, |, &, etc.). In SQL, summary functions are what are used although I would understand if Cdogroups would want to be supported.

One additional note. It seems like it would be difficult to fit in dt[having(var > 3), .(var = mean(x)), by = .(grp)] within the current '[.data.table' code. There would need to be some checks to make sure the syntax were correct.

n = 1e6
grps = 1e5
head_n = 2L
dt = data.table::data.table(x = sample(grps, n, TRUE), y = runif(n))
 
# A tibble: 2 x 13
  expression                                                            min   median
  <bch:expr>                                                       <bch:tm> <bch:tm>
1 lw[having.i(.N < 2L | sum(y) > 11 | median(y) < 0.7, by = x)]     114.13ms 124.98ms
2 dt[dt[, .I[.N < 2L | sum(y) > 11 | median(y) < 0.7], by = x]$V1] 4000ms   4000ms


# A tibble: 2 x 13
  expression                           min  median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>                       <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int>
1 lw[having.i(.N < 2L, by = x)]     30.2ms  35.3ms     27.9     8.02MB     3.99    14
2 dt[dt[, .I[.N < 2L], by = x]$V1] 106.1ms 110.4ms      8.81    6.13MB    10.6      5

@MichaelChirico
Copy link
Member

I would prefer this as an added parameter, named either as having= or group_filter= (or something else that doesn't rely on SQL awareness to know what it does on sight).

e.g. I think it would be confusing to combine row filters in i with group-level filters also in i

@ColeMiller1
Copy link
Contributor

Would having = work on a subset of the data or would you only be able to use the i arg or the having arg? I also assume that having would occur prior to the j being evaluated. How would .BY and .GRP and soon .NGRP work with ```having = ````?

@renkun-ken
Copy link
Member

renkun-ken commented Feb 15, 2020

There are not many syntactic choices:

  • Adding a new argument such as having
  • Leveraging existing arguments: i, j, by.

If row filter and group filter are both needed, dt[row_selector & group_selector, ...] does not look right therefore it seems that for such use case, row filter and group filter should not appear in the same argument. Then i is ruled out.

Then there won't be many syntactic choices.

Leveraging by could make it look confusing. For example,

dt[, .SD, by = having(.(id), mean(var > 1))]
dt[, .SD, by = id ~ mean(var) > 1]

Adding special function to j does not look good.

dt[, having(mean(var) > 1, .SD), by = id]

Now, the code I find look best to me is the most original version

dt[, if (mean(var) > 1) .SD, by = id]
dt[, if (mean(var) > 1) .(x = sum(x), y = sum(y)), by = id]

What I really want is keep the optimization done after the group filtering. Can we detect theif expression in j and optimize into it such as keeping GForce to work inside if statement?

@franknarf1
Copy link
Contributor

@renkun-ken Or overload another infix operator?

dt[, mean(var) > 1 ? .SD, by=id]

One advantage of a special symbol over if is that there's no chance the user put a matching else later.

@renkun-ken
Copy link
Member

renkun-ken commented Feb 15, 2020

@franknarf1 It seems that while we are trying to detect if in j, we could also check if has else if and else. We could optimize if-only case and leave if-else un-optimized. Maybe later we could handle if-else case too. Personally I still prefer optimizing code to overriding or leveraging too much on existing operators.

@jangorecki
Copy link
Member

jangorecki commented Feb 15, 2020

@franknarf1 this is cool C syntax, although not sure if it wouldn't complicate to much here.
var > 1 ? d : e could work as well, isn't it?

@renkun-ken
Copy link
Member

renkun-ken commented Feb 15, 2020

var > 1 ? d : e looks concise but only works for inline simple case since d and e could be anything like {...} and the operator precedence may be confusing. Are we only trying to allow .SD to perform pure group filtering, or any expression in j here?

Adding syntax has a problem that user needs to be aware that the syntax is specially handled and should not work inside j. For example, user may expect

dt[, mean(var) > 1 ? 0 : (sd(var) < 1 ? 1 : 0), by = id]

to work, and even

dt[, mean(var) > 1 ? 0 : 1]
dt[, mean(var) > 1 ? 0 : (sd(var) < 1 ? 1 : 0)]

to work in general.

@renkun-ken
Copy link
Member

I'm a bit confused here.

Does

dt[, .SD, by = id, having = mean(var) > 1]

have any advantage over

dt[, if(mean(var) > 1) .SD, by = id]

since mean(var) > 1 will always be evaluated for each group. Does it only serve as a syntactic sugar or we are trying to optimize over this somehow to have higher performance?

@franknarf1
Copy link
Contributor

@jangorecki

@franknarf1 this is cool C syntax, although not sure if it wouldn't complicate to much here.
var > 1 ? d : e could work as well, isn't it?

Yeah, that would be cool. Operator precedence might get in the way without {}s as @renkun-ken pointed out (ex = quote(x & y ? a+b : v+w); str(rapply(as.list(ex), as.list, how="replace")))

I'm a bit confused here.

Does

dt[, .SD, by = id, having = mean(var) > 1]

have any advantage over

dt[, if(mean(var) > 1) .SD, by = id]

since mean(var) > 1 will always be evaluated for each group. Does it only serve as a syntactic sugar or we are trying to optimize over this somehow to have higher performance?

I guess until now I had preferred having= because I find it a little clearer to read and imagine it's easier to maintain as compared to adding further syntactical magic to j. On the other hand, I think I might instead prefer the j syntax magic, since

  • I am used to if () ... already; and like the ? way too if it's feasible.
  • If it is integrated in j, then no additional questions need to be answered about its behavior (eg, DT[, x := if (cond) y, by=id] creates NAs if the condition is met in some groups but not others and this behavior shouldn't need to be re-explained for having=).

Regarding optimization, it seems like there are a lot of examples where the having condition itself could benefit from some version of GForce, since it usually is an expression like max(x) > 0, max(x) == 0.

For my own use, besides the optimization, I guess it would be mostly useful for the return-only-groups case mentioned above #1269

> dt[, if (mean(var) > 1) .(), by=id] 
> # instead of ...
> dt[, mean(var) > 1, by=id][V1 == TRUE, !"V1"]
   id
1:  2

@MichaelChirico
Copy link
Member

MichaelChirico commented Feb 15, 2020 via email

@jangorecki
Copy link
Member

j code to be moved to C is the code that is responsible for column selection only, so guessing with argument. Won't interfere here.

@ColeMiller1
Copy link
Contributor

Since the FR is for add a 'having' parameter..., the word having should be somewhere in the solution. Optimization for ternary operators seems like a separate issue.

My preference for having.i() is because of the mantra of data.table: subset / order in i, select in j, group in by. having is just a special case of subsetting.

Regardless, if there is a new argument having, would the API support an i arg? Most use cases do not seem to need that requirement.

@ColeMiller1
Copy link
Contributor

What should be the behavior of ordering? That is, most of the current approaches automatically re-order:

library(data.table)

dt = data.table(grp = c(1L, 2L, 1L, 2L), x = letters[sample(4L)])
dt
#>      grp      x
#>    <int> <char>
#> 1:     1      a
#> 2:     2      b
#> 3:     1      c
#> 4:     2      d
dt[dt[, .I[.N > 0L], by = grp]$V1]
#>      grp      x
#>    <int> <char>
#> 1:     1      a
#> 2:     1      c
#> 3:     2      b
#> 4:     2      d

Should the having arg return a result that is re-ordered according to by?

@ColeMiller1 ColeMiller1 mentioned this issue May 2, 2020
7 tasks
@franknarf1
Copy link
Contributor

What should be the behavior of ordering?
Should the having arg return a result that is re-ordered according to by?

@ColeMiller1 Fwiw, I would expect having= to only appear when by= also appears, so the results would be grouped as in your example with ...$V1.

@MichaelChirico
Copy link
Member

Yes I would expect the ordering to be consistent:

DT[i, j, by, having]
# < == >
DT[i, if (having) j, by]

@jangorecki
Copy link
Member

I think there was no agreement on API, particularly on having new having argument in [. @mattdowle wdyt?
Current approach of using DT[, if (.N > 1L) .SD, col1] is good one, not really complicated, easy to extend, but little bit more difficult to optimise.
My idea was to use having as function call in i: DT[having(N > 1L), .N, col1], but then it is not possible to provide normal subsetting to i.
Alternatively new arg could be a sub-argument of by, haven't much thought about it but something like DT[, .N, by=.(col1, .having = N > 1L)], so the extra grouping-related argument is encapsulated into by argument. This is the proper way to scale up number of arguments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

9 participants