-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a 'having' parameter to [.data.table
#788
Comments
Great FR. I've been pondering about this use case for quite a while as well. We can do this without the additional argument like so: dt[, .SD[mean(var)>1], by=id] (But for speed, this'll need optimisation of It's most likely this case that we resort to dt[dt[, .I[mean(var) > 1], by=id]$V1] And it'd be great to get this directly (even better if we can achieve it without |
Hi Arun. Thanks for the answer. Once optimisation of
and
Even though the second may also have a better appeal for people coming from other languages (SQL in particular). But again, this might be just my opinion. Maybe I just using SQL too much in the last period (lol). |
As far as the taste part goes - I really dislike adding an extra param, when it can be accomplished with simple and standard syntax (i.e. the first option above). |
Curious. I was sure that you were that one most likely to appreciate this :-) (considering how much you wanted to eliminate by-without-by, mainly to improve readability, especially for people coming from other languages, if I remember correctly). Anyway, I know the two are quite different scenarios. I just wanted to share my point of view,:
|
The reasons I didn't like the silent by-without-by and "having" are actually the same - I don't like to remember extra stuff, whether it be extra params or extra strange behavior. I would argue that the first expression you wrote is much easier to read, because you don't have to keep reading the line, then discover that some new param is specified, and have to go back to the beginning of the sentence and reevaluate your mental model of what's going on. |
What do you think of not adding dt[ having(var > 1), .(var = mean(var)), by = id ]
# would perform below without additional copy:
dt[, .(var = mean(var)), by = id ][ var > 1 ]
|
I think this FR is closely tied with #1269 "Returning only groups." I often want to get groups with some attribute and store them in a vector, like
With the
The code is just as long, but I prefer it, so I don't have to read |
Another example from SO. The goal is to overwrite the
Besides arguably nicer syntax, I guess the
which are rather convoluted. Edit: And another example to update if/when this feature is available: http://stackoverflow.com/q/36292702 |
Another example from SO. It could be used to select strictly unique rows (related to #1163 ):
Note that And another from SO: http://stackoverflow.com/q/38272608/ They want to select groups based on stuff in the last row, so And another simple case (filtering by size): http://stackoverflow.com/q/39085450/ And another, with an anti join:
Not such a great example, though. And another with an answer like And another: http://stackoverflow.com/q/43354165/ And another: http://stackoverflow.com/q/43613087/ Another (though it might get deleted): http://stackoverflow.com/q/43635968/ Another http://stackoverflow.com/a/43765352/ Another http://chat.stackoverflow.com/transcript/message/37148860#37148860 Un autre https://stackoverflow.com/q/45557011/ Haiyou https://stackoverflow.com/questions/45598397/filter-data-frame-matching-all-values-of-a-vector Um mais https://stackoverflow.com/a/45721286/ lingwai yige https://stackoverflow.com/a/45820567/ and https://stackoverflow.com/q/46251221/ uno mas https://stackoverflow.com/questions/46307315/show-sequences-that-include-a-variable-in-r tambem https://stackoverflow.com/q/46638058/ And another. I want to subset my data.table (myDT) to entries that aren't found in a reference table (idDT):
This would be inefficient, though, since my desired notation entails each by= value making a separate join to idDT. In that sense, maybe it's not the best example. mais um https://stackoverflow.com/questions/47765283/r-data-table-group-by-where/47765308?noredirect=1#comment82524998_47765308 could do and then https://stackoverflow.com/a/48669032/ mais um exemplo https://stackoverflow.com/q/49072250/ ein anderer https://stackoverflow.com/a/49211292/ mais um https://stackoverflow.com/a/49366998/ autre https://stackoverflow.com/a/49919015/ y https://stackoverflow.com/questions/50257643/deleting-rows-in-r-with-value-less-than-x moar https://stackoverflow.com/q/54582048 e https://stackoverflow.com/q/56283005 keep groups if .N==k (also many at the dupe target) https://stackoverflow.com/questions/56794306/only-get-data-table-groups-with-a-given-number-of-rows keep groups if any(diff(sorted_col)) <= threshold https://stackoverflow.com/q/57512417 keep if max(x) < threshold https://stackoverflow.com/a/57698641 |
@eantonya IMHO, adding the In
|
@ywhuofu data.table already accepts |
Could this be the API?
I have implemented this version although it sets a restriction of only using One additional note. It seems like it would be difficult to fit in
|
I would prefer this as an added parameter, named either as e.g. I think it would be confusing to combine row filters in |
Would |
There are not many syntactic choices:
If row filter and group filter are both needed, Then there won't be many syntactic choices. Leveraging dt[, .SD, by = having(.(id), mean(var > 1))]
dt[, .SD, by = id ~ mean(var) > 1] Adding special function to dt[, having(mean(var) > 1, .SD), by = id] Now, the code I find look best to me is the most original version dt[, if (mean(var) > 1) .SD, by = id]
dt[, if (mean(var) > 1) .(x = sum(x), y = sum(y)), by = id] What I really want is keep the optimization done after the group filtering. Can we detect the |
@renkun-ken Or overload another infix operator?
One advantage of a special symbol over |
@franknarf1 It seems that while we are trying to detect |
@franknarf1 this is cool C syntax, although not sure if it wouldn't complicate to much here. |
Adding syntax has a problem that user needs to be aware that the syntax is specially handled and should not work inside dt[, mean(var) > 1 ? 0 : (sd(var) < 1 ? 1 : 0), by = id] to work, and even dt[, mean(var) > 1 ? 0 : 1]
dt[, mean(var) > 1 ? 0 : (sd(var) < 1 ? 1 : 0)] to work in general. |
I'm a bit confused here. Does dt[, .SD, by = id, having = mean(var) > 1] have any advantage over
since |
Yeah, that would be cool. Operator precedence might get in the way without
I guess until now I had preferred
Regarding optimization, it seems like there are a lot of examples where the having condition itself could benefit from some version of GForce, since it usually is an expression like For my own use, besides the optimization, I guess it would be mostly useful for the return-only-groups case mentioned above #1269
|
Nice points Frank. in addition to the huge compendium of use cases you've
built (thanks again btw!).
it may in fact be easier to do GForce in the having= version since we can
just apply the gforce logic to having similar to j rather than trying to do
NSE to accomplish the same.
though that may interact w Jan's WIP to move a lot of j code to C -- any
thoughts there Jan?
…On Sat, Feb 15, 2020, 1:40 PM Frank ***@***.***> wrote:
@jangorecki <https://github.com/jangorecki>
@franknarf1 <https://github.com/franknarf1> this is cool C syntax,
although not sure if it wouldn't complicate to much here.
var > 1 ? d : e could work as well, isn't it?
Yeah, that would be cool. Operator precedence might get in the way without
{}s as @renkun-ken <https://github.com/renkun-ken> pointed out (ex =
quote(x & y ? a+b : v+w); str(rapply(as.list(ex), as.list, how="replace"))
)
I'm a bit confused here.
Does
dt[, .SD, by = id, having = mean(var) > 1]
have any advantage over
dt[, if(mean(var) > 1) .SD, by = id]
since mean(var) > 1 will always be evaluated for each group. Does it only
serve as a syntactic sugar or we are trying to optimize over this somehow
to have higher performance?
I guess until now I had preferred having= because I find it a little
clearer to read and imagine it's easier to maintain as compared to adding
further syntactical magic to j. On the other hand, I think I might
instead prefer the j syntax magic, since
- I am used to if () ... already; and like the ? way too if it's
feasible.
- If it is integrated in j, then no additional questions need to be
answered about its behavior (eg, DT[, x := if (cond) y, by=id] creates
NAs if the condition is met in some groups but not others and this behavior
shouldn't need to be re-explained for having=).
Regarding optimization, it seems like there are a lot of examples where
the having condition itself could benefit from some version of GForce,
since it usually is an expression like max(x) > 0, max(x) == 0.
For my own use, besides the optimization, I guess it would be mostly
useful for the return-only-groups case mentioned above #1269
<#1269>
> dt[, if (mean(var) > 1) .(), by=id]
> # instead of ...
> dt[, mean(var) > 1, by=id][V1 == TRUE, !"V1"]
id
1: 2
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#788?email_source=notifications&email_token=AB2BA5OCN4IW3N6QQJU6RJ3RC555BA5CNFSM4ATSQPMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL3CK7A#issuecomment-586556796>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2BA5MD7ZXWSRRHVEJM6C3RC555BANCNFSM4ATSQPMA>
.
|
j code to be moved to C is the code that is responsible for column selection only, so guessing |
Since the FR is for add a 'having' parameter..., the word My preference for Regardless, if there is a new argument |
What should be the behavior of ordering? That is, most of the current approaches automatically re-order: library(data.table)
dt = data.table(grp = c(1L, 2L, 1L, 2L), x = letters[sample(4L)])
dt
#> grp x
#> <int> <char>
#> 1: 1 a
#> 2: 2 b
#> 3: 1 c
#> 4: 2 d
dt[dt[, .I[.N > 0L], by = grp]$V1]
#> grp x
#> <int> <char>
#> 1: 1 a
#> 2: 1 c
#> 3: 2 b
#> 4: 2 d Should the |
@ColeMiller1 Fwiw, I would expect |
Yes I would expect the ordering to be consistent:
|
I think there was no agreement on API, particularly on having new |
Currently, to have the equivalent (or something similar) of the SQL
having
clause you need to write a[.data.table
first usingby
and then feed the result into thei
parameter of a second[.data.table
, like in:Another option is to use conditional statement inside
j
, very powerful, I do all the time, and so far there is nothing that the current syntax did not allow me to do. However having ahaving
parameter I believe will allow writing much more clear and readable codes. For example the above can be written as:dt[, if(mean(var) > 1) .SD, by = id]
What I propose is something like:
dt[, .SD, by = id, having = mean(var) > 1]
The idea is to have an expression that always evaluates to a logical of length 1 which would tell whether or not
j
has to be evaluated for the current group.Thanks,
Michele
The text was updated successfully, but these errors were encountered: