RFC: .SDcols=patterns() #3186

MichaelChirico · 2018-12-05T08:47:15Z

Closes #1878 (.SDcols=patterns(...))
Closes #3185 (ancillary edge case)

RFC given the popularity of this issue despite how simple the PR is to be sure we can basically agree this is the right API for implementing this feature.

Tagging everyone who participated in #1878: @eantonya, @ksavin, @mbacou, @hannes101, @franknarf1, @HughParsonage, @arunsrinivasan, @skanskan, @therimalaya, @bchen102, @sbsdc as well as @mattdowle and @jangorecki.

As Implemented

Essentially we re-use patterns like we do in melt; multiple patterns --> intersection of matches:

library(data.table)
DT = data.table(i = 1:10, c = letters[1:10])
set.seed(3459)
DT[ , paste0('V', 1:10) := lapply(integer(10L), function(...) rnorm(10L))][]
#      i c         V1         V2          V3         V4          V5         V6         V7
#  1:  1 a  0.3714012 -0.1494429  1.13300407  1.2661908 -0.09249212  0.7954114 -0.1211000
#  2:  2 b -0.1177177 -2.4585715 -1.63960330 -0.7998233 -0.54588880 -1.3338185  0.8138912
#  3:  3 c -1.0918951 -1.0230666  0.68272545  2.2874009  1.54356767 -0.7094124  0.7326431
#  4:  4 d -2.5809883 -0.1065164  1.60203979 -0.6913598 -0.47854994 -0.2953956 -0.2386240
#  5:  5 e -0.1362863 -0.4616205 -1.41872650  0.5156108  1.90234433  1.3667683 -2.0415515
#  6:  6 f -1.2504643 -0.7284620  0.98501940  0.5406783  0.21214610  0.6952400  0.4622092
#  7:  7 g  0.2597725 -1.0290593 -0.55132090  0.1504328 -0.14798292  0.4138119  0.4046622
#  8:  8 h -2.1484501 -2.0823730  1.18265952  0.6882005 -0.70851129  0.3380656 -0.2353361
#  9:  9 i -0.5772339  2.6833881 -0.78724790 -1.4424652 -1.72913363 -1.6478304 -1.2496821
# 10: 10 j  0.9225026 -1.1975755  0.09083374  0.7527641 -0.94123194 -1.2617830 -0.6598987
#             V8          V9        V10
#  1:  0.6730572  0.17482020  0.8189296
#  2: -1.0052455 -0.05906419  0.6700489
#  3:  1.2782899  1.18298399 -1.2358743
#  4:  0.4508402 -0.53088184 -0.8933916
#  5:  0.1969130  1.41444639 -0.6043715
#  6:  0.8346712  1.00041944  0.3736168
#  7:  0.6027966  0.16537790 -2.3239332
#  8: -1.3998695  0.73790807  2.2070307
#  9: -1.9737772  0.36031874  0.5035344
# 10: -0.1259538  1.55622494 -1.3582985

DT[ , lapply(.SD, sum), .SDcols = patterns('^V')]
#           V1      V2       V3      V4         V5        V6        V7         V8       V9
# 1: -6.349359 -6.5533 1.279383 3.26763 -0.9857325 -1.638943 -2.132787 -0.4682779 6.002554
#          V10
# 1: -1.842709

# non-empty intersection
DT[ , lapply(.SD, sum), .SDcols = patterns('^V[02468]', '^V[48]')]
#         V4         V8
# 1: 3.26763 -0.4682779

Questions to address before merging:

I think for me the main design decision was this line:

.SDcols = Reduce(intersect, do_patterns(colsub, names(x)))

Should we take patterns(regex1, regex2, ...) to mean the user wants only columns matching both regex1 and regex2, or is it columns matching regex1 or regex2?
Does this answer depend on whether patterns is negated? I considered conditioning if .SDcols is like !patterns(regex1, regex2, ...), we take the union, otherwise, the intersection.

Other points to clarify:

Should we bother implementing the cols argument to patterns? Right now it's undefined behavior, but it will fail/not work as expected. Would take a bit more digging around in the code to get this to work, is it worth it? Should we fail if cols is supplied?
Is patterns the right name? Is it confusing that the behavior of patterns is not exactly the same as it is for melt.data.table? Should we replace patterns with a flexible helper a la like_any/like_all which could handle the intersect/union question directly?

🙇 in advance and happy data.tableing!

codecov · 2018-12-05T09:00:29Z

Codecov Report

Merging #3186 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #3186      +/-   ##
=========================================
+ Coverage    94.6%   94.6%   +<.01%     
=========================================
  Files          61      61              
  Lines       11742   11747       +5     
=========================================
+ Hits        11108   11113       +5     
  Misses        634     634

Impacted Files	Coverage Δ
R/utils.R	`91.11% <100%> (+3.23%)`	⬆️
R/data.table.R	`95.18% <100%> (ø)`	⬆️
R/fmelt.R	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8faf21...c03c740. Read the comment docs.

R/data.table.R

jangorecki · 2018-12-05T09:41:59Z

inst/tests/tests.Rraw

+     data.table(V4 = 3.4, V8 = -0.4))
+
+# also with !/- inversion
+test(1963.4, DT[ , lapply(.SD, sum), .SDcols = !patterns('^c|i')],


you mentioned !/- but tested only !

As the two cases are literally handled side-by-side, there's not as of now any way for the behavior to differ for the two approaches? Is why I only tested the one:

data.table/R/data.table.R

Line 1009 in 87a7bb6

if (is.call(colsub) && deparse(colsub[[1L]], 500L, backtick=FALSE) %chin% c("!", "-")) {

"as of now", that is how regression happens :) in this case quite unlikely, agree

I'm kicking the can to downstream... if someone wants to treat them differently at some point, add the tests then 😛

The only worry is some upstream change that somehow changes how - and ! work in R in general? But I think as long as lazy eval remains as is and they both remain functions we'll be fine.

franknarf1 · 2018-12-05T11:32:52Z

to be sure we can basically agree this is the right API for implementing this feature.

multiple patterns --> intersection of matches:

I don't think I'd need more than one pattern at a time, fwiw.

Since the behavior is different from melt's patterns, maybe use a different function? There's also ?patterns to be kept updated for both uses. Maybe .SDcols = like_any(...) and .SDcols = like_all(...).

For negation with multiple patterns, maybe individual patterns could be negated as well, .SDcols=like_all("boo", !"yah", "foo", !"bar") sort of along the lines of #1878 (comment) Hugh's function.

Does this answer depend on whether patterns is negated? I considered conditioning if .SDcols is like !patterns(regex1, regex2, ...), we take the union, otherwise, the intersection.

"In the intersection" = all conditions hold, but "Not in the intersection" != any conditions hold (right?), so that's not what I'd expect ! to do (based on j when with=FALSE).

MichaelChirico · 2018-12-05T15:12:42Z

Thanks Frank. Will definitely update patterns before merging.

The point about patterns working differently is good to keep in mind... but I think it's a bit subtle the difference? What I'm going for is to avoid cognitive overload of all these functions to keep track of, and the usage is close enough that I just went with patterns. The main difference between the functionalities comes I think from what happens on 2nd, 3rd, etc ... arguments, so maybe we just want to shut that down here? But I've added this to the main post as something to sort out.

Agree about ! changing to union not actually making sense.

I'm not sure what the expected output of like_all("boo", !"yah", "foo", !"bar") would be? It is also technically possible to use lazy eval to handle something like "boo" & !"yah" & "foo" & !"bar" but I think that's probably overkill...

HughParsonage · 2018-12-05T15:50:31Z

My 2c:

I'm not really a fan of functions that are implicit within []; I'd prefer an .SDgrep argument, possibly as a list to pass arguments like perl, invert. It also would allow us to keep .SDcols as an "and" argument. I fear bugs like someone defining patterns outside the data.table and expecting something slightly different.

DT[, .SD, .SDgrep = list(pattern = c("^V", "9"), invert = c(FALSE, TRUE))]
DT[, .SD, .SDgrep = "^V"] # if atomic just pass to first list argument.

Similarly, I think this would be easier to reason about and maintain than trying to interpret ! and - prefixes.

MichaelChirico · 2018-12-07T17:36:41Z

@HughParsonage FWIW we've had that NSE trick in melt.data.table for quite some time and I don't recall any complaints about it (doesn't mean there aren't any or the point is irrelevant)...

Interpretation of - and ! hasn't changed (same functionality as has been used in interpreting .SDcols basically as far back as I can remember...)

I'm not sure how useful it is to make this functionality extremely flexible, the more I think about it... probably 80-90% of the target use cases are as simple as your .SDgrep = '^V' example.

I think the main reason for having this in the first place is the same as the reason why it's built on the implicitly-interpreted function -- convenience/dynamism -- being able to apply the pattern filtering without having to be bothered with referring to self since this is in most most cases the natural source of names to be filtering. Especially useful when chaining when names(x) isn't available.

I don't really have a rigorous way of evaluating the cost/benefit of adding a new argument to [... Certainly pandas design UI, e.g., does not care at all about adding new arguments. But it feels strange to me (only me, maybe?) to add a new argument (.SDgrep) that has a 90% conceptual/cognitive overlap with an existing argument (.SDcols) when I feel like most of the novelty/differentiation is in the back-end (arguments being the user-facing front end)...

Another thing that came to mind at some point was .SDcols = .GREP(...) where .GREP is simply a grep wrapper where the x argument is automatically set to names(x)? In this case we could even get away with just overwriting grep internal to each [ call (no new function or argument name required):

`[.data.table` = function(...) {
  # potentially search `substitute(.SDcols)` for `grep` calls before defining this
  grep = function(pattern, ignore.case = FALSE, perl = FALSE, value = FALSE, 
                  fixed = FALSE, useBytes = FALSE, invert = FALSE) {
    grep(pattern, names(x), ignore.case, per, value, fixed, useBytes, invert)
  }
  ...
}

franknarf1 · 2018-12-07T18:23:06Z

I think the main reason for having this in the first place is the same as the reason why it's built on the implicitly-interpreted function -- convenience/dynamism -- being able to apply the pattern filtering without having to be bothered with referring to self since this is in most most cases the natural source of names to be filtering. Especially useful when chaining when names(x) isn't available.

I feel like this was brought up before, but could there just be a symbol for "self"? So...

.SDcols = like(.NM, "foo")
.SDcols = startsWith(.NM, "foo")
.SDcols = seq_along(.NM) %% 3 == 1

In that case, existing functions (grep or other) can still be relied on without inventing new ones.

MichaelChirico · 2018-12-07T18:31:25Z

I'm not seeing anything like .NM, but #2130 is close...

jangorecki · 2018-12-08T04:21:12Z

IMO if we could refer to names(.SD) inside .SDcols args then we are having enough flexibility to handle whatever we want using data.table independent api. Only extra thing that make sense is patterns as we already have such helper in melt.

HughParsonage · 2018-12-08T07:38:06Z

OK I'm less keen on .SDgrep than I was; I think .SDcols is the right place to implement this.

But I think that your .GREP idea is better: makes it clearer that the function should not be called directly. I also think the .and and .but.not arguments from my select_grep are important: I've used them a lot more than I thought I would.

MichaelChirico · 2018-12-08T07:51:15Z

do you have any examples you could share? to make things more concrete... seems like we're making some progress here 😁

…

On Sat, Dec 8, 2018, 3:38 PM HughParsonage ***@***.*** wrote: OK I'm less keen on .SDgrep than I was; I think .SDcols is the right place to implement this. But I think that your .GREP idea is better: makes it clearer that the function should not be called directly. I also think the .and and .but.not arguments from my select_grep are important: I've used them a lot more than I thought I would. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3186 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHQQdUXnEzj2gXXK-soCI4gYWTvXaYgaks5u22xmgaJpZM4ZCQa2> .

HughParsonage · 2018-12-08T09:37:55Z

Something like

       x income2014 tax2014 income2015 tax2015 income2016 tax2016 income2017 tax2017
  1:   1      41149 16043.7      45429 10834.0      30092  4998.6      35139  6550.6
  2:   2      66379  9605.3      38110  7119.0      75009 12893.1      41810  6558.4
  3:   3      61011  8835.2      45726  6010.2      50157  4438.5      58424 12087.6
  4:   4      35457 19485.7      28610  3500.9      61623  8216.8      57302  6799.9
  5:   5      31649  7736.1      36109  4614.4      48183  4306.5      53117 17794.3
 ---                                                                                
496: 496      37231  6314.5      62428 10677.4      58065  6880.2      48274  3092.2
497: 497      60474 12755.9      61201  5214.9      57968  6566.2      45954 12666.0
498: 498      51132  9113.2      45104 27374.2      77910 31478.2      44016 16641.3
499: 499      44300  7711.9      42577  7940.4      55625  4154.7      46742 12233.1
500: 500      36028 12853.6      62328  9945.5      44063 11789.2      58207 10863.0

And then select_grep("tax", .and = "x").

Atrebas · 2018-12-08T13:53:35Z

Thanks for the ping. As a user, what I like about data.table is that it offers conciseness and flexibility at the same time. I definitely agree with @MichaelChirico's comment about avoiding "cognitive overload". When upvoting this feature, DT[ , lapply(.SD, sum), .SDcols = patterns('^V')] was indeed clearly what I had in mind and I think it would cover most, if not all, of my use cases.
Kudos for your work!

Michael Chirico added 2 commits December 5, 2018 16:36

Closes #1878 and #3185 -- .SDcols accepts patterns

b24f490

NEWS item for #3185

87a7bb6

jangorecki reviewed Dec 5, 2018

View reviewed changes

empty .SDcols errors for now

41c75c2

Merge branch 'master' into sd_patterns

c03c740

mattdowle added this to the 1.12.0 milestone Dec 14, 2018

mattdowle changed the title ~~RFC: #3185 and #1878 -- patterns in .SDcols~~ RFC: .SDcols=patterns() Dec 14, 2018

mattdowle merged commit 02642ab into master Dec 14, 2018

mattdowle deleted the sd_patterns branch December 14, 2018 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: .SDcols=patterns() #3186

RFC: .SDcols=patterns() #3186

MichaelChirico commented Dec 5, 2018 •

edited by mattdowle

Loading

codecov bot commented Dec 5, 2018 •

edited

Loading

jangorecki Dec 5, 2018

MichaelChirico Dec 5, 2018

jangorecki Dec 5, 2018

MichaelChirico Dec 5, 2018

franknarf1 commented Dec 5, 2018 •

edited

Loading

MichaelChirico commented Dec 5, 2018

HughParsonage commented Dec 5, 2018

MichaelChirico commented Dec 7, 2018 •

edited

Loading

franknarf1 commented Dec 7, 2018 •

edited

Loading

MichaelChirico commented Dec 7, 2018

jangorecki commented Dec 8, 2018

HughParsonage commented Dec 8, 2018

MichaelChirico commented Dec 8, 2018 via email

HughParsonage commented Dec 8, 2018

Atrebas commented Dec 8, 2018 •

edited

Loading

RFC: .SDcols=patterns() #3186

RFC: .SDcols=patterns() #3186

Conversation

MichaelChirico commented Dec 5, 2018 • edited by mattdowle Loading

As Implemented

Questions to address before merging:

codecov bot commented Dec 5, 2018 • edited Loading

Codecov Report

jangorecki Dec 5, 2018

Choose a reason for hiding this comment

MichaelChirico Dec 5, 2018

Choose a reason for hiding this comment

jangorecki Dec 5, 2018

Choose a reason for hiding this comment

MichaelChirico Dec 5, 2018

Choose a reason for hiding this comment

franknarf1 commented Dec 5, 2018 • edited Loading

MichaelChirico commented Dec 5, 2018

HughParsonage commented Dec 5, 2018

MichaelChirico commented Dec 7, 2018 • edited Loading

franknarf1 commented Dec 7, 2018 • edited Loading

MichaelChirico commented Dec 7, 2018

jangorecki commented Dec 8, 2018

HughParsonage commented Dec 8, 2018

MichaelChirico commented Dec 8, 2018 via email

HughParsonage commented Dec 8, 2018

Atrebas commented Dec 8, 2018 • edited Loading

MichaelChirico commented Dec 5, 2018 •

edited by mattdowle

Loading

codecov bot commented Dec 5, 2018 •

edited

Loading

franknarf1 commented Dec 5, 2018 •

edited

Loading

MichaelChirico commented Dec 7, 2018 •

edited

Loading

franknarf1 commented Dec 7, 2018 •

edited

Loading

Atrebas commented Dec 8, 2018 •

edited

Loading