Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: .SDcols=patterns() #3186

Merged
merged 4 commits into from
Dec 14, 2018
Merged

RFC: .SDcols=patterns() #3186

merged 4 commits into from
Dec 14, 2018

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Dec 5, 2018

Closes #1878 (.SDcols=patterns(...))
Closes #3185 (ancillary edge case)

RFC given the popularity of this issue despite how simple the PR is to be sure we can basically agree this is the right API for implementing this feature.

Tagging everyone who participated in #1878: @eantonya, @ksavin, @mbacou, @hannes101, @franknarf1, @HughParsonage, @arunsrinivasan, @skanskan, @therimalaya, @bchen102, @sbsdc as well as @mattdowle and @jangorecki.

As Implemented

Essentially we re-use patterns like we do in melt; multiple patterns --> intersection of matches:

library(data.table)
DT = data.table(i = 1:10, c = letters[1:10])
set.seed(3459)
DT[ , paste0('V', 1:10) := lapply(integer(10L), function(...) rnorm(10L))][]
#      i c         V1         V2          V3         V4          V5         V6         V7
#  1:  1 a  0.3714012 -0.1494429  1.13300407  1.2661908 -0.09249212  0.7954114 -0.1211000
#  2:  2 b -0.1177177 -2.4585715 -1.63960330 -0.7998233 -0.54588880 -1.3338185  0.8138912
#  3:  3 c -1.0918951 -1.0230666  0.68272545  2.2874009  1.54356767 -0.7094124  0.7326431
#  4:  4 d -2.5809883 -0.1065164  1.60203979 -0.6913598 -0.47854994 -0.2953956 -0.2386240
#  5:  5 e -0.1362863 -0.4616205 -1.41872650  0.5156108  1.90234433  1.3667683 -2.0415515
#  6:  6 f -1.2504643 -0.7284620  0.98501940  0.5406783  0.21214610  0.6952400  0.4622092
#  7:  7 g  0.2597725 -1.0290593 -0.55132090  0.1504328 -0.14798292  0.4138119  0.4046622
#  8:  8 h -2.1484501 -2.0823730  1.18265952  0.6882005 -0.70851129  0.3380656 -0.2353361
#  9:  9 i -0.5772339  2.6833881 -0.78724790 -1.4424652 -1.72913363 -1.6478304 -1.2496821
# 10: 10 j  0.9225026 -1.1975755  0.09083374  0.7527641 -0.94123194 -1.2617830 -0.6598987
#             V8          V9        V10
#  1:  0.6730572  0.17482020  0.8189296
#  2: -1.0052455 -0.05906419  0.6700489
#  3:  1.2782899  1.18298399 -1.2358743
#  4:  0.4508402 -0.53088184 -0.8933916
#  5:  0.1969130  1.41444639 -0.6043715
#  6:  0.8346712  1.00041944  0.3736168
#  7:  0.6027966  0.16537790 -2.3239332
#  8: -1.3998695  0.73790807  2.2070307
#  9: -1.9737772  0.36031874  0.5035344
# 10: -0.1259538  1.55622494 -1.3582985

DT[ , lapply(.SD, sum), .SDcols = patterns('^V')]
#           V1      V2       V3      V4         V5        V6        V7         V8       V9
# 1: -6.349359 -6.5533 1.279383 3.26763 -0.9857325 -1.638943 -2.132787 -0.4682779 6.002554
#          V10
# 1: -1.842709

# non-empty intersection
DT[ , lapply(.SD, sum), .SDcols = patterns('^V[02468]', '^V[48]')]
#         V4         V8
# 1: 3.26763 -0.4682779

Questions to address before merging:

I think for me the main design decision was this line:

.SDcols = Reduce(intersect, do_patterns(colsub, names(x)))
  • Should we take patterns(regex1, regex2, ...) to mean the user wants only columns matching both regex1 and regex2, or is it columns matching regex1 or regex2?
  • Does this answer depend on whether patterns is negated? I considered conditioning if .SDcols is like !patterns(regex1, regex2, ...), we take the union, otherwise, the intersection.

Other points to clarify:

  • Should we bother implementing the cols argument to patterns? Right now it's undefined behavior, but it will fail/not work as expected. Would take a bit more digging around in the code to get this to work, is it worth it? Should we fail if cols is supplied?
  • Is patterns the right name? Is it confusing that the behavior of patterns is not exactly the same as it is for melt.data.table? Should we replace patterns with a flexible helper a la like_any/like_all which could handle the intersect/union question directly?

🙇 in advance and happy data.tableing!

@codecov
Copy link

codecov bot commented Dec 5, 2018

Codecov Report

Merging #3186 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #3186      +/-   ##
=========================================
+ Coverage    94.6%   94.6%   +<.01%     
=========================================
  Files          61      61              
  Lines       11742   11747       +5     
=========================================
+ Hits        11108   11113       +5     
  Misses        634     634
Impacted Files Coverage Δ
R/utils.R 91.11% <100%> (+3.23%) ⬆️
R/data.table.R 95.18% <100%> (ø) ⬆️
R/fmelt.R 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8faf21...c03c740. Read the comment docs.

R/data.table.R Outdated Show resolved Hide resolved
data.table(V4 = 3.4, V8 = -0.4))

# also with !/- inversion
test(1963.4, DT[ , lapply(.SD, sum), .SDcols = !patterns('^c|i')],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mentioned !/- but tested only !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the two cases are literally handled side-by-side, there's not as of now any way for the behavior to differ for the two approaches? Is why I only tested the one:

if (is.call(colsub) && deparse(colsub[[1L]], 500L, backtick=FALSE) %chin% c("!", "-")) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"as of now", that is how regression happens :) in this case quite unlikely, agree

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kicking the can to downstream... if someone wants to treat them differently at some point, add the tests then 😛

The only worry is some upstream change that somehow changes how - and ! work in R in general? But I think as long as lazy eval remains as is and they both remain functions we'll be fine.

@franknarf1
Copy link
Contributor

franknarf1 commented Dec 5, 2018

to be sure we can basically agree this is the right API for implementing this feature.

multiple patterns --> intersection of matches:

I don't think I'd need more than one pattern at a time, fwiw.

Since the behavior is different from melt's patterns, maybe use a different function? There's also ?patterns to be kept updated for both uses. Maybe .SDcols = like_any(...) and .SDcols = like_all(...).

For negation with multiple patterns, maybe individual patterns could be negated as well, .SDcols=like_all("boo", !"yah", "foo", !"bar") sort of along the lines of #1878 (comment) Hugh's function.

Does this answer depend on whether patterns is negated? I considered conditioning if .SDcols is like !patterns(regex1, regex2, ...), we take the union, otherwise, the intersection.

"In the intersection" = all conditions hold, but "Not in the intersection" != any conditions hold (right?), so that's not what I'd expect ! to do (based on j when with=FALSE).

@MichaelChirico
Copy link
Member Author

Thanks Frank. Will definitely update patterns before merging.

The point about patterns working differently is good to keep in mind... but I think it's a bit subtle the difference? What I'm going for is to avoid cognitive overload of all these functions to keep track of, and the usage is close enough that I just went with patterns. The main difference between the functionalities comes I think from what happens on 2nd, 3rd, etc ... arguments, so maybe we just want to shut that down here? But I've added this to the main post as something to sort out.

Agree about ! changing to union not actually making sense.

I'm not sure what the expected output of like_all("boo", !"yah", "foo", !"bar") would be? It is also technically possible to use lazy eval to handle something like "boo" & !"yah" & "foo" & !"bar" but I think that's probably overkill...

@HughParsonage
Copy link
Member

My 2c:

I'm not really a fan of functions that are implicit within []; I'd prefer an .SDgrep argument, possibly as a list to pass arguments like perl, invert. It also would allow us to keep .SDcols as an "and" argument. I fear bugs like someone defining patterns outside the data.table and expecting something slightly different.

DT[, .SD, .SDgrep = list(pattern = c("^V", "9"), invert = c(FALSE, TRUE))]
DT[, .SD, .SDgrep = "^V"] # if atomic just pass to first list argument.

Similarly, I think this would be easier to reason about and maintain than trying to interpret ! and - prefixes.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Dec 7, 2018

@HughParsonage FWIW we've had that NSE trick in melt.data.table for quite some time and I don't recall any complaints about it (doesn't mean there aren't any or the point is irrelevant)...

Interpretation of - and ! hasn't changed (same functionality as has been used in interpreting .SDcols basically as far back as I can remember...)

I'm not sure how useful it is to make this functionality extremely flexible, the more I think about it... probably 80-90% of the target use cases are as simple as your .SDgrep = '^V' example.

I think the main reason for having this in the first place is the same as the reason why it's built on the implicitly-interpreted function -- convenience/dynamism -- being able to apply the pattern filtering without having to be bothered with referring to self since this is in most most cases the natural source of names to be filtering. Especially useful when chaining when names(x) isn't available.

I don't really have a rigorous way of evaluating the cost/benefit of adding a new argument to [... Certainly pandas design UI, e.g., does not care at all about adding new arguments. But it feels strange to me (only me, maybe?) to add a new argument (.SDgrep) that has a 90% conceptual/cognitive overlap with an existing argument (.SDcols) when I feel like most of the novelty/differentiation is in the back-end (arguments being the user-facing front end)...

Another thing that came to mind at some point was .SDcols = .GREP(...) where .GREP is simply a grep wrapper where the x argument is automatically set to names(x)? In this case we could even get away with just overwriting grep internal to each [ call (no new function or argument name required):

`[.data.table` = function(...) {
  # potentially search `substitute(.SDcols)` for `grep` calls before defining this
  grep = function(pattern, ignore.case = FALSE, perl = FALSE, value = FALSE, 
                  fixed = FALSE, useBytes = FALSE, invert = FALSE) {
    grep(pattern, names(x), ignore.case, per, value, fixed, useBytes, invert)
  }
  ...
}

@franknarf1
Copy link
Contributor

franknarf1 commented Dec 7, 2018

I think the main reason for having this in the first place is the same as the reason why it's built on the implicitly-interpreted function -- convenience/dynamism -- being able to apply the pattern filtering without having to be bothered with referring to self since this is in most most cases the natural source of names to be filtering. Especially useful when chaining when names(x) isn't available.

I feel like this was brought up before, but could there just be a symbol for "self"? So...

.SDcols = like(.NM, "foo")
.SDcols = startsWith(.NM, "foo")
.SDcols = seq_along(.NM) %% 3 == 1

In that case, existing functions (grep or other) can still be relied on without inventing new ones.

@MichaelChirico
Copy link
Member Author

I'm not seeing anything like .NM, but #2130 is close...

@jangorecki
Copy link
Member

IMO if we could refer to names(.SD) inside .SDcols args then we are having enough flexibility to handle whatever we want using data.table independent api. Only extra thing that make sense is patterns as we already have such helper in melt.

@HughParsonage
Copy link
Member

OK I'm less keen on .SDgrep than I was; I think .SDcols is the right place to implement this.

But I think that your .GREP idea is better: makes it clearer that the function should not be called directly. I also think the .and and .but.not arguments from my select_grep are important: I've used them a lot more than I thought I would.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Dec 8, 2018 via email

@HughParsonage
Copy link
Member

Something like

       x income2014 tax2014 income2015 tax2015 income2016 tax2016 income2017 tax2017
  1:   1      41149 16043.7      45429 10834.0      30092  4998.6      35139  6550.6
  2:   2      66379  9605.3      38110  7119.0      75009 12893.1      41810  6558.4
  3:   3      61011  8835.2      45726  6010.2      50157  4438.5      58424 12087.6
  4:   4      35457 19485.7      28610  3500.9      61623  8216.8      57302  6799.9
  5:   5      31649  7736.1      36109  4614.4      48183  4306.5      53117 17794.3
 ---                                                                                
496: 496      37231  6314.5      62428 10677.4      58065  6880.2      48274  3092.2
497: 497      60474 12755.9      61201  5214.9      57968  6566.2      45954 12666.0
498: 498      51132  9113.2      45104 27374.2      77910 31478.2      44016 16641.3
499: 499      44300  7711.9      42577  7940.4      55625  4154.7      46742 12233.1
500: 500      36028 12853.6      62328  9945.5      44063 11789.2      58207 10863.0

And then select_grep("tax", .and = "x").

@Atrebas
Copy link

Atrebas commented Dec 8, 2018

Thanks for the ping. As a user, what I like about data.table is that it offers conciseness and flexibility at the same time. I definitely agree with @MichaelChirico's comment about avoiding "cognitive overload". When upvoting this feature, DT[ , lapply(.SD, sum), .SDcols = patterns('^V')] was indeed clearly what I had in mind and I think it would cover most, if not all, of my use cases.
Kudos for your work!

@mattdowle mattdowle added this to the 1.12.0 milestone Dec 14, 2018
@mattdowle mattdowle changed the title RFC: #3185 and #1878 -- patterns in .SDcols RFC: .SDcols=patterns() Dec 14, 2018
@mattdowle mattdowle merged commit 02642ab into master Dec 14, 2018
@mattdowle mattdowle deleted the sd_patterns branch December 14, 2018 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants