introduce fexplode function #4156

MichaelChirico · 2020-01-02T16:57:09Z

Closes #2146, part of #3189

Still plenty to do, initial basic idea is here. Haven't benchmarked vs. ad-hoc approaches yet either.

TODO:

TysonStanley · 2020-01-02T19:39:19Z

Are you thinking of something like tidyfast::dt_unnest() or tidyfast::dt_hoist() (see here)? Or something that works inside of j?

MichaelChirico · 2020-01-03T01:37:36Z

Thanks for the reminder! I thought I was missing something.

Like dt_unnest when there's only one column, but the signature is the full table for now:

library(data.table)
x = setDT(list(V1 = 1:2, V2 = 3:4, V3 = list(1:3, 1:2), V4 = list(1L, 1:3)))
unnest(x)
#       V1    V2    V3    V4
#    <int> <int> <int> <int>
# 1:     1     3     1     1
# 2:     1     3     2     1
# 3:     1     3     3     1
# 4:     2     4     1     1
# 5:     2     4     1     1
# 6:     2     4     1     2
# 7:     2     4     2     2
# 8:     2     4     2     3
# 9:     2     4     2     3

i.e., it's doing "cartesian unnesting" (if in one row there's a list with 4 elements, and the other there's 6 elements, this will make 24 rows).

I couldn't get dt_unnest to work on this example btw -- both of these failed:

dt_unnest(x, 3)
dt_unnest(x, 'V3')

jangorecki

Some few initial comments, not really a review

jangorecki · 2020-01-03T03:03:25Z

src/unnest.c

+  int n = LENGTH(VECTOR_ELT(x, 0));
+  int p = LENGTH(x);
+  int row_counts[n];
+  SEXPTYPE col_types[p];


Could segfault for many rows or columns

Would also be obviated by switch to cols...

The issue here is if we try to instantiate SEXPTYPE array with >INT_MAX elements?

src/unnest.c

MichaelChirico · 2020-01-03T03:33:29Z

Looking again, I'm thinking of just using this instead:

rbindlist(lapply(1:nrow(x), function(ii) do.call(CJ, c(unlist(x[ii], recursive = FALSE), sorted=FALSE))))

because we immediately leverage code built into CJ and rbindlist that would be reused for unnest anyway.

The unnest part makes me a bit uneasy, but the approach I was trying just now would have been basically to do this above at the C level.

To allow a cols argument, we'd have to mess with unlist to only unlist the columns specified in cols (maybe that part could be a C helper).

Any thoughts?

MichaelChirico · 2020-01-03T04:03:04Z

Hmm... as straightforward as it looks, it's much slower than the current implementation of unnest.c:

x = setDT(list(
  1:1e4,
  1e4 + 1:1e4,
  lapply(integer(1e4), function(...) 1:sample(5L, 1L)),
  lapply(integer(1e4), function(...) 1:sample(5L, 1L))
))

system.time(unnest(x))
  #  user  system elapsed 
  # 0.010   0.001   0.011 
system.time(rbindlist(lapply(1:nrow(x), function(ii) do.call(CJ, c(unlist(x[ii], recursive = FALSE), sorted=FALSE)))))
 #   user  system elapsed 
 # 10.773   0.134   2.771

A quick look at the profile suggests it's the loop itself slowing things down...

At a glance, current implementation working pretty well (not 100% clear this will stay the same once the TODO are all fixed):

# same for 1e6 rows
system.time(unnest(x))
   user  system elapsed 
  0.792   0.070   0.869

jangorecki · 2020-01-03T04:56:23Z

What if you call CJ and rbindlist from C? Ideally would be to isolate CJ from R C api then it can be called from C without unnecessary overhead.

TysonStanley · 2020-01-03T05:04:21Z

Ah, none of the functions in tidyfast will do cartesian unnesting (rather they require the lengths to be the same across the various columns being unnested), as it basically copies the functionality from tidyr. So, ignore my comment from earlier :)

MichaelChirico · 2020-01-03T06:14:59Z

What if you call CJ and rbindlist from C? Ideally would be to isolate CJ from R C api then it can be called from C without unnecessary overhead.

Yes I think we're on the same page there. Maybe I'll add unnest to cj.c, I think they're pretty similar in implementation.

jangorecki · 2020-01-03T06:21:18Z

If skipping CJ part can give extra speed up then probably provide an extra argument to control that.

MichaelChirico · 2020-01-03T06:29:57Z

If skipping CJ part can give extra speed up

CJ is only necessary for >1 column to unnest... could do a branch for the two cases, will think about how different the logic would be.

MichaelChirico · 2020-01-04T01:58:57Z

Problem with re-using CJ is the parallelism -- I think most use cases for unnesting have very small inputs to CJ, shouldn't really need to parallelize...

however, i just found this, which looks quite interesting:

https://stackoverflow.com/a/7413460/3576984

#pragma omp parallel for if (size >= OMP_MIN_VALUE)

jangorecki · 2020-01-04T07:13:32Z

AFAIR we already use pragma omp if in froll

MichaelChirico · 2020-01-04T08:44:53Z

I'm seeing only v limited usage:

src/nafill.c:163:  #pragma omp parallel for if (nx>1) num_threads(getDTthreads())
src/types.c:72:  #pragma omp parallel for if (nx*nk>1) schedule(auto) collapse(2) num_threads(getDTthreads())
src/frollR.c:202:  #pragma omp parallel for if (ialgo==0 && nx*nk>1) schedule(auto) collapse(2) num_threads(getDTthreads())

Doesn't this provide a nice window for shutting off parallelism for small-cardinality cases that's been the source of some speed issues in other cases?

MichaelChirico · 2020-01-04T10:16:14Z

R/setkey.R

@@ -356,8 +356,6 @@ CJ = function(..., sorted = TRUE, unique = FALSE)
      if (unique) l[[i]] = unique(y)
    }
  }
-  nrow = prod( vapply_1i(l, length) )  # lengths(l) will work from R 3.2.0


migrated this to C, epsilon more efficient but anyway cleaner to have more logic there; will also make it easier to switch to long vector support

MichaelChirico · 2020-01-04T10:43:36Z

should be ready for testing. a few tweaks needed & then dotting i's. just ran out of time for now

ColeMiller1 · 2020-01-04T11:59:47Z

On 10,000 rows, it looks like the performance is slower than your initial benchmarks. Is it just me?

##remotes::install_github("Rdatatable/data.table", ref = "unnest")
library(data.table)

n = 1e4
x = setDT(list(1:n,
               n + 1:n,
               lapply(integer(n), function(...) 1:sample(5L, 1L))
               ,lapply(integer(n), function(...) 1:sample(5L, 1L))))

system.time(unnest(x))
##   user  system elapsed 
##   1.22    1.28    2.32

Also, there are typically fewer row names than your alternative rbindlist(..., do.call(CJ, ...)) method. I discovered this when identical() resulted in FALSE.

x = setDT(list(V1 = 1:2, V2 = 3:4, V3 = list(1:3, 1:2), V4 = list(1L, 1:3)))
attr(unnest(x), 'row.names')
## [1] 1 2
attr(rbindlist(lapply(1:nrow(x), function(ii) do.call(CJ, c(unlist(x[ii], recursive = FALSE), sorted=FALSE)))), 'row.names')
## [1] 1 2 3 4 5 6 7 8 9

MichaelChirico · 2020-01-04T13:16:00Z

Thanks @colem1 , and good catch w.r.t. row.names. That's copyMostAttrib's fault I guess.

For the slowdown, do you have a lot of cores on your machine? Either way could you try again?

I just turned on conditional parallelism for CJ so it won't try and parallel loops over < 1024 iterations. For me, the speed had gone from the .01 I initially reported to .04; with the latest push it's .02. n=1e6 slowed down for me a lot... will investigate, but I'm a bit loath to compare to the original version because the answer's wrong 😅

ColeMiller1 · 2020-01-04T15:25:58Z

I think the functions would parse easier as f.unnest or f.sort but may be too late for that convention. unwrap would be pretty good on its own without the f.

MichaelChirico · 2020-01-04T15:45:26Z

@TysonStanley or anyone else more familiar with tidyr, am I using unnest right? it is really quite slow (about 300x slower for current implementation... and it basically just won't run for nn=1e5)

> nn = 1e4
> x = setDT(list(
+   1:nn,
+   nn + 1:nn,
+   lapply(integer(nn), function(...) 1:sample(20L, 1L))
+ ))
> 
> system.time(funnest(x))
   user  system elapsed 
  0.032   0.001   0.034 
> 
> system.time(unnest(x, V3))
   user  system elapsed 
  8.070   1.129   9.227 
> 
> system.time(x[ , c(.SD[rep(1:.N, lengths(V3))], list(V3 = unlist(V3))), .SDcols=!'V3'])
   user  system elapsed 
  0.023   0.001   0.013

output seems correct for nn=1e4 so it seems I'm using it right...

1e5 finally ran... we are 9000x faster 🤔

> nn = 1e5
> x = setDT(list(
+   1:nn,
+   nn + 1:nn,
+   lapply(integer(nn), function(...) 1:sample(20L, 1L))
+ ))
> 
> system.time(funnest(x))
   user  system elapsed 
  0.105   0.006   0.110 
> 
> system.time(unnest(x, V3))

system.time(x[ , c(.SD[rep(1:.N, lengths(V3))], list(V3 = unlist(V3))), .SDcols=!'V3'])
    user   system  elapsed 
 929.743   83.603 1024.019 
> 
> system.time(x[ , c(.SD[rep(1:.N, lengths(V3))], list(V3 = unlist(V3))), .SDcols=!'V3'])
   user  system elapsed 
  0.184   0.014   0.066

TysonStanley · 2020-01-04T20:01:27Z

@MichaelChirico this is looking awesome! Your usage of tidyr::unnest() looks correct to me. It is pretty slow in most situations. I replicated your results on my machine for 1e4 (didn't even try the 1e5...).

Some things to consider (as they are useful and tidyr::unnest() provides them):

Is funnest() (or whatever name we go with!) only for list-columns of vectors? tidyr::unnest() also handles list-columns of data frames, which can be very useful in a number of situations. If it is for vectors then we'll want to be explicit about that as those familiar with tidyr will expect functionality for data frames.
The cols arg currently only accepts integers (this may just be because it is the initial version). In many cases, I don't want all the list-columns to be unnested, just want one or two. It would be really useful to have the ability to provide names of columns (either quoted or unquoted) to unnest.
If there are other list-columns that we don't unnest, the unnested data can get really big with the repeated list-column values. Would we want to drop those (with a message) or at least warn users of any issues of not unnesting the other list-columns? Not really sure what is best here...

Thanks for putting this together, @MichaelChirico ! I will definitely be able to use this function.

MichaelChirico · 2020-01-05T02:47:03Z

Thanks, very helpful. I tried funnest on the example from tidyfast::dt_unnest and it led to some very broken things 😓 glad to have seen this now.

For 2, yes, should be easy enough to support.

For 3, I think for the list columns, ~~the same amount of data is present~~ scratch that I chose a bad example. I see what you mean esp. in the case when the nested structure is a decently large data.table, e.g. will have to trust the user 😄

TysonStanley · 2020-01-05T04:28:05Z

Also, re:naming, my vote (if it counts much) is, if we are essentially replicating the behavior of tidyr::unnest() (looks like we are for the most part) then we go with something like funnest() but make it clear in the documentation that it is fast-unnest to tie it to the already learned (for many R users) verb of unnest. If the behavior is fairly different, I like the unwrap name since that would show it is different than unnest and unwrap feels like it is describing it well 🤷‍♂

MichaelChirico · 2020-01-05T04:31:41Z

For dt_unnest, actually i think maybe the intended behavior of tidyr::unnest and what I implemented here are different.

funnest here will have output with the same number of columns -- it's really about reshaping long. Whereas unnest is kind of wide-and-long.

So I'm leaning towards using explode (probably fexplode since explode has collision with sparklyr.nested and SparkR).

MichaelChirico · 2020-01-05T04:33:56Z

My two cents re: unwrap is that I don't see much value add to introducing a new vocabulary to the "data scientist toolkit ether" (the shared vocab we have for working with pandas, dplyr, data.table, spark, SQL, etc). The behavior I'm implementing here is supposed to imitate explode in SparkQL, so I think best to align to that, vocab wise.

TysonStanley · 2020-01-05T04:35:05Z

Ok. I see. Yeah, I would definitely go with fexplode, especially since other systems use it.

Are you still wanting to implement functionality for list-columns of class data frame/table?

TysonStanley · 2020-01-05T04:36:09Z

My two cents re: unwrap is that I don't see much value add to introducing a new vocabulary to the "data scientist toolkit ether" (the shared vocab we have for working with pandas, dplyr, data.table, spark, SQL, etc). The behavior I'm implementing here is supposed to imitate explode in SparkQL, so I think best to align to that, vocab wise.

I totally agree. We don't need more terms if a term for it already exists.

MichaelChirico · 2020-01-05T10:38:09Z

@TysonStanley there's some other bug obscuring things a bit here (#4159) but I think the long-and-wide unnesting of data.table-in-list should be as simple as:

dt <- data.table(
  x = rnorm(1e5),
  y = runif(1e5),
  grp = sample(1L:3L, 1e5, replace = TRUE)
  )

nested <- tidyfast::dt_nest(dt, grp)
nested[ , data[[1L]], by = grp]

So I'm not sure we need to add a new function for that? Or maybe some more complicated things are involved that I'm not seeing...

MichaelChirico · 2020-01-05T12:51:49Z

OK, with the fix in #4161, indeed we can just use [[ by group to do what unnest is doing; hopefully eventually it could be GForced as well:

dt <- data.table(
  x = rnorm(1e5),
  y = runif(1e5),
  grp = sample(1L:3L, 1e5, replace = TRUE)
  )

nested <- tidyfast::dt_nest(dt, grp)
nested[ , data[[1L]], by = grp]
#           grp            x         y
#         <int>        <num>     <num>
#      1:     1  0.297592377 0.4391232
#      2:     1  1.779575646 0.8837532
#      3:     1 -0.134158583 0.7179423
#      4:     1 -0.004123587 0.7993645
#      5:     1 -0.726821276 0.9287760
#     ---                             
#  99996:     3 -1.585162305 0.0606167
#  99997:     3 -1.128405293 0.6230116
#  99998:     3 -1.256436885 0.4428009
#  99999:     3 -1.904886406 0.1553449
# 100000:     3  0.232626958 0.8745510

ColeMiller1 · 2020-01-05T17:52:49Z

It almost seems like there are two functions in one. For a field of data.tables, we should expect one data.table per row - grouping seems unnecessary. Using rbindlist seems pretty relevant:

data.table(rep(nested[['grp']], vapply(nested[['data']], nrow, integer(1))),
           rbindlist(nested[['data']]))

Then for actual lists, your excellent work on fexplode() does the trick. The only question I have on that seems to be whether the option of allowing cartesian or not should be present.

library(data.table)
x = setDT(list(V1 = 1:2, V2 = 3:4, V3 = list(1:2, 1:2), V4 = list(2:3, 2:3)))
tidyfast::dt_hoist(x, V3, V4) ##non-cartesian
funnest(x) ##cartesian

MichaelChirico · 2020-01-06T00:55:02Z

In fact I've added to the TODO & begun work on type argument.

type='cartesian' will be as now

type='matched' will do like tidyr::unnest does. (I'm open to a different name than 'matched')

Seems a lot of the use cases cited wanted 'matched' so I'll make that the default.

jangorecki · 2020-01-06T03:23:26Z

I would stick to funnest name, assuming type='matched' will be implemented, and maybe a default?

MichaelChirico · 2020-01-06T03:27:32Z

Yes & yes, however, there is still a major difference vs. tidyr::unnest that fexplode will always return the same number of columns as the input. As Tyson pointed out users accustomed to the long-and-wide expansion method might be surprised if we use funnest

hope-data-science · 2020-03-10T02:58:01Z

See some of my tries (https://hope-data-science.github.io/tidyfst/reference/nest.html), the only problem seems to be the data class of integer and double, nothing else.

D3SL · 2020-04-21T18:34:21Z

See some of my tries (https://hope-data-science.github.io/tidyfst/reference/nest.html), the only problem seems to be the data class of integer and double, nothing else.

That's exactly the problem I've been running into with pretty much every attempt at "unnesting" in data.table. When attempting to unlist a mixture of numeric types it just dies on the spot. It's a shame because being able to unnest makes producing "long" tables when starting with start/end sequences in two separate columns trivial.

Michael Chirico added 3 commits January 2, 2020 22:06

prod_int function (probably not safe enough to use

f88e69b

initial working version of unnest

ade2a94

make note of incorrect cartesian unnesting

6777a4a

MichaelChirico added the WIP label Jan 3, 2020

jangorecki reviewed Jan 3, 2020

View reviewed changes

initial progress migrating to simpler signature

2b34cac

Michael Chirico added 2 commits January 4, 2020 11:27

extract CJ workhorse to power unnesting

fa93bee

move unnest into cj script (similar logic)

e1af1ef

Michael Chirico added 2 commits January 4, 2020 17:21

fix some bugs, but giving up on this approach for now

71f72c1

use rbindlist() and cj() internally

eaf56c6

MichaelChirico commented Jan 4, 2020

View reviewed changes

Michael Chirico added 3 commits January 4, 2020 18:23

no need for splitting out cj logic anymore

8933477

reduce diff slightly

488f679

validate input, add comments

b2214fe

conditional parallelism

c263976

Merge branch 'master' into unnest

565051b

Michael Chirico added 3 commits January 4, 2020 23:51

tests numbers

0df5db8

missed rename in man

fbbccd2

coverage; fix test

d161dfd

MichaelChirico mentioned this pull request Jan 5, 2020

returning .SD by group doesn't unlock .SD; and GForce [[ non-atomic type causes trouble #4159

Closed

MichaelChirico changed the title ~~WIP: introduce unnest function~~ WIP: introduce fexplode function Jan 5, 2020

MichaelChirico closed this Jan 6, 2020

MichaelChirico reopened this Jan 6, 2020

MichaelChirico mentioned this pull request Mar 9, 2020

List column support in data.table #4290

Closed

ColeMiller1 mentioned this pull request Jul 29, 2020

[[ by group takes forever (24 hours +) with v1.13.0 vs 4 seconds with v1.12.8 #4646

Closed

MichaelChirico removed the WIP label Dec 14, 2023

MichaelChirico marked this pull request as draft December 14, 2023 11:21

MichaelChirico changed the title ~~WIP: introduce fexplode function~~ introduce fexplode function Jan 12, 2024

introduce fexplode function #4156

Are you sure you want to change the base?

introduce fexplode function #4156

Conversation

MichaelChirico commented Jan 2, 2020 • edited Loading

TysonStanley commented Jan 2, 2020

MichaelChirico commented Jan 3, 2020

jangorecki left a comment

Choose a reason for hiding this comment

jangorecki Jan 3, 2020

Choose a reason for hiding this comment

MichaelChirico Jan 3, 2020

Choose a reason for hiding this comment

MichaelChirico commented Jan 3, 2020

MichaelChirico commented Jan 3, 2020 • edited Loading

jangorecki commented Jan 3, 2020 • edited Loading

TysonStanley commented Jan 3, 2020

MichaelChirico commented Jan 3, 2020

jangorecki commented Jan 3, 2020

MichaelChirico commented Jan 3, 2020

MichaelChirico commented Jan 4, 2020

jangorecki commented Jan 4, 2020 • edited Loading

MichaelChirico commented Jan 4, 2020

MichaelChirico Jan 4, 2020

Choose a reason for hiding this comment

MichaelChirico commented Jan 4, 2020

ColeMiller1 commented Jan 4, 2020

MichaelChirico commented Jan 4, 2020

ColeMiller1 commented Jan 4, 2020

MichaelChirico commented Jan 4, 2020 • edited Loading

TysonStanley commented Jan 4, 2020

MichaelChirico commented Jan 5, 2020 • edited Loading

TysonStanley commented Jan 5, 2020

MichaelChirico commented Jan 5, 2020

MichaelChirico commented Jan 5, 2020

TysonStanley commented Jan 5, 2020

TysonStanley commented Jan 5, 2020

MichaelChirico commented Jan 5, 2020 • edited Loading

MichaelChirico commented Jan 5, 2020

ColeMiller1 commented Jan 5, 2020

MichaelChirico commented Jan 6, 2020 • edited Loading

jangorecki commented Jan 6, 2020

MichaelChirico commented Jan 6, 2020

hope-data-science commented Mar 10, 2020

D3SL commented Apr 21, 2020

MichaelChirico commented Jan 2, 2020 •

edited

Loading

MichaelChirico commented Jan 3, 2020 •

edited

Loading

jangorecki commented Jan 3, 2020 •

edited

Loading

jangorecki commented Jan 4, 2020 •

edited

Loading

MichaelChirico commented Jan 4, 2020 •

edited

Loading

MichaelChirico commented Jan 5, 2020 •

edited

Loading

MichaelChirico commented Jan 5, 2020 •

edited

Loading

MichaelChirico commented Jan 6, 2020 •

edited

Loading