Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce fexplode function #4156

Draft
wants to merge 20 commits into
base: master
Choose a base branch
from
Draft

introduce fexplode function #4156

wants to merge 20 commits into from

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Jan 2, 2020

Closes #2146, part of #3189

Still plenty to do, initial basic idea is here. Haven't benchmarked vs. ad-hoc approaches yet either.

TODO:

  • Consider/handle factor columns [via rbindlist]
  • Either stop or handle type bumping for mismatched nested types (e.g. list(1, 1L, 'a')) [via rbindlist]
  • Better input validation generally
  • Fix erroneous logic for "Cartesian unnesting", i.e., the case when we unnest more than one list column at once to get the cartesian product of the elements [via cj]
  • Any possibility to drop any loops?
  • Restore OMP shutoff valve for CJ loops
  • Fix handling of row.names
  • Compare option of separate handling for cartesian unnesting vs simple unnesting
  • type argument -- 'cartesian' to do cross-product of multiple inputs; 'matched' (?name) works more like tidyr::unnest
  • Robustness -- what to do with empty lists? Needs an option? (Courtesy Frank's comment on the issue)
  • Flexibility for cols -- accept int-as-numeric, column names
  • Stress test memory (getting some session crashes)
  • [ ] Make it work recursively? [only if requested... pending a follow-up issue...]
  • Benchmark (vs alternatives from base/dplyr, kdb::ungroup, and ad hoc versions using existing data.table code)
  • Add man page
  • tests+coverage
  • settle naming issue
  • Add NEWS

@TysonStanley
Copy link
Member

Are you thinking of something like tidyfast::dt_unnest() or tidyfast::dt_hoist() (see here)? Or something that works inside of j?

@MichaelChirico
Copy link
Member Author

Thanks for the reminder! I thought I was missing something.

Like dt_unnest when there's only one column, but the signature is the full table for now:

library(data.table)
x = setDT(list(V1 = 1:2, V2 = 3:4, V3 = list(1:3, 1:2), V4 = list(1L, 1:3)))
unnest(x)
#       V1    V2    V3    V4
#    <int> <int> <int> <int>
# 1:     1     3     1     1
# 2:     1     3     2     1
# 3:     1     3     3     1
# 4:     2     4     1     1
# 5:     2     4     1     1
# 6:     2     4     1     2
# 7:     2     4     2     2
# 8:     2     4     2     3
# 9:     2     4     2     3

i.e., it's doing "cartesian unnesting" (if in one row there's a list with 4 elements, and the other there's 6 elements, this will make 24 rows).

I couldn't get dt_unnest to work on this example btw -- both of these failed:

dt_unnest(x, 3)
dt_unnest(x, 'V3')

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some few initial comments, not really a review

src/unnest.c Outdated
int n = LENGTH(VECTOR_ELT(x, 0));
int p = LENGTH(x);
int row_counts[n];
SEXPTYPE col_types[p];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could segfault for many rows or columns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also be obviated by switch to cols...

The issue here is if we try to instantiate SEXPTYPE array with >INT_MAX elements?

src/unnest.c Outdated Show resolved Hide resolved
src/unnest.c Outdated Show resolved Hide resolved
src/unnest.c Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member Author

Looking again, I'm thinking of just using this instead:

rbindlist(lapply(1:nrow(x), function(ii) do.call(CJ, c(unlist(x[ii], recursive = FALSE), sorted=FALSE))))

because we immediately leverage code built into CJ and rbindlist that would be reused for unnest anyway.

The unnest part makes me a bit uneasy, but the approach I was trying just now would have been basically to do this above at the C level.

To allow a cols argument, we'd have to mess with unlist to only unlist the columns specified in cols (maybe that part could be a C helper).

Any thoughts?

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jan 3, 2020

Hmm... as straightforward as it looks, it's much slower than the current implementation of unnest.c:

x = setDT(list(
  1:1e4,
  1e4 + 1:1e4,
  lapply(integer(1e4), function(...) 1:sample(5L, 1L)),
  lapply(integer(1e4), function(...) 1:sample(5L, 1L))
))

system.time(unnest(x))
  #  user  system elapsed 
  # 0.010   0.001   0.011 
system.time(rbindlist(lapply(1:nrow(x), function(ii) do.call(CJ, c(unlist(x[ii], recursive = FALSE), sorted=FALSE)))))
 #   user  system elapsed 
 # 10.773   0.134   2.771 

A quick look at the profile suggests it's the loop itself slowing things down...

At a glance, current implementation working pretty well (not 100% clear this will stay the same once the TODO are all fixed):

# same for 1e6 rows
system.time(unnest(x))
   user  system elapsed 
  0.792   0.070   0.869 

@jangorecki
Copy link
Member

jangorecki commented Jan 3, 2020

What if you call CJ and rbindlist from C? Ideally would be to isolate CJ from R C api then it can be called from C without unnecessary overhead.

@TysonStanley
Copy link
Member

Ah, none of the functions in tidyfast will do cartesian unnesting (rather they require the lengths to be the same across the various columns being unnested), as it basically copies the functionality from tidyr. So, ignore my comment from earlier :)

@MichaelChirico
Copy link
Member Author

What if you call CJ and rbindlist from C? Ideally would be to isolate CJ from R C api then it can be called from C without unnecessary overhead.

Yes I think we're on the same page there. Maybe I'll add unnest to cj.c, I think they're pretty similar in implementation.

@jangorecki
Copy link
Member

If skipping CJ part can give extra speed up then probably provide an extra argument to control that.

@MichaelChirico
Copy link
Member Author

If skipping CJ part can give extra speed up

CJ is only necessary for >1 column to unnest... could do a branch for the two cases, will think about how different the logic would be.

@MichaelChirico
Copy link
Member Author

Problem with re-using CJ is the parallelism -- I think most use cases for unnesting have very small inputs to CJ, shouldn't really need to parallelize...

however, i just found this, which looks quite interesting:

https://stackoverflow.com/a/7413460/3576984

#pragma omp parallel for if (size >= OMP_MIN_VALUE)

@jangorecki
Copy link
Member

jangorecki commented Jan 4, 2020

AFAIR we already use pragma omp if in froll

@MichaelChirico
Copy link
Member Author

I'm seeing only v limited usage:

src/nafill.c:163:  #pragma omp parallel for if (nx>1) num_threads(getDTthreads())
src/types.c:72:  #pragma omp parallel for if (nx*nk>1) schedule(auto) collapse(2) num_threads(getDTthreads())
src/frollR.c:202:  #pragma omp parallel for if (ialgo==0 && nx*nk>1) schedule(auto) collapse(2) num_threads(getDTthreads())

Doesn't this provide a nice window for shutting off parallelism for small-cardinality cases that's been the source of some speed issues in other cases?

@@ -356,8 +356,6 @@ CJ = function(..., sorted = TRUE, unique = FALSE)
if (unique) l[[i]] = unique(y)
}
}
nrow = prod( vapply_1i(l, length) ) # lengths(l) will work from R 3.2.0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

migrated this to C, epsilon more efficient but anyway cleaner to have more logic there; will also make it easier to switch to long vector support

@MichaelChirico
Copy link
Member Author

should be ready for testing. a few tweaks needed & then dotting i's. just ran out of time for now

@ColeMiller1
Copy link
Contributor

On 10,000 rows, it looks like the performance is slower than your initial benchmarks. Is it just me?

##remotes::install_github("Rdatatable/data.table", ref = "unnest")
library(data.table)

n = 1e4
x = setDT(list(1:n,
               n + 1:n,
               lapply(integer(n), function(...) 1:sample(5L, 1L))
               ,lapply(integer(n), function(...) 1:sample(5L, 1L))))

system.time(unnest(x))
##   user  system elapsed 
##   1.22    1.28    2.32 

Also, there are typically fewer row names than your alternative rbindlist(..., do.call(CJ, ...)) method. I discovered this when identical() resulted in FALSE.

x = setDT(list(V1 = 1:2, V2 = 3:4, V3 = list(1:3, 1:2), V4 = list(1L, 1:3)))
attr(unnest(x), 'row.names')
## [1] 1 2
attr(rbindlist(lapply(1:nrow(x), function(ii) do.call(CJ, c(unlist(x[ii], recursive = FALSE), sorted=FALSE)))), 'row.names')
## [1] 1 2 3 4 5 6 7 8 9

@MichaelChirico
Copy link
Member Author

Thanks @colem1 , and good catch w.r.t. row.names. That's copyMostAttrib's fault I guess.

For the slowdown, do you have a lot of cores on your machine? Either way could you try again?

I just turned on conditional parallelism for CJ so it won't try and parallel loops over < 1024 iterations. For me, the speed had gone from the .01 I initially reported to .04; with the latest push it's .02. n=1e6 slowed down for me a lot... will investigate, but I'm a bit loath to compare to the original version because the answer's wrong 😅

@ColeMiller1
Copy link
Contributor

I think the functions would parse easier as f.unnest or f.sort but may be too late for that convention. unwrap would be pretty good on its own without the f.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jan 4, 2020

@TysonStanley or anyone else more familiar with tidyr, am I using unnest right? it is really quite slow (about 300x slower for current implementation... and it basically just won't run for nn=1e5)

> nn = 1e4
> x = setDT(list(
+   1:nn,
+   nn + 1:nn,
+   lapply(integer(nn), function(...) 1:sample(20L, 1L))
+ ))
> 
> system.time(funnest(x))
   user  system elapsed 
  0.032   0.001   0.034 
> 
> system.time(unnest(x, V3))
   user  system elapsed 
  8.070   1.129   9.227 
> 
> system.time(x[ , c(.SD[rep(1:.N, lengths(V3))], list(V3 = unlist(V3))), .SDcols=!'V3'])
   user  system elapsed 
  0.023   0.001   0.013

output seems correct for nn=1e4 so it seems I'm using it right...

1e5 finally ran... we are 9000x faster 🤔

> nn = 1e5
> x = setDT(list(
+   1:nn,
+   nn + 1:nn,
+   lapply(integer(nn), function(...) 1:sample(20L, 1L))
+ ))
> 
> system.time(funnest(x))
   user  system elapsed 
  0.105   0.006   0.110 
> 
> system.time(unnest(x, V3))

system.time(x[ , c(.SD[rep(1:.N, lengths(V3))], list(V3 = unlist(V3))), .SDcols=!'V3'])
    user   system  elapsed 
 929.743   83.603 1024.019 
> 
> system.time(x[ , c(.SD[rep(1:.N, lengths(V3))], list(V3 = unlist(V3))), .SDcols=!'V3'])
   user  system elapsed 
  0.184   0.014   0.066 

@TysonStanley
Copy link
Member

@MichaelChirico this is looking awesome! Your usage of tidyr::unnest() looks correct to me. It is pretty slow in most situations. I replicated your results on my machine for 1e4 (didn't even try the 1e5...).

Some things to consider (as they are useful and tidyr::unnest() provides them):

  • Is funnest() (or whatever name we go with!) only for list-columns of vectors? tidyr::unnest() also handles list-columns of data frames, which can be very useful in a number of situations. If it is for vectors then we'll want to be explicit about that as those familiar with tidyr will expect functionality for data frames.
  • The cols arg currently only accepts integers (this may just be because it is the initial version). In many cases, I don't want all the list-columns to be unnested, just want one or two. It would be really useful to have the ability to provide names of columns (either quoted or unquoted) to unnest.
  • If there are other list-columns that we don't unnest, the unnested data can get really big with the repeated list-column values. Would we want to drop those (with a message) or at least warn users of any issues of not unnesting the other list-columns? Not really sure what is best here...

Thanks for putting this together, @MichaelChirico ! I will definitely be able to use this function.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jan 5, 2020

Thanks, very helpful. I tried funnest on the example from tidyfast::dt_unnest and it led to some very broken things 😓 glad to have seen this now.

For 2, yes, should be easy enough to support.

For 3, I think for the list columns, the same amount of data is present scratch that I chose a bad example. I see what you mean esp. in the case when the nested structure is a decently large data.table, e.g. will have to trust the user 😄

@TysonStanley
Copy link
Member

Also, re:naming, my vote (if it counts much) is, if we are essentially replicating the behavior of tidyr::unnest() (looks like we are for the most part) then we go with something like funnest() but make it clear in the documentation that it is fast-unnest to tie it to the already learned (for many R users) verb of unnest. If the behavior is fairly different, I like the unwrap name since that would show it is different than unnest and unwrap feels like it is describing it well 🤷‍♂

@MichaelChirico
Copy link
Member Author

For dt_unnest, actually i think maybe the intended behavior of tidyr::unnest and what I implemented here are different.

funnest here will have output with the same number of columns -- it's really about reshaping long. Whereas unnest is kind of wide-and-long.

So I'm leaning towards using explode (probably fexplode since explode has collision with sparklyr.nested and SparkR).

@MichaelChirico
Copy link
Member Author

My two cents re: unwrap is that I don't see much value add to introducing a new vocabulary to the "data scientist toolkit ether" (the shared vocab we have for working with pandas, dplyr, data.table, spark, SQL, etc). The behavior I'm implementing here is supposed to imitate explode in SparkQL, so I think best to align to that, vocab wise.

@TysonStanley
Copy link
Member

Ok. I see. Yeah, I would definitely go with fexplode, especially since other systems use it.

Are you still wanting to implement functionality for list-columns of class data frame/table?

@TysonStanley
Copy link
Member

My two cents re: unwrap is that I don't see much value add to introducing a new vocabulary to the "data scientist toolkit ether" (the shared vocab we have for working with pandas, dplyr, data.table, spark, SQL, etc). The behavior I'm implementing here is supposed to imitate explode in SparkQL, so I think best to align to that, vocab wise.

I totally agree. We don't need more terms if a term for it already exists.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jan 5, 2020

@TysonStanley there's some other bug obscuring things a bit here (#4159) but I think the long-and-wide unnesting of data.table-in-list should be as simple as:

dt <- data.table(
  x = rnorm(1e5),
  y = runif(1e5),
  grp = sample(1L:3L, 1e5, replace = TRUE)
  )

nested <- tidyfast::dt_nest(dt, grp)
nested[ , data[[1L]], by = grp]

So I'm not sure we need to add a new function for that? Or maybe some more complicated things are involved that I'm not seeing...

@MichaelChirico
Copy link
Member Author

OK, with the fix in #4161, indeed we can just use [[ by group to do what unnest is doing; hopefully eventually it could be GForced as well:

dt <- data.table(
  x = rnorm(1e5),
  y = runif(1e5),
  grp = sample(1L:3L, 1e5, replace = TRUE)
  )

nested <- tidyfast::dt_nest(dt, grp)
nested[ , data[[1L]], by = grp]
#           grp            x         y
#         <int>        <num>     <num>
#      1:     1  0.297592377 0.4391232
#      2:     1  1.779575646 0.8837532
#      3:     1 -0.134158583 0.7179423
#      4:     1 -0.004123587 0.7993645
#      5:     1 -0.726821276 0.9287760
#     ---                             
#  99996:     3 -1.585162305 0.0606167
#  99997:     3 -1.128405293 0.6230116
#  99998:     3 -1.256436885 0.4428009
#  99999:     3 -1.904886406 0.1553449
# 100000:     3  0.232626958 0.8745510

@MichaelChirico MichaelChirico changed the title WIP: introduce unnest function WIP: introduce fexplode function Jan 5, 2020
@ColeMiller1
Copy link
Contributor

It almost seems like there are two functions in one. For a field of data.tables, we should expect one data.table per row - grouping seems unnecessary. Using rbindlist seems pretty relevant:

data.table(rep(nested[['grp']], vapply(nested[['data']], nrow, integer(1))),
           rbindlist(nested[['data']]))

Then for actual lists, your excellent work on fexplode() does the trick. The only question I have on that seems to be whether the option of allowing cartesian or not should be present.

library(data.table)
x = setDT(list(V1 = 1:2, V2 = 3:4, V3 = list(1:2, 1:2), V4 = list(2:3, 2:3)))
tidyfast::dt_hoist(x, V3, V4) ##non-cartesian
funnest(x) ##cartesian

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jan 6, 2020

In fact I've added to the TODO & begun work on type argument.

type='cartesian' will be as now

type='matched' will do like tidyr::unnest does. (I'm open to a different name than 'matched')

Seems a lot of the use cases cited wanted 'matched' so I'll make that the default.

@jangorecki
Copy link
Member

I would stick to funnest name, assuming type='matched' will be implemented, and maybe a default?

@MichaelChirico
Copy link
Member Author

Yes & yes, however, there is still a major difference vs. tidyr::unnest that fexplode will always return the same number of columns as the input. As Tyson pointed out users accustomed to the long-and-wide expansion method might be surprised if we use funnest

@hope-data-science
Copy link

See some of my tries (https://hope-data-science.github.io/tidyfst/reference/nest.html), the only problem seems to be the data class of integer and double, nothing else.

@D3SL
Copy link

D3SL commented Apr 21, 2020

See some of my tries (https://hope-data-science.github.io/tidyfst/reference/nest.html), the only problem seems to be the data class of integer and double, nothing else.

That's exactly the problem I've been running into with pretty much every attempt at "unnesting" in data.table. When attempting to unlist a mixture of numeric types it just dies on the spot. It's a shame because being able to unnest makes producing "long" tables when starting with start/end sequences in two separate columns trivial.

@MichaelChirico MichaelChirico marked this pull request as draft December 14, 2023 11:21
@MichaelChirico MichaelChirico changed the title WIP: introduce fexplode function introduce fexplode function Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create an efficient unnest function
6 participants