-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce fexplode function #4156
base: master
Are you sure you want to change the base?
Conversation
Are you thinking of something like |
Thanks for the reminder! I thought I was missing something. Like
i.e., it's doing "cartesian unnesting" (if in one row there's a list with 4 elements, and the other there's 6 elements, this will make 24 rows). I couldn't get
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some few initial comments, not really a review
src/unnest.c
Outdated
int n = LENGTH(VECTOR_ELT(x, 0)); | ||
int p = LENGTH(x); | ||
int row_counts[n]; | ||
SEXPTYPE col_types[p]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could segfault for many rows or columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also be obviated by switch to cols
...
The issue here is if we try to instantiate SEXPTYPE
array with >INT_MAX
elements?
Looking again, I'm thinking of just using this instead:
because we immediately leverage code built into The To allow a Any thoughts? |
Hmm... as straightforward as it looks, it's much slower than the current implementation of
A quick look at the profile suggests it's the loop itself slowing things down... At a glance, current implementation working pretty well (not 100% clear this will stay the same once the TODO are all fixed):
|
What if you call CJ and rbindlist from C? Ideally would be to isolate CJ from R C api then it can be called from C without unnecessary overhead. |
Ah, none of the functions in |
Yes I think we're on the same page there. Maybe I'll add |
If skipping CJ part can give extra speed up then probably provide an extra argument to control that. |
|
Problem with re-using however, i just found this, which looks quite interesting: https://stackoverflow.com/a/7413460/3576984
|
AFAIR we already use |
I'm seeing only v limited usage:
Doesn't this provide a nice window for shutting off parallelism for small-cardinality cases that's been the source of some speed issues in other cases? |
@@ -356,8 +356,6 @@ CJ = function(..., sorted = TRUE, unique = FALSE) | |||
if (unique) l[[i]] = unique(y) | |||
} | |||
} | |||
nrow = prod( vapply_1i(l, length) ) # lengths(l) will work from R 3.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
migrated this to C, epsilon more efficient but anyway cleaner to have more logic there; will also make it easier to switch to long vector support
should be ready for testing. a few tweaks needed & then dotting i's. just ran out of time for now |
On 10,000 rows, it looks like the performance is slower than your initial benchmarks. Is it just me?
Also, there are typically fewer row names than your alternative
|
Thanks @colem1 , and good catch w.r.t. For the slowdown, do you have a lot of cores on your machine? Either way could you try again? I just turned on conditional parallelism for |
I think the functions would parse easier as |
@TysonStanley or anyone else more familiar with
output seems correct for
|
@MichaelChirico this is looking awesome! Your usage of Some things to consider (as they are useful and
Thanks for putting this together, @MichaelChirico ! I will definitely be able to use this function. |
Thanks, very helpful. I tried For 2, yes, should be easy enough to support. For 3, I think for the list columns, |
Also, re:naming, my vote (if it counts much) is, if we are essentially replicating the behavior of |
For
So I'm leaning towards using |
My two cents re: |
Ok. I see. Yeah, I would definitely go with Are you still wanting to implement functionality for list-columns of class data frame/table? |
I totally agree. We don't need more terms if a term for it already exists. |
@TysonStanley there's some other bug obscuring things a bit here (#4159) but I think the long-and-wide unnesting of
So I'm not sure we need to add a new function for that? Or maybe some more complicated things are involved that I'm not seeing... |
OK, with the fix in #4161, indeed we can just use
|
It almost seems like there are two functions in one. For a field of
Then for actual lists, your excellent work on
|
In fact I've added to the TODO & begun work on
Seems a lot of the use cases cited wanted 'matched' so I'll make that the default. |
I would stick to |
Yes & yes, however, there is still a major difference vs. |
See some of my tries (https://hope-data-science.github.io/tidyfst/reference/nest.html), the only problem seems to be the data class of integer and double, nothing else. |
That's exactly the problem I've been running into with pretty much every attempt at "unnesting" in data.table. When attempting to unlist a mixture of numeric types it just dies on the spot. It's a shame because being able to unnest makes producing "long" tables when starting with start/end sequences in two separate columns trivial. |
Closes #2146, part of #3189
Still plenty to do, initial basic idea is here. Haven't benchmarked vs. ad-hoc approaches yet either.
TODO:
rbindlist
]list(1, 1L, 'a')
) [viarbindlist
]cj
]row.names
type
argument --'cartesian'
to do cross-product of multiple inputs;'matched'
(?name) works more liketidyr::unnest
cols
-- accept int-as-numeric
, column names[ ] Make it work recursively?[only if requested... pending a follow-up issue...]base
/dplyr
,kdb::ungroup
, and ad hoc versions using existingdata.table
code)