Replies: 3 comments 4 replies
-
Can't you do
Did you check this for flatmapgroups? Finally, this pretty much explains why we didn't invest in RDD APIs: #101 (comment) |
Beta Was this translation helpful? Give feedback.
-
@imback82 You asked for me to share some sample code. Here below is an example of using mappartitions in scala to acquire a new column that has dimension keys (surrogates) for a data frame. It basically receives a dataframe and attaches a new column on it with the name DIM_ResultSurrogate. The goal is to prepare the dimension data, so that when the fact data is eventually added to the database, we can associate all the fact records to the related dimension. The DIM_ResultSurrogate is a surrogate record identifier that represents a combination of a given "InvoiceTypeCode", "InvoiceJournalizationCode", and "PostedFlag". Notice that I don't really need to care how the records are partitioned. And I don't need to do grouping. I just operate on whatever partitions are in place. I don't show the full code below (eg. how I create transformedIterable) but it should give you the general idea. You can see that within the mapPartitions delegate, I can work with a large batch of records (p_Iterator). I can also do preparation work at at the top and bottom of the delegate (represented here by opening and closing a database connection).
|
Beta Was this translation helpful? Give feedback.
-
Hi! Are there any plans to support |
Beta Was this translation helpful? Give feedback.
-
I see that one of the topmost goals in the roadmap is to work on improving .Net support for dataframe operations.
But I don't know if I can wait. I'm pretty stuck on not having the ability to run mappartitions. This is some critical functionality that we use regularly in scala/python. We rely heavily on methods like mappartitions and flatmapgroups . These don't appear to be available in .net for spark yet, unless I'm missing something. I only found a few references to these method names in the original RDD code ... which appears to be deprecated in version 1.
As a side question, how do people use that RDD code if they really, really need it? Are people just forking the entire repo and changing all the internals to public? Or maybe they are using reflection to execute that stuff?
I did find the dataframe methods for GroupBy/Apply and these may allow me to accomplish some portion of the work ... but they aren't a perfect fit. Ideally there would be a way to send mapping operations out to any existing partitions where they sit.
Beta Was this translation helpful? Give feedback.
All reactions