Missing some dataframe operations (eg. mappartitions) #857

dbeavon · 2021-03-20T01:04:22Z

dbeavon
Mar 20, 2021

I see that one of the topmost goals in the roadmap is to work on improving .Net support for dataframe operations.

But I don't know if I can wait. I'm pretty stuck on not having the ability to run mappartitions. This is some critical functionality that we use regularly in scala/python. We rely heavily on methods like mappartitions and flatmapgroups . These don't appear to be available in .net for spark yet, unless I'm missing something. I only found a few references to these method names in the original RDD code ... which appears to be deprecated in version 1.

As a side question, how do people use that RDD code if they really, really need it? Are people just forking the entire repo and changing all the internals to public? Or maybe they are using reflection to execute that stuff?

I did find the dataframe methods for GroupBy/Apply and these may allow me to accomplish some portion of the work ... but they aren't a perfect fit. Ideally there would be a way to send mapping operations out to any existing partitions where they sit.

imback82 · 2021-03-20T03:00:47Z

imback82
Mar 20, 2021

But I don't know if I can wait. I'm pretty stuck on not having the ability to run mappartitions.

Can't you do df.Select(your_udf(df["col"]))?

We rely heavily on methods like mappartitions and flatmapgroups .

Did you check this for flatmapgroups?

spark/src/csharp/Microsoft.Spark/Sql/RelationalGroupedDataset.cs

Line 191 in a6f4e91

    
           public DataFrame Apply(StructType returnType, Func<RecordBatch, RecordBatch> func)

Finally, this pretty much explains why we didn't invest in RDD APIs: #101 (comment)

4 replies

dbeavon Mar 21, 2021
Author

Thanks for the reply.

I had found that one of the the biggest advantages of mappartitions was in the way I can easily distribute workloads for processing. I use it to distribute an operation that will handle a batch of records at a time (eg. after repartitioning in one way or another).

Can't you do df.Select(your_udf(df["col"]))

I will do more investigation into the UDF idea. The UDF examples I tested would only operate on a single row at a time. This is a problem if there is a lot of overhead that is needed for initializing the worker before operating on rows - like connecting to a database or creating a lookup table, or what-not. Maybe there is a pattern available for the initialization of the .net worker process, prior to the start of a task.

Did you check this for flatmapgroups?

Yes, I had found that and intend to test it a bit. I think my main concern is the requirement of creating artificial groupings before being able to operate on batches of records. Also the groupings may be wrong-sized or imbalanced, which is would be a problem. On the other hand, mappartitions doesn't have any prerequisite grouping requirement.

I'm convinced that batching rows together during processing is always preferred over handling them one at a time. There are efficiencies that can be gained by operating on a batch of rows. When using mappartition, the unit of work is typically hundreds or thousands of rows, rather than just one . Generally it just depends on the size of the dataset, and the number of worker nodes, and the arguments that were applied when repartitioning.

imback82 Mar 21, 2021

I had found that one of the the biggest advantages of mappartitions was in the way I can easily distribute workloads for processing. I use it to distribute an operation that will handle a batch of records at a time (eg. after repartitioning in one way or another).

Could you share some code? You can do this in DataFrame: df.repartition(50).select(your_udf(df["col"])). In this case, you will have 50 tasks calling your udf.

If you are doing mapPartition in Python/Scala, it's basically same as defining UDFs; in other words, if we expose mapPartition, it will go thru the same code path for invoking UDFs.

The UDF examples I tested would only operate on a single row at a time.

You can check the VectorUdf if you want more rows per UDF call:

spark/src/csharp/Microsoft.Spark.E2ETest/IpcTests/Sql/DataFrameTests.cs

Line 162 in a6f4e91

public void TestVectorUdf()

.

Maybe there is a pattern available for the initialization of the .net worker process, prior to the start of a task. But you can always repartition your data such that each task gets enough rows to process to make initialization cost negligible.

You can use a static variable, but could be a problematic on Windows b/c it can create a worker process per task.

When using mappartition, the unit of work is typically hundreds or thousands of rows, rather than just one

I am not entirely sure what this mean, but even though the regular UDF is invoked per row, Spark batches up the rows and send it over to CLR. If this is still a concern, you can check the VectorUDF I shared above.

dbeavon Mar 21, 2021
Author

Going back to the suggestion to use RelationalGroupedDataset Apply (ie. Apache Arrow RecordBatch)...

This isn't working for me. I tested the example code Microsoft.Spark.Examples.Sql.Batch.VectorDataFrameUdfs... and I get a pretty scary error.
java.lang.NullPointerException at org.apache.spark.sql.vectorized.ArrowColumnVector.getChild. It looks like it is from the worker, as it tries to serialize the RecordBatch to go back to the driver:


System.Exception
  HResult=0x80131500
  Message=JVM method execution failed: Nonstatic method 'showString' failed for class '33' when called with 3 arguments ([Index=1, Type=Int32, Value=20], [Index=2, Type=Int32, Value=20], [Index=3, Type=Boolean, Value=False], )
  Source=Microsoft.Spark
  StackTrace:
   at Microsoft.Spark.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic, Object classNameOrJvmObjectReference, String methodName, Object[] args)
   at Microsoft.Spark.Interop.Ipc.JvmBridge.CallNonStaticJavaMethod(JvmObjectReference objectId, String methodName, Object[] args)
   at Microsoft.Spark.Interop.Ipc.JvmObjectReference.Invoke(String methodName, Object[] args)
   at Microsoft.Spark.Sql.DataFrame.Show(Int32 numRows, Int32 truncate, Boolean vertical)
   at MySparkApp.Program.Main(String[] args) in C:\Users\dbeavon\source\repos\mySparkApp\Program.cs:line 329

Inner Exception 1:
JvmException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 in stage 10.0 failed 4 times, most recent failure: Lost task 44.3 in stage 10.0 (TID 131, 172.30.11.206, executor 0): 
java.lang.NullPointerException
	at org.apache.spark.sql.vectorized.ArrowColumnVector.getChild(ArrowColumnVector.java:131)
	at org.apache.spark.sql.execution.python.PandasGroupUtils$.$anonfun$executePython$2(PandasGroupUtils.scala:50)
	at org.apache.spark.sql.execution.python.PandasGroupUtils$.$anonfun$executePython$2$adapted(PandasGroupUtils.scala:50)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.sql.execution.python.PandasGroupUtils$.$anonfun$executePython$1(PandasGroupUtils.scala:50)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2023)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1972)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1971)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1971)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:950)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:950)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:950)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2203)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2152)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2141)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:752)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2093)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2133)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3625)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3616)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3614)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2695)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2902)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.api.dotnet.DotnetBackendHandler.handleMethodCall(DotnetBackendHandler.scala:159)
	at org.apache.spark.api.dotnet.DotnetBackendHandler.$anonfun$handleBackendRequest$2(DotnetBackendHandler.scala:99)
	at org.apache.spark.api.dotnet.ThreadPool$$anon$1.run(ThreadPool.scala:34)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
	at org.apache.spark.sql.vectorized.ArrowColumnVector.getChild(ArrowColumnVector.java:131)
	at org.apache.spark.sql.execution.python.PandasGroupUtils$.$anonfun$executePython$2(PandasGroupUtils.scala:50)
	at org.apache.spark.sql.execution.python.PandasGroupUtils$.$anonfun$executePython$2$adapted(PandasGroupUtils.scala:50)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.sql.execution.python.PandasGroupUtils$.$anonfun$executePython$1(PandasGroupUtils.scala:50)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
	... 3 more

CIould this be based on my environment?
Windows 10
Microsoft.Spark.Worker-1.0.0
spark-3.0.0-bin-hadoop3.2

I'm assuming it works on Linux?
Appears to be similar to this bug.
#468

dbeavon Mar 21, 2021
Author

After giving up on RelationalGroupedDataset , I also started testing the VectorUdf stuff which you pointed me to. That seems promising. I'll see how far I get with that.

The VectorUdf stuff appears to be idiomatically different than mapPartitions, which is why I wasn't finding this before. I shared some pseudocode with you below which shows how my methods were organized in scala. The methods receive an input dataframe and return a modified output dataframe that is basically a copy of the input, but incorporates one or more additional columns.

Working with VectorUdf, and its distinct arrays of column values (Arrow Arrays) doesn't feel that natural as working with dataframes and rows. Maybe that's because it is how things are done in python and I haven't had any experience in python yet.

dbeavon · 2021-03-21T05:51:34Z

dbeavon
Mar 21, 2021
Author

@imback82
Thanks for helping.

You asked for me to share some sample code. Here below is an example of using mappartitions in scala to acquire a new column that has dimension keys (surrogates) for a data frame. It basically receives a dataframe and attaches a new column on it with the name DIM_ResultSurrogate. The goal is to prepare the dimension data, so that when the fact data is eventually added to the database, we can associate all the fact records to the related dimension.

The DIM_ResultSurrogate is a surrogate record identifier that represents a combination of a given "InvoiceTypeCode", "InvoiceJournalizationCode", and "PostedFlag".

Notice that I don't really need to care how the records are partitioned. And I don't need to do grouping. I just operate on whatever partitions are in place. I don't show the full code below (eg. how I create transformedIterable) but it should give you the general idea.

You can see that within the mapPartitions delegate, I can work with a large batch of records (p_Iterator). I can also do preparation work at at the top and bottom of the delegate (represented here by opening and closing a database connection).


    def AcquireDimensionSurrogates (
        p_RemoteInfoMotiveAccess: IDatabricksPrincipalAndRights,
        p_SparkSession: org.apache.spark.sql.SparkSession,
        p_UfpDataVaultDatabaseName: String,
        p_UfpDataVaultServerName: String,
        p_DataFrame: org.apache.spark.sql.DataFrame) :org.apache.spark.sql.DataFrame =
    {



        // Ensure that the data frame is structured as expected.
        validatePresenceOfColumns(
            p_DataFrame,
            Seq(
                "InvoiceTypeCode",
                "InvoiceJournalizationCode",
                "PostedFlag"))



        // We will be adding one field to the final DataFrame.
        val newSchema = StructType(p_DataFrame.schema.fields ++ Array(
            StructField("DIM_ResultSurrogate", IntegerType, nullable=false)))

        // Define the new encoder .
        val newSchemaEncoder = RowEncoder(newSchema)


        // Create the modified data frame from the partitions.
        var extendedDataFrame = p_DataFrame.mapPartitions((p_Iterator:Iterator[Row]) =>
        {

            // Connection to the database server (UfpDataVault01)
            var vaultConnectivityObj:UfpDataVaultConnectivity = null
            vaultConnectivityObj = new UfpDataVaultConnectivity(p_RemoteInfoMotiveAccess)

            // Open the connection
            vaultConnectivityObj.OpenNewConnection()

            // Perform the flat map of the grouped data.

            var transformedIterable:immutable.Iterable[Row] = /* LOGIC TO CREATE A New list of rows with an integer on the end FOR DIM_ResultSurrogate */

            // Clean up database connection
            vaultConnectivityObj.CloseCurrentConnectionWhenAvailable()

            
            // Return the transformed data
            // (ensure it is encoded for
            // the sake of spark dataframe)
            transformedIterable.toIterator


        })(newSchemaEncoder)

        // Get the data results
        var resultsDataFrame = extendedDataFrame.persist(StorageLevel.MEMORY_ONLY)
        resultsDataFrame = resultsDataFrame.checkpoint();
        val localCount = resultsDataFrame.count()



        // Return DIM keys
        // along with prior
        // transaction data
        return resultsDataFrame



    }

0 replies

novikov-alexander · 2021-10-27T11:27:03Z

novikov-alexander
Oct 27, 2021

Hi!
I'm very interested in mapPartitions too.
My team needs to merge together a lot of rows of data using a very complicated probabilistic algorithm.
I'm not interested which data is in which group, I just need to merge it together.
I can obviously use GroupBy(foo).Apply(...) but it reshuffles data and that is why it is very slow.

Are there any plans to support mapPartitions?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing some dataframe operations (eg. mappartitions) #857

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Missing some dataframe operations (eg. mappartitions) #857

dbeavon Mar 20, 2021

Replies: 3 comments · 4 replies

imback82 Mar 20, 2021

dbeavon Mar 21, 2021 Author

imback82 Mar 21, 2021

dbeavon Mar 21, 2021 Author

dbeavon Mar 21, 2021 Author

dbeavon Mar 21, 2021 Author

novikov-alexander Oct 27, 2021

dbeavon
Mar 20, 2021

Replies: 3 comments 4 replies

imback82
Mar 20, 2021

dbeavon Mar 21, 2021
Author

dbeavon Mar 21, 2021
Author

dbeavon Mar 21, 2021
Author

dbeavon
Mar 21, 2021
Author

novikov-alexander
Oct 27, 2021