Guidance in relation to RelationalGroupedDataset/FlatMapGroupsInPandas #899

dbeavon · 2021-04-09T15:59:24Z

dbeavon
Apr 9, 2021

Is the "FlatMapGroupsInPandas" something that is still experimental or are others using it successfully?

I'm not having any luck with this. There are two examples, one using fxdataframe and one using arrow recordbatch but neither works for me:

I see the same error when running samples:

Microsoft.Spark.Examples.Sql.Batch.VectorUdfs and
Microsoft.Spark.Examples.Sql.Batch.VectorDataFrameUdfs

I'm running spark-3.0.0-bin-hadoop3.2 on windows.

I get the NullPointerException described here:
#468

I can try to gather more details about this from my machine. I guess my first question is whether I should focus my effort on the fxdataframe variety of the UDF or the recordbatch variety. I think the fxdataframe is a higher level of abstraction, so I was going to avoid it for now....

The stack looks like this. And the NPE only happens after my UDF has executed (I'm able to reach my breakpoints in .net):

java.lang.NullPointerException


	at org.apache.spark.sql.vectorized.ArrowColumnVector.getChild(ArrowColumnVector.java:131)
	at org.apache.spark.sql.execution.python.PandasGroupUtils$.$anonfun$executePython$2(PandasGroupUtils.scala:50)
	at org.apache.spark.sql.execution.python.PandasGroupUtils$.$anonfun$executePython$2$adapted(PandasGroupUtils.scala:50)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.sql.execution.python.PandasGroupUtils$.$anonfun$executePython$1(PandasGroupUtils.scala:50)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)

I have only tested this locally on Windows, but I also have access to azure databricks in case we think it will work any differently on that side of things.

Any pointers would be appreciated (aside from null ones).

Answered by dbeavon

Apr 9, 2021

FYI, I did a bit more testing with a newer version.... as-of now I am not getting the error anymore after upgrading to Microsoft.Spark.Worker.1.1.1 and the new version of nuget (microsoft-spark-3-0_2.12-1.1.1.jar)

One thing I'm still not clear on is how to pick between similar functionality when the same underlying feature (eg FlatMapGroupsInPandas) can be used in two separate ways. Should I focus my efforts on the using the fxdataframe variety of the UDF or the recordbatch variety? Which is likely to be more reliable and less prone to breaking changes in future versions of .Net UDF's?

It would help to have some guidelines for picking which path to take.

Given that this is the only versio…

View full answer

dbeavon · 2021-04-09T16:36:21Z

dbeavon
Apr 9, 2021
Author

FYI, I did a bit more testing with a newer version.... as-of now I am not getting the error anymore after upgrading to Microsoft.Spark.Worker.1.1.1 and the new version of nuget (microsoft-spark-3-0_2.12-1.1.1.jar)

One thing I'm still not clear on is how to pick between similar functionality when the same underlying feature (eg FlatMapGroupsInPandas) can be used in two separate ways. Should I focus my efforts on the using the fxdataframe variety of the UDF or the recordbatch variety? Which is likely to be more reliable and less prone to breaking changes in future versions of .Net UDF's?

It would help to have some guidelines for picking which path to take.

Given that this is the only version I can get to work, is using 1.1.1 in production a reasonable thing to do? Or should I find another way to accomplish my groupings (... perhaps by collecting to the driver)?

1 reply

dbeavon Apr 10, 2021
Author

It would help to have some guidelines for picking which path to take (recordbatch vs fxdataframe).

My 2 cents, now that I've tried both....

Use recordbatch first. It just works. Then if you have a couple days to spare, you can struggle with fxdataframe (which can throw lots of unusual exceptions and can be challenging to use insofar as certain datatypes are concerned).

I wish someone would tell @bamurtaugh to not be so passionate about these fxdataframes. Here is the blog that convinced me to start working with fxdataframe before recordbatch.
https://devblogs.microsoft.com/dotnet/net-for-apache-spark-in-memory-dataframe-support/

.. but lets face it Apache.Arrow nuget is v.3.0.0 and the Microsoft.Data.Analysis nuget is only v.0.4.0. Not only that, but I believe fxdataframe relies on Appache.Arrow internally, so Arrow needs to be extremely stable or they both would suffer!

It is true that you can save a few lines of code with fxdataframe, but the trade-off is a lot more time and effort trying to get it to work properly with a range of datatypes (like strings).

Fxdataframe does look like it has the potential to be a more natural API to use in the long run, but I'm sticking with recordbatch for now!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance in relation to RelationalGroupedDataset/FlatMapGroupsInPandas #899

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Guidance in relation to RelationalGroupedDataset/FlatMapGroupsInPandas #899

dbeavon Apr 9, 2021

Replies: 1 comment · 1 reply

dbeavon Apr 9, 2021 Author

dbeavon Apr 10, 2021 Author

dbeavon
Apr 9, 2021

Replies: 1 comment 1 reply

dbeavon
Apr 9, 2021
Author

dbeavon Apr 10, 2021
Author