-
Is the "FlatMapGroupsInPandas" something that is still experimental or are others using it successfully? I'm not having any luck with this. There are two examples, one using fxdataframe and one using arrow recordbatch but neither works for me: I see the same error when running samples:
I'm running spark-3.0.0-bin-hadoop3.2 on windows. I get the NullPointerException described here: I can try to gather more details about this from my machine. I guess my first question is whether I should focus my effort on the fxdataframe variety of the UDF or the recordbatch variety. I think the fxdataframe is a higher level of abstraction, so I was going to avoid it for now.... The stack looks like this. And the NPE only happens after my UDF has executed (I'm able to reach my breakpoints in .net):
I have only tested this locally on Windows, but I also have access to azure databricks in case we think it will work any differently on that side of things. Any pointers would be appreciated (aside from null ones). |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
FYI, I did a bit more testing with a newer version.... as-of now I am not getting the error anymore after upgrading to Microsoft.Spark.Worker.1.1.1 and the new version of nuget (microsoft-spark-3-0_2.12-1.1.1.jar) One thing I'm still not clear on is how to pick between similar functionality when the same underlying feature (eg FlatMapGroupsInPandas) can be used in two separate ways. Should I focus my efforts on the using the fxdataframe variety of the UDF or the recordbatch variety? Which is likely to be more reliable and less prone to breaking changes in future versions of .Net UDF's? It would help to have some guidelines for picking which path to take. Given that this is the only version I can get to work, is using 1.1.1 in production a reasonable thing to do? Or should I find another way to accomplish my groupings (... perhaps by collecting to the driver)? |
Beta Was this translation helpful? Give feedback.
FYI, I did a bit more testing with a newer version.... as-of now I am not getting the error anymore after upgrading to Microsoft.Spark.Worker.1.1.1 and the new version of nuget (microsoft-spark-3-0_2.12-1.1.1.jar)
One thing I'm still not clear on is how to pick between similar functionality when the same underlying feature (eg FlatMapGroupsInPandas) can be used in two separate ways. Should I focus my efforts on the using the fxdataframe variety of the UDF or the recordbatch variety? Which is likely to be more reliable and less prone to breaking changes in future versions of .Net UDF's?
It would help to have some guidelines for picking which path to take.
Given that this is the only versio…