Azure Databricks Worker is not Finding Assembly for Purpose of UDF #848

dbeavon · 2021-03-18T01:53:36Z

dbeavon
Mar 18, 2021

I realize this has been asked before, in the context of other Spark environments. But I can't quite find the answer for Azure Databricks. How do I configure the environment variable DOTNET_ASSEMBLY_SEARCH_PATHS on Azure Databricks?

I'm using a job that is configured with 7.3 LTS. I'm following these instructions:
https://docs.microsoft.com/en-us/dotnet/spark/tutorials/databricks-deployment

I upload the zip and run the job, and everything works up to the point where it is trying to run a UDF. Then it fails with a message and stack (below) which appears to be related to the inability to find assemblies. The instructions didn't tell me what to do with the environment variable. They just told me to zip everything up into /dbfs/spark-dotnet/publish.zip and use that as an argument to DotnetRunner in microsoft-spark-3-0_2.12-1.0.0.jar.

I'd like to understand where the zip is actually extracted at runtime. Maybe that would tell me how to set the environment variable. Ideally it would be extracted in a place where the Microsoft.Spark.Worker would find the contained assemblies.

Any help would be greatly appreciated. Below is the stack. When I google Microsoft.Spark.Utils.UdfSerDe.DeserializeType and System.Collections.Concurrent.ConcurrentDictionary and NullReferenceException, then I find some hints that indicate that the environment variable DOTNET_ASSEMBLY_SEARCH_PATHS is wrong.

[2021-03-18T00:23:23.8893901Z] [0318-002222-packs73-10-129-253-10] [Error] [JvmBridge] JVM method execution failed: Nonstatic method 'showString' failed for class '42' when called with 3 arguments ([Index=1, Type=Int32, Value=20], [Index=2, Type=Int32, Value=20], [Index=3, Type=Boolean, Value=False], ) [2021-03-18T00:23:23.8895239Z] [0318-002222-packs73-10-129-253-10] [Error] [JvmBridge] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 11, 10.129.253.5, executor 1): org.apache.spark.api.python.PythonException: System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.Spark.Utils.UdfSerDe.<>c.<DeserializeType>b__10_0(TypeData td) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 265 at System.Collections.Concurrent.ConcurrentDictionary2.GetOrAdd(TKey key, Func2 valueFactory) at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 261 at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160 at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 333 at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 306 at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 188 at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 102 at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 43 at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 82 at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 144 at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:598) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:551) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:733) at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:80) at org.apache.spark.sql.execution.collect.Collector.$anonfun$processFunc$1(Collector.scala:187) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144) at org.apache.spark.scheduler.Task.run(Task.scala:117) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:640) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Answered by dbeavon

Mar 18, 2021

Another update. I turned on databricks logging and I'm able to clearly see where the unzipping is happening in the driver's log4j output file :


21/03/18 02:38:56 INFO Utils: Copying /dbfs/spark-dotnet/publish.zip to /databricks/driver/publish.zip

21/03/18 02:38:57 INFO DotnetRunner: Unzipping .NET driver publish.zip to /databricks/driver/publish

21/03/18 02:38:58 INFO DotnetRunner: Starting DotnetBackend with /databricks/driver/publish/mySparkApp

Based on that, I set the environment variable to be /databricks/driver/publish
DOTNET_ASSEMBLY_SEARCH_PATHS=/databricks/driver/publish

But that doesn't help matters. In the worker logs it is very clear that it isn't finding the dll once it g…

View full answer

dbeavon · 2021-03-18T02:58:43Z

dbeavon
Mar 18, 2021
Author

Another update. I turned on databricks logging and I'm able to clearly see where the unzipping is happening in the driver's log4j output file :


21/03/18 02:38:56 INFO Utils: Copying /dbfs/spark-dotnet/publish.zip to /databricks/driver/publish.zip

21/03/18 02:38:57 INFO DotnetRunner: Unzipping .NET driver publish.zip to /databricks/driver/publish

21/03/18 02:38:58 INFO DotnetRunner: Starting DotnetBackend with /databricks/driver/publish/mySparkApp

Based on that, I set the environment variable to be /databricks/driver/publish
DOTNET_ASSEMBLY_SEARCH_PATHS=/databricks/driver/publish

But that doesn't help matters. In the worker logs it is very clear that it isn't finding the dll once it gets to the point where it needs to execute the UDF.


[2021-03-18T02:39:23.6100503Z] [0318-023817-sulks118-10-129-253-4] [Warn] [AssemblyLoader] Assembly 'mySparkApp, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' file not found 'mySparkApp.dll' in '/databricks/driver/publish,__dbutils_lib_,/databricks/spark/work/app-20210318023900-0000/0,/usr/local/bin/spark-dotnet/Microsoft.Spark.Worker-1.0.0/'
[2021-03-18T02:39:23.6282251Z] [0318-023817-sulks118-10-129-253-4] [Error] [TaskRunner] [1] ProcessStream() failed with exception: System.NullReferenceException: Object reference not set to an instance of an object.
   at Microsoft.Spark.Utils.UdfSerDe.<>c.<DeserializeType>b__10_0(TypeData td) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 265
   at System.Collections.Concurrent.ConcurrentDictionary`2.GetOrAdd(TKey key, Func`2 valueFactory)
   at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 261
   at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in /_/src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160
   at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 333
   at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in /_/src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 306
   at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 188
   at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 102
   at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 43
   at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 82

I noticed that the zip is extracted into a folder that seems to explicitly reference the "driver" (ie. /databricks/driver/publish). I'm starting to believe that my UDF assembly should not be in this zip to begin with, or that it should be placed in another place as well. When I upload the dll's and place them next to the zip, then I can set the environment variable like so and everything works:
DOTNET_ASSEMBLY_SEARCH_PATHS=/dbfs/spark-dotnet

Perhaps the files in the zip will never be accessible to the workers nodes. Is that so? There are so many moving parts... I was hoping the zip would have all my code, whether driver-related or worker-related.

1 reply

dbeavon Mar 20, 2021
Author

I will just answer my own question ... if anyone needs an assembly to be accessible to the worker code, then you need to upload it to dbfs separately from the zip. That zip appears to be only for the purpose of the driver code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Databricks Worker is not Finding Assembly for Purpose of UDF #848

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Azure Databricks Worker is not Finding Assembly for Purpose of UDF #848

dbeavon Mar 18, 2021

Replies: 1 comment · 1 reply

dbeavon Mar 18, 2021 Author

dbeavon Mar 20, 2021 Author

dbeavon
Mar 18, 2021

Replies: 1 comment 1 reply

dbeavon
Mar 18, 2021
Author

dbeavon Mar 20, 2021
Author