-
I realize this has been asked before, in the context of other Spark environments. But I can't quite find the answer for Azure Databricks. How do I configure the environment variable DOTNET_ASSEMBLY_SEARCH_PATHS on Azure Databricks? I'm using a job that is configured with 7.3 LTS. I'm following these instructions: I upload the zip and run the job, and everything works up to the point where it is trying to run a UDF. Then it fails with a message and stack (below) which appears to be related to the inability to find assemblies. The instructions didn't tell me what to do with the environment variable. They just told me to zip everything up into /dbfs/spark-dotnet/publish.zip and use that as an argument to DotnetRunner in microsoft-spark-3-0_2.12-1.0.0.jar. I'd like to understand where the zip is actually extracted at runtime. Maybe that would tell me how to set the environment variable. Ideally it would be extracted in a place where the Microsoft.Spark.Worker would find the contained assemblies. Any help would be greatly appreciated. Below is the stack. When I google Microsoft.Spark.Utils.UdfSerDe.DeserializeType and System.Collections.Concurrent.ConcurrentDictionary and NullReferenceException, then I find some hints that indicate that the environment variable DOTNET_ASSEMBLY_SEARCH_PATHS is wrong.
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Another update. I turned on databricks logging and I'm able to clearly see where the unzipping is happening in the driver's log4j output file :
Based on that, I set the environment variable to be /databricks/driver/publish But that doesn't help matters. In the worker logs it is very clear that it isn't finding the dll once it gets to the point where it needs to execute the UDF.
I noticed that the zip is extracted into a folder that seems to explicitly reference the "driver" (ie. /databricks/driver/publish). I'm starting to believe that my UDF assembly should not be in this zip to begin with, or that it should be placed in another place as well. When I upload the dll's and place them next to the zip, then I can set the environment variable like so and everything works: Perhaps the files in the zip will never be accessible to the workers nodes. Is that so? There are so many moving parts... I was hoping the zip would have all my code, whether driver-related or worker-related. |
Beta Was this translation helpful? Give feedback.
Another update. I turned on databricks logging and I'm able to clearly see where the unzipping is happening in the driver's log4j output file :
Based on that, I set the environment variable to be /databricks/driver/publish
DOTNET_ASSEMBLY_SEARCH_PATHS=/databricks/driver/publish
But that doesn't help matters. In the worker logs it is very clear that it isn't finding the dll once it g…