Startup Overhead in Azure Databricks #867

dbeavon · 2021-03-25T16:25:24Z

dbeavon
Mar 25, 2021

I'm using azure databricks to host a basic .net for spark application. I use their "cluster pool" feature to try to improve startup performance. It uses VM's that are already warmed-up and ready for use.

Executing Spark applications is never snappy by any means. But databricks claims their platform runs more efficiently than most.

I can't understand what is happening in the forty seconds of time between when my executors are added, and when the first job is started.

I will try to dig into this more deeply today but was wondering if anyone already knows off the top of their heads. Is this specific to our configuration, or is there some design principal that would explain why databricks should take 40 seconds to start running my code in a job cluster (... one that is created from a pool, and has executors).

My on-prem stand-alone cluster doesn't behave this way. The main difference is that azure-databricks is running in Azure, is running 7.3 LTS (rather than apache spark 3.0.0), and its VM's are all hosted on a VNET.

Please let me know if anyone has ideas. Otherwise I will dig, and open a support ticket.

Venting (please skip) .... These azure databricks tickets are no fun (to say the least). There are several hoops you have to jump thru before getting someone who can help. Usually it ends up that azure databricks has to open their own internal ticket with the databricks company to get in contact with an engineer, and by that time a week or two have passed. Then you find out that it is a "known issue" which databricks-direct customers are privy to, but the information is not shared with azure databricks customers (since it requires a login to the databricks-direct portal). I'm very surprised so many azure databricks customers are OK with this state of things... the support is not what it should be...

Thanks for letting me vent. Here is the code that is submitted in this example.

            Console.WriteLine("Hello World!");


            Microsoft.Spark.Sql.SparkSession spark = Microsoft.Spark.Sql.SparkSession
                .Builder()
                .AppName("SQL basic example using .NET for Apache Spark")
                .Config("spark.task.maxFailures", (long)1)
                .GetOrCreate();

             

            var dfRange1 = spark.Range(0, 10000);
            var counted = dfRange1.Count();

            Console.WriteLine($"Counted : {counted}");

Thanks, David

dbeavon · 2021-03-25T17:19:47Z

dbeavon
Mar 25, 2021
Author

I am finding a few clues to explain the 40 second window when "nothing" seems to be happening.

In the databricks U/I there is a "log4j" windows that shows logging from the driver. It had very limited detail, and the information that was available didn't span the entire 40 second window. So I assumed it would be difficult to investigate that interval... However I discovered a link to a file, "log4j-active.log", and it turns out to have different contents than what is shown on the U/I (confusingly enough). It has quite a bit more detail to explain the 40 second window of time when "nothing" is happening...

In general, there is lots of setup work that is happening during this interval. I thought this would have been done previously, in the "instance pool", but apparently that was not the case. This particular setup work must depend on the specific configuration of my job cluster.

Much of the setup work is proportional to the number of Java assemblies that are loaded.

Some of the setup is related to the .net stuff (eg. unzipping the driver code)

Most of this setup work seems reasonable... But I noticed that a few seconds are related to preparing to use R programming (RDriverLocal). That seems like something that should be done as an opt-in. Can someone tell me how to disable it? I haven't been able to google this very well, probably because the letter R is ignored in my google searches.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Startup Overhead in Azure Databricks #867

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Startup Overhead in Azure Databricks #867

dbeavon Mar 25, 2021

Replies: 1 comment

dbeavon Mar 25, 2021 Author

dbeavon
Mar 25, 2021

dbeavon
Mar 25, 2021
Author