-
I have a small standalone cluster running on my machine with 2 workers that are each using up to 8 cores. (My machine has 8 cores). I will occasionally submit a few applications to run on the cluster at the same time. They do a minimal amount of work (30 seconds worth of work) I am amazed by how much it swamps the CPU and the RAM of my machine. I suppose worker nodes are more heavyweight than I realized. The workers are highlighted in yellow below. Those 2 worker processes in Java will spawn executors (purple lines in image), and will also spawn .Net as well (Microsoft.Spark.Worker.exe in green). Here is a discussion where I learned that the .Net processes are respawned for each task: #852 ... that seemed problematic on its own. But what is even worse are these java "executors". When you add it together, they consume a massive amount of CPU and RAM (purple lines). So my question is if this behaves approximately the same on Windows as on Linux? Or do they cut a lot of corners on Windows and everything runs more poorly than it would on Linux? The thing that discourages me about Windows is that it seems to be a platform that the Spark folks don't care about that much. Eg. See the following: https://issues.apache.org/jira/browse/SPARK-23015
Is it an accurate assessment that Windows will generally behave poorly compared to Linux (by a substantial margin like ~5 % or more)? For the sake of making comparisons to production, I would like to use a local standalone cluster while I'm doing my development on Windows in Visual Studio. But maybe I should just revert back to running spark in "localmode" instead (eg --master local[8]) . Or maybe I try to get databricks-connect working... Thoughts? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Yes, this is how the interop works (similar to pyspark). How else can you implement it?
The numbers look reasonable to me. You should configure your job such that it fits your machine.
I did a benchmark about two years ago, and it was slower Windows (prob. 10-15%?) with Windows Defender off. |
Beta Was this translation helpful? Give feedback.
Yes, this is how the interop works (similar to pyspark). How else can you implement it?
The numbers look reasonable to me. You should configure your job such that it fits your machine.