Standalone Apache Spark with .Net on Windows ... Is performance supposed to be so poor? #922

dbeavon · 2021-04-19T01:13:34Z

dbeavon
Apr 19, 2021

I have a small standalone cluster running on my machine with 2 workers that are each using up to 8 cores. (My machine has 8 cores).

I will occasionally submit a few applications to run on the cluster at the same time. They do a minimal amount of work (30 seconds worth of work) I am amazed by how much it swamps the CPU and the RAM of my machine.

I suppose worker nodes are more heavyweight than I realized. The workers are highlighted in yellow below. Those 2 worker processes in Java will spawn executors (purple lines in image), and will also spawn .Net as well (Microsoft.Spark.Worker.exe in green). Here is a discussion where I learned that the .Net processes are respawned for each task: #852

... that seemed problematic on its own. But what is even worse are these java "executors". When you add it together, they consume a massive amount of CPU and RAM (purple lines).

So my question is if this behaves approximately the same on Windows as on Linux? Or do they cut a lot of corners on Windows and everything runs more poorly than it would on Linux?

The thing that discourages me about Windows is that it seems to be a platform that the Spark folks don't care about that much. Eg. See the following: https://issues.apache.org/jira/browse/SPARK-23015

Need to figure out why it was added later, but Windows is not really on my list of things to prioritize at the moment...

Is it an accurate assessment that Windows will generally behave poorly compared to Linux (by a substantial margin like ~5 % or more)?

For the sake of making comparisons to production, I would like to use a local standalone cluster while I'm doing my development on Windows in Visual Studio. But maybe I should just revert back to running spark in "localmode" instead (eg --master local[8]) . Or maybe I try to get databricks-connect working...

Thoughts?

Answered by imback82

Apr 19, 2021

Those 2 worker processes in Java will spawn executors (purple lines in image), and will also spawn .Net as well (Microsoft.Spark.Worker.exe in green). Here is a discussion where I learned that the .Net processes are respawned for each task: #852

... that seemed problematic on its own

Yes, this is how the interop works (similar to pyspark). How else can you implement it?

But what is even worse are these java "executors". When you add it together, they consume a massive amount of CPU and RAM (purple lines).

The numbers look reasonable to me. You should configure your job such that it fits your machine.

So my question is if this behaves approximately the same on Windows as on Linux? Or …

View full answer

imback82 · 2021-04-19T18:37:40Z

imback82
Apr 19, 2021

Those 2 worker processes in Java will spawn executors (purple lines in image), and will also spawn .Net as well (Microsoft.Spark.Worker.exe in green). Here is a discussion where I learned that the .Net processes are respawned for each task: #852

... that seemed problematic on its own

Yes, this is how the interop works (similar to pyspark). How else can you implement it?

But what is even worse are these java "executors". When you add it together, they consume a massive amount of CPU and RAM (purple lines).

The numbers look reasonable to me. You should configure your job such that it fits your machine.

So my question is if this behaves approximately the same on Windows as on Linux? Or do they cut a lot of corners on Windows and everything runs more poorly than it would on Linux?

I did a benchmark about two years ago, and it was slower Windows (prob. 10-15%?) with Windows Defender off.

1 reply

dbeavon Apr 23, 2021
Author

@imback82 thanks for this. I suppose this means you probably develop on Linux? ;) Or maybe you at least do your performance testing on Linux.

I agree that my tests should be made smaller, whenever possible, for the sake of efficient development. However sometimes it takes additional effort to right-size the sample data, for the sake of development and testing. There is a tradeoff between the time it takes to prepare a small, artificial sample, and the time that you would otherwise waste during debugging.

That performance difference is significant. I'd guess the difference (~10-15%) is just a consequence of the fact that most of the developers are spending their time in Linux. In principal it doesn't seem like there are great reasons why a Windows -hosted cluster would have slower resources . The available hardware is exactly the same.

It is good to hear that my experience is normal. I'm not that aware of what other Spark developers are having to deal with. Maybe I need to ask for a bigger workstation! Those crazy executors use ~500 MB apiece and the CPU they use adds up quickly as well...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone Apache Spark with .Net on Windows ... Is performance supposed to be so poor? #922

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Standalone Apache Spark with .Net on Windows ... Is performance supposed to be so poor? #922

dbeavon Apr 19, 2021

Replies: 1 comment · 1 reply

imback82 Apr 19, 2021

dbeavon Apr 23, 2021 Author

dbeavon
Apr 19, 2021

Replies: 1 comment 1 reply

imback82
Apr 19, 2021

dbeavon Apr 23, 2021
Author