-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22959][PYTHON] Configuration to select the modules for daemon and worker in PySpark #20151
[SPARK-22959][PYTHON] Configuration to select the modules for daemon and worker in PySpark #20151
Conversation
@holdenk, @rxin, @JoshRosen and @ueshin, as you all might already know, I am working on Python coverage. Based on the top of this PR, I think we can leave the main codes intact while we properly track the coverage within worker processes. I believe this also partly covers SPARK-20368 too. What do you guys think about this configuration? |
Test build #85677 has finished for PR 20151 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally LGTM
val useDaemon = { | ||
val useDaemonEnabled = SparkEnv.get.conf.getBoolean("spark.python.use.daemon", true) | ||
|
||
// This flag is ignored on Windows as it's unable to fork. | ||
!System.getProperty("os.name").startsWith("Windows") && useDaemonEnabled | ||
} | ||
|
||
// This configuration indicates the module to run the daemon to execute its Python workers. | ||
val daemonModule = SparkEnv.get.conf.get("spark.python.daemon.module", "pyspark.daemon") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally, I thought we use the name "command" as what we call the thing to execute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yup that's true in general. But please let me stick to "module" here as that's what we execute (python -m
) describes:
python --help
...
-m mod : run library module as a script (terminates option list)
...
Hey @rxin, I think I need your sign-off too as it's related with SPARK-7721. |
The changes LGTM. |
I manually tested after setting >>> spark.range(1).rdd.map(lambda x: x).collect()
|
Looks good. Let's wait for @rxin's response. |
Yup, thanks for all review @felixcheung and @ueshin BTW |
@rxin or @JoshRosen could you guys take a quick look and see if it makes sense? |
So I think this could be the basis for solving a lot of related problems and I like the minimally invasive approach to it. I think the error message for setting it to a bad module rather than a nonexistent module is probably going to be very confusing. I think it would be good to make it clear that this is advanced setting we don't expect most users to modify directly. |
+1... this is "undocumented" conf, sooo it's an expert one :) |
Yup, will write up some more warnings that says like it's expert only configuration, experimental and rather an internal configuration. Also, I will note that we should be super careful. Will update tonight (KST) :). |
logInfo( | ||
s"Python daemon module in PySpark is set to [$value] in 'spark.python.daemon.module', " + | ||
"using this to start the daemon up. Note that this configuration only has an effect when " + | ||
"'spark.python.use.daemon' is enabled and the platform is not Windows.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checked it shows the log only when the configuration is explicitly set:
18/01/10 21:23:24 INFO PythonWorkerFactory: Python daemon module in PySpark is set to [pyspark.daemon] in
'spark.python.daemon.module', using this to start the daemon up. Note that this configuration only has an
effect when 'spark.python.use.daemon' is enabled and the platform is not Windows.
Test build #85917 has finished for PR 20151 at commit
|
Test build #85918 has finished for PR 20151 at commit
|
// as expert-only option, and shouldn't be used before knowing what it means exactly. | ||
|
||
// This configuration indicates the module to run the daemon to execute its Python workers. | ||
val daemonModule = SparkEnv.get.conf.getOption("spark.python.daemon.module").map { value => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to restrict the module's package to only allow something like pyspark.*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, actually we could check like .. if it's empty string too. I wrote "shouldn't be used before knowing what it means exactly." above. So, I think it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think there have been many times that this would have been incredibly useful for me, thanks!
Will merge this one in few days if there isn't any objection. I believe this doesn't affect the existing code path anyway .. |
retest this please |
Test build #86075 has finished for PR 20151 at commit
|
Merged to master. |
## What changes were proposed in this pull request? Note that this PR was made based on the top of #20151. So, it almost leaves the main codes intact. This PR proposes to add a script for the preparation of automatic PySpark coverage generation. Now, it's difficult to check the actual coverage in case of PySpark. With this script, it allows to run tests by the way we did via `run-tests` script before. The usage is exactly the same with `run-tests` script as this basically wraps it. This script and PR alone should also be useful. I was asked about how to run this before, and seems some reviewers (including me) need this. It would be also useful to run it manually. It usually requires a small diff in normal Python projects but PySpark cases are a bit different because apparently we are unable to track the coverage after it's forked. So, here, I made a custom worker that forces the coverage, based on the top of #20151. I made a simple demo. Please take a look - https://spark-test.github.io/pyspark-coverage-site. To show up the structure, this PR adds the files as below: ``` python ├── .coveragerc # Runtime configuration when we run the script. ├── run-tests-with-coverage # The script that has coverage support and wraps run-tests script. └── test_coverage # Directories that have files required when running coverage. ├── conf │ └── spark-defaults.conf # Having the configuration 'spark.python.daemon.module'. ├── coverage_daemon.py # A daemon having custom fix and wrapping our daemon.py └── sitecustomize.py # Initiate coverage with COVERAGE_PROCESS_START ``` Note that this PR has a minor nit: [This scope](https://github.com/apache/spark/blob/04e44b37cc04f62fbf9e08c7076349e0a4d12ea8/python/pyspark/daemon.py#L148-L169) in `daemon.py` is not in the coverage results as basically I am producing the coverage results in `worker.py` separately and then merging it. I believe it's not a big deal. In a followup, I might have a site that has a single up-to-date PySpark coverage from the master branch as the fallback / default, or have a site that has multiple PySpark coverages and the site link will be left to each pull request. ## How was this patch tested? Manually tested. Usage is the same with the existing Python test script - `./python/run-tests`. For example, ``` sh run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql ``` Running this will generate HTMLs under `./python/test_coverage/htmlcov`. Console output example: ``` sh run-tests-with-coverage --python-executables=python3,python --modules=pyspark-core Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python3', 'python'] Will test the following Python modules: ['pyspark-core'] Starting test(python): pyspark.tests Starting test(python3): pyspark.tests ... Tests passed in 231 seconds Combining collected coverage data under /.../spark/python/test_coverage/coverage_data Reporting the coverage data at /...spark/python/test_coverage/coverage_data/coverage Name Stmts Miss Branch BrPart Cover -------------------------------------------------------------- pyspark/__init__.py 41 0 8 2 96% ... pyspark/profiler.py 74 11 22 5 83% pyspark/rdd.py 871 40 303 32 93% pyspark/rddsampler.py 68 10 32 2 82% ... -------------------------------------------------------------- TOTAL 8521 3077 2748 191 59% Generating HTML files for PySpark coverage under /.../spark/python/test_coverage/htmlcov ``` Author: hyukjinkwon <[email protected]> Closes #20204 from HyukjinKwon/python-coverage.
What changes were proposed in this pull request?
We are now forced to use
pyspark/daemon.py
andpyspark/worker.py
in PySpark.This doesn't allow a custom modification for it (well, maybe we can still do this in a super hacky way though, for example, setting Python executable that has the custom modification). Because of this, for example, it's sometimes hard to debug what happens inside Python worker processes.
This is actually related with SPARK-7721 too as somehow Coverage is unable to detect the coverage from
os.fork
. If we have some custom fixes to force the coverage, it works fine.This is also related with SPARK-20368. This JIRA describes Sentry support which (roughly) needs some changes within worker side.
With this configuration advanced users will be able to do a lot of pluggable workarounds and we can meet such potential needs in the future.
As an example, let's say if I configure the module
coverage_daemon
and hadcoverage_daemon.py
in the python path:I can track the coverages in worker side too.
More importantly, we can leave the main code intact but allow some workarounds.
How was this patch tested?
Manually tested.