-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation #14567
Conversation
Although there has generally been some resistance to large style-only changes, we do enforce import order in Scala/Java including checks. So it seems pretty reasonable to do the same in one big go for Python. |
from pyspark.rdd import RDD | ||
from pyspark.rdd import _load_from_socket | ||
from pyspark.rdd import ignore_unicode_prefix | ||
from pyspark.serializers import AutoBatchedSerializer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanding these multiple imports seems counter-productive. We don't do it in Scala (and in Java you can only import one thing or everything). Is this important/canonical for PEP8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed this is a deviation from Pep recommendation. I encourage this behavior since it simplifies a lot file maintainance over its lifetime, since once this "isort" stuff is setup, developers does not have any more latitude in the placement of the "import" statements.
On our Buildbot based project, we use to have lot of conflict involving changes on import statement: on 2 differents branches (say: prod and main), we often had to import either the same import (so merge might add this line twice when the two developers have placed them in two different places) or not so easy to solve conflict (when two developers add theo import two different object from the same module).
Once we setup the "one import per line" rule, no more conflict on this lines, ever. This helped us a lot automatizing an auto merger from the release branches to the "master" branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we get that in the Scala code for sure, although you get conflicts even if you change merely adjacent lines anyway, so it only saves so much conflict. It's a decent open question, I wonder what others think? the downside is the extra duplication of import boilerplate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see in https://pypi.python.org/pypi/isort, there a several way to format this multiple import statements. At worst, at least I recommende to enforce the sort of this multi import lines so there is no ambiguity where to place any "import" (and isort will correct the change from the developer)
84320dd
to
5590796
Compare
Rebased, sorry I had to force push this PR. |
5590796
to
9d52994
Compare
Can you create a JIRA ticket for this? This is too large to go in without a JIRA ticket. |
BTW this is actually a non-trivial change and would require very careful look, since Python imports are not side effect free. |
@stibbons connect this to your JIRA in the title |
Done. Linked to #14180. Is it possible to test this PR with SparkPullRequestBuilder ? |
9d52994
to
067ab4a
Compare
Jenkins test this please |
Jenkins add to whitelist |
Test build #63533 has finished for PR 14567 at commit
|
Test build #63534 has finished for PR 14567 at commit
|
Test build #63536 has finished for PR 14567 at commit
|
Test build #63544 has finished for PR 14567 at commit
|
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
[pep8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another pep8config file at ./dev/toxi.ini - seems like it would be good to have a single file (also unify the ignore lists)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the rename of tox.ini addressing this -- I wasn't sure how
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, tox.ini looks more similar to this pep8rc than isort.cfg, but github doesn't show it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but should they be in the same file or are they separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if they can be merged. Will try it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ignorant of the config here of course, but it seems like we've replaced 1 config file (tox.ini) with 4 in total (pep8rc, isort.cfg, style.yapf.ini, .editorconfig). Maybe it would help to explain what the differences are, because maybe only some of them are really worth dealing with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- tox.ini kind of becames this pep8rc, indeed.
- isort only deals with sorting "import" statement, not checked by pep8, but pylint has some checks on it
- style.yapf.ini is the configuration for yapf, which is a more aggressive version of 'autopep8' made by google for their chrominum projects, to strictly format all their python files. I don't recommend to enforce the usage of it, but if one want to use it on its python file in the Spark project, he can execute it and review the style before commiting. In an ideal wold, this would be a mandatory formatting (on post commit hook for example), but this may take some times before reaching this point
- .editorconfig only deals with tab space, like I said, might be put outside of this PR if you want
I am not sure if yapf can sort imports like isort do
b3a176c
to
ff549b6
Compare
Test build #63689 has finished for PR 14567 at commit
|
ff549b6
to
be39a83
Compare
Test build #63690 has finished for PR 14567 at commit
|
be39a83
to
46e83f2
Compare
Test build #63691 has finished for PR 14567 at commit
|
46e83f2
to
d3eab9f
Compare
Test build #63693 has finished for PR 14567 at commit
|
d3eab9f
to
500b659
Compare
1a2aa35
to
b81c38f
Compare
Rebased |
Test build #65762 has finished for PR 14567 at commit
|
b81c38f
to
a02f634
Compare
Test build #66043 has finished for PR 14567 at commit
|
a02f634
to
4a3cde7
Compare
Test build #66389 has finished for PR 14567 at commit
|
Use a virtualenv for isolation and easy installation. This basically reverts 85a50a6 Might have been a solution to SPARK-9385. Lot of new test disabled. I propose to fix issues in various pull requests (obviously most 'import' order errors should be fixed by my other pull requests such as apache#14830 for documentation examples, which is part of the effort on code style described in apache#14567). Each subsequent pull request will fix one or more error and reenable the according pylint check. List of new disabled checks: - bad-super-call - consider-iterating-dictionary - consider-using-enumerate - eval-used - exec-used - invalid-length-returned - misplaced-comparison-constant - raising-bad-type - redefined-variable-type - trailing-newlines - trailing-whitespace - ungrouped-imports - unnecessary-pass - unneeded-not - wrong-import-order - wrong-import-position Signed-off-by: Gaetan Semet <[email protected]>
Use a virtualenv for isolation and easy installation. This basically reverts 85a50a6 Might have been a solution to SPARK-9385. Lot of new test disabled. I propose to fix issues in various pull requests (obviously most 'import' order errors should be fixed by my other pull requests such as apache#14830 for documentation examples, which is part of the effort on code style described in apache#14567). Each subsequent pull request will fix one or more error and reenable the according pylint check. List of new disabled checks: - bad-super-call - consider-iterating-dictionary - consider-using-enumerate - eval-used - exec-used - invalid-length-returned - misplaced-comparison-constant - raising-bad-type - redefined-variable-type - trailing-newlines - trailing-whitespace - ungrouped-imports - unnecessary-pass - unneeded-not - wrong-import-order - wrong-import-position Signed-off-by: Gaetan Semet <[email protected]>
+ quiet mode for sphinx
4a3cde7
to
1252d08
Compare
Test build #66847 has finished for PR 14567 at commit
|
Use a virtualenv for isolation and easy installation. This basically reverts 85a50a6 Might have been a solution to SPARK-9385. Lot of new test disabled. I propose to fix issues in various pull requests (obviously most 'import' order errors should be fixed by my other pull requests such as apache#14830 for documentation examples, which is part of the effort on code style described in apache#14567). Each subsequent pull request will fix one or more error and reenable the according pylint check. List of new disabled checks: - bad-super-call - consider-iterating-dictionary - consider-using-enumerate - eval-used - exec-used - invalid-length-returned - misplaced-comparison-constant - raising-bad-type - redefined-variable-type - trailing-newlines - trailing-whitespace - ungrouped-imports - unnecessary-pass - unneeded-not - wrong-import-order - wrong-import-position Signed-off-by: Gaetan Semet <[email protected]>
Hi, are you still working on this? |
What changes were proposed in this pull request?
This patch adds a code style validation script named
dev/py-validate.sh
to enforce pep8 recommendations.The execution of this script will format a lot of files. I did not put them into this PR. I plan to execute this smoothly and reformat the code in various Pull Requests to ease reviews.
First one is automatic formatting of the examples in the documentations: See #14830.
Features of the current patch:
python/.editconfig
file (Scala files use 2 space indentation, while Python files uses 4) for compatible editors (almost every editors has a plugin to support .editconfig file)autopep8
to fix basic pep8 mistakesisort
to automatically split "import" statement and organise them into logically linked order (see doc here. The most important thing is that it split import statements that imports more than one object into several lines. This will increase the number of line of the file, but this facilitates a lot file maintainance and file merges if needed.py-validate.sh
script in order to automatise the correction (need isort and autopep8 installed inside a virtualenv that has the right version of the tools, this is documented in the script)You can see similar script in prod in the Buildbot project
How was this patch tested?
Simple tests on my machines has been done (local mode only). There should not have any regression or feature change at all with this pull request.