Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation #14567

Closed
wants to merge 4 commits into from

Conversation

gsemet
Copy link
Contributor

@gsemet gsemet commented Aug 9, 2016

What changes were proposed in this pull request?

This patch adds a code style validation script named dev/py-validate.sh to enforce pep8 recommendations.

The execution of this script will format a lot of files. I did not put them into this PR. I plan to execute this smoothly and reformat the code in various Pull Requests to ease reviews.

First one is automatic formatting of the examples in the documentations: See #14830.

Features of the current patch:

  • add a python/.editconfig file (Scala files use 2 space indentation, while Python files uses 4) for compatible editors (almost every editors has a plugin to support .editconfig file)
  • use autopep8 to fix basic pep8 mistakes
  • use isort to automatically split "import" statement and organise them into logically linked order (see doc here. The most important thing is that it split import statements that imports more than one object into several lines. This will increase the number of line of the file, but this facilitates a lot file maintainance and file merges if needed.
  • add a py-validate.sh script in order to automatise the correction (need isort and autopep8 installed inside a virtualenv that has the right version of the tools, this is documented in the script)

You can see similar script in prod in the Buildbot project

How was this patch tested?

Simple tests on my machines has been done (local mode only). There should not have any regression or feature change at all with this pull request.

@srowen
Copy link
Member

srowen commented Aug 9, 2016

Although there has generally been some resistance to large style-only changes, we do enforce import order in Scala/Java including checks. So it seems pretty reasonable to do the same in one big go for Python.

from pyspark.rdd import RDD
from pyspark.rdd import _load_from_socket
from pyspark.rdd import ignore_unicode_prefix
from pyspark.serializers import AutoBatchedSerializer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding these multiple imports seems counter-productive. We don't do it in Scala (and in Java you can only import one thing or everything). Is this important/canonical for PEP8?

Copy link
Contributor Author

@gsemet gsemet Aug 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed this is a deviation from Pep recommendation. I encourage this behavior since it simplifies a lot file maintainance over its lifetime, since once this "isort" stuff is setup, developers does not have any more latitude in the placement of the "import" statements.

On our Buildbot based project, we use to have lot of conflict involving changes on import statement: on 2 differents branches (say: prod and main), we often had to import either the same import (so merge might add this line twice when the two developers have placed them in two different places) or not so easy to solve conflict (when two developers add theo import two different object from the same module).
Once we setup the "one import per line" rule, no more conflict on this lines, ever. This helped us a lot automatizing an auto merger from the release branches to the "master" branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we get that in the Scala code for sure, although you get conflicts even if you change merely adjacent lines anyway, so it only saves so much conflict. It's a decent open question, I wonder what others think? the downside is the extra duplication of import boilerplate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see in https://pypi.python.org/pypi/isort, there a several way to format this multiple import statements. At worst, at least I recommende to enforce the sort of this multi import lines so there is no ambiguity where to place any "import" (and isort will correct the change from the developer)

@gsemet gsemet force-pushed the python_import_reorg branch from 84320dd to 5590796 Compare August 9, 2016 17:03
@gsemet
Copy link
Contributor Author

gsemet commented Aug 9, 2016

Rebased, sorry I had to force push this PR.

@gsemet gsemet force-pushed the python_import_reorg branch from 5590796 to 9d52994 Compare August 9, 2016 17:16
@rxin
Copy link
Contributor

rxin commented Aug 9, 2016

Can you create a JIRA ticket for this? This is too large to go in without a JIRA ticket.

@rxin
Copy link
Contributor

rxin commented Aug 9, 2016

BTW this is actually a non-trivial change and would require very careful look, since Python imports are not side effect free.

@srowen
Copy link
Member

srowen commented Aug 10, 2016

@stibbons connect this to your JIRA in the title
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

@gsemet gsemet changed the title Python import reorg [SPARK-16992] Python Pep8 formatting and import reorganisation Aug 10, 2016
@gsemet
Copy link
Contributor Author

gsemet commented Aug 10, 2016

Done. Linked to #14180. Is it possible to test this PR with SparkPullRequestBuilder ?

@gsemet gsemet changed the title [SPARK-16992] Python Pep8 formatting and import reorganisation [SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation Aug 10, 2016
@gsemet gsemet force-pushed the python_import_reorg branch from 9d52994 to 067ab4a Compare August 10, 2016 11:24
@srowen
Copy link
Member

srowen commented Aug 10, 2016

Jenkins test this please

@srowen
Copy link
Member

srowen commented Aug 10, 2016

Jenkins add to whitelist

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63533 has finished for PR 14567 at commit 067ab4a.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63534 has finished for PR 14567 at commit 6f17382.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63536 has finished for PR 14567 at commit 239344b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63544 has finished for PR 14567 at commit b3a176c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

# See the License for the specific language governing permissions and
# limitations under the License.

[pep8]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another pep8config file at ./dev/toxi.ini - seems like it would be good to have a single file (also unify the ignore lists)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the rename of tox.ini addressing this -- I wasn't sure how

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, tox.ini looks more similar to this pep8rc than isort.cfg, but github doesn't show it that way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but should they be in the same file or are they separate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if they can be merged. Will try it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ignorant of the config here of course, but it seems like we've replaced 1 config file (tox.ini) with 4 in total (pep8rc, isort.cfg, style.yapf.ini, .editorconfig). Maybe it would help to explain what the differences are, because maybe only some of them are really worth dealing with.

Copy link
Contributor Author

@gsemet gsemet Aug 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • tox.ini kind of becames this pep8rc, indeed.
  • isort only deals with sorting "import" statement, not checked by pep8, but pylint has some checks on it
  • style.yapf.ini is the configuration for yapf, which is a more aggressive version of 'autopep8' made by google for their chrominum projects, to strictly format all their python files. I don't recommend to enforce the usage of it, but if one want to use it on its python file in the Spark project, he can execute it and review the style before commiting. In an ideal wold, this would be a mandatory formatting (on post commit hook for example), but this may take some times before reaching this point
  • .editorconfig only deals with tab space, like I said, might be put outside of this PR if you want

I am not sure if yapf can sort imports like isort do

@gsemet gsemet force-pushed the python_import_reorg branch from b3a176c to ff549b6 Compare August 12, 2016 14:03
@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63689 has finished for PR 14567 at commit ff549b6.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from ff549b6 to be39a83 Compare August 12, 2016 14:13
@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63690 has finished for PR 14567 at commit be39a83.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from be39a83 to 46e83f2 Compare August 12, 2016 15:11
@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63691 has finished for PR 14567 at commit 46e83f2.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from 46e83f2 to d3eab9f Compare August 12, 2016 15:33
@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63693 has finished for PR 14567 at commit d3eab9f.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from d3eab9f to 500b659 Compare August 12, 2016 15:41
@gsemet gsemet force-pushed the python_import_reorg branch from 1a2aa35 to b81c38f Compare September 22, 2016 07:31
@gsemet
Copy link
Contributor Author

gsemet commented Sep 22, 2016

Rebased

@SparkQA
Copy link

SparkQA commented Sep 22, 2016

Test build #65762 has finished for PR 14567 at commit b81c38f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from b81c38f to a02f634 Compare September 28, 2016 16:35
@SparkQA
Copy link

SparkQA commented Sep 28, 2016

Test build #66043 has finished for PR 14567 at commit a02f634.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet gsemet force-pushed the python_import_reorg branch from a02f634 to 4a3cde7 Compare October 5, 2016 12:17
@SparkQA
Copy link

SparkQA commented Oct 5, 2016

Test build #66389 has finished for PR 14567 at commit 4a3cde7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

gsemet added a commit to gsemet/spark that referenced this pull request Oct 12, 2016
Use a virtualenv for isolation and easy installation.

This basically reverts 85a50a6

Might have been a solution to SPARK-9385.

Lot of new test disabled. I propose to fix issues in various
pull requests (obviously most 'import' order errors should be
fixed by my other pull requests such as apache#14830 for documentation
examples, which is part of the effort on code style described
in apache#14567). Each subsequent pull request will fix one or more
error and reenable the according pylint check.

List of new disabled checks:
- bad-super-call
- consider-iterating-dictionary
- consider-using-enumerate
- eval-used
- exec-used
- invalid-length-returned
- misplaced-comparison-constant
- raising-bad-type
- redefined-variable-type
- trailing-newlines
- trailing-whitespace
- ungrouped-imports
- unnecessary-pass
- unneeded-not
- wrong-import-order
- wrong-import-position

Signed-off-by: Gaetan Semet <[email protected]>
Use a virtualenv for isolation and easy installation.

This basically reverts 85a50a6

Might have been a solution to SPARK-9385.

Lot of new test disabled. I propose to fix issues in various
pull requests (obviously most 'import' order errors should be
fixed by my other pull requests such as apache#14830 for documentation
examples, which is part of the effort on code style described
in apache#14567). Each subsequent pull request will fix one or more
error and reenable the according pylint check.

List of new disabled checks:
- bad-super-call
- consider-iterating-dictionary
- consider-using-enumerate
- eval-used
- exec-used
- invalid-length-returned
- misplaced-comparison-constant
- raising-bad-type
- redefined-variable-type
- trailing-newlines
- trailing-whitespace
- ungrouped-imports
- unnecessary-pass
- unneeded-not
- wrong-import-order
- wrong-import-position

Signed-off-by: Gaetan Semet <[email protected]>
+ quiet mode for sphinx
@gsemet gsemet force-pushed the python_import_reorg branch from 4a3cde7 to 1252d08 Compare October 12, 2016 21:45
@SparkQA
Copy link

SparkQA commented Oct 13, 2016

Test build #66847 has finished for PR 14567 at commit 1252d08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

gsemet added a commit to gsemet/spark that referenced this pull request Dec 22, 2016
Use a virtualenv for isolation and easy installation.

This basically reverts 85a50a6

Might have been a solution to SPARK-9385.

Lot of new test disabled. I propose to fix issues in various
pull requests (obviously most 'import' order errors should be
fixed by my other pull requests such as apache#14830 for documentation
examples, which is part of the effort on code style described
in apache#14567). Each subsequent pull request will fix one or more
error and reenable the according pylint check.

List of new disabled checks:
- bad-super-call
- consider-iterating-dictionary
- consider-using-enumerate
- eval-used
- exec-used
- invalid-length-returned
- misplaced-comparison-constant
- raising-bad-type
- redefined-variable-type
- trailing-newlines
- trailing-whitespace
- ungrouped-imports
- unnecessary-pass
- unneeded-not
- wrong-import-order
- wrong-import-position

Signed-off-by: Gaetan Semet <[email protected]>
@ueshin
Copy link
Member

ueshin commented Jun 26, 2017

Hi, are you still working on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants