[SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples #14830

gsemet · 2016-08-26T11:55:26Z

This is a set of files that has been formatted by the script defined in #14567.

Not all files are formatted, only the documentation examples, for information sake.

This Pull Request can be merged alone, but it makes more sens to merge it once #14567 is accepted and merged (comes on top of it)

SparkQA · 2016-08-26T12:08:32Z

Test build #64473 has finished for PR 14830 at commit 2e28dd6.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-26T12:18:13Z

Test build #64471 has finished for PR 14830 at commit 493ae5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-08-26T12:51:57Z

It's a lot of change but I tend to favor biting the bullet and standardizing, especially if we can enforce it going forward. Thoughts @davies @holdenk (time permitting) @MLnick

holdenk · 2016-08-27T07:53:39Z

examples/src/main/python/ml/binarizer_example.py

 # $example on$
 from pyspark.ml.feature import Binarizer
+from pyspark.sql import SparkSession


So we might want to move the $example off$ tag/comment up above this so that we keep the example text the same.

ok. what does this tag do ?

Some of the examples files are used in generating the website documentation, and the "example on" and "example off" tags are used to determine which parts get pulled in to the website (in this case this is done since we don't want to have the same boiler plate imports for each example - rather showing the ones specific to that). You can take a look at ./docs/ml-features.md which includes this file to see how its used in markdown and the generated website documentation at http://spark.apache.org/docs/latest/ml-features.html#binarizer .

The instructions for building the docs locally are located at ./docs/README.md - let me know if you need any help with that - the documentation build is sometimes a bit overlooked since many of the developers don't build it manually often.

yes I see, makes perfectly sense !

So we probably want to fix that here and in other places.

holdenk · 2016-08-27T08:02:02Z

Thank for taking the time to do this @stibbons I think its great progress. Doing a quick skim it seems like there are a number of places where the import reordering may have inadvertanly changed what the users will see in the examples we have in the documentation - which is probably not what was intended.

I've left line comments in some of the places where I noticed them but there are probably quite a few others since it was just a quick first skim.

I'd suggest doing a quick audit yourself and then consider building the documentation to verify that it hasn't changed in any unintended ways by your change.

Once again thanks for taking on this task! :)

gsemet · 2016-08-27T09:36:29Z

yes, i will try to understand how it works and make it beautiful. The goal is to move toward an automation of such code housework, but it may take some time. I'll continue to submit part of this code style work next week, so we can see "small" changes like this.

I really like "yapf", a formatting tool from google that almost do the job, better that autopep8. it works a bit aggressively, that why I do not recommend to enforce using it, but it helps identifying and rework most pep8 errors in Python.

gsemet · 2016-08-29T12:34:06Z

examples/src/main/python/ml/aft_survival_regression.py

 # $example on$
 from pyspark.ml.regression import AFTSurvivalRegression
-from pyspark.ml.linalg import Vectors


I actually prefer this line be in the doc

In that case, move the # $example on$ comment up above the from pyspark.ml.linalg import Vectors

gsemet · 2016-08-29T13:27:55Z

Here is a new proposal. I've taken into account your remark, hope all $on/$off things are ok, and added some minor rework with the multiline syntax (I find using \ weird and inelegant, using parenthesis "()" make is more readable, TMHO).

Tell me what you think about this

holdenk · 2016-08-29T13:37:53Z

For what its worth pep8 says:

The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation.

So this sounds like keeping in line with the general more pep8ification of the code - but I am a little concerned about just how many files this touches now that it isn't just an autogenerated change*, but I'll try and set aside some time this week to review it (I'm currently ~13 hours off my regular timezone so my review times may be a little erratic).

gsemet · 2016-08-29T13:40:36Z

Cool I wasn't sure of it.

No pbl, I can even split it into several PR

gsemet · 2016-08-29T13:42:22Z

examples/src/main/python/als.py

-        .builder\
-        .appName("PythonALS")\
-        .getOrCreate()
+    spark = (SparkSession


I have not changed all this initilization lines, since they do not appear most of the time in the documentation

+ manual editing (replace '\' by parenthesis for multiline syntax)

SparkQA · 2017-01-09T12:09:00Z

Test build #71079 has finished for PR 14830 at commit 78b66d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

It seems I let this slip my radar (sorry). Some minor comments, but if you're ok with updating this to master I can now merge Python PRs and it would be nice to have our examples cleaned up in this way. Sorry @stibbons for the delay.

holdenk · 2016-12-28T05:22:26Z

examples/src/main/python/mllib/bisecting_k_means_example.py

 from pyspark import SparkContext
 # $example on$
 from pyspark.mllib.clustering import BisectingKMeans, BisectingKMeansModel
 # $example off$
+#


Whats this for?

holdenk · 2016-12-28T05:26:59Z

examples/src/main/python/logistic_regression.py

@@ -29,7 +29,6 @@
 import numpy as np
 from pyspark.sql import SparkSession



Why did you remove the double newlines after the end of the imports?

gsemet

Fixed your remarks. The extra line has been emptied (no need for the '#'). It is the pep8 recomendation to have 2 empty lines after imports.

I have fixed the other remark as well

thanks

SparkQA · 2017-02-14T08:03:51Z

Test build #72861 has finished for PR 14830 at commit 31cea6d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-14T08:31:24Z

Test build #72862 has finished for PR 14830 at commit 582c822.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-02-14T17:55:22Z

Great, thanks for updating this :) Would be good to see if @HyukjinKwon has anything to say otherwise I'll do another pass through this tomorrow and hopefully its really close :)

HyukjinKwon · 2017-02-14T22:53:22Z

Thank you for cc'ing me @holdenk. Let me try to take a look within tomorrow too at my best.

HyukjinKwon

I left several comments. In general, I think we should minimise the changes as possible as we can. Could we check if they really are recommended changes all (at least the ones I commented)?

I know it sounds a bit demanding but I a bit suspect some changes are not really explicitly required/recommended and some removed lines are not explicitly discouraged. I worry if it is worth sweeping all.

HyukjinKwon · 2017-02-15T10:23:40Z

examples/src/main/python/ml/cross_validator.py

-        (11, "hadoop software", 0.0)
-    ], ["id", "text", "label"])
+    training = spark.createDataFrame(
+        [


It'd great if we have some references or quotes.

HyukjinKwon · 2017-02-15T11:23:36Z

examples/src/main/python/ml/count_vectorizer_example.py

+        [
+            (0, "a b c".split(" ")),
+            (1, "a b b c a".split(" "))
+        ],


Could you double check if it really does not follow pep8? I have seen the removed syntax more often (e.g., numpy).

Indeed, this is a recommendation not an obligation. I see it to be more looking like Scala multi-line code, and I prefer it. It is a personal opinion, and I don't think there is a pylint/pep8 check to prevent using .

HyukjinKwon · 2017-02-15T11:27:20Z

examples/src/main/python/ml/decision_tree_classification_example.py

@@ -65,8 +67,9 @@
    predictions.select("prediction", "indexedLabel", "features").show(5)

    # Select (prediction, true label) and compute test error
-    evaluator = MulticlassClassificationEvaluator(
-        labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")


Hm.. dose pep8 has a different argument location rule for class and function? It seems this one is already fine and seems inconsistent with https://github.com/apache/spark/pull/14830/files#diff-82fe155d22aaaf433e949193d262c736R43

pep8 tool does this automatically if line is > 100 char. There is indeed no preference between this format and:

evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")

I would say both are equivalent. I tend to prefere this one (the latter)

HyukjinKwon · 2017-02-15T11:30:50Z

examples/src/main/python/streaming/network_wordjoinsentiments.py

-        .transform(lambda rdd: rdd.sortByKey(False))
+    happiest_words = (word_counts
+                      .map(lambda word_tuples: (word_tuples[0],
+                                                float(word_tuples[1][0]) * word_tuples[1][1]))


(Personally, I think it is not more readable..)

I agree, if you prefer I can change all at once. But like I said, I don't know any autoformat that does it automatically

HyukjinKwon · 2017-02-15T11:39:08Z

examples/src/main/python/mllib/streaming_linear_regression_example.py

-from pyspark.mllib.regression import LabeledPoint
-from pyspark.mllib.regression import StreamingLinearRegressionWithSGD
+from pyspark.mllib.regression import (LabeledPoint,
+                                      StreamingLinearRegressionWithSGD)


This does not exceed 100 line length? Up to my knowledge Spark limits it 100 (not default 80).

I actually prefer having a single import per line (this simplifies a lot file management, multi branch merges,...). I can revert this change

HyukjinKwon · 2017-02-15T11:42:14Z

examples/src/main/python/mllib/naive_bayes_example.py


 from pyspark import SparkContext
 # $example on$
 from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
 from pyspark.mllib.util import MLUtils
-
-


Could I ask to check if the example rendered in doc still complies pep8?

If you happen to be not able to build the python doc, I will check tomorrow to help.

yes, because the 2 empty lines are after

# $example off$

HyukjinKwon · 2017-02-15T12:06:43Z

@stibbons are there maybe some options in autopep8 to minimise the changes? (just in case I believe we ignore some rules such as E402,E731,E241,W503 and E226 in Spark).

gsemet · 2017-02-15T12:15:00Z

Hello. This is actually the execution of the pylint/autopep8 config proposed in #14963. I can minimize a little bit more this PR by ignoring indeed more rules.

HyukjinKwon · 2017-02-15T12:18:12Z

Thanks @stibbons. FWIW, I won't stay against but am just neutral. Let me please defer to @holdenk and @srowen.

holdenk · 2017-02-24T22:24:15Z

lets do a jenkins re-run just to make sure everything is up to date and I'll try and get a final pass done soon. I think it would be good to improve our examples to be closer to pep8 style for the sake of readability for people coming from different Python code bases trying to learn PySpark.

holdenk · 2017-02-24T22:24:22Z

Jenkins retest this please.

SparkQA · 2017-02-24T22:38:54Z

Test build #73445 has finished for PR 14830 at commit 582c822.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-04-09T11:12:21Z

Jenkins retest this please.

SparkQA · 2017-04-09T11:28:21Z

Test build #75632 has finished for PR 14830 at commit 582c822.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gsemet · 2017-04-09T11:31:55Z

I guess a rebased will be welcomed, I can do it by tomorow if you want

holdenk · 2017-04-11T19:15:40Z

Sure, if you have a chance to rebase & check if any other changes are needed that would be useful.

ueshin · 2017-06-26T23:44:12Z

Hi, are you still working on this?

holdenk · 2017-07-02T02:14:23Z

Gentle follow up ping. I've got some bandwith next week.

gsemet · 2017-07-02T11:04:01Z

Hello. Sadly I cannot work on this we are in a middle of a big restructuration at work.

gsemet mentioned this pull request Aug 26, 2016

[SPARK-16992][PYSPARK] Python Pep8 formatting and import reorganisation #14567

Closed

gsemet force-pushed the python_import_reorg_plus_exec branch from 7848c92 to 493ae5c Compare August 26, 2016 11:57

gsemet changed the title ~~[SPARK-16992][PYSPARK] [DO NOT MERGE] #14567 execution example~~ [SPARK-16992][PYSPARK] autopep8 on documentation example Aug 26, 2016

gsemet force-pushed the python_import_reorg_plus_exec branch 2 times, most recently from 54c5fdf to 2e28dd6 Compare August 26, 2016 12:05

gsemet changed the title ~~[SPARK-16992][PYSPARK] autopep8 on documentation example~~ [SPARK-16992][PYSPARK] autopep8 on documentation examples Aug 26, 2016

holdenk reviewed Aug 27, 2016
View reviewed changes

gsemet reviewed Aug 29, 2016
View reviewed changes

gsemet force-pushed the python_import_reorg_plus_exec branch 2 times, most recently from 50fc56e to 2635dcb Compare August 29, 2016 13:21

gsemet changed the title ~~[SPARK-16992][PYSPARK] autopep8 on documentation examples~~ [SPARK-16992][PYSPARK] PEP8 on documentation examples Aug 29, 2016

gsemet reviewed Aug 29, 2016
View reviewed changes

execute isort/pep8 on example files

78b66d8

+ manual editing (replace '\' by parenthesis for multiline syntax)

gsemet force-pushed the python_import_reorg_plus_exec branch from ff6aabf to 78b66d8 Compare January 9, 2017 11:44

holdenk reviewed Feb 13, 2017

View reviewed changes

gsemet added 2 commits February 14, 2017 08:53

Undo add extra space

4af5966

Fix with 2 empty lines after import

ef1306e

gsemet commented Feb 14, 2017

View reviewed changes

Merge branch 'master' into python_import_reorg_plus_exec

31cea6d

Fix pep8 empty lines

582c822

HyukjinKwon requested changes Feb 15, 2017

View reviewed changes

HyukjinKwon mentioned this pull request Jul 31, 2017

[INFRA] Close stale PRs #18780

Closed

asfgit closed this in 3a45c7f Aug 5, 2017

		@@ -29,7 +29,6 @@
		import numpy as np
		from pyspark.sql import SparkSession

[SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples #14830

[SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples #14830

Conversation

gsemet commented Aug 26, 2016 • edited Loading

SparkQA commented Aug 26, 2016

SparkQA commented Aug 26, 2016

srowen commented Aug 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk commented Aug 27, 2016

gsemet commented Aug 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsemet commented Aug 29, 2016

holdenk commented Aug 29, 2016

gsemet commented Aug 29, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 9, 2017

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsemet left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 14, 2017

SparkQA commented Feb 14, 2017

holdenk commented Feb 14, 2017

HyukjinKwon commented Feb 14, 2017

HyukjinKwon left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsemet Feb 15, 2017 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Feb 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Feb 15, 2017

gsemet commented Feb 15, 2017

HyukjinKwon commented Feb 15, 2017

holdenk commented Feb 24, 2017

holdenk commented Feb 24, 2017

SparkQA commented Feb 24, 2017

holdenk commented Apr 9, 2017

SparkQA commented Apr 9, 2017

gsemet commented Apr 9, 2017

holdenk commented Apr 11, 2017

ueshin commented Jun 26, 2017

holdenk commented Jul 2, 2017

gsemet commented Jul 2, 2017

gsemet commented Aug 26, 2016 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

gsemet Feb 15, 2017 •

edited

Loading

HyukjinKwon Feb 15, 2017 •

edited

Loading

HyukjinKwon Feb 15, 2017 •

edited

Loading