Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples #14830

Closed
wants to merge 5 commits into from

Conversation

gsemet
Copy link
Contributor

@gsemet gsemet commented Aug 26, 2016

This is a set of files that has been formatted by the script defined in #14567.

Not all files are formatted, only the documentation examples, for information sake.

This Pull Request can be merged alone, but it makes more sens to merge it once #14567 is accepted and merged (comes on top of it)

@gsemet gsemet force-pushed the python_import_reorg_plus_exec branch from 7848c92 to 493ae5c Compare August 26, 2016 11:57
@gsemet gsemet changed the title [SPARK-16992][PYSPARK] [DO NOT MERGE] #14567 execution example [SPARK-16992][PYSPARK] autopep8 on documentation example Aug 26, 2016
@gsemet gsemet force-pushed the python_import_reorg_plus_exec branch 2 times, most recently from 54c5fdf to 2e28dd6 Compare August 26, 2016 12:05
@gsemet gsemet changed the title [SPARK-16992][PYSPARK] autopep8 on documentation example [SPARK-16992][PYSPARK] autopep8 on documentation examples Aug 26, 2016
@SparkQA
Copy link

SparkQA commented Aug 26, 2016

Test build #64473 has finished for PR 14830 at commit 2e28dd6.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 26, 2016

Test build #64471 has finished for PR 14830 at commit 493ae5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 26, 2016

It's a lot of change but I tend to favor biting the bullet and standardizing, especially if we can enforce it going forward. Thoughts @davies @holdenk (time permitting) @MLnick

# $example on$
from pyspark.ml.feature import Binarizer
from pyspark.sql import SparkSession
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we might want to move the $example off$ tag/comment up above this so that we keep the example text the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. what does this tag do ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the examples files are used in generating the website documentation, and the "example on" and "example off" tags are used to determine which parts get pulled in to the website (in this case this is done since we don't want to have the same boiler plate imports for each example - rather showing the ones specific to that). You can take a look at ./docs/ml-features.md which includes this file to see how its used in markdown and the generated website documentation at http://spark.apache.org/docs/latest/ml-features.html#binarizer .

The instructions for building the docs locally are located at ./docs/README.md - let me know if you need any help with that - the documentation build is sometimes a bit overlooked since many of the developers don't build it manually often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I see, makes perfectly sense !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we probably want to fix that here and in other places.

@holdenk
Copy link
Contributor

holdenk commented Aug 27, 2016

Thank for taking the time to do this @stibbons I think its great progress. Doing a quick skim it seems like there are a number of places where the import reordering may have inadvertanly changed what the users will see in the examples we have in the documentation - which is probably not what was intended.

I've left line comments in some of the places where I noticed them but there are probably quite a few others since it was just a quick first skim.

I'd suggest doing a quick audit yourself and then consider building the documentation to verify that it hasn't changed in any unintended ways by your change.

Once again thanks for taking on this task! :)

@gsemet
Copy link
Contributor Author

gsemet commented Aug 27, 2016

yes, i will try to understand how it works and make it beautiful. The goal is to move toward an automation of such code housework, but it may take some time. I'll continue to submit part of this code style work next week, so we can see "small" changes like this.

I really like "yapf", a formatting tool from google that almost do the job, better that autopep8. it works a bit aggressively, that why I do not recommend to enforce using it, but it helps identifying and rework most pep8 errors in Python.

# $example on$
from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer this line be in the doc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, move the # $example on$ comment up above the from pyspark.ml.linalg import Vectors

@gsemet gsemet force-pushed the python_import_reorg_plus_exec branch 2 times, most recently from 50fc56e to 2635dcb Compare August 29, 2016 13:21
@gsemet
Copy link
Contributor Author

gsemet commented Aug 29, 2016

Here is a new proposal. I've taken into account your remark, hope all $on/$off things are ok, and added some minor rework with the multiline syntax (I find using \ weird and inelegant, using parenthesis "()" make is more readable, TMHO).

Tell me what you think about this

@holdenk
Copy link
Contributor

holdenk commented Aug 29, 2016

For what its worth pep8 says:

The preferred way of wrapping long lines is by using Python's implied line continuation inside parentheses, brackets and braces. Long lines can be broken over multiple lines by wrapping expressions in parentheses. These should be used in preference to using a backslash for line continuation.

So this sounds like keeping in line with the general more pep8ification of the code - but I am a little concerned about just how many files this touches now that it isn't just an autogenerated change*, but I'll try and set aside some time this week to review it (I'm currently ~13 hours off my regular timezone so my review times may be a little erratic).

@gsemet
Copy link
Contributor Author

gsemet commented Aug 29, 2016

Cool I wasn't sure of it.

No pbl, I can even split it into several PR

@gsemet gsemet changed the title [SPARK-16992][PYSPARK] autopep8 on documentation examples [SPARK-16992][PYSPARK] PEP8 on documentation examples Aug 29, 2016
.builder\
.appName("PythonALS")\
.getOrCreate()
spark = (SparkSession
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not changed all this initilization lines, since they do not appear most of the time in the documentation

+ manual editing
(replace '\' by parenthesis for multiline syntax)
@gsemet gsemet force-pushed the python_import_reorg_plus_exec branch from ff6aabf to 78b66d8 Compare January 9, 2017 11:44
@SparkQA
Copy link

SparkQA commented Jan 9, 2017

Test build #71079 has finished for PR 14830 at commit 78b66d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems I let this slip my radar (sorry). Some minor comments, but if you're ok with updating this to master I can now merge Python PRs and it would be nice to have our examples cleaned up in this way. Sorry @stibbons for the delay.

from pyspark import SparkContext
# $example on$
from pyspark.mllib.clustering import BisectingKMeans, BisectingKMeansModel
# $example off$
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats this for?

@@ -29,7 +29,6 @@
import numpy as np
from pyspark.sql import SparkSession

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the double newlines after the end of the imports?

Copy link
Contributor Author

@gsemet gsemet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed your remarks. The extra line has been emptied (no need for the '#'). It is the pep8 recomendation to have 2 empty lines after imports.

I have fixed the other remark as well

thanks

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72861 has finished for PR 14830 at commit 31cea6d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72862 has finished for PR 14830 at commit 582c822.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Feb 14, 2017

Great, thanks for updating this :) Would be good to see if @HyukjinKwon has anything to say otherwise I'll do another pass through this tomorrow and hopefully its really close :)

@HyukjinKwon
Copy link
Member

Thank you for cc'ing me @holdenk. Let me try to take a look within tomorrow too at my best.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left several comments. In general, I think we should minimise the changes as possible as we can. Could we check if they really are recommended changes all (at least the ones I commented)?

I know it sounds a bit demanding but I a bit suspect some changes are not really explicitly required/recommended and some removed lines are not explicitly discouraged. I worry if it is worth sweeping all.

(11, "hadoop software", 0.0)
], ["id", "text", "label"])
training = spark.createDataFrame(
[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd great if we have some references or quotes.

[
(0, "a b c".split(" ")),
(1, "a b b c a".split(" "))
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you double check if it really does not follow pep8? I have seen the removed syntax more often (e.g., numpy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is a recommendation not an obligation. I see it to be more looking like Scala multi-line code, and I prefer it. It is a personal opinion, and I don't think there is a pylint/pep8 check to prevent using .

@@ -65,8 +67,9 @@
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. dose pep8 has a different argument location rule for class and function? It seems this one is already fine and seems inconsistent with https://github.com/apache/spark/pull/14830/files#diff-82fe155d22aaaf433e949193d262c736R43

Copy link
Contributor Author

@gsemet gsemet Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 tool does this automatically if line is > 100 char. There is indeed no preference between this format and:

evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel",
                                              predictionCol="prediction",
                                              metricName="accuracy")

I would say both are equivalent. I tend to prefere this one (the latter)

.transform(lambda rdd: rdd.sortByKey(False))
happiest_words = (word_counts
.map(lambda word_tuples: (word_tuples[0],
float(word_tuples[1][0]) * word_tuples[1][1]))
Copy link
Member

@HyukjinKwon HyukjinKwon Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Personally, I think it is not more readable..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, if you prefer I can change all at once. But like I said, I don't know any autoformat that does it automatically

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import StreamingLinearRegressionWithSGD
from pyspark.mllib.regression import (LabeledPoint,
StreamingLinearRegressionWithSGD)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not exceed 100 line length? Up to my knowledge Spark limits it 100 (not default 80).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer having a single import per line (this simplifies a lot file management, multi branch merges,...). I can revert this change


from pyspark import SparkContext
# $example on$
from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.util import MLUtils


Copy link
Member

@HyukjinKwon HyukjinKwon Feb 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I ask to check if the example rendered in doc still complies pep8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you happen to be not able to build the python doc, I will check tomorrow to help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because the 2 empty lines are after

 # $example off$

@HyukjinKwon
Copy link
Member

@stibbons are there maybe some options in autopep8 to minimise the changes? (just in case I believe we ignore some rules such as E402,E731,E241,W503 and E226 in Spark).

@gsemet
Copy link
Contributor Author

gsemet commented Feb 15, 2017

Hello. This is actually the execution of the pylint/autopep8 config proposed in #14963. I can minimize a little bit more this PR by ignoring indeed more rules.

@HyukjinKwon
Copy link
Member

Thanks @stibbons. FWIW, I won't stay against but am just neutral. Let me please defer to @holdenk and @srowen.

@holdenk
Copy link
Contributor

holdenk commented Feb 24, 2017

lets do a jenkins re-run just to make sure everything is up to date and I'll try and get a final pass done soon. I think it would be good to improve our examples to be closer to pep8 style for the sake of readability for people coming from different Python code bases trying to learn PySpark.

@holdenk
Copy link
Contributor

holdenk commented Feb 24, 2017

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73445 has finished for PR 14830 at commit 582c822.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Apr 9, 2017

Jenkins retest this please.

@SparkQA
Copy link

SparkQA commented Apr 9, 2017

Test build #75632 has finished for PR 14830 at commit 582c822.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gsemet
Copy link
Contributor Author

gsemet commented Apr 9, 2017

I guess a rebased will be welcomed, I can do it by tomorow if you want

@holdenk
Copy link
Contributor

holdenk commented Apr 11, 2017

Sure, if you have a chance to rebase & check if any other changes are needed that would be useful.

@ueshin
Copy link
Member

ueshin commented Jun 26, 2017

Hi, are you still working on this?

@holdenk
Copy link
Contributor

holdenk commented Jul 2, 2017

Gentle follow up ping. I've got some bandwith next week.

@gsemet
Copy link
Contributor Author

gsemet commented Jul 2, 2017

Hello. Sadly I cannot work on this we are in a middle of a big restructuration at work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants