-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on Pyspark examples #14830
Conversation
7848c92
to
493ae5c
Compare
54c5fdf
to
2e28dd6
Compare
Test build #64473 has finished for PR 14830 at commit
|
Test build #64471 has finished for PR 14830 at commit
|
# $example on$ | ||
from pyspark.ml.feature import Binarizer | ||
from pyspark.sql import SparkSession |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we might want to move the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. what does this tag do ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the examples files are used in generating the website documentation, and the "example on" and "example off" tags are used to determine which parts get pulled in to the website (in this case this is done since we don't want to have the same boiler plate imports for each example - rather showing the ones specific to that). You can take a look at ./docs/ml-features.md
which includes this file to see how its used in markdown and the generated website documentation at http://spark.apache.org/docs/latest/ml-features.html#binarizer .
The instructions for building the docs locally are located at ./docs/README.md
- let me know if you need any help with that - the documentation build is sometimes a bit overlooked since many of the developers don't build it manually often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I see, makes perfectly sense !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we probably want to fix that here and in other places.
Thank for taking the time to do this @stibbons I think its great progress. Doing a quick skim it seems like there are a number of places where the import reordering may have inadvertanly changed what the users will see in the examples we have in the documentation - which is probably not what was intended. I've left line comments in some of the places where I noticed them but there are probably quite a few others since it was just a quick first skim. I'd suggest doing a quick audit yourself and then consider building the documentation to verify that it hasn't changed in any unintended ways by your change. Once again thanks for taking on this task! :) |
yes, i will try to understand how it works and make it beautiful. The goal is to move toward an automation of such code housework, but it may take some time. I'll continue to submit part of this code style work next week, so we can see "small" changes like this. I really like "yapf", a formatting tool from google that almost do the job, better that autopep8. it works a bit aggressively, that why I do not recommend to enforce using it, but it helps identifying and rework most pep8 errors in Python. |
# $example on$ | ||
from pyspark.ml.regression import AFTSurvivalRegression | ||
from pyspark.ml.linalg import Vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer this line be in the doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, move the # $example on$
comment up above the from pyspark.ml.linalg import Vectors
50fc56e
to
2635dcb
Compare
Here is a new proposal. I've taken into account your remark, hope all $on/$off things are ok, and added some minor rework with the multiline syntax (I find using \ weird and inelegant, using parenthesis "()" make is more readable, TMHO). Tell me what you think about this |
For what its worth pep8 says:
So this sounds like keeping in line with the general more pep8ification of the code - but I am a little concerned about just how many files this touches now that it isn't just an autogenerated change*, but I'll try and set aside some time this week to review it (I'm currently ~13 hours off my regular timezone so my review times may be a little erratic). |
Cool I wasn't sure of it. No pbl, I can even split it into several PR |
.builder\ | ||
.appName("PythonALS")\ | ||
.getOrCreate() | ||
spark = (SparkSession |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not changed all this initilization lines, since they do not appear most of the time in the documentation
+ manual editing (replace '\' by parenthesis for multiline syntax)
ff6aabf
to
78b66d8
Compare
Test build #71079 has finished for PR 14830 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems I let this slip my radar (sorry). Some minor comments, but if you're ok with updating this to master I can now merge Python PRs and it would be nice to have our examples cleaned up in this way. Sorry @stibbons for the delay.
from pyspark import SparkContext | ||
# $example on$ | ||
from pyspark.mllib.clustering import BisectingKMeans, BisectingKMeansModel | ||
# $example off$ | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats this for?
@@ -29,7 +29,6 @@ | |||
import numpy as np | |||
from pyspark.sql import SparkSession | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove the double newlines after the end of the imports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed your remarks. The extra line has been emptied (no need for the '#'). It is the pep8 recomendation to have 2 empty lines after imports.
I have fixed the other remark as well
thanks
Test build #72861 has finished for PR 14830 at commit
|
Test build #72862 has finished for PR 14830 at commit
|
Great, thanks for updating this :) Would be good to see if @HyukjinKwon has anything to say otherwise I'll do another pass through this tomorrow and hopefully its really close :) |
Thank you for cc'ing me @holdenk. Let me try to take a look within tomorrow too at my best. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left several comments. In general, I think we should minimise the changes as possible as we can. Could we check if they really are recommended changes all (at least the ones I commented)?
I know it sounds a bit demanding but I a bit suspect some changes are not really explicitly required/recommended and some removed lines are not explicitly discouraged. I worry if it is worth sweeping all.
(11, "hadoop software", 0.0) | ||
], ["id", "text", "label"]) | ||
training = spark.createDataFrame( | ||
[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd great if we have some references or quotes.
[ | ||
(0, "a b c".split(" ")), | ||
(1, "a b b c a".split(" ")) | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you double check if it really does not follow pep8? I have seen the removed syntax more often (e.g., numpy
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this is a recommendation not an obligation. I see it to be more looking like Scala multi-line code, and I prefer it. It is a personal opinion, and I don't think there is a pylint/pep8 check to prevent using .
@@ -65,8 +67,9 @@ | |||
predictions.select("prediction", "indexedLabel", "features").show(5) | |||
|
|||
# Select (prediction, true label) and compute test error | |||
evaluator = MulticlassClassificationEvaluator( | |||
labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. dose pep8 has a different argument location rule for class and function? It seems this one is already fine and seems inconsistent with https://github.com/apache/spark/pull/14830/files#diff-82fe155d22aaaf433e949193d262c736R43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8
tool does this automatically if line is > 100 char. There is indeed no preference between this format and:
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel",
predictionCol="prediction",
metricName="accuracy")
I would say both are equivalent. I tend to prefere this one (the latter)
.transform(lambda rdd: rdd.sortByKey(False)) | ||
happiest_words = (word_counts | ||
.map(lambda word_tuples: (word_tuples[0], | ||
float(word_tuples[1][0]) * word_tuples[1][1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Personally, I think it is not more readable..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, if you prefer I can change all at once. But like I said, I don't know any autoformat that does it automatically
from pyspark.mllib.regression import LabeledPoint | ||
from pyspark.mllib.regression import StreamingLinearRegressionWithSGD | ||
from pyspark.mllib.regression import (LabeledPoint, | ||
StreamingLinearRegressionWithSGD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not exceed 100 line length? Up to my knowledge Spark limits it 100 (not default 80).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer having a single import per line (this simplifies a lot file management, multi branch merges,...). I can revert this change
|
||
from pyspark import SparkContext | ||
# $example on$ | ||
from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel | ||
from pyspark.mllib.util import MLUtils | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I ask to check if the example rendered in doc still complies pep8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you happen to be not able to build the python doc, I will check tomorrow to help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, because the 2 empty lines are after
# $example off$
@stibbons are there maybe some options in autopep8 to minimise the changes? (just in case I believe we ignore some rules such as E402,E731,E241,W503 and E226 in Spark). |
Hello. This is actually the execution of the pylint/autopep8 config proposed in #14963. I can minimize a little bit more this PR by ignoring indeed more rules. |
lets do a jenkins re-run just to make sure everything is up to date and I'll try and get a final pass done soon. I think it would be good to improve our examples to be closer to pep8 style for the sake of readability for people coming from different Python code bases trying to learn PySpark. |
Jenkins retest this please. |
Test build #73445 has finished for PR 14830 at commit
|
Jenkins retest this please. |
Test build #75632 has finished for PR 14830 at commit
|
I guess a rebased will be welcomed, I can do it by tomorow if you want |
Sure, if you have a chance to rebase & check if any other changes are needed that would be useful. |
Hi, are you still working on this? |
Gentle follow up ping. I've got some bandwith next week. |
Hello. Sadly I cannot work on this we are in a middle of a big restructuration at work. |
This is a set of files that has been formatted by the script defined in #14567.
Not all files are formatted, only the documentation examples, for information sake.
This Pull Request can be merged alone, but it makes more sens to merge it once #14567 is accepted and merged (comes on top of it)