-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character #22782
Conversation
dev/run-tests.py
Outdated
@@ -551,7 +551,8 @@ def main(): | |||
if not changed_files or any(f.endswith(".scala") | |||
or f.endswith("scalastyle-config.xml") | |||
for f in changed_files): | |||
run_scala_style_checks() | |||
# run_scala_style_checks() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary? Or just tentative workaround?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tentative workaround. yup yup. I will revert all back once the tests pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
python/pyspark/__init__.py
Outdated
@@ -16,7 +16,7 @@ | |||
# | |||
|
|||
""" | |||
PySpark is the Python API for Spark. | |||
PySpark is the Python API for Spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup. I will revert this one too. just intended to test Python tests only since Scala tests takes long.
Thank you for this hot fix. I found
The build error occurs due to
|
LGTM after reverting workaround, pending Jenkins |
I will visit here tomorrow morning in Japan. |
@@ -79,7 +79,7 @@ function build { | |||
fi | |||
|
|||
# Verify that Spark has actually been built/is a runnable distribution | |||
# i.e. the Spark JARs that the Docker files will place into the image are present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, some text editors insert non-breakable spaces and probably that's why. I dealt with similar problems at md files before. I think we have scalastyle for nonascii but looks not for scripts ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We need some automatic check for scripts too. Anyway, thank you for fixing this, @HyukjinKwon ! This blocks almost everything PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, this was edited in Xcode on OS X so almost certainly defaulting to UTF-8 encoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, this was edited in Xcode on OS X so almost certainly defaulting to UTF-8 encoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was more because non-breakable space was used instead instead of space. It was in utf-8 but non ascii compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obvious question but why are we still using ASCII encoding for anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the issue itself, It's related to a historical reason for Python. Python 2 supported str
type as bytes like string. It looked a mistake that confuses users about the concept between bytes and string, and then Python 3 introduced str
as unicode strings concepts like other programing languages.
open(...).read()
reads it as str
(which is bytes) in Python 2 but it's read in unicode strings in Python 3 - where we need an implicit conversion between bytes and strings. Looks it had to be to minimise the breaking changes in users codes.
So, bytes to string conversion happened here and unfortunately our Jenkins's system default encoding is set to ascii (even though arguably UTF-8 is common).
For non-ascii itself, please see the justification at http://www.scalastyle.org/rules-dev.html in ScalaStyle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore, using non-breakable spaces in the codes is obviously not a good practice. Please avoid next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't as if I used a non-breaking space intentionally, I just used my OSes default editor for Shell scripts!
pip packaging tests got passed. Let me merge this one since it blocks almost every PR. |
Merged to master. |
Test build #97658 has finished for PR 22782 at commit
|
Test build #97656 has finished for PR 22782 at commit
|
Test build #97657 has finished for PR 22782 at commit
|
Test build #97659 has finished for PR 22782 at commit
|
…acter ## What changes were proposed in this pull request? PIP installation requires to package bin scripts together. https://github.com/apache/spark/blob/master/python/setup.py#L71 The recent fix introduced non-ascii compatible (non-breackable space I guess) at apache@ec96d34 fix. This is usually not the problem but looks Jenkins's default encoding is `ascii` and during copying the script, there looks implicit conversion between bytes and strings - where the default encoding is used https://github.com/pypa/setuptools/blob/v40.4.3/setuptools/command/develop.py#L185-L189 ## How was this patch tested? Jenkins Closes apache#22782 from HyukjinKwon/pip-failure-fix. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
…tion ### What changes were proposed in this pull request? This PR replaces the non-ASCII characters to ASCII characters when possible in PySpark documentation ### Why are the changes needed? To avoid unnecessarily using other non-ASCII characters which could lead to the issue such as #32047 or #22782 ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Found via (Mac OS): ```bash # In Spark root directory cd python pcregrep --color='auto' -n "[\x80-\xFF]" `git ls-files .` ``` Closes #32048 from HyukjinKwon/minor-fix. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Max Gekk <[email protected]>
What changes were proposed in this pull request?
PIP installation requires to package bin scripts together.
https://github.com/apache/spark/blob/master/python/setup.py#L71
The recent fix introduced non-ascii compatible (non-breackable space I guess) at ec96d34 fix.
This is usually not the problem but looks Jenkins's default encoding is
ascii
and during copying the script, there looks implicit conversion between bytes and strings - where the default encoding is usedhttps://github.com/pypa/setuptools/blob/v40.4.3/setuptools/command/develop.py#L185-L189
How was this patch tested?
Jenkins