[HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character #22782

HyukjinKwon · 2018-10-20T16:34:09Z

What changes were proposed in this pull request?

PIP installation requires to package bin scripts together.

https://github.com/apache/spark/blob/master/python/setup.py#L71

The recent fix introduced non-ascii compatible (non-breackable space I guess) at ec96d34 fix.

This is usually not the problem but looks Jenkins's default encoding is ascii and during copying the script, there looks implicit conversion between bytes and strings - where the default encoding is used

https://github.com/pypa/setuptools/blob/v40.4.3/setuptools/command/develop.py#L185-L189

How was this patch tested?

Jenkins

kiszk · 2018-10-20T16:56:40Z

dev/run-tests.py

@@ -551,7 +551,8 @@ def main():
    if not changed_files or any(f.endswith(".scala")
                                or f.endswith("scalastyle-config.xml")
                                for f in changed_files):
-        run_scala_style_checks()
+        # run_scala_style_checks()


Is this change necessary? Or just tentative workaround?

tentative workaround. yup yup. I will revert all back once the tests pass

kiszk · 2018-10-20T16:57:05Z

python/pyspark/__init__.py

@@ -16,7 +16,7 @@
 #

 """
-PySpark is the Python API for Spark.
+PySpark is the Python API for Spark


Is this intentional change?

yup. I will revert this one too. just intended to test Python tests only since Scala tests takes long.

kiszk · 2018-10-20T17:03:02Z

Thank you for this hot fix. I found 0xc2 and 0xa0 after # in docker-image-tool.sh where @HyukjinKwon fixed.

> git log | head -1
commit fc9ba9dcc6ad47fbd05f093b94e7e13580000d5f
> /home/ishizaki/Spark/PR/tmp/spark > od -c -t x1 bin/docker-image-tool.sh  | grep -A 4 -B 4 c2
         75  61  6c  6c  79  20  62  65  65  6e  20  62  75  69  6c  74
0005200   /   i   s       a       r   u   n   n   a   b   l   e       d
         2f  69  73  20  61  20  72  75  6e  6e  61  62  6c  65  20  64
0005220   i   s   t   r   i   b   u   t   i   o   n  \n           # 302
         69  73  74  72  69  62  75  74  69  6f  6e  0a  20  20  23  c2
0005240 240   i   .   e   .       t   h   e       S   p   a   r   k    
         a0  69  2e  65  2e  20  74  68  65  20  53  70  61  72  6b  20
0005260   J   A   R   s       t   h   a   t       t   h   e       D   o
         4a  41  52  73  20  74  68  61  74  20  74  68  65  20  44  6f

The build error occurs due to 0xc2 in a script under bin directory.

Installing collected packages: py4j, pyspark
  Running setup.py develop for pyspark
    Complete output from command /tmp/tmp.EWtmCOYUBn/3.5/bin/python -c "import setuptools, tokenize;__file__='/home/jenkins/workspace/SparkPullRequestBuilder/python/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps:
    running develop
    running egg_info
    writing dependency_links to pyspark.egg-info/dependency_links.txt
    writing pyspark.egg-info/PKG-INFO
    writing requirements to pyspark.egg-info/requires.txt
    writing top-level names to pyspark.egg-info/top_level.txt
    Could not import pypandoc - required to package PySpark
    package init file 'deps/bin/__init__.py' not found (or not a regular file)
    package init file 'deps/jars/__init__.py' not found (or not a regular file)
    package init file 'pyspark/python/pyspark/__init__.py' not found (or not a regular file)
    package init file 'lib/__init__.py' not found (or not a regular file)
    package init file 'deps/data/__init__.py' not found (or not a regular file)
    package init file 'deps/licenses/__init__.py' not found (or not a regular file)
    package init file 'deps/examples/__init__.py' not found (or not a regular file)
    reading manifest file 'pyspark.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    warning: no previously-included files matching '*.py[cod]' found anywhere in distribution
    warning: no previously-included files matching '__pycache__' found anywhere in distribution
    warning: no previously-included files matching '.DS_Store' found anywhere in distribution
    writing manifest file 'pyspark.egg-info/SOURCES.txt'
    running build_ext
    Creating /tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/pyspark.egg-link (link to .)
    Adding pyspark 3.0.0.dev0 to easy-install.pth file
    Installing load-spark-env.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-submit script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-class.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing beeline.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing find-spark-home.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing run-example script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-shell2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing pyspark script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing sparkR script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-sql script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-submit.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-shell script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing beeline script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-submit2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing find-spark-home script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing sparkR.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing run-example.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing sparkR2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-shell.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-sql.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-class2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/jenkins/workspace/SparkPullRequestBuilder/python/setup.py", line 224, in <module>
        'Programming Language :: Python :: Implementation :: PyPy']
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/setuptools/__init__.py", line 140, in setup
        return distutils.core.setup(**attrs)
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/distutils/dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/setuptools/command/develop.py", line 38, in run
        self.install_for_development()
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/setuptools/command/develop.py", line 154, in install_for_development
        self.process_distribution(None, self.dist, not self.no_deps)
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/setuptools/command/easy_install.py", line 729, in process_distribution
        self.install_egg_scripts(dist)
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/setuptools/command/develop.py", line 189, in install_egg_scripts
        script_text = strm.read()
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2719: ordinal not in range(128)

kiszk · 2018-10-20T17:07:45Z

LGTM after reverting workaround, pending Jenkins

kiszk · 2018-10-20T17:19:25Z

I will visit here tomorrow morning in Japan.

dongjoon-hyun · 2018-10-20T17:19:59Z

bin/docker-image-tool.sh

@@ -79,7 +79,7 @@ function build {
  fi

  # Verify that Spark has actually been built/is a runnable distribution
-  # i.e. the Spark JARs that the Docker files will place into the image are present


Hi, @rvesse .
Do you have any clue about the special characters after '#' found by @kiszk at here?
I'm wondering if we can avoid this situation in the future. Do you have any idea?

Yea, some text editors insert non-breakable spaces and probably that's why. I dealt with similar problems at md files before. I think we have scalastyle for nonascii but looks not for scripts ..

Right. We need some automatic check for scripts too. Anyway, thank you for fixing this, @HyukjinKwon ! This blocks almost everything PRs.

Huh, this was edited in Xcode on OS X so almost certainly defaulting to UTF-8 encoding

It was more because non-breakable space was used instead instead of space. It was in utf-8 but non ascii compatible.

Obvious question but why are we still using ASCII encoding for anything?

For the issue itself, It's related to a historical reason for Python. Python 2 supported str type as bytes like string. It looked a mistake that confuses users about the concept between bytes and string, and then Python 3 introduced str as unicode strings concepts like other programing languages.

open(...).read() reads it as str (which is bytes) in Python 2 but it's read in unicode strings in Python 3 - where we need an implicit conversion between bytes and strings. Looks it had to be to minimise the breaking changes in users codes.

So, bytes to string conversion happened here and unfortunately our Jenkins's system default encoding is set to ascii (even though arguably UTF-8 is common).

For non-ascii itself, please see the justification at http://www.scalastyle.org/rules-dev.html in ScalaStyle.

Therefore, using non-breakable spaces in the codes is obviously not a good practice. Please avoid next time.

It wasn't as if I used a non-breaking space intentionally, I just used my OSes default editor for Shell scripts!

This reverts commit cd5cab6.

This reverts commit 5439ff2.

This reverts commit 38abee5.

HyukjinKwon · 2018-10-20T18:03:17Z

pip packaging tests got passed. Let me merge this one since it blocks almost every PR.

HyukjinKwon · 2018-10-20T18:04:21Z

Merged to master.

SparkQA · 2018-10-20T18:27:19Z

Test build #97658 has finished for PR 22782 at commit 38abee5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-20T20:15:15Z

Test build #97656 has finished for PR 22782 at commit cd5cab6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-20T20:30:57Z

Test build #97657 has finished for PR 22782 at commit 5439ff2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-20T22:33:00Z

Test build #97659 has finished for PR 22782 at commit 114fe0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…acter ## What changes were proposed in this pull request? PIP installation requires to package bin scripts together. https://github.com/apache/spark/blob/master/python/setup.py#L71 The recent fix introduced non-ascii compatible (non-breackable space I guess) at apache@ec96d34 fix. This is usually not the problem but looks Jenkins's default encoding is `ascii` and during copying the script, there looks implicit conversion between bytes and strings - where the default encoding is used https://github.com/pypa/setuptools/blob/v40.4.3/setuptools/command/develop.py#L185-L189 ## How was this patch tested? Jenkins Closes apache#22782 from HyukjinKwon/pip-failure-fix. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

…tion ### What changes were proposed in this pull request? This PR replaces the non-ASCII characters to ASCII characters when possible in PySpark documentation ### Why are the changes needed? To avoid unnecessarily using other non-ASCII characters which could lead to the issue such as #32047 or #22782 ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Found via (Mac OS): ```bash # In Spark root directory cd python pcregrep --color='auto' -n "[\x80-\xFF]" `git ls-files .` ``` Closes #32048 from HyukjinKwon/minor-fix. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Max Gekk <[email protected]>

HyukjinKwon added 2 commits October 21, 2018 00:32

Fix

25435dc

python test

cd5cab6

HyukjinKwon changed the title ~~[WIP][HOTFIX] PIP failure fix~~ [WIP][HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character Oct 20, 2018

HyukjinKwon changed the title ~~[WIP][HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character~~ [HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character Oct 20, 2018

skip scala tests

5439ff2

HyukjinKwon mentioned this pull request Oct 20, 2018

[SPARK-25492][TEST] Refactor WideSchemaBenchmark to use main method #22501

Closed

kiszk reviewed Oct 20, 2018

View reviewed changes

skip scala tests

38abee5

dongjoon-hyun reviewed Oct 20, 2018

View reviewed changes

HyukjinKwon added 3 commits October 21, 2018 01:46

Revert "python test"

8743bb1

This reverts commit cd5cab6.

Revert "skip scala tests"

352f922

This reverts commit 5439ff2.

Revert "skip scala tests"

114fe0f

This reverts commit 38abee5.

asfgit closed this in 5330c19 Oct 20, 2018

HyukjinKwon mentioned this pull request Feb 20, 2019

[SPARK-27262][R] Add explicit UTF-8 Encoding to DESCRIPTION #23823

Closed

HyukjinKwon deleted the pip-failure-fix branch March 3, 2020 01:20

HyukjinKwon mentioned this pull request Apr 4, 2021

[MINOR][DOCS] Use ASCII characters when possible in PySpark documentation #32048

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character #22782

[HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character #22782

HyukjinKwon commented Oct 20, 2018 •

edited

Loading

kiszk Oct 20, 2018

HyukjinKwon Oct 20, 2018

kiszk Oct 20, 2018

kiszk Oct 20, 2018

HyukjinKwon Oct 20, 2018

kiszk commented Oct 20, 2018 •

edited

Loading

kiszk commented Oct 20, 2018 •

edited

Loading

kiszk commented Oct 20, 2018

dongjoon-hyun Oct 20, 2018

HyukjinKwon Oct 20, 2018

dongjoon-hyun Oct 20, 2018

rvesse Oct 22, 2018

rvesse Oct 22, 2018

HyukjinKwon Oct 22, 2018

rvesse Oct 23, 2018

HyukjinKwon Oct 23, 2018

HyukjinKwon Oct 23, 2018

rvesse Oct 23, 2018

HyukjinKwon commented Oct 20, 2018

HyukjinKwon commented Oct 20, 2018

SparkQA commented Oct 20, 2018

SparkQA commented Oct 20, 2018

SparkQA commented Oct 20, 2018

SparkQA commented Oct 20, 2018

[HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character #22782

[HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character #22782

Conversation

HyukjinKwon commented Oct 20, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiszk commented Oct 20, 2018 • edited Loading

kiszk commented Oct 20, 2018 • edited Loading

kiszk commented Oct 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Oct 20, 2018

HyukjinKwon commented Oct 20, 2018

SparkQA commented Oct 20, 2018

SparkQA commented Oct 20, 2018

SparkQA commented Oct 20, 2018

SparkQA commented Oct 20, 2018

HyukjinKwon commented Oct 20, 2018 •

edited

Loading

kiszk commented Oct 20, 2018 •

edited

Loading

kiszk commented Oct 20, 2018 •

edited

Loading