Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character #22782

Closed
wants to merge 7 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Oct 20, 2018

What changes were proposed in this pull request?

PIP installation requires to package bin scripts together.

https://github.com/apache/spark/blob/master/python/setup.py#L71

The recent fix introduced non-ascii compatible (non-breackable space I guess) at ec96d34 fix.

This is usually not the problem but looks Jenkins's default encoding is ascii and during copying the script, there looks implicit conversion between bytes and strings - where the default encoding is used

https://github.com/pypa/setuptools/blob/v40.4.3/setuptools/command/develop.py#L185-L189

How was this patch tested?

Jenkins

@HyukjinKwon HyukjinKwon changed the title [WIP][HOTFIX] PIP failure fix [WIP][HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character Oct 20, 2018
@HyukjinKwon HyukjinKwon changed the title [WIP][HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character [HOTFIX] Fix PySpark pip packaging tests by non-ascii compatible character Oct 20, 2018
dev/run-tests.py Outdated
@@ -551,7 +551,8 @@ def main():
if not changed_files or any(f.endswith(".scala")
or f.endswith("scalastyle-config.xml")
for f in changed_files):
run_scala_style_checks()
# run_scala_style_checks()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary? Or just tentative workaround?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tentative workaround. yup yup. I will revert all back once the tests pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

@@ -16,7 +16,7 @@
#

"""
PySpark is the Python API for Spark.
PySpark is the Python API for Spark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. I will revert this one too. just intended to test Python tests only since Scala tests takes long.

@kiszk
Copy link
Member

kiszk commented Oct 20, 2018

Thank you for this hot fix. I found 0xc2 and 0xa0 after # in docker-image-tool.sh where @HyukjinKwon fixed.

> git log | head -1
commit fc9ba9dcc6ad47fbd05f093b94e7e13580000d5f
> /home/ishizaki/Spark/PR/tmp/spark > od -c -t x1 bin/docker-image-tool.sh  | grep -A 4 -B 4 c2
         75  61  6c  6c  79  20  62  65  65  6e  20  62  75  69  6c  74
0005200   /   i   s       a       r   u   n   n   a   b   l   e       d
         2f  69  73  20  61  20  72  75  6e  6e  61  62  6c  65  20  64
0005220   i   s   t   r   i   b   u   t   i   o   n  \n           # 302
         69  73  74  72  69  62  75  74  69  6f  6e  0a  20  20  23  c2
0005240 240   i   .   e   .       t   h   e       S   p   a   r   k    
         a0  69  2e  65  2e  20  74  68  65  20  53  70  61  72  6b  20
0005260   J   A   R   s       t   h   a   t       t   h   e       D   o
         4a  41  52  73  20  74  68  61  74  20  74  68  65  20  44  6f

The build error occurs due to 0xc2 in a script under bin directory.

Installing collected packages: py4j, pyspark
  Running setup.py develop for pyspark
    Complete output from command /tmp/tmp.EWtmCOYUBn/3.5/bin/python -c "import setuptools, tokenize;__file__='/home/jenkins/workspace/SparkPullRequestBuilder/python/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps:
    running develop
    running egg_info
    writing dependency_links to pyspark.egg-info/dependency_links.txt
    writing pyspark.egg-info/PKG-INFO
    writing requirements to pyspark.egg-info/requires.txt
    writing top-level names to pyspark.egg-info/top_level.txt
    Could not import pypandoc - required to package PySpark
    package init file 'deps/bin/__init__.py' not found (or not a regular file)
    package init file 'deps/jars/__init__.py' not found (or not a regular file)
    package init file 'pyspark/python/pyspark/__init__.py' not found (or not a regular file)
    package init file 'lib/__init__.py' not found (or not a regular file)
    package init file 'deps/data/__init__.py' not found (or not a regular file)
    package init file 'deps/licenses/__init__.py' not found (or not a regular file)
    package init file 'deps/examples/__init__.py' not found (or not a regular file)
    reading manifest file 'pyspark.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    warning: no previously-included files matching '*.py[cod]' found anywhere in distribution
    warning: no previously-included files matching '__pycache__' found anywhere in distribution
    warning: no previously-included files matching '.DS_Store' found anywhere in distribution
    writing manifest file 'pyspark.egg-info/SOURCES.txt'
    running build_ext
    Creating /tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/pyspark.egg-link (link to .)
    Adding pyspark 3.0.0.dev0 to easy-install.pth file
    Installing load-spark-env.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-submit script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-class.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing beeline.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing find-spark-home.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing run-example script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-shell2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing pyspark script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing sparkR script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-sql script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-submit.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-shell script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing beeline script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-submit2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing find-spark-home script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing sparkR.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing run-example.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing sparkR2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-shell.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-sql.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Installing spark-class2.cmd script to /tmp/tmp.EWtmCOYUBn/3.5/bin
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/jenkins/workspace/SparkPullRequestBuilder/python/setup.py", line 224, in <module>
        'Programming Language :: Python :: Implementation :: PyPy']
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/setuptools/__init__.py", line 140, in setup
        return distutils.core.setup(**attrs)
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/distutils/dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/setuptools/command/develop.py", line 38, in run
        self.install_for_development()
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/setuptools/command/develop.py", line 154, in install_for_development
        self.process_distribution(None, self.dist, not self.no_deps)
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/setuptools/command/easy_install.py", line 729, in process_distribution
        self.install_egg_scripts(dist)
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/site-packages/setuptools/command/develop.py", line 189, in install_egg_scripts
        script_text = strm.read()
      File "/tmp/tmp.EWtmCOYUBn/3.5/lib/python3.5/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2719: ordinal not in range(128)

@kiszk
Copy link
Member

kiszk commented Oct 20, 2018

LGTM after reverting workaround, pending Jenkins

@kiszk
Copy link
Member

kiszk commented Oct 20, 2018

I will visit here tomorrow morning in Japan.

@@ -79,7 +79,7 @@ function build {
fi

# Verify that Spark has actually been built/is a runnable distribution
# i.e. the Spark JARs that the Docker files will place into the image are present
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @rvesse .
Do you have any clue about the special characters after '#' found by @kiszk at here?
I'm wondering if we can avoid this situation in the future. Do you have any idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, some text editors insert non-breakable spaces and probably that's why. I dealt with similar problems at md files before. I think we have scalastyle for nonascii but looks not for scripts ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We need some automatic check for scripts too. Anyway, thank you for fixing this, @HyukjinKwon ! This blocks almost everything PRs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, this was edited in Xcode on OS X so almost certainly defaulting to UTF-8 encoding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, this was edited in Xcode on OS X so almost certainly defaulting to UTF-8 encoding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was more because non-breakable space was used instead instead of space. It was in utf-8 but non ascii compatible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obvious question but why are we still using ASCII encoding for anything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the issue itself, It's related to a historical reason for Python. Python 2 supported str type as bytes like string. It looked a mistake that confuses users about the concept between bytes and string, and then Python 3 introduced str as unicode strings concepts like other programing languages.

open(...).read() reads it as str (which is bytes) in Python 2 but it's read in unicode strings in Python 3 - where we need an implicit conversion between bytes and strings. Looks it had to be to minimise the breaking changes in users codes.

So, bytes to string conversion happened here and unfortunately our Jenkins's system default encoding is set to ascii (even though arguably UTF-8 is common).

For non-ascii itself, please see the justification at http://www.scalastyle.org/rules-dev.html in ScalaStyle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, using non-breakable spaces in the codes is obviously not a good practice. Please avoid next time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't as if I used a non-breaking space intentionally, I just used my OSes default editor for Shell scripts!

This reverts commit cd5cab6.
This reverts commit 5439ff2.
This reverts commit 38abee5.
@HyukjinKwon
Copy link
Member Author

pip packaging tests got passed. Let me merge this one since it blocks almost every PR.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@asfgit asfgit closed this in 5330c19 Oct 20, 2018
@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97658 has finished for PR 22782 at commit 38abee5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97656 has finished for PR 22782 at commit cd5cab6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97657 has finished for PR 22782 at commit 5439ff2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97659 has finished for PR 22782 at commit 114fe0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…acter

## What changes were proposed in this pull request?

PIP installation requires to package bin scripts together.

https://github.com/apache/spark/blob/master/python/setup.py#L71

The recent fix introduced non-ascii compatible (non-breackable space I guess) at apache@ec96d34 fix.

This is usually not the problem but looks Jenkins's default encoding is `ascii` and during copying the script, there looks implicit conversion between bytes and strings - where the default encoding is used

https://github.com/pypa/setuptools/blob/v40.4.3/setuptools/command/develop.py#L185-L189

## How was this patch tested?

Jenkins

Closes apache#22782 from HyukjinKwon/pip-failure-fix.

Authored-by: hyukjinkwon <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
@HyukjinKwon HyukjinKwon deleted the pip-failure-fix branch March 3, 2020 01:20
MaxGekk pushed a commit that referenced this pull request Apr 4, 2021
…tion

### What changes were proposed in this pull request?

This PR replaces the non-ASCII characters to ASCII characters when possible in PySpark documentation

### Why are the changes needed?

To avoid unnecessarily using other non-ASCII characters which could lead to the issue such as #32047 or #22782

### Does this PR introduce _any_ user-facing change?

Virtually no.

### How was this patch tested?

Found via (Mac OS):

```bash
# In Spark root directory
cd python
pcregrep --color='auto' -n "[\x80-\xFF]" `git ls-files .`
```

Closes #32048 from HyukjinKwon/minor-fix.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants