[SPARK-19019][PYTHON][BRANCH-2.0] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #17374

HyukjinKwon · 2017-03-21T09:28:16Z

What changes were proposed in this pull request?

This PR proposes to backports #16429 to branch-2.0 so that Python 3.6.0 works with Spark 2.0.x.

How was this patch tested?

Manually, via

./run-tests --python-executables=python3.6

Finished test(python3.6): pyspark.tests (124s)
Finished test(python3.6): pyspark.accumulators (4s)
Finished test(python3.6): pyspark.broadcast (4s)
Finished test(python3.6): pyspark.conf (3s)
Finished test(python3.6): pyspark.context (15s)
Finished test(python3.6): pyspark.ml.classification (24s)
Finished test(python3.6): pyspark.sql.tests (190s)
Finished test(python3.6): pyspark.mllib.tests (190s)
Finished test(python3.6): pyspark.ml.clustering (14s)
Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
Finished test(python3.6): pyspark.ml.recommendation (18s)
Finished test(python3.6): pyspark.ml.feature (28s)
Finished test(python3.6): pyspark.ml.evaluation (28s)
Finished test(python3.6): pyspark.ml.regression (21s)
Finished test(python3.6): pyspark.ml.tuning (17s)
Finished test(python3.6): pyspark.streaming.tests (239s)
Finished test(python3.6): pyspark.mllib.evaluation (15s)
Finished test(python3.6): pyspark.mllib.classification (24s)
Finished test(python3.6): pyspark.mllib.clustering (37s)
Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
Finished test(python3.6): pyspark.mllib.fpm (19s)
Finished test(python3.6): pyspark.mllib.feature (19s)
Finished test(python3.6): pyspark.mllib.random (8s)
Finished test(python3.6): pyspark.ml.tests (76s)
Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
Finished test(python3.6): pyspark.mllib.recommendation (21s)
Finished test(python3.6): pyspark.mllib.linalg.distributed (27s)
Finished test(python3.6): pyspark.mllib.regression (22s)
Finished test(python3.6): pyspark.mllib.stat._statistics (11s)
Finished test(python3.6): pyspark.mllib.tree (16s)
Finished test(python3.6): pyspark.profiler (8s)
Finished test(python3.6): pyspark.shuffle (1s)
Finished test(python3.6): pyspark.mllib.util (17s)
Finished test(python3.6): pyspark.serializers (12s)
Finished test(python3.6): pyspark.rdd (18s)
Finished test(python3.6): pyspark.sql.conf (4s)
Finished test(python3.6): pyspark.sql.catalog (14s)
Finished test(python3.6): pyspark.sql.column (13s)
Finished test(python3.6): pyspark.sql.context (15s)
Finished test(python3.6): pyspark.sql.group (26s)
Finished test(python3.6): pyspark.sql.dataframe (31s)
Finished test(python3.6): pyspark.sql.functions (32s)
Finished test(python3.6): pyspark.sql.types (5s)
Finished test(python3.6): pyspark.sql.streaming (11s)
Finished test(python3.6): pyspark.sql.window (5s)
Finished test(python3.6): pyspark.streaming.util (0s)
Finished test(python3.6): pyspark.sql.session (15s)
Finished test(python3.6): pyspark.sql.readwriter (34s)
Tests passed in 376 seconds

SparkQA · 2017-03-21T09:37:16Z

Test build #74970 has finished for PR 17374 at commit 242afc9.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-03-21T10:19:57Z

(I manually tested in the commit before some commits because it seems the build in branch-2.0 is being failed)

… cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? Currently, PySpark does not work with Python 3.6.0. Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all: ``` Traceback (most recent call last): File ".../spark/python/pyspark/shell.py", line 30, in <module> import pyspark File ".../spark/python/pyspark/__init__.py", line 46, in <module> from pyspark.context import SparkContext File ".../spark/python/pyspark/context.py", line 36, in <module> from pyspark.java_gateway import launch_gateway File ".../spark/python/pyspark/java_gateway.py", line 31, in <module> from py4j.java_gateway import java_import, JavaGateway, GatewayClient File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module> File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module> import pkgutil File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg') File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' ``` The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628). We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments). This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this. Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0. ## How was this patch tested? Manually tested with Python 2.7.6 and Python 3.6.0. ``` ./bin/pyspsark ``` , manual creation of `namedtuple` both in local and rdd with Python 3.6.0, and Jenkins tests for other Python versions. Also, ``` ./run-tests --python-executables=python3.6 ``` ``` Will test against the following Python executables: ['python3.6'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Finished test(python3.6): pyspark.sql.tests (192s) Finished test(python3.6): pyspark.accumulators (3s) Finished test(python3.6): pyspark.mllib.tests (198s) Finished test(python3.6): pyspark.broadcast (3s) Finished test(python3.6): pyspark.conf (2s) Finished test(python3.6): pyspark.context (14s) Finished test(python3.6): pyspark.ml.classification (21s) Finished test(python3.6): pyspark.ml.evaluation (11s) Finished test(python3.6): pyspark.ml.clustering (20s) Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Finished test(python3.6): pyspark.streaming.tests (240s) Finished test(python3.6): pyspark.tests (240s) Finished test(python3.6): pyspark.ml.recommendation (19s) Finished test(python3.6): pyspark.ml.feature (36s) Finished test(python3.6): pyspark.ml.regression (37s) Finished test(python3.6): pyspark.ml.tuning (28s) Finished test(python3.6): pyspark.mllib.classification (26s) Finished test(python3.6): pyspark.mllib.evaluation (18s) Finished test(python3.6): pyspark.mllib.clustering (44s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.feature (26s) Finished test(python3.6): pyspark.mllib.fpm (23s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.ml.tests (92s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.linalg.distributed (25s) Finished test(python3.6): pyspark.mllib.stat._statistics (15s) Finished test(python3.6): pyspark.mllib.recommendation (24s) Finished test(python3.6): pyspark.mllib.regression (26s) Finished test(python3.6): pyspark.profiler (9s) Finished test(python3.6): pyspark.mllib.tree (16s) Finished test(python3.6): pyspark.shuffle (1s) Finished test(python3.6): pyspark.mllib.util (18s) Finished test(python3.6): pyspark.serializers (11s) Finished test(python3.6): pyspark.rdd (20s) Finished test(python3.6): pyspark.sql.conf (8s) Finished test(python3.6): pyspark.sql.catalog (17s) Finished test(python3.6): pyspark.sql.column (18s) Finished test(python3.6): pyspark.sql.context (18s) Finished test(python3.6): pyspark.sql.group (27s) Finished test(python3.6): pyspark.sql.dataframe (33s) Finished test(python3.6): pyspark.sql.functions (35s) Finished test(python3.6): pyspark.sql.types (6s) Finished test(python3.6): pyspark.sql.streaming (13s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.session (16s) Finished test(python3.6): pyspark.sql.window (4s) Finished test(python3.6): pyspark.sql.readwriter (35s) Tests passed in 433 seconds ``` Author: hyukjinkwon <[email protected]> Closes apache#16429 from HyukjinKwon/SPARK-19019.

SparkQA · 2017-03-21T14:39:57Z

Test build #74982 has finished for PR 17374 at commit affc3b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-03-21T14:41:30Z

cc @davies

davies · 2017-03-22T00:51:17Z

lgtm

jbloom22 · 2017-04-14T11:49:17Z

Our users (https://hail.is) are running into this bug. Will the backport be merged soon? Thanks!

holdenk · 2017-04-17T16:57:12Z

LGTM I'll merge this today.

holdenk · 2017-04-17T17:01:50Z

@jbloom22 is there any particular reason your waiting on this in the 2.X branch?

…e` and port cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? This PR proposes to backports #16429 to branch-2.0 so that Python 3.6.0 works with Spark 2.0.x. ## How was this patch tested? Manually, via ``` ./run-tests --python-executables=python3.6 ``` ``` Finished test(python3.6): pyspark.tests (124s) Finished test(python3.6): pyspark.accumulators (4s) Finished test(python3.6): pyspark.broadcast (4s) Finished test(python3.6): pyspark.conf (3s) Finished test(python3.6): pyspark.context (15s) Finished test(python3.6): pyspark.ml.classification (24s) Finished test(python3.6): pyspark.sql.tests (190s) Finished test(python3.6): pyspark.mllib.tests (190s) Finished test(python3.6): pyspark.ml.clustering (14s) Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Finished test(python3.6): pyspark.ml.recommendation (18s) Finished test(python3.6): pyspark.ml.feature (28s) Finished test(python3.6): pyspark.ml.evaluation (28s) Finished test(python3.6): pyspark.ml.regression (21s) Finished test(python3.6): pyspark.ml.tuning (17s) Finished test(python3.6): pyspark.streaming.tests (239s) Finished test(python3.6): pyspark.mllib.evaluation (15s) Finished test(python3.6): pyspark.mllib.classification (24s) Finished test(python3.6): pyspark.mllib.clustering (37s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.fpm (19s) Finished test(python3.6): pyspark.mllib.feature (19s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.ml.tests (76s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.recommendation (21s) Finished test(python3.6): pyspark.mllib.linalg.distributed (27s) Finished test(python3.6): pyspark.mllib.regression (22s) Finished test(python3.6): pyspark.mllib.stat._statistics (11s) Finished test(python3.6): pyspark.mllib.tree (16s) Finished test(python3.6): pyspark.profiler (8s) Finished test(python3.6): pyspark.shuffle (1s) Finished test(python3.6): pyspark.mllib.util (17s) Finished test(python3.6): pyspark.serializers (12s) Finished test(python3.6): pyspark.rdd (18s) Finished test(python3.6): pyspark.sql.conf (4s) Finished test(python3.6): pyspark.sql.catalog (14s) Finished test(python3.6): pyspark.sql.column (13s) Finished test(python3.6): pyspark.sql.context (15s) Finished test(python3.6): pyspark.sql.group (26s) Finished test(python3.6): pyspark.sql.dataframe (31s) Finished test(python3.6): pyspark.sql.functions (32s) Finished test(python3.6): pyspark.sql.types (5s) Finished test(python3.6): pyspark.sql.streaming (11s) Finished test(python3.6): pyspark.sql.window (5s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.session (15s) Finished test(python3.6): pyspark.sql.readwriter (34s) Tests passed in 376 seconds ``` Author: hyukjinkwon <[email protected]> Closes #17374 from HyukjinKwon/SPARK-19019-backport.

holdenk · 2017-04-17T17:11:42Z

Merged into branch-2.0.

Thanks for doing this @HyukjinKwon , sorry for my hesitation with merging backports (these are the first pure backport PRs I've merged rather than sanctimoniously merging into an old branch as well).

HyukjinKwon · 2017-04-17T20:09:21Z

I do understand the concern. Thank you @holdenk.

HyukjinKwon force-pushed the SPARK-19019-backport branch from 242afc9 to affc3b2 Compare March 21, 2017 14:01

HyukjinKwon closed this Apr 17, 2017

HyukjinKwon deleted the SPARK-19019-backport branch April 17, 2017 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19019][PYTHON][BRANCH-2.0] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #17374

[SPARK-19019][PYTHON][BRANCH-2.0] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #17374

HyukjinKwon commented Mar 21, 2017 •

edited

Loading

SparkQA commented Mar 21, 2017

HyukjinKwon commented Mar 21, 2017 •

edited

Loading

SparkQA commented Mar 21, 2017

HyukjinKwon commented Mar 21, 2017

davies commented Mar 22, 2017

jbloom22 commented Apr 14, 2017

holdenk commented Apr 17, 2017

holdenk commented Apr 17, 2017

holdenk commented Apr 17, 2017

HyukjinKwon commented Apr 17, 2017

[SPARK-19019][PYTHON][BRANCH-2.0] Fix hijacked collections.namedtuple and port cloudpickle changes for PySpark to work with Python 3.6.0 #17374

[SPARK-19019][PYTHON][BRANCH-2.0] Fix hijacked collections.namedtuple and port cloudpickle changes for PySpark to work with Python 3.6.0 #17374

Conversation

HyukjinKwon commented Mar 21, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 21, 2017

HyukjinKwon commented Mar 21, 2017 • edited Loading

SparkQA commented Mar 21, 2017

HyukjinKwon commented Mar 21, 2017

davies commented Mar 22, 2017

jbloom22 commented Apr 14, 2017

holdenk commented Apr 17, 2017

holdenk commented Apr 17, 2017

holdenk commented Apr 17, 2017

HyukjinKwon commented Apr 17, 2017

[SPARK-19019][PYTHON][BRANCH-2.0] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #17374

[SPARK-19019][PYTHON][BRANCH-2.0] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #17374

HyukjinKwon commented Mar 21, 2017 •

edited

Loading

HyukjinKwon commented Mar 21, 2017 •

edited

Loading