[SPARK-19019][PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #16429

HyukjinKwon · 2016-12-29T02:46:55Z

What changes were proposed in this pull request?

Currently, PySpark does not work with Python 3.6.0.

Running ./bin/pyspark simply throws the error as below and PySpark does not work at all:

Traceback (most recent call last):
  File ".../spark/python/pyspark/shell.py", line 30, in <module>
    import pyspark
  File ".../spark/python/pyspark/__init__.py", line 46, in <module>
    from pyspark.context import SparkContext
  File ".../spark/python/pyspark/context.py", line 36, in <module>
    from pyspark.java_gateway import launch_gateway
  File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
    from py4j.java_gateway import java_import, JavaGateway, GatewayClient
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
  File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
    import pkgutil
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
    ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
  File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
    cls = _old_namedtuple(*args, **kwargs)
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

The root cause seems because some arguments of namedtuple are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628).

We currently copy this function via types.FunctionType which does not set the default values of keyword-only arguments (meaning namedtuple.__kwdefaults__) and this seems causing internally missing values in the function (non-bound arguments).

This PR proposes to work around this by manually setting it via kwargs as types.FunctionType seems not supporting to set this.

Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0.

How was this patch tested?

Manually tested with Python 2.7.6 and Python 3.6.0.

./bin/pyspsark

, manual creation of namedtuple both in local and rdd with Python 3.6.0,

and Jenkins tests for other Python versions.

Also,

./run-tests --python-executables=python3.6

Will test against the following Python executables: ['python3.6']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Finished test(python3.6): pyspark.sql.tests (192s)
Finished test(python3.6): pyspark.accumulators (3s)
Finished test(python3.6): pyspark.mllib.tests (198s)
Finished test(python3.6): pyspark.broadcast (3s)
Finished test(python3.6): pyspark.conf (2s)
Finished test(python3.6): pyspark.context (14s)
Finished test(python3.6): pyspark.ml.classification (21s)
Finished test(python3.6): pyspark.ml.evaluation (11s)
Finished test(python3.6): pyspark.ml.clustering (20s)
Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
Finished test(python3.6): pyspark.streaming.tests (240s)
Finished test(python3.6): pyspark.tests (240s)
Finished test(python3.6): pyspark.ml.recommendation (19s)
Finished test(python3.6): pyspark.ml.feature (36s)
Finished test(python3.6): pyspark.ml.regression (37s)
Finished test(python3.6): pyspark.ml.tuning (28s)
Finished test(python3.6): pyspark.mllib.classification (26s)
Finished test(python3.6): pyspark.mllib.evaluation (18s)
Finished test(python3.6): pyspark.mllib.clustering (44s)
Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
Finished test(python3.6): pyspark.mllib.feature (26s)
Finished test(python3.6): pyspark.mllib.fpm (23s)
Finished test(python3.6): pyspark.mllib.random (8s)
Finished test(python3.6): pyspark.ml.tests (92s)
Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
Finished test(python3.6): pyspark.mllib.linalg.distributed (25s)
Finished test(python3.6): pyspark.mllib.stat._statistics (15s)
Finished test(python3.6): pyspark.mllib.recommendation (24s)
Finished test(python3.6): pyspark.mllib.regression (26s)
Finished test(python3.6): pyspark.profiler (9s)
Finished test(python3.6): pyspark.mllib.tree (16s)
Finished test(python3.6): pyspark.shuffle (1s)
Finished test(python3.6): pyspark.mllib.util (18s)
Finished test(python3.6): pyspark.serializers (11s)
Finished test(python3.6): pyspark.rdd (20s)
Finished test(python3.6): pyspark.sql.conf (8s)
Finished test(python3.6): pyspark.sql.catalog (17s)
Finished test(python3.6): pyspark.sql.column (18s)
Finished test(python3.6): pyspark.sql.context (18s)
Finished test(python3.6): pyspark.sql.group (27s)
Finished test(python3.6): pyspark.sql.dataframe (33s)
Finished test(python3.6): pyspark.sql.functions (35s)
Finished test(python3.6): pyspark.sql.types (6s)
Finished test(python3.6): pyspark.sql.streaming (13s)
Finished test(python3.6): pyspark.streaming.util (0s)
Finished test(python3.6): pyspark.sql.session (16s)
Finished test(python3.6): pyspark.sql.window (4s)
Finished test(python3.6): pyspark.sql.readwriter (35s)
Tests passed in 433 seconds

HyukjinKwon · 2016-12-29T02:48:06Z

cc @davies and @JoshRosen. I know both of you are insightful in this area. I am not too sure if this is in the right direction (or rather the best fix) as it seems not even fixed in some other Python thirdparty libraries about function serialization yet (but an issue seems open by its maintainer). Do you mind if I ask to take a look?

SparkQA · 2016-12-29T02:48:40Z

Test build #70699 has finished for PR 16429 at commit fb04979.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-29T03:31:43Z

Test build #70702 has finished for PR 16429 at commit f4c56c8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-29T03:41:29Z

Test build #70701 has finished for PR 16429 at commit 9ac9d01.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-29T04:22:50Z

Test build #70705 has finished for PR 16429 at commit b688e89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-12-30T04:22:19Z

python/pyspark/serializers.py


    def _copy_func(f):
        return types.FunctionType(f.__code__, f.__globals__, f.__name__,
                                  f.__defaults__, f.__closure__)

+    def _kwdefaults(f):
+        kargs = getattr(f, "__kwdefaults__", None)


__kwdefaults__ can be None or not existing.

Could you put this comment into code?

HyukjinKwon · 2017-01-03T00:00:48Z

gentle ping..

azmras · 2017-01-03T08:16:10Z

after applying patch can you try to run
sc.parallelize(range(100), 8)
and confirm that it is working, because for me it is not...
and serialisation of objects goes crazy..

had to go back to python 3.5.2 for now..

Thanks for your efforts

HyukjinKwon · 2017-01-03T09:18:36Z

Thanks for your interests @azmras. I just checked it as below:

sc.parallelize(range(100), 8)

Traceback (most recent call last):
  File ".../spark/python/pyspark/cloudpickle.py", line 107, in dump
    return Pickler.dump(self, obj)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 751, in save_tuple
    save(element)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File ".../spark/python/pyspark/cloudpickle.py", line 214, in save_function
    self.save_function_tuple(obj)
  File ".../spark/python/pyspark/cloudpickle.py", line 244, in save_function_tuple
    code, f_globals, defaults, closure, dct, base_globals = self.extract_func_data(func)
  File ".../spark/python/pyspark/cloudpickle.py", line 306, in extract_func_data
    func_global_refs = self.extract_code_globals(code)
  File ".../spark/python/pyspark/cloudpickle.py", line 288, in extract_code_globals
    out_names.add(names[oparg])
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/rdd.py", line 198, in __repr__
    return self._jrdd.toString()
  File ".../spark/python/pyspark/rdd.py", line 2438, in _jrdd
    self._jrdd_deserializer, profiler)
  File ".../spark/python/pyspark/rdd.py", line 2371, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
  File ".../spark/python/pyspark/rdd.py", line 2357, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
  File ".../spark/python/pyspark/serializers.py", line 452, in dumps
    return cloudpickle.dumps(obj, 2)
  File ".../spark/python/pyspark/cloudpickle.py", line 667, in dumps
    cp.dump(obj)
  File ".../spark/python/pyspark/cloudpickle.py", line 115, in dump
    if "'i' format requires" in e.message:
AttributeError: 'IndexError' object has no attribute 'message'

It looks another issue with Python 3.6.0 which seems related with cloudpickle module.

We should port cloudpipe/cloudpickle@cbd3f34

HyukjinKwon · 2017-01-03T09:23:56Z

Hi @JoshRosen and @davies, do you think that should be ported in this PR? I am worried of making this PR harder to review by porting it here.

HyukjinKwon · 2017-01-03T11:28:20Z

Hi @azmras, now it should work fine for your case as well.

SparkQA · 2017-01-03T12:09:09Z

Test build #70812 has finished for PR 16429 at commit 7b96546.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

azmras · 2017-01-03T13:39:49Z

Thanks.. it worked...
while you are at it...
sc.parallelize(range(1000), 20).take(5)
is problemetic..

the original problem is back when you do some action on RDD
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

Will be thankful if you can look into it..

HyukjinKwon · 2017-01-03T13:44:12Z

@azmras Could you maybe double check? It works okay in my local as below:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
      /_/

Using Python version 3.6.0 (default, Dec 24 2016 00:01:50)
SparkSession available as 'spark'.
>>> sc.parallelize(range(100), 8).take(5)
[0, 1, 2, 3, 4]
>>> sc.parallelize(range(1000), 20).take(5)
[0, 1, 2, 3, 4]
>>>

azmras · 2017-01-04T05:30:40Z

Spark version 2.1.0
Using Python version 3.6.0 (default, Dec 24 2016 08:01:42)
SparkSession available as 'spark'.

sc.parallelize(range(1000), 20).take(5)
[0, 1, 2, 3, 4]

Thanks a lot it is working now.. had to patch zipped lib too.

HyukjinKwon · 2017-01-04T05:32:59Z

@azmras Thank you for confirming this.

azmras · 2017-01-04T05:45:12Z

just checked other things, ml, sql etc... everything is looking fine... I can safely say goodbye to python 3.5 now...

Thank you.

nbys · 2017-01-04T15:52:42Z

After I applied your patch I get this error:

Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/cloudpickle.py", line 107, in dump
    return Pickler.dump(self, obj)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 751, in save_tuple
    save(element)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/cloudpickle.py", line 214, in save_function
    self.save_function_tuple(obj)
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/cloudpickle.py", line 251, in save_function_tuple
    save((code, closure, base_globals))
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 736, in save_tuple
    save(element)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 781, in save_list
    self._batch_appends(obj)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 805, in _batch_appends
    save(x)
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/cloudpickle.py", line 214, in save_function
    self.save_function_tuple(obj)
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/cloudpickle.py", line 244, in save_function_tuple
    code, f_globals, defaults, closure, dct, base_globals = self.extract_func_data(func)
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/cloudpickle.py", line 306, in extract_func_data
    func_global_refs = self.extract_code_globals(code)
  File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/cloudpickle.py", line 288, in extract_code_globals
    out_names.add(names[oparg])
IndexError: tuple index out of range

Could you please take a look?

Regards,
Nikolay

HyukjinKwon · 2017-01-04T16:01:41Z

Could you maybe check if this patch is applied properly? That error is exactly what this PR fixes and it seems the line number in the errors is not matched to the one in this PR.

azmras · 2017-01-05T04:50:44Z

@cxww107
Try to update both patched files in the following locations
/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark
/usr/local/Cellar/apache-spark/2.1.0/libexec/python/lib/pyspark.zip
See if it works..

Thanks

HyukjinKwon · 2017-01-06T00:36:11Z

ping @JoshRosen and @davies

HyukjinKwon · 2017-01-15T12:34:02Z

Thank you @davies. The only added comments are as blow in the rebased commits:

# __kwdefaults__ contains the default values of keyword-only arguments which are
# introduced from Python 3. The possible cases for __kwdefaults__ in namedtuple
# are as below:
#
# - Does not exist in Python 2.
# - Returns None in <= Python 3.5.x.
# - Returns a dictionary containing the default values to the keys from Python 3.6.x
#    (See https://bugs.python.org/issue25628).

HyukjinKwon · 2017-01-15T12:45:04Z

(I re-ran ./run-tests --python-executables=python3.6 for sure)

SparkQA · 2017-01-15T13:09:46Z

Test build #71396 has finished for PR 16429 at commit 6458d41.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-15T13:11:39Z

retest this please

SparkQA · 2017-01-15T13:54:29Z

Test build #71397 has finished for PR 16429 at commit 6458d41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-17T03:39:42Z

@davies, Could this be merged by any change maybe?

davies · 2017-01-17T17:52:38Z

lgtm, merging into master and 2.1 branch.

… cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? Currently, PySpark does not work with Python 3.6.0. Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all: ``` Traceback (most recent call last): File ".../spark/python/pyspark/shell.py", line 30, in <module> import pyspark File ".../spark/python/pyspark/__init__.py", line 46, in <module> from pyspark.context import SparkContext File ".../spark/python/pyspark/context.py", line 36, in <module> from pyspark.java_gateway import launch_gateway File ".../spark/python/pyspark/java_gateway.py", line 31, in <module> from py4j.java_gateway import java_import, JavaGateway, GatewayClient File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module> File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module> import pkgutil File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg') File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' ``` The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628). We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments). This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this. Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0. ## How was this patch tested? Manually tested with Python 2.7.6 and Python 3.6.0. ``` ./bin/pyspsark ``` , manual creation of `namedtuple` both in local and rdd with Python 3.6.0, and Jenkins tests for other Python versions. Also, ``` ./run-tests --python-executables=python3.6 ``` ``` Will test against the following Python executables: ['python3.6'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Finished test(python3.6): pyspark.sql.tests (192s) Finished test(python3.6): pyspark.accumulators (3s) Finished test(python3.6): pyspark.mllib.tests (198s) Finished test(python3.6): pyspark.broadcast (3s) Finished test(python3.6): pyspark.conf (2s) Finished test(python3.6): pyspark.context (14s) Finished test(python3.6): pyspark.ml.classification (21s) Finished test(python3.6): pyspark.ml.evaluation (11s) Finished test(python3.6): pyspark.ml.clustering (20s) Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Finished test(python3.6): pyspark.streaming.tests (240s) Finished test(python3.6): pyspark.tests (240s) Finished test(python3.6): pyspark.ml.recommendation (19s) Finished test(python3.6): pyspark.ml.feature (36s) Finished test(python3.6): pyspark.ml.regression (37s) Finished test(python3.6): pyspark.ml.tuning (28s) Finished test(python3.6): pyspark.mllib.classification (26s) Finished test(python3.6): pyspark.mllib.evaluation (18s) Finished test(python3.6): pyspark.mllib.clustering (44s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.feature (26s) Finished test(python3.6): pyspark.mllib.fpm (23s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.ml.tests (92s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.linalg.distributed (25s) Finished test(python3.6): pyspark.mllib.stat._statistics (15s) Finished test(python3.6): pyspark.mllib.recommendation (24s) Finished test(python3.6): pyspark.mllib.regression (26s) Finished test(python3.6): pyspark.profiler (9s) Finished test(python3.6): pyspark.mllib.tree (16s) Finished test(python3.6): pyspark.shuffle (1s) Finished test(python3.6): pyspark.mllib.util (18s) Finished test(python3.6): pyspark.serializers (11s) Finished test(python3.6): pyspark.rdd (20s) Finished test(python3.6): pyspark.sql.conf (8s) Finished test(python3.6): pyspark.sql.catalog (17s) Finished test(python3.6): pyspark.sql.column (18s) Finished test(python3.6): pyspark.sql.context (18s) Finished test(python3.6): pyspark.sql.group (27s) Finished test(python3.6): pyspark.sql.dataframe (33s) Finished test(python3.6): pyspark.sql.functions (35s) Finished test(python3.6): pyspark.sql.types (6s) Finished test(python3.6): pyspark.sql.streaming (13s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.session (16s) Finished test(python3.6): pyspark.sql.window (4s) Finished test(python3.6): pyspark.sql.readwriter (35s) Tests passed in 433 seconds ``` Author: hyukjinkwon <[email protected]> Closes #16429 from HyukjinKwon/SPARK-19019. (cherry picked from commit 20e6280) Signed-off-by: Davies Liu <[email protected]>

… cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? Currently, PySpark does not work with Python 3.6.0. Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all: ``` Traceback (most recent call last): File ".../spark/python/pyspark/shell.py", line 30, in <module> import pyspark File ".../spark/python/pyspark/__init__.py", line 46, in <module> from pyspark.context import SparkContext File ".../spark/python/pyspark/context.py", line 36, in <module> from pyspark.java_gateway import launch_gateway File ".../spark/python/pyspark/java_gateway.py", line 31, in <module> from py4j.java_gateway import java_import, JavaGateway, GatewayClient File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module> File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module> import pkgutil File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg') File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' ``` The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628). We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments). This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this. Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0. ## How was this patch tested? Manually tested with Python 2.7.6 and Python 3.6.0. ``` ./bin/pyspsark ``` , manual creation of `namedtuple` both in local and rdd with Python 3.6.0, and Jenkins tests for other Python versions. Also, ``` ./run-tests --python-executables=python3.6 ``` ``` Will test against the following Python executables: ['python3.6'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Finished test(python3.6): pyspark.sql.tests (192s) Finished test(python3.6): pyspark.accumulators (3s) Finished test(python3.6): pyspark.mllib.tests (198s) Finished test(python3.6): pyspark.broadcast (3s) Finished test(python3.6): pyspark.conf (2s) Finished test(python3.6): pyspark.context (14s) Finished test(python3.6): pyspark.ml.classification (21s) Finished test(python3.6): pyspark.ml.evaluation (11s) Finished test(python3.6): pyspark.ml.clustering (20s) Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Finished test(python3.6): pyspark.streaming.tests (240s) Finished test(python3.6): pyspark.tests (240s) Finished test(python3.6): pyspark.ml.recommendation (19s) Finished test(python3.6): pyspark.ml.feature (36s) Finished test(python3.6): pyspark.ml.regression (37s) Finished test(python3.6): pyspark.ml.tuning (28s) Finished test(python3.6): pyspark.mllib.classification (26s) Finished test(python3.6): pyspark.mllib.evaluation (18s) Finished test(python3.6): pyspark.mllib.clustering (44s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.feature (26s) Finished test(python3.6): pyspark.mllib.fpm (23s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.ml.tests (92s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.linalg.distributed (25s) Finished test(python3.6): pyspark.mllib.stat._statistics (15s) Finished test(python3.6): pyspark.mllib.recommendation (24s) Finished test(python3.6): pyspark.mllib.regression (26s) Finished test(python3.6): pyspark.profiler (9s) Finished test(python3.6): pyspark.mllib.tree (16s) Finished test(python3.6): pyspark.shuffle (1s) Finished test(python3.6): pyspark.mllib.util (18s) Finished test(python3.6): pyspark.serializers (11s) Finished test(python3.6): pyspark.rdd (20s) Finished test(python3.6): pyspark.sql.conf (8s) Finished test(python3.6): pyspark.sql.catalog (17s) Finished test(python3.6): pyspark.sql.column (18s) Finished test(python3.6): pyspark.sql.context (18s) Finished test(python3.6): pyspark.sql.group (27s) Finished test(python3.6): pyspark.sql.dataframe (33s) Finished test(python3.6): pyspark.sql.functions (35s) Finished test(python3.6): pyspark.sql.types (6s) Finished test(python3.6): pyspark.sql.streaming (13s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.session (16s) Finished test(python3.6): pyspark.sql.window (4s) Finished test(python3.6): pyspark.sql.readwriter (35s) Tests passed in 433 seconds ``` Author: hyukjinkwon <[email protected]> Closes apache#16429 from HyukjinKwon/SPARK-19019.

…e` and port cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? This PR proposes to backports #16429 to branch-1.6 so that Python 3.6.0 works with Spark 1.6.x. ## How was this patch tested? Manually, via ``` ./run-tests --python-executables=python3.6 ``` ``` Finished test(python3.6): pyspark.conf (5s) Finished test(python3.6): pyspark.broadcast (7s) Finished test(python3.6): pyspark.accumulators (9s) Finished test(python3.6): pyspark.rdd (16s) Finished test(python3.6): pyspark.shuffle (0s) Finished test(python3.6): pyspark.serializers (11s) Finished test(python3.6): pyspark.profiler (5s) Finished test(python3.6): pyspark.context (21s) Finished test(python3.6): pyspark.ml.clustering (12s) Finished test(python3.6): pyspark.ml.feature (16s) Finished test(python3.6): pyspark.ml.classification (16s) Finished test(python3.6): pyspark.ml.recommendation (16s) Finished test(python3.6): pyspark.ml.tuning (14s) Finished test(python3.6): pyspark.ml.regression (16s) Finished test(python3.6): pyspark.ml.evaluation (12s) Finished test(python3.6): pyspark.ml.tests (17s) Finished test(python3.6): pyspark.mllib.classification (18s) Finished test(python3.6): pyspark.mllib.evaluation (12s) Finished test(python3.6): pyspark.mllib.feature (19s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.fpm (12s) Finished test(python3.6): pyspark.mllib.clustering (31s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.mllib.linalg.distributed (17s) Finished test(python3.6): pyspark.mllib.recommendation (23s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.stat._statistics (13s) Finished test(python3.6): pyspark.mllib.regression (22s) Finished test(python3.6): pyspark.mllib.util (9s) Finished test(python3.6): pyspark.mllib.tree (14s) Finished test(python3.6): pyspark.sql.types (9s) Finished test(python3.6): pyspark.sql.context (16s) Finished test(python3.6): pyspark.sql.column (14s) Finished test(python3.6): pyspark.sql.group (16s) Finished test(python3.6): pyspark.sql.dataframe (25s) Finished test(python3.6): pyspark.tests (164s) Finished test(python3.6): pyspark.sql.window (6s) Finished test(python3.6): pyspark.sql.functions (19s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.readwriter (24s) Finished test(python3.6): pyspark.sql.tests (38s) Finished test(python3.6): pyspark.mllib.tests (133s) Finished test(python3.6): pyspark.streaming.tests (189s) Tests passed in 380 seconds ``` Author: hyukjinkwon <[email protected]> Closes #17375 from HyukjinKwon/SPARK-19019-backport-1.6.

…e` and port cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? This PR proposes to backports #16429 to branch-2.0 so that Python 3.6.0 works with Spark 2.0.x. ## How was this patch tested? Manually, via ``` ./run-tests --python-executables=python3.6 ``` ``` Finished test(python3.6): pyspark.tests (124s) Finished test(python3.6): pyspark.accumulators (4s) Finished test(python3.6): pyspark.broadcast (4s) Finished test(python3.6): pyspark.conf (3s) Finished test(python3.6): pyspark.context (15s) Finished test(python3.6): pyspark.ml.classification (24s) Finished test(python3.6): pyspark.sql.tests (190s) Finished test(python3.6): pyspark.mllib.tests (190s) Finished test(python3.6): pyspark.ml.clustering (14s) Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Finished test(python3.6): pyspark.ml.recommendation (18s) Finished test(python3.6): pyspark.ml.feature (28s) Finished test(python3.6): pyspark.ml.evaluation (28s) Finished test(python3.6): pyspark.ml.regression (21s) Finished test(python3.6): pyspark.ml.tuning (17s) Finished test(python3.6): pyspark.streaming.tests (239s) Finished test(python3.6): pyspark.mllib.evaluation (15s) Finished test(python3.6): pyspark.mllib.classification (24s) Finished test(python3.6): pyspark.mllib.clustering (37s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.fpm (19s) Finished test(python3.6): pyspark.mllib.feature (19s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.ml.tests (76s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.recommendation (21s) Finished test(python3.6): pyspark.mllib.linalg.distributed (27s) Finished test(python3.6): pyspark.mllib.regression (22s) Finished test(python3.6): pyspark.mllib.stat._statistics (11s) Finished test(python3.6): pyspark.mllib.tree (16s) Finished test(python3.6): pyspark.profiler (8s) Finished test(python3.6): pyspark.shuffle (1s) Finished test(python3.6): pyspark.mllib.util (17s) Finished test(python3.6): pyspark.serializers (12s) Finished test(python3.6): pyspark.rdd (18s) Finished test(python3.6): pyspark.sql.conf (4s) Finished test(python3.6): pyspark.sql.catalog (14s) Finished test(python3.6): pyspark.sql.column (13s) Finished test(python3.6): pyspark.sql.context (15s) Finished test(python3.6): pyspark.sql.group (26s) Finished test(python3.6): pyspark.sql.dataframe (31s) Finished test(python3.6): pyspark.sql.functions (32s) Finished test(python3.6): pyspark.sql.types (5s) Finished test(python3.6): pyspark.sql.streaming (11s) Finished test(python3.6): pyspark.sql.window (5s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.session (15s) Finished test(python3.6): pyspark.sql.readwriter (34s) Tests passed in 376 seconds ``` Author: hyukjinkwon <[email protected]> Closes #17374 from HyukjinKwon/SPARK-19019-backport.

…e` and port cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? This PR proposes to backports apache#16429 to branch-1.6 so that Python 3.6.0 works with Spark 1.6.x. ## How was this patch tested? Manually, via ``` ./run-tests --python-executables=python3.6 ``` ``` Finished test(python3.6): pyspark.conf (5s) Finished test(python3.6): pyspark.broadcast (7s) Finished test(python3.6): pyspark.accumulators (9s) Finished test(python3.6): pyspark.rdd (16s) Finished test(python3.6): pyspark.shuffle (0s) Finished test(python3.6): pyspark.serializers (11s) Finished test(python3.6): pyspark.profiler (5s) Finished test(python3.6): pyspark.context (21s) Finished test(python3.6): pyspark.ml.clustering (12s) Finished test(python3.6): pyspark.ml.feature (16s) Finished test(python3.6): pyspark.ml.classification (16s) Finished test(python3.6): pyspark.ml.recommendation (16s) Finished test(python3.6): pyspark.ml.tuning (14s) Finished test(python3.6): pyspark.ml.regression (16s) Finished test(python3.6): pyspark.ml.evaluation (12s) Finished test(python3.6): pyspark.ml.tests (17s) Finished test(python3.6): pyspark.mllib.classification (18s) Finished test(python3.6): pyspark.mllib.evaluation (12s) Finished test(python3.6): pyspark.mllib.feature (19s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.fpm (12s) Finished test(python3.6): pyspark.mllib.clustering (31s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.mllib.linalg.distributed (17s) Finished test(python3.6): pyspark.mllib.recommendation (23s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.stat._statistics (13s) Finished test(python3.6): pyspark.mllib.regression (22s) Finished test(python3.6): pyspark.mllib.util (9s) Finished test(python3.6): pyspark.mllib.tree (14s) Finished test(python3.6): pyspark.sql.types (9s) Finished test(python3.6): pyspark.sql.context (16s) Finished test(python3.6): pyspark.sql.column (14s) Finished test(python3.6): pyspark.sql.group (16s) Finished test(python3.6): pyspark.sql.dataframe (25s) Finished test(python3.6): pyspark.tests (164s) Finished test(python3.6): pyspark.sql.window (6s) Finished test(python3.6): pyspark.sql.functions (19s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.readwriter (24s) Finished test(python3.6): pyspark.sql.tests (38s) Finished test(python3.6): pyspark.mllib.tests (133s) Finished test(python3.6): pyspark.streaming.tests (189s) Tests passed in 380 seconds ``` Author: hyukjinkwon <[email protected]> Closes apache#17375 from HyukjinKwon/SPARK-19019-backport-1.6. (cherry picked from commit 6b315f3)

See apache/spark#16429

jamal119 · 2018-10-30T03:09:17Z

hi,HyukjinKwon,have you solved your problem，my error is exactly the same as yours，looking forward to your reply.

HyukjinKwon · 2018-10-30T03:14:42Z

This is fixed from Spark 1.6.4, 2.0.3, 2.1.1 and 2.2.0.

jamal119 · 2018-10-30T11:31:20Z

How did you end up solving that problem，reinstall the spark ？ Which version did you install，Can you offer a final solution，thanks

HyukjinKwon · 2018-10-30T13:23:56Z

Install Spark I mentioned above.

HyukjinKwon changed the title ~~[WIP][SPARK-19019][PYTHON] Fix hijected collections.namedtuple to be serialized with keyword-only arguments~~ [WIP][SPARK-19019][PYTHON] Fix hijected collections.namedtuple to be serialized with keyword-only arguments in Python 3.6.0 Dec 29, 2016

HyukjinKwon changed the title ~~[WIP][SPARK-19019][PYTHON] Fix hijected collections.namedtuple to be serialized with keyword-only arguments in Python 3.6.0~~ [WIP][SPARK-19019][PYTHON] Fix hijacked collections.namedtuple to be serialized with keyword-only arguments in Python 3.6.0 Dec 29, 2016

HyukjinKwon force-pushed the SPARK-19019 branch from f4c56c8 to b688e89 Compare December 29, 2016 03:43

HyukjinKwon changed the title ~~[WIP][SPARK-19019][PYTHON] Fix hijacked collections.namedtuple to be serialized with keyword-only arguments in Python 3.6.0~~ [SPARK-19019][PYTHON] Fix hijacked collections.namedtuple to be serialized with keyword-only arguments in Python 3.6.0 Dec 29, 2016

HyukjinKwon commented Dec 30, 2016

View reviewed changes

HyukjinKwon force-pushed the SPARK-19019 branch from b688e89 to 7b96546 Compare January 3, 2017 11:28

HyukjinKwon changed the title ~~[SPARK-19019][PYTHON] Fix hijacked collections.namedtuple and port cloudpickle for PySpark to work with Python 3.6.0~~ [SPARK-19019][PYTHON] Fix hijacked collections.namedtuple and port cloudpickle changes for PySpark to work with Python 3.6.0 Jan 3, 2017

Add comments for __kwdeaults__

6458d41

HyukjinKwon force-pushed the SPARK-19019 branch from 7d98e35 to 6458d41 Compare January 15, 2017 12:31

asfgit closed this in 20e6280 Jan 17, 2017

HyukjinKwon mentioned this pull request Mar 21, 2017

[SPARK-19019][PYTHON][BRANCH-2.0] Fix hijacked collections.namedtuple and port cloudpickle changes for PySpark to work with Python 3.6.0 #17374

Closed

HyukjinKwon mentioned this pull request Mar 21, 2017

[SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked collections.namedtuple and port cloudpickle changes for PySpark to work with Python 3.6.0 #17375

Closed

liudangyi mentioned this pull request Apr 4, 2017

apache-spark 2.1.0: patch to work with python 3.6 Homebrew/homebrew-core#12059

Closed

4 tasks

ilovezfs pushed a commit to liudangyi/homebrew-core that referenced this pull request Apr 18, 2017

apache-spark: patch to work with python 3.6

9f46eef

See apache/spark#16429

HyukjinKwon deleted the SPARK-19019 branch January 2, 2018 03:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19019][PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #16429

[SPARK-19019][PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #16429

HyukjinKwon commented Dec 29, 2016 •

edited

Loading

HyukjinKwon commented Dec 29, 2016 •

edited

Loading

SparkQA commented Dec 29, 2016

SparkQA commented Dec 29, 2016

SparkQA commented Dec 29, 2016

SparkQA commented Dec 29, 2016

HyukjinKwon Dec 30, 2016

davies Jan 15, 2017

HyukjinKwon commented Jan 3, 2017

azmras commented Jan 3, 2017

HyukjinKwon commented Jan 3, 2017 •

edited

Loading

HyukjinKwon commented Jan 3, 2017 •

edited

Loading

HyukjinKwon commented Jan 3, 2017

SparkQA commented Jan 3, 2017

azmras commented Jan 3, 2017 •

edited

Loading

HyukjinKwon commented Jan 3, 2017

azmras commented Jan 4, 2017 •

edited

Loading

HyukjinKwon commented Jan 4, 2017

azmras commented Jan 4, 2017

nbys commented Jan 4, 2017

HyukjinKwon commented Jan 4, 2017

azmras commented Jan 5, 2017

HyukjinKwon commented Jan 6, 2017

HyukjinKwon commented Jan 15, 2017 •

edited

Loading

HyukjinKwon commented Jan 15, 2017

SparkQA commented Jan 15, 2017

HyukjinKwon commented Jan 15, 2017

SparkQA commented Jan 15, 2017

HyukjinKwon commented Jan 17, 2017

davies commented Jan 17, 2017

jamal119 commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

jamal119 commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

[SPARK-19019][PYTHON] Fix hijacked collections.namedtuple and port cloudpickle changes for PySpark to work with Python 3.6.0 #16429

[SPARK-19019][PYTHON] Fix hijacked collections.namedtuple and port cloudpickle changes for PySpark to work with Python 3.6.0 #16429

Conversation

HyukjinKwon commented Dec 29, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Dec 29, 2016 • edited Loading

SparkQA commented Dec 29, 2016

SparkQA commented Dec 29, 2016

SparkQA commented Dec 29, 2016

SparkQA commented Dec 29, 2016

HyukjinKwon Dec 30, 2016

Choose a reason for hiding this comment

davies Jan 15, 2017

Choose a reason for hiding this comment

HyukjinKwon commented Jan 3, 2017

azmras commented Jan 3, 2017

HyukjinKwon commented Jan 3, 2017 • edited Loading

HyukjinKwon commented Jan 3, 2017 • edited Loading

HyukjinKwon commented Jan 3, 2017

SparkQA commented Jan 3, 2017

azmras commented Jan 3, 2017 • edited Loading

HyukjinKwon commented Jan 3, 2017

azmras commented Jan 4, 2017 • edited Loading

HyukjinKwon commented Jan 4, 2017

azmras commented Jan 4, 2017

nbys commented Jan 4, 2017

HyukjinKwon commented Jan 4, 2017

azmras commented Jan 5, 2017

HyukjinKwon commented Jan 6, 2017

HyukjinKwon commented Jan 15, 2017 • edited Loading

HyukjinKwon commented Jan 15, 2017

SparkQA commented Jan 15, 2017

HyukjinKwon commented Jan 15, 2017

SparkQA commented Jan 15, 2017

HyukjinKwon commented Jan 17, 2017

davies commented Jan 17, 2017

jamal119 commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

jamal119 commented Oct 30, 2018

HyukjinKwon commented Oct 30, 2018

[SPARK-19019][PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #16429

[SPARK-19019][PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #16429

HyukjinKwon commented Dec 29, 2016 •

edited

Loading

HyukjinKwon commented Dec 29, 2016 •

edited

Loading

HyukjinKwon commented Jan 3, 2017 •

edited

Loading

HyukjinKwon commented Jan 3, 2017 •

edited

Loading

azmras commented Jan 3, 2017 •

edited

Loading

azmras commented Jan 4, 2017 •

edited

Loading

HyukjinKwon commented Jan 15, 2017 •

edited

Loading