[SPARK-19926][PYSPARK] Make pyspark exception more user-friendly #17267

uncleGen · 2017-03-12T13:00:05Z

What changes were proposed in this pull request?

Exception in pyspark is a little difficult to read.

before pr, like:

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/root/dev/spark/dist/python/pyspark/sql/streaming.py", line 853, in start
    return self._sq(self._jwrite.start())
  File "/root/dev/spark/dist/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/root/dev/spark/dist/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;\nAggregate [window#17, word#5], [window#17 AS window#11, word#5, count(1) AS count#16L]\n+- Filter ((t#6 >= window#17.start) && (t#6 < window#17.end))\n   +- Expand [ArrayBuffer(named_struct(start, ((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 30000000) + 0), end, (((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 30000000) + 0) + 30000000)), word#5, t#6-T30000ms), ArrayBuffer(named_struct(start, ((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 30000000) + 0), end, (((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 30000000) + 0) + 30000000)), word#5, t#6-T30000ms)], [window#17, word#5, t#6-T30000ms]\n      +- EventTimeWatermark t#6: timestamp, interval 30 seconds\n         +- Project [cast(word#0 as string) AS word#5, cast(t#1 as timestamp) AS t#6]\n            +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@c4079ca,csv,List(),Some(StructType(StructField(word,StringType,true), StructField(t,IntegerType,true))),List(),None,Map(sep -> ;, path -> /tmp/data),None), FileSource[/tmp/data], [word#0, t#1]\n'

after pr:

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/root/dev/spark/dist/python/pyspark/sql/streaming.py", line 853, in start
    return self._sq(self._jwrite.start())
  File "/root/dev/spark/dist/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/root/dev/spark/dist/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
Aggregate [window#17, word#5], [window#17 AS window#11, word#5, count(1) AS count#16L]
+- Filter ((t#6 >= window#17.start) && (t#6 < window#17.end))
   +- Expand [ArrayBuffer(named_struct(start, ((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 30000000) + 0), end, (((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 30000000) + 0) + 30000000)), word#5, t#6-T30000ms), ArrayBuffer(named_struct(start, ((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 30000000) + 0), end, (((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 30000000) + 0) + 30000000)), word#5, t#6-T30000ms)], [window#17, word#5, t#6-T30000ms]
      +- EventTimeWatermark t#6: timestamp, interval 30 seconds
         +- Project [cast(word#0 as string) AS word#5, cast(t#1 as timestamp) AS t#6]
            +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@5265083b,csv,List(),Some(StructType(StructField(word,StringType,true), StructField(t,IntegerType,true))),List(),None,Map(sep -> ;, path -> /tmp/data),None), FileSource[/tmp/data], [word#0, t#1]

IMHO, the root cause is the repr is not user-friendly

class CapturedException(Exception):
    def __init__(self, desc, stackTrace):
        self.desc = desc
        self.stackTrace = stackTrace

    def __str__(self):
        return repr(self.desc)
               ▲▲▲▲

This pr change repr to str

str()	repr()
make object readable	need code that reproduces object
generate output for end user	generate output for developer

How was this patch tested?

Jenkins

uncleGen · 2017-03-12T13:03:07Z

Maybe @viirya and @davies can give some suggestion.

SparkQA · 2017-03-12T13:23:27Z

Test build #74403 has finished for PR 17267 at commit 273c1bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-13T03:50:15Z

Thanks for working on this. LGTM.

uncleGen · 2017-03-13T03:55:20Z

@viirya Thanks for your review.
cc @srowen IIUC, it is OK for other use of repr.

srowen · 2017-03-13T08:41:23Z

What's the difference between the two? briefly. I don't know enough to evaluate it though the effect looks positive. Is this the only place this should change?

uncleGen · 2017-03-13T08:57:43Z

IMHO, yes. And @viirya is the original author.

str()	repr()
make object readable	need code that reproduces object
generate output for end user	generate output for developer

HyukjinKwon · 2017-03-13T13:00:49Z

python/pyspark/sql/utils.py

@@ -24,7 +24,7 @@ def __init__(self, desc, stackTrace):
        self.stackTrace = stackTrace

    def __str__(self):
-        return repr(self.desc)
+        return str(self.desc)


Hm.. does this work for unicode in Python 2, for example, spark.range(1).select("아")? Up to my knowledge, converting it to ascii directly throws an exception.

>>> str(u"아")

Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 0: ordinal not in range(128)

>>> repr(u"아")

"u'\\uc544'"

Maybe, we should check if this is unicode and do .encode.

I just tested with this change as below to help:

before

>>> try: ... spark.range(1).select(u"아") ... except Exception as e: ... print e

u"cannot resolve '`\uc544`' given input columns: [id];;\n'Project ['\uc544]\n+- Range (0, 1, step=1, splits=Some(8))\n"

>>> spark.range(1).select(u"아")

Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select jdf = self._jdf.select(self._jcols(*cols)) File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File ".../spark/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u"cannot resolve '`\uc544`' given input columns: [id];;\n'Project ['\uc544]\n+- Range (0, 1, step=1, splits=Some(8))\n"

after

>>> try: ... spark.range(1).select(u"아") ... except Exception as e: ... print e

Traceback (most recent call last): File "<stdin>", line 4, in <module> File ".../spark/python/pyspark/sql/utils.py", line 27, in __str__ return str(self.desc) UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 17: ordinal not in range(128)

>>> spark.range(1).select(u"아")

Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select jdf = self._jdf.select(self._jcols(*cols)) File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File ".../spark/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException

@uncleGen, could you double check if I did something wrong maybe?

We can add a check under Python2. If it is unicode, just encode it with utf-8.

@HyukjinKwon Good catch!

Ah, thank you for confirmation. I thought I was mistaken :).

Maybe another benefit for this change is, before it you will see the error log in your example like:

u"cannot resolve '\uc544' given input columns: [id];;\n'Project ['\uc544]

repr will show unicode escape characters \uc544. Even you encode it, you will see binary representation for it. str can show the correct "아" if encoded with utf-8.

If I test it correctly.

Yea, I support this change and tested some more cases with that encode.

based on latest commit:

>>> df.select("아") Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select jdf = self._jdf.select(self._jcols(*cols)) File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File ".../spark/python/pyspark/sql/utils.py", line 75, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException : cannot resolve '`아`' given input columns: [age, name];; 'Project ['아] +- Relation[age#0L,name#1] json

uncleGen · 2017-03-14T00:08:37Z

Thanks @HyukjinKwon，you give a good catch！I lost that case. Thanks @viirya for your suggestion.

SparkQA · 2017-03-14T05:06:50Z

Test build #74487 has finished for PR 17267 at commit 5bc1d8e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-14T06:08:40Z

Test build #74490 has finished for PR 17267 at commit 6c55e02.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-14T06:35:15Z

Test build #74491 has finished for PR 17267 at commit 7b96e97.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-14T07:03:49Z

python/pyspark/sql/utils.py

@@ -16,6 +16,10 @@
 #

 import py4j
+import sys
+
+if sys.version > '3':


I think it should be >=.

SparkQA · 2017-03-14T09:49:21Z

Test build #74503 has finished for PR 17267 at commit edf9b12.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2017-03-15T03:24:31Z

ping @viirya and @HyukjinKwon

viirya · 2017-03-15T03:30:10Z

LGTM cc @davies @holdenk

uncleGen · 2017-03-16T12:02:19Z

@srowen Could you please take a view and help to merge?

srowen · 2017-03-18T17:12:17Z

I'm not reviewing this patch. People who know better should merge it

holdenk · 2017-03-21T01:00:20Z

I'll take a look at reviewing this later on this week @uncleGen. Two minor thing that we can do in the meantime is make the JIRA description a bit clearer as to what the proposed change is, the other is this change isn't really tested by Jenkins - there are no tests that look at the formatting of the error strings - maybe consider adding a test or updating the description on the PR.

gatorsmile · 2017-06-13T06:48:38Z

cc @ueshin

HyukjinKwon · 2017-06-13T06:54:29Z

LGTM too but hope there would be a test if possible.

ueshin · 2017-06-13T20:01:53Z

Correct me if I'm wrong, but I got the following message after this patch in Python 3.6:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/pyspark/spark/python/pyspark/sql/dataframe.py", line 1049, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/Users/ueshin/workspace/pyspark/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/Users/ueshin/workspace/pyspark/spark/python/pyspark/sql/utils.py", line 77, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: b"cannot resolve '`\xec\x95\x84`' given input columns: [id];;\n'Project ['\xec\x95\x84]\n+- Range (0, 1, step=1, splits=Some(8))\n"

I guess this message is not desirable?

ueshin · 2017-06-13T20:08:41Z

+1 for adding a test.

HyukjinKwon · 2017-06-13T22:45:20Z

python/pyspark/sql/utils.py

-        return repr(self.desc)
+        desc = self.desc
+        if isinstance(desc, unicode):
+            return str(desc.encode('utf-8'))


@ueshin, you are right and I misread the codes. We need to

unicode in Python 2 => u.encode("utf-8").

others in Python 2 => return str(s).

others in Python 3 => return str(s).

Root cause for #17267 (comment) looks because encode on string (also same as unicode in Python 2) in Python 3 produces 8-bit bytes, b"...", (also same as normal string, "..." and b"...", where b is ignored, in Python 2). And str function works differently as below:

Python 2

>>> str(b"aa") 'aa' >>> b"aa" 'aa'

Python 3

>>> str(b"aa") "b'aa'" >>> "aa" 'aa'

Good catch! I previously thought str works like Python2.

cc @zero323 and @davies too. Would you have some time to take a look for this one? This is a typical annoying problem between unicode and byte strings. There are many similar PRs (at least I can identify few PRs trying to handle this problem. One good example might help resolving other PRs too.

viirya · 2017-06-14T02:54:47Z

+1 We should add a test for this.

holdenk · 2017-07-02T03:23:13Z

Hey @uncleGen anytime to add a test for this?

jiangxb1987 · 2018-06-26T15:27:25Z

@dataknocker do you want to take over this one? then we can continue with #18324

…endly ### What changes were proposed in this pull request? The str of `CapaturedException` is now returned by str(self.desc) rather than repr(self.desc), which is more user-friendly. It also handles unicode under python2 specially. ### Why are the changes needed? This is an improvement, and makes exception more human readable in python side. ### Does this PR introduce any user-facing change? Before this pr, select `中文字段` throws exception something likes below: ``` Traceback (most recent call last): File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception raise e AnalysisException: u"cannot resolve '`\u4e2d\u6587\u5b57\u6bb5`' given input columns: []; line 1 pos 7;\n'Project ['\u4e2d\u6587\u5b57\u6bb5]\n+- OneRowRelation\n" ``` after this pr: ``` Traceback (most recent call last): File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception raise e AnalysisException: cannot resolve '`中文字段`' given input columns: []; line 1 pos 7; 'Project ['中文字段] +- OneRowRelation ``` ### How was this patch Add a new test to verify unicode are correctly converted and manual checks for thrown exceptions. This pr's credits should go to uncleGen and is based on #17267 Closes #25814 from advancedxy/python_exception_19926_and_21045. Authored-by: Xianjin YE <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

Make pyspark exception more readable

273c1bc

HyukjinKwon reviewed Mar 13, 2017

View reviewed changes

address comments from @HyukjinKwon

7b96e97

uncleGen force-pushed the SPARK-19926 branch from 6c55e02 to 7b96e97 Compare March 14, 2017 06:11

viirya reviewed Mar 14, 2017

View reviewed changes

bug fix

edf9b12

uncleGen changed the title ~~[SPARK-19926][PYSPARK] Make pyspark exception more readable~~ [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly Mar 21, 2017

HyukjinKwon reviewed Jun 13, 2017

View reviewed changes

ueshin mentioned this pull request Jun 19, 2017

[SPARK-21045][PYSPARK]Fixed executor blocked because traceback.format_exc throw UnicodeDecodeError #18324

Closed

HyukjinKwon mentioned this pull request Jul 31, 2017

[INFRA] Close stale PRs #18780

Closed

asfgit closed this in 3a45c7f Aug 5, 2017

advancedxy mentioned this pull request Sep 17, 2019

[SPARK-19926][PYSPARK] make captured exception from JVM side user friendly #25814

Closed

[SPARK-19926][PYSPARK] Make pyspark exception more user-friendly #17267

[SPARK-19926][PYSPARK] Make pyspark exception more user-friendly #17267

Conversation

uncleGen commented Mar 12, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

uncleGen commented Mar 12, 2017 • edited Loading

SparkQA commented Mar 12, 2017

viirya commented Mar 13, 2017

uncleGen commented Mar 13, 2017 • edited Loading

srowen commented Mar 13, 2017

uncleGen commented Mar 13, 2017 • edited Loading

HyukjinKwon Mar 13, 2017 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon Mar 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uncleGen commented Mar 14, 2017

SparkQA commented Mar 14, 2017

SparkQA commented Mar 14, 2017

SparkQA commented Mar 14, 2017

Choose a reason for hiding this comment

SparkQA commented Mar 14, 2017

uncleGen commented Mar 15, 2017 • edited Loading

viirya commented Mar 15, 2017

uncleGen commented Mar 16, 2017

srowen commented Mar 18, 2017

holdenk commented Mar 21, 2017

gatorsmile commented Jun 13, 2017

HyukjinKwon commented Jun 13, 2017

ueshin commented Jun 13, 2017

ueshin commented Jun 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Jun 14, 2017

holdenk commented Jul 2, 2017

jiangxb1987 commented Jun 26, 2018

uncleGen commented Mar 12, 2017 •

edited

Loading

uncleGen commented Mar 12, 2017 •

edited

Loading

uncleGen commented Mar 13, 2017 •

edited

Loading

uncleGen commented Mar 13, 2017 •

edited

Loading

HyukjinKwon Mar 13, 2017 •

edited

Loading

HyukjinKwon Mar 13, 2017 •

edited

Loading

uncleGen commented Mar 15, 2017 •

edited

Loading