Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19926][PYSPARK] Make pyspark exception more user-friendly #17267

Closed
wants to merge 3 commits into from

Conversation

uncleGen
Copy link
Contributor

@uncleGen uncleGen commented Mar 12, 2017

What changes were proposed in this pull request?

Exception in pyspark is a little difficult to read.

before pr, like:

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/root/dev/spark/dist/python/pyspark/sql/streaming.py", line 853, in start
    return self._sq(self._jwrite.start())
  File "/root/dev/spark/dist/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/root/dev/spark/dist/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;\nAggregate [window#17, word#5], [window#17 AS window#11, word#5, count(1) AS count#16L]\n+- Filter ((t#6 >= window#17.start) && (t#6 < window#17.end))\n   +- Expand [ArrayBuffer(named_struct(start, ((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 30000000) + 0), end, (((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 30000000) + 0) + 30000000)), word#5, t#6-T30000ms), ArrayBuffer(named_struct(start, ((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 30000000) + 0), end, (((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 30000000) + 0) + 30000000)), word#5, t#6-T30000ms)], [window#17, word#5, t#6-T30000ms]\n      +- EventTimeWatermark t#6: timestamp, interval 30 seconds\n         +- Project [cast(word#0 as string) AS word#5, cast(t#1 as timestamp) AS t#6]\n            +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@c4079ca,csv,List(),Some(StructType(StructField(word,StringType,true), StructField(t,IntegerType,true))),List(),None,Map(sep -> ;, path -> /tmp/data),None), FileSource[/tmp/data], [word#0, t#1]\n'

after pr:

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/root/dev/spark/dist/python/pyspark/sql/streaming.py", line 853, in start
    return self._sq(self._jwrite.start())
  File "/root/dev/spark/dist/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/root/dev/spark/dist/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
Aggregate [window#17, word#5], [window#17 AS window#11, word#5, count(1) AS count#16L]
+- Filter ((t#6 >= window#17.start) && (t#6 < window#17.end))
   +- Expand [ArrayBuffer(named_struct(start, ((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 30000000) + 0), end, (((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(0 as bigint)) - cast(1 as bigint)) * 30000000) + 0) + 30000000)), word#5, t#6-T30000ms), ArrayBuffer(named_struct(start, ((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 30000000) + 0), end, (((((CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(30000000 as double))) + cast(1 as bigint)) - cast(1 as bigint)) * 30000000) + 0) + 30000000)), word#5, t#6-T30000ms)], [window#17, word#5, t#6-T30000ms]
      +- EventTimeWatermark t#6: timestamp, interval 30 seconds
         +- Project [cast(word#0 as string) AS word#5, cast(t#1 as timestamp) AS t#6]
            +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@5265083b,csv,List(),Some(StructType(StructField(word,StringType,true), StructField(t,IntegerType,true))),List(),None,Map(sep -> ;, path -> /tmp/data),None), FileSource[/tmp/data], [word#0, t#1]

IMHO, the root cause is the repr is not user-friendly

class CapturedException(Exception):
    def __init__(self, desc, stackTrace):
        self.desc = desc
        self.stackTrace = stackTrace

    def __str__(self):
        return repr(self.desc)
               ▲▲▲▲

This pr change repr to str

str() repr()
make object readable need code that reproduces object
generate output for end user generate output for developer

How was this patch tested?

Jenkins

@uncleGen
Copy link
Contributor Author

uncleGen commented Mar 12, 2017

Maybe @viirya and @davies can give some suggestion.

@SparkQA
Copy link

SparkQA commented Mar 12, 2017

Test build #74403 has finished for PR 17267 at commit 273c1bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Mar 13, 2017

Thanks for working on this. LGTM.

@uncleGen
Copy link
Contributor Author

uncleGen commented Mar 13, 2017

@viirya Thanks for your review.
cc @srowen IIUC, it is OK for other use of repr.

@srowen
Copy link
Member

srowen commented Mar 13, 2017

What's the difference between the two? briefly. I don't know enough to evaluate it though the effect looks positive. Is this the only place this should change?

@uncleGen
Copy link
Contributor Author

uncleGen commented Mar 13, 2017

IMHO, yes. And @viirya is the original author.

str() repr()
make object readable need code that reproduces object
generate output for end user generate output for developer

@@ -24,7 +24,7 @@ def __init__(self, desc, stackTrace):
self.stackTrace = stackTrace

def __str__(self):
return repr(self.desc)
return str(self.desc)
Copy link
Member

@HyukjinKwon HyukjinKwon Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. does this work for unicode in Python 2, for example, spark.range(1).select("아")? Up to my knowledge, converting it to ascii directly throws an exception.

>>> str(u"아")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 0: ordinal not in range(128)
>>> repr(u"아")
"u'\\uc544'"

Maybe, we should check if this is unicode and do .encode.

Copy link
Member

@HyukjinKwon HyukjinKwon Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested with this change as below to help:

  • before
>>> try:
...     spark.range(1).select(u"아")
... except Exception as e:
...     print e
u"cannot resolve '`\uc544`' given input columns: [id];;\n'Project ['\uc544]\n+- Range (0, 1, step=1, splits=Some(8))\n"
>>> spark.range(1).select(u"아")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve '`\uc544`' given input columns: [id];;\n'Project ['\uc544]\n+- Range (0, 1, step=1, splits=Some(8))\n"
  • after
>>> try:
...     spark.range(1).select(u"아")
... except Exception as e:
...     print e
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File ".../spark/python/pyspark/sql/utils.py", line 27, in __str__
    return str(self.desc)
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 17: ordinal not in range(128)
>>> spark.range(1).select(u"아")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uncleGen, could you double check if I did something wrong maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a check under Python2. If it is unicode, just encode it with utf-8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Good catch!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you for confirmation. I thought I was mistaken :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another benefit for this change is, before it you will see the error log in your example like:

u"cannot resolve '\uc544' given input columns: [id];;\n'Project ['\uc544]

repr will show unicode escape characters \uc544. Even you encode it, you will see binary representation for it. str can show the correct "아" if encoded with utf-8.

If I test it correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I support this change and tested some more cases with that encode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on latest commit:

>>> df.select("아")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File ".../spark/python/pyspark/sql/utils.py", line 75, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException
: cannot resolve '`아`' given input columns: [age, name];;
'Project ['아]
+- Relation[age#0L,name#1] json

@uncleGen
Copy link
Contributor Author

Thanks @HyukjinKwon,you give a good catch!I lost that case. Thanks @viirya for your suggestion.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74487 has finished for PR 17267 at commit 5bc1d8e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74490 has finished for PR 17267 at commit 6c55e02.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74491 has finished for PR 17267 at commit 7b96e97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -16,6 +16,10 @@
#

import py4j
import sys

if sys.version > '3':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be >=.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74503 has finished for PR 17267 at commit edf9b12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uncleGen
Copy link
Contributor Author

uncleGen commented Mar 15, 2017

ping @viirya and @HyukjinKwon

@viirya
Copy link
Member

viirya commented Mar 15, 2017

LGTM cc @davies @holdenk

@uncleGen
Copy link
Contributor Author

@srowen Could you please take a view and help to merge?

@srowen
Copy link
Member

srowen commented Mar 18, 2017

I'm not reviewing this patch. People who know better should merge it

@holdenk
Copy link
Contributor

holdenk commented Mar 21, 2017

I'll take a look at reviewing this later on this week @uncleGen. Two minor thing that we can do in the meantime is make the JIRA description a bit clearer as to what the proposed change is, the other is this change isn't really tested by Jenkins - there are no tests that look at the formatting of the error strings - maybe consider adding a test or updating the description on the PR.

@uncleGen uncleGen changed the title [SPARK-19926][PYSPARK] Make pyspark exception more readable [SPARK-19926][PYSPARK] Make pyspark exception more user-friendly Mar 21, 2017
@gatorsmile
Copy link
Member

cc @ueshin

@HyukjinKwon
Copy link
Member

LGTM too but hope there would be a test if possible.

@ueshin
Copy link
Member

ueshin commented Jun 13, 2017

Correct me if I'm wrong, but I got the following message after this patch in Python 3.6:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/pyspark/spark/python/pyspark/sql/dataframe.py", line 1049, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/Users/ueshin/workspace/pyspark/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/Users/ueshin/workspace/pyspark/spark/python/pyspark/sql/utils.py", line 77, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: b"cannot resolve '`\xec\x95\x84`' given input columns: [id];;\n'Project ['\xec\x95\x84]\n+- Range (0, 1, step=1, splits=Some(8))\n"

I guess this message is not desirable?

@ueshin
Copy link
Member

ueshin commented Jun 13, 2017

+1 for adding a test.

return repr(self.desc)
desc = self.desc
if isinstance(desc, unicode):
return str(desc.encode('utf-8'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin, you are right and I misread the codes. We need to

  • unicode in Python 2 => u.encode("utf-8").
  • others in Python 2 => return str(s).
  • others in Python 3 => return str(s).

Root cause for #17267 (comment) looks because encode on string (also same as unicode in Python 2) in Python 3 produces 8-bit bytes, b"...", (also same as normal string, "..." and b"...", where b is ignored, in Python 2). And str function works differently as below:

Python 2

>>> str(b"aa")
'aa'
>>> b"aa"
'aa'

Python 3

>>> str(b"aa")
"b'aa'"
>>> "aa"
'aa'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I previously thought str works like Python2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @zero323 and @davies too. Would you have some time to take a look for this one? This is a typical annoying problem between unicode and byte strings. There are many similar PRs (at least I can identify few PRs trying to handle this problem. One good example might help resolving other PRs too.

@viirya
Copy link
Member

viirya commented Jun 14, 2017

+1 We should add a test for this.

@holdenk
Copy link
Contributor

holdenk commented Jul 2, 2017

Hey @uncleGen anytime to add a test for this?

@jiangxb1987
Copy link
Contributor

@dataknocker do you want to take over this one? then we can continue with #18324

HyukjinKwon pushed a commit that referenced this pull request Sep 18, 2019
…endly

### What changes were proposed in this pull request?
The str of `CapaturedException` is now returned by str(self.desc) rather than repr(self.desc), which is more user-friendly. It also handles unicode under python2 specially.

### Why are the changes needed?
This is an improvement, and makes exception more human readable in python side.

### Does this PR introduce any user-facing change?
Before this pr,  select `中文字段` throws exception something likes below:

```
Traceback (most recent call last):
  File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception
    raise e
AnalysisException: u"cannot resolve '`\u4e2d\u6587\u5b57\u6bb5`' given input columns: []; line 1 pos 7;\n'Project ['\u4e2d\u6587\u5b57\u6bb5]\n+- OneRowRelation\n"
```

after this pr:
```
Traceback (most recent call last):
  File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception
    raise e
AnalysisException: cannot resolve '`中文字段`' given input columns: []; line 1 pos 7;
'Project ['中文字段]
+- OneRowRelation

```
### How was this patch
Add a new test to verify unicode are correctly converted and manual checks for thrown exceptions.

This pr's credits should go to uncleGen and is based on #17267

Closes #25814 from advancedxy/python_exception_19926_and_21045.

Authored-by: Xianjin YE <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants