-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19926][PYSPARK] Make pyspark exception more user-friendly #17267
Conversation
Test build #74403 has finished for PR 17267 at commit
|
Thanks for working on this. LGTM. |
What's the difference between the two? briefly. I don't know enough to evaluate it though the effect looks positive. Is this the only place this should change? |
IMHO, yes. And @viirya is the original author.
|
python/pyspark/sql/utils.py
Outdated
@@ -24,7 +24,7 @@ def __init__(self, desc, stackTrace): | |||
self.stackTrace = stackTrace | |||
|
|||
def __str__(self): | |||
return repr(self.desc) | |||
return str(self.desc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. does this work for unicode
in Python 2, for example, spark.range(1).select("아")
? Up to my knowledge, converting it to ascii directly throws an exception.
>>> str(u"아")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 0: ordinal not in range(128)
>>> repr(u"아")
"u'\\uc544'"
Maybe, we should check if this is unicode
and do .encode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested with this change as below to help:
- before
>>> try:
... spark.range(1).select(u"아")
... except Exception as e:
... print e
u"cannot resolve '`\uc544`' given input columns: [id];;\n'Project ['\uc544]\n+- Range (0, 1, step=1, splits=Some(8))\n"
>>> spark.range(1).select(u"아")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select
jdf = self._jdf.select(self._jcols(*cols))
File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File ".../spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve '`\uc544`' given input columns: [id];;\n'Project ['\uc544]\n+- Range (0, 1, step=1, splits=Some(8))\n"
- after
>>> try:
... spark.range(1).select(u"아")
... except Exception as e:
... print e
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File ".../spark/python/pyspark/sql/utils.py", line 27, in __str__
return str(self.desc)
UnicodeEncodeError: 'ascii' codec can't encode character u'\uc544' in position 17: ordinal not in range(128)
>>> spark.range(1).select(u"아")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select
jdf = self._jdf.select(self._jcols(*cols))
File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File ".../spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uncleGen, could you double check if I did something wrong maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a check under Python2. If it is unicode, just encode it with utf-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you for confirmation. I thought I was mistaken :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe another benefit for this change is, before it you will see the error log in your example like:
u"cannot resolve '\uc544
' given input columns: [id];;\n'Project ['\uc544]
repr
will show unicode escape characters \uc544
. Even you encode it, you will see binary representation for it. str
can show the correct "아" if encoded with utf-8.
If I test it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I support this change and tested some more cases with that encode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on latest commit:
>>> df.select("아")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/dataframe.py", line 992, in select
jdf = self._jdf.select(self._jcols(*cols))
File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File ".../spark/python/pyspark/sql/utils.py", line 75, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException
: cannot resolve '`아`' given input columns: [age, name];;
'Project ['아]
+- Relation[age#0L,name#1] json
Thanks @HyukjinKwon,you give a good catch!I lost that case. Thanks @viirya for your suggestion. |
Test build #74487 has finished for PR 17267 at commit
|
Test build #74490 has finished for PR 17267 at commit
|
Test build #74491 has finished for PR 17267 at commit
|
python/pyspark/sql/utils.py
Outdated
@@ -16,6 +16,10 @@ | |||
# | |||
|
|||
import py4j | |||
import sys | |||
|
|||
if sys.version > '3': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be >=
.
Test build #74503 has finished for PR 17267 at commit
|
ping @viirya and @HyukjinKwon |
@srowen Could you please take a view and help to merge? |
I'm not reviewing this patch. People who know better should merge it |
I'll take a look at reviewing this later on this week @uncleGen. Two minor thing that we can do in the meantime is make the JIRA description a bit clearer as to what the proposed change is, the other is this change isn't really tested by Jenkins - there are no tests that look at the formatting of the error strings - maybe consider adding a test or updating the description on the PR. |
cc @ueshin |
LGTM too but hope there would be a test if possible. |
Correct me if I'm wrong, but I got the following message after this patch in Python 3.6: Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ueshin/workspace/pyspark/spark/python/pyspark/sql/dataframe.py", line 1049, in select
jdf = self._jdf.select(self._jcols(*cols))
File "/Users/ueshin/workspace/pyspark/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/Users/ueshin/workspace/pyspark/spark/python/pyspark/sql/utils.py", line 77, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: b"cannot resolve '`\xec\x95\x84`' given input columns: [id];;\n'Project ['\xec\x95\x84]\n+- Range (0, 1, step=1, splits=Some(8))\n" I guess this message is not desirable? |
+1 for adding a test. |
return repr(self.desc) | ||
desc = self.desc | ||
if isinstance(desc, unicode): | ||
return str(desc.encode('utf-8')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin, you are right and I misread the codes. We need to
- unicode in Python 2 =>
u.encode("utf-8")
. - others in Python 2 => return
str(s)
. - others in Python 3 => return
str(s)
.
Root cause for #17267 (comment) looks because encode
on string (also same as unicode in Python 2) in Python 3 produces 8-bit bytes, b"..."
, (also same as normal string, "..."
and b"..."
, where b
is ignored, in Python 2). And str
function works differently as below:
Python 2
>>> str(b"aa")
'aa'
>>> b"aa"
'aa'
Python 3
>>> str(b"aa")
"b'aa'"
>>> "aa"
'aa'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I previously thought str
works like Python2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 We should add a test for this. |
Hey @uncleGen anytime to add a test for this? |
@dataknocker do you want to take over this one? then we can continue with #18324 |
…endly ### What changes were proposed in this pull request? The str of `CapaturedException` is now returned by str(self.desc) rather than repr(self.desc), which is more user-friendly. It also handles unicode under python2 specially. ### Why are the changes needed? This is an improvement, and makes exception more human readable in python side. ### Does this PR introduce any user-facing change? Before this pr, select `中文字段` throws exception something likes below: ``` Traceback (most recent call last): File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception raise e AnalysisException: u"cannot resolve '`\u4e2d\u6587\u5b57\u6bb5`' given input columns: []; line 1 pos 7;\n'Project ['\u4e2d\u6587\u5b57\u6bb5]\n+- OneRowRelation\n" ``` after this pr: ``` Traceback (most recent call last): File "/Users/advancedxy/code_workspace/github/spark/python/pyspark/sql/tests/test_utils.py", line 34, in test_capture_user_friendly_exception raise e AnalysisException: cannot resolve '`中文字段`' given input columns: []; line 1 pos 7; 'Project ['中文字段] +- OneRowRelation ``` ### How was this patch Add a new test to verify unicode are correctly converted and manual checks for thrown exceptions. This pr's credits should go to uncleGen and is based on #17267 Closes #25814 from advancedxy/python_exception_19926_and_21045. Authored-by: Xianjin YE <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
Exception in pyspark is a little difficult to read.
before pr, like:
after pr:
IMHO, the root cause is the
repr
is not user-friendlyThis pr change
repr
tostr
How was this patch tested?
Jenkins