-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19454][PYTHON][SQL] DataFrame.replace improvements #16793
Changes from all commits
a02e4ff
db8f4c9
e014867
17e6820
03303df
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1591,6 +1591,78 @@ def test_replace(self): | |
self.assertEqual(row.age, 10) | ||
self.assertEqual(row.height, None) | ||
|
||
# replace with lists | ||
row = self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace([u'Alice'], [u'Ann']).first() | ||
self.assertTupleEqual(row, (u'Ann', 10, 80.1)) | ||
|
||
# replace with dict | ||
row = self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace({10: 11}).first() | ||
self.assertTupleEqual(row, (u'Alice', 11, 80.1)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the only test of "new" functionality (excluding error cases), correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These tests are mostly a side effect of discussions related to #16792 Right now test coverage is low and we depend on a certain behavior of Py4j and Scala counterpart. Also I wanted to be sure that all the expected types are still accepted after the changes I've made. So maybe not necessary, but I will argue it is a good idea to have these. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think (and I could be wrong) that @nchammas was suggesting it might make sense to have some more tests with dict, not that the other additional new tests are bad. |
||
|
||
# test backward compatibility with dummy value | ||
dummy_value = 1 | ||
row = self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace({'Alice': 'Bob'}, dummy_value).first() | ||
self.assertTupleEqual(row, (u'Bob', 10, 80.1)) | ||
|
||
# test dict with mixed numerics | ||
row = self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace({10: -10, 80.1: 90.5}).first() | ||
self.assertTupleEqual(row, (u'Alice', -10, 90.5)) | ||
|
||
# replace with tuples | ||
row = self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace((u'Alice', ), (u'Bob', )).first() | ||
self.assertTupleEqual(row, (u'Bob', 10, 80.1)) | ||
|
||
# replace multiple columns | ||
row = self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.0)], schema).replace((10, 80.0), (20, 90)).first() | ||
self.assertTupleEqual(row, (u'Alice', 20, 90.0)) | ||
|
||
# test for mixed numerics | ||
row = self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.0)], schema).replace((10, 80), (20, 90.5)).first() | ||
self.assertTupleEqual(row, (u'Alice', 20, 90.5)) | ||
|
||
row = self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.0)], schema).replace({10: 20, 80: 90.5}).first() | ||
self.assertTupleEqual(row, (u'Alice', 20, 90.5)) | ||
|
||
# replace with boolean | ||
row = (self | ||
.spark.createDataFrame([(u'Alice', 10, 80.0)], schema) | ||
.selectExpr("name = 'Bob'", 'age <= 15') | ||
.replace(False, True).first()) | ||
self.assertTupleEqual(row, (True, True)) | ||
|
||
# should fail if subset is not list, tuple or None | ||
with self.assertRaises(ValueError): | ||
self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace({10: 11}, subset=1).first() | ||
|
||
# should fail if to_replace and value have different length | ||
with self.assertRaises(ValueError): | ||
self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace(["Alice", "Bob"], ["Eve"]).first() | ||
|
||
# should fail if when received unexpected type | ||
with self.assertRaises(ValueError): | ||
from datetime import datetime | ||
self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace(datetime.now(), datetime.now()).first() | ||
|
||
# should fail if provided mixed type replacements | ||
with self.assertRaises(ValueError): | ||
self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace(["Alice", 10], ["Eve", 20]).first() | ||
|
||
with self.assertRaises(ValueError): | ||
self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace({u"Alice": u"Bob", 10: 20}).first() | ||
|
||
def test_capture_analysis_exception(self): | ||
self.assertRaises(AnalysisException, lambda: self.spark.sql("select abc")) | ||
self.assertRaises(AnalysisException, lambda: self.df.selectExpr("a + b")) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe give this a doc-string to clarify what all_of does even though its not user facing better to have a docstring than not.