Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19454][PYTHON][SQL] DataFrame.replace improvements #16793

Closed
wants to merge 5 commits into from

Conversation

zero323
Copy link
Member

@zero323 zero323 commented Feb 3, 2017

What changes were proposed in this pull request?

  • Allows skipping value argument if to_replace is a dict:
     df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
     df.replace({"Alice": "Bob"}).show()
  • Adds validation step to ensure homogeneous values / replacements.
  • Simplifies internal control flow.
  • Improves unit tests coverage.

How was this patch tested?

Existing unit tests, additional unit tests, manual testing.

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72318 has finished for PR 16793 at commit 904db24.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72319 has finished for PR 16793 at commit a3a3127.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72390 has finished for PR 16793 at commit a7b6dba.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72392 has finished for PR 16793 at commit f61b782.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72389 has finished for PR 16793 at commit c06b97c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72393 has finished for PR 16793 at commit a02e4ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@nchammas nchammas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the new tests were intimidating at first, but then I realized that they were all testing for existing functionality. Are they all necessary?

Fixing the value API wart looks good to me. I can take a closer look at the reworked flow and all_of_ logic if necessary.

# replace with dict
row = self.spark.createDataFrame(
[(u'Alice', 10, 80.1)], schema).replace({10: 11}).first()
self.assertTupleEqual(row, (u'Alice', 11, 80.1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only test of "new" functionality (excluding error cases), correct?

Copy link
Member Author

@zero323 zero323 Feb 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are mostly a side effect of discussions related to #16792 Right now test coverage is low and we depend on a certain behavior of Py4j and Scala counterpart. Also I wanted to be sure that all the expected types are still accepted after the changes I've made.

So maybe not necessary, but I will argue it is a good idea to have these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (and I could be wrong) that @nchammas was suggesting it might make sense to have some more tests with dict, not that the other additional new tests are bad.

@zero323
Copy link
Member Author

zero323 commented Feb 25, 2017

cc @holdenk

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! :) I've done a quick first pass and I've got a few questions/comments - let me know what you think and I'll follow up with a more thorough read through :)

@@ -1307,43 +1307,66 @@ def replace(self, to_replace, value, subset=None):
|null| null|null|
+----+------+----+
"""
if not isinstance(to_replace, (float, int, long, basestring, list, tuple, dict)):
# Helper functions
def all_of(types):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe give this a doc-string to clarify what all_of does even though its not user facing better to have a docstring than not.

subset = [subset]

if not isinstance(subset, (list, tuple)):
raise ValueError("subset should be a list or tuple of column names")
# Check if we won't pass mixed type generics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads a bit awkwardly. How about "Verify we were not passed in mixed type generics."?

# replace with dict
row = self.spark.createDataFrame(
[(u'Alice', 10, 80.1)], schema).replace({10: 11}).first()
self.assertTupleEqual(row, (u'Alice', 11, 80.1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (and I could be wrong) that @nchammas was suggesting it might make sense to have some more tests with dict, not that the other additional new tests are bad.

@zero323
Copy link
Member Author

zero323 commented Feb 25, 2017

I think (and I could be wrong) that @nchammas was suggesting it might make sense to have some more tests with dict, not that the other additional new tests are bad.

I am like Python - you have to be explicit :) I'll try to figure out some useful tests and get back to you. Thanks for the feedback @holdenk, @nchammas

@zero323
Copy link
Member Author

zero323 commented Feb 27, 2017

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Feb 27, 2017

Test build #73517 has finished for PR 16793 at commit e014867.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 27, 2017

Test build #73524 has finished for PR 16793 at commit 17e6820.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if not isinstance(to_replace, (float, int, long, basestring, list, tuple, dict)):
# Helper functions
def all_of(types):
"""Given a type or tuple of types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting of this docstring seems odd here. Also I'd clarify that all_of returns a function which you can use for the check rather than it doing the check its self.

rep_dict = to_replace
if value is not None:
warnings.warn("to_replace is a dict, but value is not None. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be split?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not.


if not isinstance(value, (float, int, long, basestring, list, tuple)):
raise ValueError("value should be a float, int, long, string, list, or tuple")
if (not isinstance(value, valid_types) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a weird split.

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74146 has finished for PR 16793 at commit 03303df.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74153 has finished for PR 16793 at commit 03303df.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Mar 8, 2017

Jenkins retest this please (47b2f68).

@SparkQA
Copy link

SparkQA commented Mar 8, 2017

Test build #74163 has finished for PR 16793 at commit 03303df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Mar 16, 2017

@holdenk Do you think it is realistic to see this merged into 2.2?

@holdenk
Copy link
Contributor

holdenk commented Apr 3, 2017

Let me try and take a look tonight. It seems like there are some small formatting issues still at a quick glance but I feel like this should be close.

@holdenk
Copy link
Contributor

holdenk commented Apr 5, 2017

LGTM

@holdenk
Copy link
Contributor

holdenk commented Apr 5, 2017

Merged to master

@asfgit asfgit closed this in e277399 Apr 5, 2017
@zero323
Copy link
Member Author

zero323 commented Apr 6, 2017

Thanks @holdenk

@zero323 zero323 deleted the SPARK-19454 branch April 6, 2017 10:57
@rxin
Copy link
Contributor

rxin commented Feb 2, 2018

Sorry I object this change. Why would we put null as the default replace value, in a function called replace? That seems very counterintuitive and error prone.

@rxin
Copy link
Contributor

rxin commented Feb 2, 2018

Also the implementation doesn't match what was proposed in https://issues.apache.org/jira/browse/SPARK-19454

Having null value as the default in a function called replace is too risky and error prone.

@HyukjinKwon
Copy link
Member

I think the actual root cause is because we happen to allow a dictionary for to_replace at the first place.

So, do you prefer to have?

def replace(self, to_replace, value, subset=None):
    ...

but in this case, we should do as below if to_replace is a dictionary.

 df.replace({"Alice": "Bob"}, 1).show()

@HyukjinKwon
Copy link
Member

Otherwise, please give me few days .. let me give a shot with def replace(self, to_replace, *args, **kwargs):and see if I can resolve it if we are okay with that although I guess pydoc will show a less pretty doc ..

ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 3, 2018
…na.replace in PySpark"

This reverts commit 0fcde87.

See the discussion in [SPARK-21658](https://issues.apache.org/jira/browse/SPARK-21658),  [SPARK-19454](https://issues.apache.org/jira/browse/SPARK-19454) and apache#16793

Author: hyukjinkwon <[email protected]>

Closes apache#20496 from HyukjinKwon/revert-SPARK-21658.
asfgit pushed a commit that referenced this pull request Feb 3, 2018
…na.replace in PySpark"

This reverts commit 0fcde87.

See the discussion in [SPARK-21658](https://issues.apache.org/jira/browse/SPARK-21658),  [SPARK-19454](https://issues.apache.org/jira/browse/SPARK-19454) and #16793

Author: hyukjinkwon <[email protected]>

Closes #20496 from HyukjinKwon/revert-SPARK-21658.

(cherry picked from commit 551dff2)
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants