-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14352][SQL] approxQuantile should support multi columns #12135
Conversation
Test build #54800 has finished for PR 12135 at commit
|
So if we are going to add this - it might make sense to also add the Python API at the same time (or at least create a JIRA to track adding this to the Python API so it doesn't slip between the cracks). |
@holdenk Ok, I will add the python API into this PR. |
67947af
to
6cf073d
Compare
Test build #55100 has finished for PR 12135 at commit
|
Test build #55101 has finished for PR 12135 at commit
|
7348d49
to
c9ebfef
Compare
Test build #55307 has finished for PR 12135 at commit
|
Test build #55326 has finished for PR 12135 at commit
|
@holdenk The wrapper for pyspark is added. |
@zhengruifeng @viirya this is a duplicate of #12207 |
@viirya could you review this PR and add any items from yours that might be missing? |
@MLnick What should I do? |
@zhengruifeng we will review and aim to merge this PR. ping @viirya @jkbradley @holdenk |
/** | ||
* Python-friendly version of [[approxQuantile()]] | ||
*/ | ||
private[spark] def approxQuantileMultiCols( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to have a new method name. Since this multi-column version has different input types, it can use the same method name approxQuantile
.
@viirya Thanks, I have updated this PR according your comments. |
Test build #56091 has finished for PR 12135 at commit
|
} | ||
|
||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only call the multi-column version from PySpark, do we still need the single column version Python API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the versions are used now.
Test build #56368 has finished for PR 12135 at commit
|
cc @MLnick ? |
Test this please |
python/pyspark/sql/dataframe.py
Outdated
if not isinstance(col, (str, list, tuple)): | ||
raise ValueError("col should be a string, list or tuple.") | ||
|
||
isStr = isinstance(col, str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor and subjective - but you could also just wrap it as a list here instead of propagating isStr down to the return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for helping to review this PR, it is quite a while.
The type of col
detemine the type of return.
If I make col = [col]
here, I will not know whether to return a list
or a list of list
.
Like this:
>>> dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
>>> dataset.stat.approxQuantile(['label'], [0.1,0.2], 0.1)
[[0.0, 1.0]]
>>> dataset.stat.approxQuantile('label', [0.1,0.2], 0.1)
[0.0, 1.0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but if we can get rid of one of the unused private methods on the Scala side I'm all for that. Can we not simply return the first element of the result if it is length 1, otherwise return the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick Sorry, I am not very sure about your opinion. Do you mean that if the input col is a list of only one element, the output should be one element (not a list) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick Ok, I will remove the unused Python-friendly version of [[approxQuantile()]]
with single col.
bae4053
to
785a667
Compare
Test build #63208 has finished for PR 12135 at commit
|
Can any admin verify this PR? It's been a long time and I really need this feature... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python/pyspark/sql/dataframe.py
Outdated
@@ -1256,18 +1256,33 @@ def approxQuantile(self, col, probabilities, relativeError): | |||
Space-efficient Online Computation of Quantile Summaries]] | |||
by Greenwald and Khanna. | |||
|
|||
:param col: the name of the numerical column | |||
:param col: the name of the numerical column, or a list/tuple of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will update this.
6517f21
to
29a691f
Compare
Test build #72070 has finished for PR 12135 at commit
|
ping @holdenk ? |
Sorry, my weekend ended up super busy. I'll try and take a look tomorrow :) Also thanks for adding more tests <3 tests :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor pluralizations I noticed while going back over it, will double check with Nick for any other issues. If you have a chance to fix the docstrings that would be great otherwise I can do that during merge.
python/pyspark/sql/dataframe.py
Outdated
:param probabilities: a list of quantile probabilities | ||
Each number must belong to [0, 1]. | ||
For example 0 is the minimum, 0.5 is the median, 1 is the maximum. | ||
:param relativeError: The relative target precision to achieve | ||
(>= 0). If set to zero, the exact quantiles are computed, which | ||
could be very expensive. Note that values greater than 1 are | ||
accepted but give the same result as 1. | ||
:return: the approximate quantiles at the given probabilities | ||
:return: the approximate quantiles at the given probabilities. If | ||
the input `col` is a string, the output is a list of float. If the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float should be pluralized (e.g. is a list of floats)
python/pyspark/sql/dataframe.py
Outdated
the input `col` is a string, the output is a list of float. If the | ||
input `col` is a list or tuple of strings, the output is also a | ||
list, but each element in it is a list of float, i.e., the output | ||
is a list of list of float. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also float -> floats here
@holdenk Updated! Thanks for your careful checking. |
Test build #72237 has finished for PR 12135 at commit
|
* @param probabilities a list of quantile probabilities | ||
* Each number must belong to [0, 1]. | ||
* For example 0 is the minimum, 0.5 is the median, 1 is the maximum. | ||
* @param relativeError The relative target precision to achieve (>= 0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a kind comment for the future changes and to inform as I know it is super easy for javadoc8 to be broken, It seems javadoc8 complains it as below:
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content
[error] * @see {@link DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile} for
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:52: error: bad use of '>'
[error] * @param relativeError The relative target precision to achieve (>= 0).
[error]
We could do this as
@param relativeError The relative target precision to achieve (greater or equal to 0).
and fix the link as below If there is no better choice:
@see `DataFrameStatsFunctions.approxQuantile` for detailed description.
Just FYI, there are several cases in #16013
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these just warnings generated? It would be nice to know during Jenkins testing if javadoc8 (or scaladoc for that matter) breaks.
The 2nd case links nicely to the single-arg version of the method, which contains the detailed doc, in Scaladoc. Pity it won't work with javadoc - is there another way to link it correctly? I suspect that what will work for javadoc will break the link for scaladoc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea.. so, kindly @jkbradley opened a JIRA here - http://issues.apache.org/jira/browse/SPARK-18692
Actually, they are errors that make documentation building failed in javadoc8. I and many guys had a hard time to figure that out a good way AKAIK (honestly, I would like to say that I have tried all the combination I could think. To make it worse, it seems case-by-case up to my observation and tests) and it kind of ended up with the one above.. as we are anyway going to drop Java 7 support in near future up to my knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, I will ping you if I happen to find another good way to make some links for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BTW, IMHO, at least for now, building javadoc everytime might be good to do but not required. We can avoid them at our best in our PRs and then sweep them when the release is close or in other related PRs if there are.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create an issue to build javadoc with Java 8 to Jenkins then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that JIRA is actually here - https://issues.apache.org/jira/browse/SPARK-18692 if we are talking about the same thing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, sorry the comments imply building it separately from the main jenkins build, but if we want to avoid breaking Java 8 unidoc I was thinking building it as part of the normal PR build process would be better. Regardless lets move discussion over to that JIRA :)
approxQuantile(col, probabilities.toArray, relativeError).toList.asJava | ||
relativeError: Double): java.util.List[java.util.List[Double]] = { | ||
approxQuantile(cols.toArray, probabilities.toArray, relativeError) | ||
.map(_.toList.asJava).toList.asJava |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indent is not right.
@holdenk When you do the code merge, you need to leave a comment to explain which branch you merged. |
@HyukjinKwon @gatorsmile Thanks for pointing out those issues. I will create a followup PR to fix them ASAP. |
* @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for | ||
* detailed description. | ||
* | ||
* Note that rows containing any null or NaN values values will be removed before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values values
-> values
@zhengruifeng Could you submit a follow-up PR to add test cases for null values?
@zhengruifeng Actually, I still have a few comments about this PR. I will leave the comments soon. Thanks! |
* For example 0 is the minimum, 0.5 is the median, 1 is the maximum. | ||
* @param relativeError The relative target precision to achieve (>= 0). | ||
* If set to zero, the exact quantiles are computed, which could be very expensive. | ||
* Note that values greater than 1 are accepted but give the same result as 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like you did not add any test case to verify it.
* Each number must belong to [0, 1]. | ||
* For example 0 is the minimum, 0.5 is the median, 1 is the maximum. | ||
* @param relativeError The relative target precision to achieve (>= 0). | ||
* If set to zero, the exact quantiles are computed, which could be very expensive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is also missing.
Actually, you also need to consider the illegal cases, like negative values.
* calculation. | ||
* @param cols the names of the numerical columns | ||
* @param probabilities a list of quantile probabilities | ||
* Each number must belong to [0, 1]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened if the users provide the number that is not in this boundary? Do we have a test case to verify the behavior?
@zhengruifeng Please try to improve the test case coverage in the follow-up PRs. You might find some bugs when you added these test cases. Thanks for your work! |
Thanks for the reminder @gatorsmile (it wasn't in the list of things to do when merging so I'll follow up and update the http://spark.apache.org/committers.html docs to add that as a follow up step along with the JIRA update). For the record: Merged to master in b098576 |
@gatorsmile it's a good point about the tests. However this JIRA & PR was for exposing the multi-column functionality of @zhengruifeng could you create a JIRA ticket and link your new PR to that one also? |
@MLnick I created SPARK-19436 for it. |
I am fine to create a separate one, but, normally, in Spark SQL, we do not create a separate JIRA for improving the related test case, if the original ones are missing. |
## What changes were proposed in this pull request? 1, add the multi-cols support based on current private api 2, add the multi-cols support to pyspark ## How was this patch tested? unit tests Author: Zheng RuiFeng <[email protected]> Author: Ruifeng Zheng <[email protected]> Closes apache#12135 from zhengruifeng/quantile4multicols.
What changes were proposed in this pull request?
1, add the multi-cols support based on current private api
2, add the multi-cols support to pyspark
How was this patch tested?
unit tests