-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15892][ML] Incorrectly merged AFTAggregator with zero total count #13619
Conversation
cc @jkbradley Could you please take a look? |
FYI, the example was being passed in branch-1.6 because it does not check whether the count is 0 or not. This check was introduced in 101663f. |
Test build #60345 has finished for PR 13619 at commit
|
@HyukjinKwon Thanks! The fix looks correct. Could you please add a tiny unit test which fails before your fix & works afterwards? |
@jkbradley Sure! |
Test build #60353 has finished for PR 13619 at commit
|
Test build #60357 has finished for PR 13619 at commit
|
Test build #60359 has finished for PR 13619 at commit
|
LGTM |
## What changes were proposed in this pull request? Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below: ``` IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.' ``` Please see [AFTSurvivalRegression.scala#L573-L575](https://github.com/apache/spark/blob/6ecedf39b44c9acd58cdddf1a31cf11e8e24428c/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L573-L575) as well. Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`. Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt configurations (AFAIK, it sets the parallelism level to the number of CPU cores). ## How was this patch tested? An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`. Author: hyukjinkwon <[email protected]> Author: Hyukjin Kwon <[email protected]> Closes #13619 from HyukjinKwon/SPARK-15892. (cherry picked from commit e355460) Signed-off-by: Joseph K. Bradley <[email protected]>
## What changes were proposed in this pull request? Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below: ``` IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.' ``` Please see [AFTSurvivalRegression.scala#L573-L575](https://github.com/apache/spark/blob/6ecedf39b44c9acd58cdddf1a31cf11e8e24428c/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L573-L575) as well. Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`. Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt configurations (AFAIK, it sets the parallelism level to the number of CPU cores). ## How was this patch tested? An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`. Author: hyukjinkwon <[email protected]> Author: Hyukjin Kwon <[email protected]> Closes #13619 from HyukjinKwon/SPARK-15892. (cherry picked from commit e355460) Signed-off-by: Joseph K. Bradley <[email protected]>
## What changes were proposed in this pull request? Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below: ``` IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.' ``` Please see [AFTSurvivalRegression.scala#L573-L575](https://github.com/apache/spark/blob/6ecedf39b44c9acd58cdddf1a31cf11e8e24428c/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L573-L575) as well. Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`. Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt configurations (AFAIK, it sets the parallelism level to the number of CPU cores). ## How was this patch tested? An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`. Author: hyukjinkwon <[email protected]> Author: Hyukjin Kwon <[email protected]> Closes apache#13619 from HyukjinKwon/SPARK-15892. (cherry picked from commit e355460) Signed-off-by: Joseph K. Bradley <[email protected]> (cherry picked from commit be3c41b)
// the parallelism is bigger than that. Because the issue was about `AFTAggregator`s | ||
// being merged incorrectly when it has an empty partition, running the codes below | ||
// should not throw an exception. | ||
val dataset = spark.createDataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon value spark
is not found here, it should be sqlContext
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compile it in branch-1.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with branch-2.0, it is OK , i think that this pr should not be merged into branch-1.6 directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it seems this is merged into branch-1.6 too. Yes, it should be sqlContext
for branch-1.6.
@HyukjinKwon OK, but this pr should be reverted first. |
Yes, it seems it is still failing. I can change the PR to revert this if @jkbradley is busy for now. Otherwise, I will close mine. |
@HyukjinKwon , could you revert this pr for branch-1.6 ? |
@HyukjinKwon , just see it, thanks. |
My apologies for the slow response. I'll revert it now. Thank you for pinging! |
Reverted commit to branch-1.6 in commit 2f3e327 |
… 1.6 ## What changes were proposed in this pull request? This PR backports #13619. The original test added in branch-2.0 was failed in branch-1.6. This seems because the behaviour was changed in 101663f. This was failure while calculating Euler's number which ends up with a infinity regardless of this path. So, I brought the dataset from `AFTSurvivalRegressionExample` to make sure this is working and then wrote the test. I ran the test before/after creating empty partitions. `model.scale` becomes `1.0` with empty partitions and becames `1.547` without them. After this patch, this becomes always `1.547`. ## How was this patch tested? Unit test in `AFTSurvivalRegressionSuite`. Author: hyukjinkwon <[email protected]> Closes #13725 from HyukjinKwon/SPARK-15892-1-6.
… 1.6 ## What changes were proposed in this pull request? This PR backports apache#13619. The original test added in branch-2.0 was failed in branch-1.6. This seems because the behaviour was changed in apache@101663f. This was failure while calculating Euler's number which ends up with a infinity regardless of this path. So, I brought the dataset from `AFTSurvivalRegressionExample` to make sure this is working and then wrote the test. I ran the test before/after creating empty partitions. `model.scale` becomes `1.0` with empty partitions and becames `1.547` without them. After this patch, this becomes always `1.547`. ## How was this patch tested? Unit test in `AFTSurvivalRegressionSuite`. Author: hyukjinkwon <[email protected]> Closes apache#13725 from HyukjinKwon/SPARK-15892-1-6. (cherry picked from commit fd05389)
What changes were proposed in this pull request?
Currently,
AFTAggregator
is not being merged correctly. For example, if there is any single empty partition in the data, this creates anAFTAggregator
with zero total count which causes the exception below:Please see AFTSurvivalRegression.scala#L573-L575 as well.
Just to be clear, the python example
aft_survival_regression.py
seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly mergedAFTAggregator
.Executing
bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py
on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt configurations (AFAIK, it sets the parallelism level to the number of CPU cores).How was this patch tested?
An unit test in
AFTSurvivalRegressionSuite.scala
and manually tested bybin/spark-submit examples/src/main/python/ml/aft_survival_regression.py
.