Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19234][MLLib] AFTSurvivalRegression should fail fast when any labels are zero #16652

Closed
wants to merge 6 commits into from

Conversation

admackin
Copy link
Contributor

What changes were proposed in this pull request?

If any labels of 0.0 (which are invalid) are supplied, AFTSurvivalRegression gives an error straight away rather than hard-to-interpret warnings and zero-valued coefficients in the output.

How was this patch tested?

Verified against current test suite. (One test needed to be updated as it was providing values of zero for labels so was failing after this patch)

Please review http://spark.apache.org/contributing.html before opening a pull request.

@@ -18,20 +18,26 @@
package org.apache.spark.ml.regression

import scala.util.Random

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Leave the blank)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still causing a style check failure @admackin

@@ -415,4 +421,40 @@ object AFTSurvivalRegressionSuite {
"maxIter" -> 2,
"tol" -> 0.01
)

private[AFTSurvivalRegressionSuite] def checkNumericTypes[M <: Model[M], T <: Estimator[M]](
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being copied in from MLUtils? why is it necessary?

@admackin
Copy link
Contributor Author

Yes, the version in MLUtils had labels of zero in the test cases, so was causing test cases to fail after my patch. It didn't look like there was a way to fix this, so I thought it better to make a patch that didn't affect potentially dozens of other packages. Any other thoughts on how to achieve this? I could add a 'minLabel' param to the MLUtils methods but that seems overly specific for this one package.

@srowen
Copy link
Member

srowen commented Jan 21, 2017

I imagine there's a better way to do this without copying code. Do you mean the common test code assumes 0 labels are permitted? then maybe it should just not do that, because it's just using any old value to test.

If it doesn't actually assume 0 labels are permitted, then its logic should still work. It's just that this test would need some additional logic to verify that 0 labels cause an exception.

@admackin
Copy link
Contributor Author

admackin commented Jan 21, 2017 via email

@srowen
Copy link
Member

srowen commented Jan 22, 2017

It looks like there is no particular reason that ("0", Vectors.dense(0, 2, 3), 0.0) really needs to set a 0 value as the label. You could make it 1 and see that it still passes all tests.

@SparkQA
Copy link

SparkQA commented Jan 22, 2017

Test build #3544 has finished for PR 16652 at commit b07c281.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@admackin
Copy link
Contributor Author

I've addressed all the problems I think – code style now fixed, MLTestingUtils patched (and verified all MLLib test cases still pass), and added a test case for zero-valued labels

@srowen
Copy link
Member

srowen commented Jan 24, 2017

This is looking OK to me, but it needs a (squash, optionally, and) rebase now before it can test again.

@SparkQA
Copy link

SparkQA commented Jan 26, 2017

Test build #3551 has finished for PR 16652 at commit cfdd286.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

(0.000, 0.0, Vectors.dense(0.346, 2.158)), // ← generates error; zero labels invalid
(4.199, 0.0, Vectors.dense(0.795, -0.226)))).toDF("label", "censor", "features")
val aft = new AFTSurvivalRegression()
intercept[SparkException] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's recommended to verify the error message using withClue, eg:
withClue("label of AFTPoint must be positive") {
intercept[SparkException] {
aft.fit(dataset)
}
}

@imatiach-msft
Copy link
Contributor

looks good to me too. I just added a small suggestion. Thanks!

@SparkQA
Copy link

SparkQA commented Feb 2, 2017

Test build #3552 has finished for PR 16652 at commit c855976.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #3554 has finished for PR 16652 at commit c855976.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 7, 2017

The core change is looking OK @admackin but seems like it fails Python tests? if you have a moment to look at that, could be all that's needed to get this over the line.

@HyukjinKwon
Copy link
Member

gentle ping @admackin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants