[SPARK-19234][MLLib] AFTSurvivalRegression should fail fast when any labels are zero #16652

admackin · 2017-01-20T04:55:51Z

What changes were proposed in this pull request?

If any labels of 0.0 (which are invalid) are supplied, AFTSurvivalRegression gives an error straight away rather than hard-to-interpret warnings and zero-valued coefficients in the output.

How was this patch tested?

Verified against current test suite. (One test needed to be updated as it was providing values of zero for labels so was failing after this patch)

Please review http://spark.apache.org/contributing.html before opening a pull request.

srowen · 2017-01-20T11:39:15Z

mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala

@@ -18,20 +18,26 @@
 package org.apache.spark.ml.regression

 import scala.util.Random
-


(Leave the blank)

This is still causing a style check failure @admackin

srowen · 2017-01-20T11:40:49Z

mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala

@@ -415,4 +421,40 @@ object AFTSurvivalRegressionSuite {
    "maxIter" -> 2,
    "tol" -> 0.01
  )
+
+  private[AFTSurvivalRegressionSuite] def checkNumericTypes[M <: Model[M], T <: Estimator[M]](


This is being copied in from MLUtils? why is it necessary?

admackin · 2017-01-21T07:35:20Z

Yes, the version in MLUtils had labels of zero in the test cases, so was causing test cases to fail after my patch. It didn't look like there was a way to fix this, so I thought it better to make a patch that didn't affect potentially dozens of other packages. Any other thoughts on how to achieve this? I could add a 'minLabel' param to the MLUtils methods but that seems overly specific for this one package.

srowen · 2017-01-21T14:06:59Z

I imagine there's a better way to do this without copying code. Do you mean the common test code assumes 0 labels are permitted? then maybe it should just not do that, because it's just using any old value to test.

If it doesn't actually assume 0 labels are permitted, then its logic should still work. It's just that this test would need some additional logic to verify that 0 labels cause an exception.

admackin · 2017-01-21T23:36:36Z

Yes that’s right, in MLUtils it supplies zero-labels as test cases (ie assumes they’re allowed, which for other regression algorithms would be true). More than happy to patch that instead though if you think that’s OK – it’ll just affect presumably a lot of other test cases. Verifying the 0-labels cause exceptions isn’t yet covered, but probably should be (separately to MLUtils). (I’ll just need to find out how to assert thrown errors in the Spark testing)

…

On 22 Jan 2017, 01:07 +1100, Sean Owen ***@***.***>, wrote: I imagine there's a better way to do this without copying code. Do you mean the common test code assumes 0 labels are permitted? then maybe it should just not do that, because it's just using any old value to test. If it doesn't actually assume 0 labels are permitted, then its logic should still work. It's just that this test would need some additional logic to verify that 0 labels cause an exception. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.((

srowen · 2017-01-22T11:01:44Z

It looks like there is no particular reason that ("0", Vectors.dense(0, 2, 3), 0.0) really needs to set a 0 value as the label. You could make it 1 and see that it still passes all tests.

SparkQA · 2017-01-22T11:08:48Z

Test build #3544 has finished for PR 16652 at commit b07c281.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

admackin · 2017-01-23T06:51:42Z

I've addressed all the problems I think – code style now fixed, MLTestingUtils patched (and verified all MLLib test cases still pass), and added a test case for zero-valued labels

srowen · 2017-01-24T11:37:59Z

This is looking OK to me, but it needs a (squash, optionally, and) rebase now before it can test again.

…ed in test cases as they now throw errors

…abels in AFT

SparkQA · 2017-01-26T10:22:51Z

Test build #3551 has finished for PR 16652 at commit cfdd286.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2017-01-30T19:10:35Z

mllib/src/test/scala/org/apache/spark/ml/regression/AFTSurvivalRegressionSuite.scala

+         (0.000, 0.0, Vectors.dense(0.346, 2.158)), // ← generates error; zero labels invalid
+         (4.199, 0.0, Vectors.dense(0.795, -0.226)))).toDF("label", "censor", "features")
+    val aft = new AFTSurvivalRegression()
+    intercept[SparkException] {


it's recommended to verify the error message using withClue, eg:
withClue("label of AFTPoint must be positive") {
intercept[SparkException] {
aft.fit(dataset)
}
}

imatiach-msft · 2017-01-30T19:11:53Z

looks good to me too. I just added a small suggestion. Thanks!

…suggestion

SparkQA · 2017-02-02T13:19:50Z

Test build #3552 has finished for PR 16652 at commit c855976.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-04T13:48:49Z

Test build #3554 has finished for PR 16652 at commit c855976.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

srowen · 2017-02-07T11:51:22Z

The core change is looking OK @admackin but seems like it fails Python tests? if you have a moment to look at that, could be all that's needed to get this over the line.

HyukjinKwon · 2017-05-11T14:15:12Z

gentle ping @admackin

srowen requested changes Jan 20, 2017

View reviewed changes

[SPARK-19234][MLLib] make sure label is positive for AFT regression

f0e09c5

admackin force-pushed the master branch from 6e48d88 to a694b7a Compare January 24, 2017 23:45

admackin and others added 2 commits January 25, 2017 10:49

[SPARK-19234][MLLib] fix test suite to ensure no zero-labels get pass…

0efbf0f

…ed in test cases as they now throw errors

[SPARK-19234][MLLib] added test case to ensure fast failure on zero l…

cfdd286

…abels in AFT

admackin force-pushed the master branch from a694b7a to cfdd286 Compare January 24, 2017 23:50

imatiach-msft reviewed Jan 30, 2017

View reviewed changes

Andrew MacKinlay added 3 commits January 31, 2017 09:44

[SPARK-19234][MLLib] added a clue about test failure per code-review …

f645ab8

…suggestion

[SPARK-19234][MLLib] fixing indentation to match code guidelines

88aee2f

[SPARK-19234][MLLib] fixing imports to match scala style guidelines

c855976

HyukjinKwon mentioned this pull request May 17, 2017

[INFRA] Close stale PRs #18017

Closed

asfgit closed this in 5d2750a May 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19234][MLLib] AFTSurvivalRegression should fail fast when any labels are zero #16652

[SPARK-19234][MLLib] AFTSurvivalRegression should fail fast when any labels are zero #16652

admackin commented Jan 20, 2017

srowen Jan 20, 2017

srowen Jan 30, 2017

srowen Jan 20, 2017

admackin commented Jan 21, 2017

srowen commented Jan 21, 2017

admackin commented Jan 21, 2017 via email

srowen commented Jan 22, 2017

SparkQA commented Jan 22, 2017

admackin commented Jan 23, 2017

srowen commented Jan 24, 2017

SparkQA commented Jan 26, 2017

imatiach-msft Jan 30, 2017

imatiach-msft commented Jan 30, 2017

SparkQA commented Feb 2, 2017

SparkQA commented Feb 4, 2017

srowen commented Feb 7, 2017

HyukjinKwon commented May 11, 2017

		@@ -18,20 +18,26 @@
		package org.apache.spark.ml.regression

		import scala.util.Random

[SPARK-19234][MLLib] AFTSurvivalRegression should fail fast when any labels are zero #16652

[SPARK-19234][MLLib] AFTSurvivalRegression should fail fast when any labels are zero #16652

Conversation

admackin commented Jan 20, 2017

What changes were proposed in this pull request?

How was this patch tested?

srowen Jan 20, 2017

Choose a reason for hiding this comment

srowen Jan 30, 2017

Choose a reason for hiding this comment

srowen Jan 20, 2017

Choose a reason for hiding this comment

admackin commented Jan 21, 2017

srowen commented Jan 21, 2017

admackin commented Jan 21, 2017 via email

srowen commented Jan 22, 2017

SparkQA commented Jan 22, 2017

admackin commented Jan 23, 2017

srowen commented Jan 24, 2017

SparkQA commented Jan 26, 2017

imatiach-msft Jan 30, 2017

Choose a reason for hiding this comment

imatiach-msft commented Jan 30, 2017

SparkQA commented Feb 2, 2017

SparkQA commented Feb 4, 2017

srowen commented Feb 7, 2017

HyukjinKwon commented May 11, 2017