Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-4111 [MLlib] add regression metrics #2978

Closed
wants to merge 6 commits into from

Conversation

yanboliang
Copy link
Contributor

Add RegressionMetrics.scala as regression metrics used for evaluation and corresponding test case RegressionMetricsSuite.scala.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

* Computes R^2^, the coefficient of determination.
* @return
*/
def r2_socre(): Double = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in name

@srowen
Copy link
Member

srowen commented Oct 28, 2014

Update the title with SPARK-XXXX [MLLIB]

private lazy val summarizer: MultivariateOnlineSummarizer = {
val summarizer: MultivariateOnlineSummarizer = valuesAndPreds.map{
case (value,pred) => Vectors.dense(
Array(value, pred, value - pred, math.abs(value - pred), math.pow(value - pred, 2.0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that stats for pred are used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not used and I have remove it in a new commit.

@yanboliang yanboliang changed the title add regression metrics SPARK-4111 [MLlib] add regression metrics Oct 28, 2014
@yanboliang
Copy link
Contributor Author

Rename re_score() and remove unused column.

* @return
*/
def r2_score(): Double = {
1 - summarizer.mean(3) * summarizer.count / (summarizer.variance(0) * (summarizer.count - 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be worth a comment to explain what sums of squares you are trying to compute in the numerator and denominator. A link to the definition might be good, here and for explained variance, since they are related.

@srowen
Copy link
Member

srowen commented Oct 28, 2014

This is picky now, but you might write out "meanAverageError" instead of saying "mae". Is "r2_score" style-wise correct vs "r2Score"? (Sorry should have thought of that.) Finally consider using @return tags in your scaladoc to describe what's being returned instead of leaving it blank but writing docs in the body.

private lazy val summarizer: MultivariateOnlineSummarizer = {
val summarizer: MultivariateOnlineSummarizer = valuesAndPreds.map{
case (value,pred) => Vectors.dense(
Array(value, value - pred, math.abs(value - pred), math.pow(value - pred, 2.0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also picky but you can avoid math.pow and avoid computing value - pred 3 times here with a local var. Might be cleaner. This LGTM for what it's worth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third and the forth columns are not necessary. You can use normL1 and normL2 on the second column:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala#L219

@yanboliang
Copy link
Contributor Author

Rename parameter and function names to be consistent with spark naming rules.
Delete unused columns and set prediction as the first column.
Add explanation and reference to r2Score and explained variance.
Other code style keeping.

@mengxr
Copy link
Contributor

mengxr commented Oct 29, 2014

ok to test

@mengxr
Copy link
Contributor

mengxr commented Oct 29, 2014

test this please

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22457 has started for PR 2978 at commit a8ad3e3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22457 timed out for PR 2978 at commit a8ad3e3 after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22457/
Test FAILed.

* @param predictionAndObservations an RDD of (prediction,observation) pairs.
*/
@Experimental
class RegressionMetrics(predictionAndObservations: RDD[(Double, Double)]) extends Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after ,

*/
private lazy val summarizer: MultivariateOnlineSummarizer = {
val summarizer: MultivariateOnlineSummarizer = predictionAndObservations.map{
case (prediction,observation) => Vectors.dense(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after ,

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22527 has started for PR 2978 at commit 3d0bec1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22528 has started for PR 2978 at commit 730d0a9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22527 has finished for PR 2978 at commit 3d0bec1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RegressionMetrics(predictionAndObservations: RDD[(Double, Double)]) extends Logging

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22527/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22528 has finished for PR 2978 at commit 730d0a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RegressionMetrics(predictionAndObservations: RDD[(Double, Double)]) extends Logging

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22528/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Oct 30, 2014

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in d932719 Oct 30, 2014
@yanboliang yanboliang deleted the regression_metrics branch February 19, 2015 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants