Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6332] [MLlib] compute calibration curve for binary classifier #5025

Closed
wants to merge 7 commits into from

Conversation

robert-dodier
Copy link

This PR contains an implementation of a calibration method in the class BinaryClassificationMetrics. The code was adapted from the method for ROC curve construction. Tests on small data sets have been added to BinaryClassificationMetricsSuite, and the current version of the code passes those tests.

In this implementation, the return value of the new method is an RDD[((Double, Double), (Double, Long))]. The first pair describes each bin and the second pair describes the content of each bin. In the first pair, the two values are the least and greatest scores in the bin. In the second pair, the two values are the proportion of positive examples in the bin, and the number of examples in the bin. I chose this representation in order to keep as much information as possible. However, a simpler representation might be better; let's talk about that if anyone is interested.

@jkbradley
Copy link
Member

@robert-dodier Thanks for the PR! I added a couple of clarification questions to the JIRA

@jkbradley
Copy link
Member

ok to test

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@@ -141,6 +141,7 @@ private[spark] abstract class ProbabilisticClassificationModel[
*
* WARNING: Not all models output well-calibrated probability estimates! These probabilities
* should be treated as confidences, not precise probabilities.
* See also BinaryClassificationMetrics.calibration to assess calibration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"[[DoubleBrackets]]" to generate API doc link

@feynmanliang
Copy link
Contributor

Can you resolve merge conflicts?

I made a quick pass for style and organization; did not check correctness. Overall it looks like there is quite a bit of repeated code between here and cumulativeCounts/confusions which we may be able to refactor out

Robert Dodier added 7 commits October 5, 2015 15:40
 o ProbabilisticClassifier.scala:
    mention calibration in comments

 o BinaryClassificationMetrics.scala:
    adapting code for ROC to calibration; incomplete and commented
    out for now

 o BinaryClassificationMetricsSuite.scala:
    tests for calibration
types to what calibration actually returns.
@rxin
Copy link
Contributor

rxin commented Dec 31, 2015

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants