Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationMetrics #364

Closed
wants to merge 21 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Apr 9, 2014

This PR implements a generic version of AreaUnderCurve using the RDD.sliding implementation from #136 . It also contains refactoring of #160 for binary classification evaluation.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13921/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@mengxr mengxr changed the title [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationEvaluator [SPARK-1225, 1241] [MLLIB] [WIP] Add AreaUnderCurve and BinaryClassificationEvaluator Apr 9, 2014
@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13924/

@mengxr mengxr changed the title [SPARK-1225, 1241] [MLLIB] [WIP] Add AreaUnderCurve and BinaryClassificationEvaluator [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationEvaluator Apr 9, 2014
@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13994/

@pwendell
Copy link
Contributor

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13997/

@mateiz
Copy link
Contributor

mateiz commented Apr 10, 2014

Jenkins, test this please

totalCount: LabelCounter) extends BinaryConfusionMatrix with Serializable {

/** number of true positives */
override def numTruePositives: Long = count.numPositives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor question, do you want to call these numTruePositives or just truePositives? Anyway I'm happy to merge it as is, just felt truePositives would be shorter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is shorter but does not have the exact meaning. Similarly, I prefer numCols instead of cols in matrix.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14020/

@mengxr
Copy link
Contributor Author

mengxr commented Apr 11, 2014

Test failure was due to a random behavior in RDDSuite, which is fixed in #387 .

@mengxr
Copy link
Contributor Author

mengxr commented Apr 11, 2014

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14047/

@mateiz
Copy link
Contributor

mateiz commented Apr 11, 2014

Thanks Xiangrui! Merged into both master and branch-1.0.

@asfgit asfgit closed this in f5ace8d Apr 12, 2014
asfgit pushed a commit that referenced this pull request Apr 12, 2014
…nMetrics

This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from #136 . It also contains refactoring of #160 for binary classification evaluation.

Author: Xiangrui Meng <[email protected]>

Closes #364 from mengxr/auc and squashes the following commits:

a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names
3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr
fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics
b1b7dab [Xiangrui Meng] fix code styles
9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator
ca31da5 [Xiangrui Meng] remove PredictionAndResponse
3d71525 [Xiangrui Meng] move binary evalution classes to evaluation.binary
8f78958 [Xiangrui Meng] add PredictionAndResponse
dda82d5 [Xiangrui Meng] add confusion matrix
aa7e278 [Xiangrui Meng] add initial version of binary classification evaluator
221ebce [Xiangrui Meng] add a new test to sliding
a920865 [Xiangrui Meng] Merge branch 'sliding' into auc
a9b250a [Xiangrui Meng] move sliding to mllib
cab9a52 [Xiangrui Meng] use last for the last element
db6cb30 [Xiangrui Meng] remove unnecessary toSeq
9916202 [Xiangrui Meng] change RDD.sliding return type to RDD[Seq[T]]
284d991 [Xiangrui Meng] change SlidedRDD to SlidingRDD
c1c6c22 [Xiangrui Meng] add AreaUnderCurve
65461b2 [Xiangrui Meng] Merge branch 'sliding' into auc
5ee6001 [Xiangrui Meng] add TODO
d2a600d [Xiangrui Meng] add sliding to rdd

(cherry picked from commit f5ace8d)
Signed-off-by: Matei Zaharia <[email protected]>
@mengxr mengxr deleted the auc branch May 7, 2014 00:09
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
…nMetrics

This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from apache#136 . It also contains refactoring of apache#160 for binary classification evaluation.

Author: Xiangrui Meng <[email protected]>

Closes apache#364 from mengxr/auc and squashes the following commits:

a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names
3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr
fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics
b1b7dab [Xiangrui Meng] fix code styles
9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator
ca31da5 [Xiangrui Meng] remove PredictionAndResponse
3d71525 [Xiangrui Meng] move binary evalution classes to evaluation.binary
8f78958 [Xiangrui Meng] add PredictionAndResponse
dda82d5 [Xiangrui Meng] add confusion matrix
aa7e278 [Xiangrui Meng] add initial version of binary classification evaluator
221ebce [Xiangrui Meng] add a new test to sliding
a920865 [Xiangrui Meng] Merge branch 'sliding' into auc
a9b250a [Xiangrui Meng] move sliding to mllib
cab9a52 [Xiangrui Meng] use last for the last element
db6cb30 [Xiangrui Meng] remove unnecessary toSeq
9916202 [Xiangrui Meng] change RDD.sliding return type to RDD[Seq[T]]
284d991 [Xiangrui Meng] change SlidedRDD to SlidingRDD
c1c6c22 [Xiangrui Meng] add AreaUnderCurve
65461b2 [Xiangrui Meng] Merge branch 'sliding' into auc
5ee6001 [Xiangrui Meng] add TODO
d2a600d [Xiangrui Meng] add sliding to rdd
tangzhankun pushed a commit to tangzhankun/spark that referenced this pull request Jul 25, 2017
* Adding PySpark Submit functionality. Launching Python from JVM

* Addressing scala idioms related to PR351

* Removing extends Logging which was necessary for LogInfo

* Refactored code to leverage the ContainerLocalizedFileResolver

* Modified Unit tests so that they would pass

* Modified Unit Test input to pass Unit Tests

* Setup working environent for integration tests for PySpark

* Comment out Python thread logic until Jenkins has python in Python

* Modifying PythonExec to pass on Jenkins

* Modifying python exec

* Added unit tests to ClientV2 and refactored to include pyspark submission resources

* Modified unit test check

* Scalastyle

* PR 348 file conflicts

* Refactored unit tests and styles

* further scala stylzing and logic

* Modified unit tests to be more specific towards Class in question

* Removed space delimiting for methods

* Submission client redesign to use a step-based builder pattern.

This change overhauls the underlying architecture of the submission
client, but it is intended to entirely preserve existing behavior of
Spark applications. Therefore users will find this to be an invisible
change.

The philosophy behind this design is to reconsider the breakdown of the
submission process. It operates off the abstraction of "submission
steps", which are transformation functions that take the previous state
of the driver and return the new state of the driver. The driver's state
includes its Spark configurations and the Kubernetes resources that will
be used to deploy it.

Such a refactor moves away from a features-first API design, which
considers different containers to serve a set of features. The previous
design, for example, had a container files resolver API object that
returned different resolutions of the dependencies added by the user.
However, it was up to the main Client to know how to intelligently
invoke all of those APIs. Therefore the API surface area of the file
resolver became untenably large and it was not intuitive of how it was
to be used or extended.

This design changes the encapsulation layout; every module is now
responsible for changing the driver specification directly. An
orchestrator builds the correct chain of steps and hands it to the
client, which then calls it verbatim. The main client then makes any
final modifications that put the different pieces of the driver
together, particularly to attach the driver container itself to the pod
and to apply the Spark configuration as command-line arguments.

* Don't add the init-container step if all URIs are local.

* Python arguments patch + tests + docs

* Revert "Python arguments patch + tests + docs"

This reverts commit 4533df2.

* Revert "Don't add the init-container step if all URIs are local."

This reverts commit e103225.

* Revert "Submission client redesign to use a step-based builder pattern."

This reverts commit 5499f6d.

* style changes

* space for styling
erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017
* Adding PySpark Submit functionality. Launching Python from JVM

* Addressing scala idioms related to PR351

* Removing extends Logging which was necessary for LogInfo

* Refactored code to leverage the ContainerLocalizedFileResolver

* Modified Unit tests so that they would pass

* Modified Unit Test input to pass Unit Tests

* Setup working environent for integration tests for PySpark

* Comment out Python thread logic until Jenkins has python in Python

* Modifying PythonExec to pass on Jenkins

* Modifying python exec

* Added unit tests to ClientV2 and refactored to include pyspark submission resources

* Modified unit test check

* Scalastyle

* PR 348 file conflicts

* Refactored unit tests and styles

* further scala stylzing and logic

* Modified unit tests to be more specific towards Class in question

* Removed space delimiting for methods

* Submission client redesign to use a step-based builder pattern.

This change overhauls the underlying architecture of the submission
client, but it is intended to entirely preserve existing behavior of
Spark applications. Therefore users will find this to be an invisible
change.

The philosophy behind this design is to reconsider the breakdown of the
submission process. It operates off the abstraction of "submission
steps", which are transformation functions that take the previous state
of the driver and return the new state of the driver. The driver's state
includes its Spark configurations and the Kubernetes resources that will
be used to deploy it.

Such a refactor moves away from a features-first API design, which
considers different containers to serve a set of features. The previous
design, for example, had a container files resolver API object that
returned different resolutions of the dependencies added by the user.
However, it was up to the main Client to know how to intelligently
invoke all of those APIs. Therefore the API surface area of the file
resolver became untenably large and it was not intuitive of how it was
to be used or extended.

This design changes the encapsulation layout; every module is now
responsible for changing the driver specification directly. An
orchestrator builds the correct chain of steps and hands it to the
client, which then calls it verbatim. The main client then makes any
final modifications that put the different pieces of the driver
together, particularly to attach the driver container itself to the pod
and to apply the Spark configuration as command-line arguments.

* Don't add the init-container step if all URIs are local.

* Python arguments patch + tests + docs

* Revert "Python arguments patch + tests + docs"

This reverts commit 4533df2.

* Revert "Don't add the init-container step if all URIs are local."

This reverts commit e103225.

* Revert "Submission client redesign to use a step-based builder pattern."

This reverts commit 5499f6d.

* style changes

* space for styling
mccheah pushed a commit to mccheah/spark that referenced this pull request Oct 3, 2018
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants