[SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs #3637

jkbradley · 2014-12-08T20:00:50Z

This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]

UPDATE: Most of the APIs are being kept private[spark] to allow further discussion. Here is a list of changes which are public:

new output columns: rawPrediction, probabilities
- The “score” column is now called “rawPrediction”
Classifiers now provide numClasses
Params.get and .set are now protected instead of private[ml].
ParamMap now has a size method.
new classes: LinearRegression, LinearRegressionModel
LogisticRegression now has an intercept.

Sketch of APIs (most of which are private[spark] for now)

Abstract classes for learning algorithms (+ corresponding Model abstractions):

Classifier (+ ClassificationModel)
ProbabilisticClassifier (+ ProbabilisticClassificationModel)
Regressor (+ RegressionModel)
Predictor (+ PredictionModel)
For all of these:
- There is no strongly typed training-time API.
- There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.

Concrete classes: learning algorithms

LinearRegression
LogisticRegression (updated to use new abstract classes)
- Also, removed "score" in favor of "probability" output column. Changed BinaryClassificationEvaluator to match. (SPARK-5031)

Other updates:

params.scala: Changed Params.set/get to be protected instead of private[ml]
- This was needed for the example of defining a class from outside of the MLlib namespace.
VectorUDT: Will later change from private[spark] to public.
- This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
- Also, added equals() method.f
SPARK-4942 : ML Transformers should allow output cols to be turned on,off
- Update validateAndTransformSchema
- Update transform
(Updated examples, test suites according to other changes)

New examples:

DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
- Added Java version too

Test Suites:

LinearRegressionSuite
LogisticRegressionSuite
+ Java versions of above suites

CC: @mengxr @etrain @shivaram

SparkQA · 2014-12-08T20:05:27Z

Test build #24226 has started for PR 3637 at commit 1e46094.

This patch merges cleanly.

SparkQA · 2014-12-08T20:05:43Z

Test build #24226 has finished for PR 3637 at commit 1e46094.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class LabeledPoint(label: Double, features: Vector, weight: Double)

AmplabJenkins · 2014-12-08T20:05:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24226/
Test FAILed.

SparkQA · 2014-12-08T20:20:34Z

Test build #24227 has started for PR 3637 at commit 83109eb.

This patch merges cleanly.

SparkQA · 2014-12-08T21:17:27Z

Test build #24227 has finished for PR 3637 at commit 83109eb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class LabeledPoint(label: Double, features: Vector, weight: Double)

AmplabJenkins · 2014-12-08T21:17:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24227/
Test FAILed.

jkbradley · 2014-12-08T22:50:01Z

The test failure reveals an issue in Spark SQL (ScalaReflection.scala:121 in schemaFor) where it gets confused if the case class includes multiple constructors. The default behavior should probably be to take the constructor with the most arguments, but I'll consult others about this. This PR may be on temporary hold...but feel free to comment since the Spark SQL fix should not require changing this PR!

Lewuathe · 2014-12-08T23:06:56Z

mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala

+ */
+@AlphaComponent
+@BeanInfo
+case class LabeledPoint(label: Double, features: Vector, weight: Double) {


Why is a label of LabeledPoint assumed as only Double? I think there are some cases where label is not Double such as one-of-k encoding. It seems better not to restrict to Double type. If I missed some alternatives, sorry for that and please let me know.

There's some discussion of this in the design doc linked from the JIRA. Basically, there could be a whole range of types, and it's a question of simplicity of the API vs. strong typing. I thought about templatizing LabeledPoint by LabelType and FeaturesType, but it makes developers & users have to write a bunch more boilerplate whenever they specify types. The current plan is to use Double for single labels and eventually some other type (Vector?) for multiple labels.

I'm also more sympathetic with a strongly-typed API here rather than overload floating-point values to represent unordered categories. Are there really so many possibilities? Any continuous or ordinal value really does naturally translate to a double. Categoricals are the only other type of value that needs a separate representation.

I feel like this misses some opportunities to optimize the internal representation (e.g. a Dataset whose feature is known to be one of N values doesn't need a double, but potentially just N bits) and avoid ambiguities of representation (is negative -1? 0?) This is one of those areas where the 'simple' API just seems to push complexity elsewhere or ignore it. An algorithm either has to have its own checks for whether 1.0 is a category or not, or, overlooks the distinction. Same with the caller.

I agree with you about strong types being nice. I think optimizations with using fewer bits should remain internal. For the user-facing API, it's really a question about whether we want to incur some extra overhead with strong types:

Users & developers have to write LabeledPoint[Double, Vector] or LabeledPoint[Int, Vector] instead of LabeledPoint.

As far as I know, we can't have default type parameters, so users could never write LabeledPoint. (a rare time I miss C++)

Algorithm APIs get messier to look at (passing about LabelType and FeaturesType). This is especially annoying with meta-algorithms (boosting & bagging).

Personally, I'd be OK with heavy typing (coming from C++ land), but it might offend some Scala users.

CC: @mengxr since I know you have opinions about this

I think it need not be designed with generic types. In fact it can't really since there are N features of different types. But you can have a Feature class with subclasses for ordered and categorical types. That too has its own tradeoffs.

By the way, I think I'd be OK with the 2nd option above as long as we can come up with simple APIs for users who don't want to think about types. That might involve default conversions between types (like one-hot encoding, and binning).

I mean the former. Yeah, that's probably the downside. Each data element is at least an object, and you can't have it reduce to a double[] under the hood.

In the second example, I think you'd only ever really want MixedFeatures as an abstraction. There's no need to think of all CategoricalFeatures as a special case deserving a unique abstraction.

I suppose if you abstract the entire training example as an object, and allow accessors like getNumericFeature(index: Int), getCategoricalFeature(index: Int) you can still internally optimize the storage while exposing a richer object representation. You get the type safety and optimization opportunity.

Sure, an Array[Double] could easily be translated into one of the more elaborate representations above. I suppose I myself wouldn't want to make it too easy to not think about types!

Anyway, up to your judgment really. There are arguments several ways here. Worth a thought to see if the idea of a bit more abstraction appeals to you.

The main pros & cons I see for having continuous & categorical types for labels & features are:

Pros:

Type safety.

Cons:

Algorithms may need to make copies of the data, depending on how much we expose internals of a features type.

Users may have to worry more about types.

For labels, if we load data from a file without metadata (like libsvm), we may need to assume that everything is continuous. Users will have to explicitly cast labels to categorical for classification.

For features, strong typing implies a stronger contract, where the assumption is that users specify the correct types. I've been wondering about having more "best effort" APIs, where we take suggestions from users (like DecisionTree's categoricalFeaturesInfo) but otherwise try to infer the best types to use under the hood.

These lists started out much more balanced, but I guess I'm voting for the old system where everything is a Double.

All good points. I think there's a "pro" for optimizing storage but that's a bit secondary.

I don't think a caller has to 'translate' 0/1 labels to categorical. These can be labels called 0 and 1. Given schema information, all of this is stuff frameworks can do for you. Is there really a case where the user doesn't know schema types, suggests a type, and lets the framework override it?

So, let's say my feature takes on "foo", "bar", "baz" as values. Doesn't the caller always have to translate to/from numbers? no big deal but is that simpler? I think the schema abstraction is going to help with this, I imagine.

Anyway, not really strongly arguing with you here, just completing the discussion. An array of doubles is kind of raw but certainly does the job.

For optimizing storage, I think we can do it in either case (Vector for weakly typing, or Array[Double] for strong typing).

I too agree both options are good & just wanted to hash out the pros & cons.

Is there really a case where the user doesn't know schema types, suggests a type, and lets the framework override it?

That's not quite what I meant. I was thinking of 2 use cases:

An expert user specifies types. The algorithm should use those types.

A beginner user does not explicitly specify types, but uses the typed API so that loaded data are given default types (probably continuous). Following the first case, the algorithm should use the given types, but following intuition, the algorithm should infer types.
This is less of an issue if the typed interface is designated as an expert API.

I'm starting to wonder if the typed interface should be completely public. I'll move this to a non-inline comment so the rest of our discussion does not get hidden in the future.

jkbradley · 2014-12-09T19:56:33Z

Question: Do people have preferences for the name of what is currently "predictRaw?" Possibilities are:

predictRaw()
predictConfidence()
confidences()

(I did not include predictScore since "score" is already taken by logreg to mean the probability estimate.)

jkbradley · 2014-12-09T22:17:51Z

@srowen @Lewuathe Continuing the above inline discussion...

Question: Should the typed interface be public?

New proposal: Hide the typed interface of Estimators. Leave the typed interface of Transformers exposed.

Argument:

The typed interface loses metadata which SchemaRDD can (but does not yet) store.
- E.g., for Classifiers, it is good to know the number of classes to predict, which features are categorical, and the number of categories for each categorical feature. The current typed train() methods do not have this info; to pass in this info, we'll need either (a) extra parameters in train() which would make Classifiers have a different signature than other Estimators' train() methods or (b) extra embedded parameters in Classifiers which would be ignored when using the fit(SchemaRDD) interface. Neither option sounds good to me.
- We could use a typed interface with stronger typing for features, but that would still not cover metadata like # classes / categories.
- This metadata is important for training, but it is not important for testing. We would just need to make sure that Vectors passed predict() methods had the same feature order as used for training.
I would guess the typed interface would be most useful for Models. This is based on me assuming that:
- Models will be kept for longer and might have predict() methods called multiple times, including on individual instances, and
- Models might need typed APIs for efficiency if used in production.

What do you think?

srowen · 2014-12-09T23:23:01Z

Pardon, which part are you referring to when you say the typed interface? The metadata about what is a categorical feature, what the values are, etc? I assumed that this information is in part derived from what's in the SchemaRDD -- a string must be categorical, for example. The information would live in a Model, either implicitly explicitly: logistic regression always knows its target is categorical; may need to remember how to treat the other features when it receives new data. And to some degree this info would be injected as parameters to an Estimator somehow.

jkbradley · 2014-12-09T23:46:22Z

Oh, apologies for being unclear. I meant this division:

Typed interface: train(RDD[LabeledPoint]), predict(Vector)
SchemaRDD interface: fit(SchemaRDD), transform(SchemaRDD)

I agree metadata is less of an issue for Models since they can get metadata from their parent Estimator. The main issue is specifying metadata for Estimators.

shivaram · 2014-12-10T04:41:41Z

@jkbradley Apologies for the delay - I just read your design doc and am catching up on this discussion.
Sorry if I missed something, but could you clarify the use case here ? I can see two kinds of scenarios

Cases where we just want to use existing classifier like LogisticRegression in a pipeline. I guess the train() interface shouldn't really matter here as they we will be passing around SchemaRDDs in a pipeline and calling fit (thus going through the untyped API ?).
Cases where developers want to implement a new Classification or Regression method. For these cases I think the strongly typed API would help in reducing the amount of cruft code and possible bugs in extracting features, labels etc.

FWIW I agree with the conclusion of keeping LabeledPoint simple as (Double, Array[Double]). And I think predictRaw is also probably fine as meaning of the values returned may vary (as noted in your comment).

Lewuathe · 2014-12-10T07:51:37Z

My initial question was based on the viewpoint of developer api.
Simple api seems to restrict the possibility of implementation of new algorithm. As @shivaram mentioned, developer cannot use optimization api because these datatypes does not support simply multi classification etc. So I think typed interface does not necessarily be public in this context. If you have different viewpoint such as end user of this api, that opinion should be included in this discussion.

srowen · 2014-12-10T10:17:45Z

So, I may not be 100% up to speed with the new API and these changes, so my comments may be a bit off, but:

An Estimator makes a Model. To make a model, you need "raw data" and its interpretation, if you will. a LabeledPoint is "raw data". That alone is not sufficient to train a Classifier (Estimator). Yes, this extra info has to come from somewhere.

I agree that SchemaRDD contains, or could contain, or could be made to deduce, this extra interpretation, so the SchemaRDD API makes sense to me.

If LabeledPoint is to remain the "raw data", given the conversation here, then it has to be parameters or something. I think you still need these for testing, right? you still need to know what the raw data means. Or is it assumed that the built Classifier / Model stores this info?

This is sort of a rehash of the same exchange we just had, in that the question is caused by the input data abstraction not really containing all the input -- the metadata comes along separately. Which could be OK but yes it means this question pops up somewhere else in the API.

Yes, a Model may be able to remember the metadata and accept raw LabeledPoints in the future. You just have to make sure you are feeding raw LabeledPoints that use the same metadata, but that's a given no matter how you design this.

To answer the question: given the question, I'd hide the typed API, I suppose. I think the typed API has to take some other values to contain metadata like the type of features, etc. These could be more parameters, then? it kind of overloads the meaning, since the parameters look like they are intended to be hyper parameters. But it's not crazy.

Transformations: these feel like these could meaningfully operate on raw data, so, typed API makes sense to me and could be public now.

jkbradley · 2014-12-10T19:32:52Z

Thanks everyone for all of the comments!

@shivaram No problem, thanks for checking out the design doc! The 2 main use cases you listed are correct. There is a remaining question about whether the strongly typed API should remain public; my initial plan was to make it public, but the above discussion is making me wonder if part should be private.

@Lewauthe I agree about discussing whether the typed API should be public or private. (But I'm not quite sure what you meant about the optimization API; please clarify.) I hope the updates to the design doc below help.

@srowen It sounds like we're converging to a solution, and I hope some use cases in the design doc will help.

I've updated the design doc for [https://issues.apache.org/jira/browse/SPARK-3702], which is the parent task of [https://issues.apache.org/jira/browse/SPARK-4789]. (I'm asking someone to add the subtask connection.) The design doc is linked from the main JIRA and also here:
[https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/edit?usp=sharing]

Perhaps we can discuss more on the JIRA rather than here.

…ctors Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now. Needed for [#3637] CC: marmbrus Author: Joseph K. Bradley <[email protected]> Closes #3646 from jkbradley/sql-reflection and squashes the following commits: 796b2e4 [Joseph K. Bradley] Modified ScalaReflection.schemaFor to take primary constructor of Product when there are multiple constructors. Added test to suite which failed before but works now.

jkbradley · 2014-12-16T21:52:43Z

This PR is paused until next week, pending some discussion.

jkbradley · 2014-12-30T21:43:11Z

I just pushed an update based on major design refactoring, but I still need to add 1 item (DeveloperApiExample for Java) and update the PR description. Will do soon...

SparkQA · 2014-12-30T21:47:37Z

Test build #24905 has started for PR 3637 at commit 1be9892.

This patch merges cleanly.

SparkQA · 2014-12-30T22:39:14Z

Test build #24905 has finished for PR 3637 at commit 1be9892.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- // where index i corresponds to class i (i = 0, 1).
- new Param(this, "probabilityCol", "column name for predicted class conditional probabilities",
- class VectorUDT extends UserDefinedType[Vector]
- case class Sort(

AmplabJenkins · 2014-12-30T22:39:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24905/
Test FAILed.

SparkQA · 2014-12-30T22:47:35Z

Test build #24912 has started for PR 3637 at commit bb02582.

This patch merges cleanly.

SparkQA · 2014-12-30T22:57:31Z

Test build #24913 has started for PR 3637 at commit 1bb315f.

This patch merges cleanly.

jkbradley · 2014-12-30T22:57:55Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

-      1.0 / (1.0 + math.exp(-margin))
+
+    // Output selected columns only.
+    // This is a bit complicated since it tries to avoid repeated computation.


I implemented transform() here to avoid repeated computation. This improves upon the default implementation in ProbabilisticClassificationModel. However, it’s a lot of code, so I would be fine with removing it. There is also a question of whether all algorithms should implement a method which would allow the ProbabilisticClassificationModel.transform implementation to avoid repeated computation:

protected def raw2prob(rawPredictions: Vector): Vector = // compute probabilities from raw predictions

What is duplicated and what duplicate computation is being avoided? Can some refactoring of both be done to make them more modular?

The data flow goes:
features --> rawPrediction --> probabilities --> prediction

If we want all of these values, then it is fastest to compute them in this sequence. If we only want some of these values, then we do not want to pollute the Schema namespace (names of columns) by computing all 4 values. Therefore, we need UDFs to handle all possible downstream links here.

We could abstract the key links: features2raw, raw2prob, prob2pred, raw2pred, and then compose those as needed. Does that sound best?

Thinking more about this, I think abstracting the key links might be best. It will certainly make LogisticRegression much shorter since prediction takes up most of the file.

Is this going to be a part of this PR ? I'm also okay with adding a TODO here to add the functions later

I'll add a TODO for now...but hope I have time to make it part of this PR.

SparkQA · 2015-02-06T01:07:55Z

Test build #26875 has started for PR 3637 at commit fec348a.

This patch merges cleanly.

shivaram · 2015-02-06T01:10:09Z

Thanks @jkbradley . Could you summarize what (if any) public APIs we are releasing as a part of this change ?

jkbradley · 2015-02-06T02:20:57Z

@shivaram I just updated the description at the top with the list of public changes.

jkbradley · 2015-02-06T02:22:03Z

I believe that last commit covers everything. My only question now is whether I should remove all of the MimaExcludes lines for spark.ml since I added a blanket exception (since spark.ml is an alpha component). Or are they nice for the record?
CC: @mengxr

SparkQA · 2015-02-06T02:22:58Z

Test build #26875 has finished for PR 3637 at commit fec348a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaDeveloperApiExample
- // where index i corresponds to class i (i = 0, 1).
- // where index i corresponds to class i (i = 0, 1).
- class DoubleParam(parent: Params, name: String, doc: String, defaultValue: Option[Double])
- class IntParam(parent: Params, name: String, doc: String, defaultValue: Option[Int])
- class FloatParam(parent: Params, name: String, doc: String, defaultValue: Option[Float])
- class LongParam(parent: Params, name: String, doc: String, defaultValue: Option[Long])
- class BooleanParam(parent: Params, name: String, doc: String, defaultValue: Option[Boolean])
- new Param(this, "probabilityCol", "column name for predicted class conditional probabilities",
- class VectorUDT extends UserDefinedType[Vector]

SparkQA · 2015-02-06T02:22:59Z

Test build #26885 has started for PR 3637 at commit 405bfb8.

This patch merges cleanly.

AmplabJenkins · 2015-02-06T02:23:01Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26875/
Test PASSed.

shivaram · 2015-02-06T03:07:02Z

Sounds good. Thanks @jkbradley

SparkQA · 2015-02-06T03:32:35Z

Test build #26885 has finished for PR 3637 at commit 405bfb8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaDeveloperApiExample
- // where index i corresponds to class i (i = 0, 1).
- * Here, we have a trait to be mixed in with the Estimator and Model (MyLogisticRegression
- * class since the maxIter parameter is only used during training (not in the Model).
- // where index i corresponds to class i (i = 0, 1).
- class DoubleParam(parent: Params, name: String, doc: String, defaultValue: Option[Double])
- class IntParam(parent: Params, name: String, doc: String, defaultValue: Option[Int])
- class FloatParam(parent: Params, name: String, doc: String, defaultValue: Option[Float])
- class LongParam(parent: Params, name: String, doc: String, defaultValue: Option[Long])
- class BooleanParam(parent: Params, name: String, doc: String, defaultValue: Option[Boolean])
- new Param(this, "probabilityCol", "column name for predicted class conditional probabilities",

AmplabJenkins · 2015-02-06T03:32:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26885/
Test PASSed.

…ion APIs This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] **UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion. Here is a list of changes which are public: * new output columns: rawPrediction, probabilities * The “score” column is now called “rawPrediction” * Classifiers now provide numClasses * Params.get and .set are now protected instead of private[ml]. * ParamMap now has a size method. * new classes: LinearRegression, LinearRegressionModel * LogisticRegression now has an intercept. ### Sketch of APIs (most of which are private[spark] for now) Abstract classes for learning algorithms (+ corresponding Model abstractions): * Classifier (+ ClassificationModel) * ProbabilisticClassifier (+ ProbabilisticClassificationModel) * Regressor (+ RegressionModel) * Predictor (+ PredictionModel) * *For all of these*: * There is no strongly typed training-time API. * There is a strongly typed test-time (prediction) API which helps developers implement new algorithms. Concrete classes: learning algorithms * LinearRegression * LogisticRegression (updated to use new abstract classes) * Also, removed "score" in favor of "probability" output column. Changed BinaryClassificationEvaluator to match. (SPARK-5031) Other updates: * params.scala: Changed Params.set/get to be protected instead of private[ml] * This was needed for the example of defining a class from outside of the MLlib namespace. * VectorUDT: Will later change from private[spark] to public. * This is needed for outside users to write their own validateAndTransformSchema() methods using vectors. * Also, added equals() method.f * SPARK-4942 : ML Transformers should allow output cols to be turned on,off * Update validateAndTransformSchema * Update transform * (Updated examples, test suites according to other changes) New examples: * DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace) * Added Java version too Test Suites: * LinearRegressionSuite * LogisticRegressionSuite * + Java versions of above suites CC: mengxr etrain shivaram Author: Joseph K. Bradley <[email protected]> Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits: 405bfb8 [Joseph K. Bradley] Last edits based on code review. Small cleanups fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types. 8316d5e [Joseph K. Bradley] fixes after rebasing on master fc62406 [Joseph K. Bradley] fixed test suites after last commit bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame) 9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it 216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged f549e34 [Joseph K. Bradley] Updates based on code review. Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT. * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value. 343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package 82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR). Fixed Java suites 0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites c3c8da5 [Joseph K. Bradley] small cleanup 934f97b [Joseph K. Bradley] Fixed bugs from previous commit. 1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel. Also introduced ProbabilisticClassifier. * This was to support output column “probabilityCol” in transform(). 4e2f711 [Joseph K. Bradley] rat fix bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite 8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others 1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0 adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites 58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap. 57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite e433872 [Joseph K. Bradley] Updated docs. Added LabeledPointSuite to spark.ml 54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly 0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString). Fixed bug in persisting logreg data. Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup). 601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString. Cleaned up classes in class hierarchy, before implementing tests and examples. d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch 52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet bfade12 [Joseph K. Bradley] Added lots of classes for new ML API: (cherry picked from commit dc0c449) Signed-off-by: Xiangrui Meng <[email protected]>

mengxr · 2015-02-06T07:46:15Z

LGTM. Merged into master and branch-1.3. Thanks everyone for the discussion! @jkbradley We can remove mima excludes in another PR.

pgirolami · 2015-02-06T10:16:21Z

After doing a clean pull of master from the repo a couple minutes ago, it fails to build due to commit dc0c449 for this PR .
Build was done using mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -Phive -Phive-thriftserver -Pbigtop-dist -DskipTests package install
If I should report this as a JIRA, please let me know.

[ERROR] warning: [options] bootstrap class path not set in conjunction with -source 1.6
[ERROR] /home/sensorly/spark/spark/mllib/src/test/java/org/apache/spark/ml/classification/JavaLogisticRegressionSuite.java:87: error: incomparable types: Object and int
[ERROR]     assert(model.fittingParamMap().apply(lr.maxIter()) == 10);
[ERROR]                                                        ^
[ERROR] /home/sensorly/spark/spark/mllib/src/test/java/org/apache/spark/ml/classification/JavaLogisticRegressionSuite.java:112: error: incomparable types: Object and int
[ERROR]     assert(model2.fittingParamMap().apply(lr.maxIter()) == 5);
[ERROR]                                                         ^
[ERROR] /home/sensorly/spark/spark/mllib/src/test/java/org/apache/spark/ml/regression/JavaLinearRegressionSuite.java:79: error: incomparable types: Object and int
[ERROR]     assert(model.fittingParamMap().apply(lr.maxIter()) == 10);
[ERROR]                                                        ^
[ERROR] /home/sensorly/spark/spark/mllib/src/test/java/org/apache/spark/ml/regression/JavaLinearRegressionSuite.java:85: error: incomparable types: Object and int
[ERROR]     assert(model2.fittingParamMap().apply(lr.maxIter()) == 5);

jkbradley · 2015-02-06T18:55:16Z

@pgirolami What version of Java are you using? Those work for me. However, I should have used assertEquals(), so I'll submit a patch for that which should fix the compilation problems you're encountering.

srowen · 2015-02-06T18:56:31Z

@jkbradley I am seeing the same compilation errors. I'm compiling with Java 8 FWIW.

pgirolami · 2015-02-07T13:06:53Z

@jkbradley Maven says it's using JDK 6

Philippes-MacBook-Air-3:~ Philippe$ mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T22:58:10+02:00)
Maven home: /Users/Philippe/Documents/apache-maven-3.2.3
Java version: 1.6.0_65, vendor: Apple Inc.
Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Default locale: en_US, platform encoding: MacRoman
OS name: "mac os x", version: "10.9.5", arch: "x86_64", family: "mac"

Philippes-MacBook-Air-3:~ Philippe$ uname -a
Darwin Philippes-MacBook-Air-3.local 13.4.0 Darwin Kernel Version 13.4.0: Sun Aug 17 19:50:11 PDT 2014; root:xnu-2422.115.4~1/RELEASE_X86_64 x86_64

pgirolami · 2015-02-07T14:16:26Z

Same problem using JDK 7 on my Mac

Philippes-MacBook-Air-3:spark Philippe$ mvn -version
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T22:58:10+02:00)
Maven home: /Users/Philippe/Documents/apache-maven-3.2.3
Java version: 1.7.0_75, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_75.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.9.5", arch: "x86_64", family: "mac"

srowen · 2015-02-07T14:18:52Z

BTW this has been fixed already in master. But it looks like it is a small difference in Java 6 vs 7:
http://stackoverflow.com/questions/16119638/differences-in-auto-unboxing-between-java-6-vs-java-7
EDIT: wait, that's the other way around, isn't it? Here, Java 6 works. Color me confused.

pgirolami · 2015-02-07T17:13:45Z

For me it works with neither Java 6 nor Java 7.
Unless, of course, maven -version is lying...

Just pulled master and ran the same maven command (no clean) and indeed, it now compiles mllib (using Java 7). Thanks !

petro-rudenko · 2015-02-09T13:37:08Z

One more issue. In LogisticRegressionWithLBFGS class there's a line:

this.setFeatureScaling(true)

I have feature scaling as a part of pipeline to produce new columns based on scaled columns. But i can't tell to the LogisticRegression class from the new API to set feature scaling to false. Also there's no actual way to set validateData = false in GeneralizedLinearAlgorithm.

jkbradley · 2015-02-16T19:46:06Z

@petro-rudenko (Apologies for the slow response; I've been without Internet for a week.) About feature scaling, I don't think it's a problem in terms of correctness for LogisticRegression since scaling twice should produce the same result as scaling once (since it handles scaling at prediction time too). It is wasted computation though.

For GLMs, you should be able to call setValidateData: [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L156]

petro-rudenko · 2015-02-16T19:57:49Z

@jkbradley i can setValidateData in GLM, but not in the LogisticRegression class from the new API. For my case found a trick to customize anything i want (add org.apache.spark.ml package to my project and extends any class). When this API would be public it would be easier to customize (e.g. use LogisticRegressionWithSGD except for LRWithLBFGS) in user's namespace.

jkbradley · 2015-02-17T00:23:24Z

Oh, I see. It's true there are missing features in spark.ml currently. Please feel free to make a JIRA to prioritize it.

jkbradley mentioned this pull request Dec 8, 2014

[MLLIB] [WIP] [SPARK-3702] Standardizing abstractions and developer API for prediction #3427

Closed

6 tasks

Lewuathe reviewed Dec 8, 2014
View reviewed changes

jkbradley mentioned this pull request Dec 9, 2014

[SPARK-4791] [sql] Infer schema from case class with multiple constructors #3646

Closed

jkbradley mentioned this pull request Dec 9, 2014

[MLLIB] SPARK-4362: Added classProbabilities method for Naive Bayes #3626

Closed

jkbradley force-pushed the ml-api-part1 branch from 83109eb to 1be9892 Compare December 30, 2014 21:42

jkbradley reviewed Dec 30, 2014
View reviewed changes

jkbradley changed the title ~~[SPARK-4789] [mllib] Standardize ML Prediction APIs~~ [SPARK-4789] [SPARK-4942] [mllib] Standardize ML Prediction APIs Dec 30, 2014

Last edits based on code review. Small cleanups

405bfb8

asfgit closed this in dc0c449 Feb 6, 2015

jkbradley mentioned this pull request Feb 23, 2015

[SPARK-5253] [ML] LinearRegression with L1/L2 (ElasticNet) using OWLQN #4259

Closed

[SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs #3637

[SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs #3637

Conversation

jkbradley commented Dec 8, 2014

Sketch of APIs (most of which are private[spark] for now)

SparkQA commented Dec 8, 2014

SparkQA commented Dec 8, 2014

AmplabJenkins commented Dec 8, 2014

SparkQA commented Dec 8, 2014

SparkQA commented Dec 8, 2014

AmplabJenkins commented Dec 8, 2014

jkbradley commented Dec 8, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Dec 9, 2014

jkbradley commented Dec 9, 2014

srowen commented Dec 9, 2014

jkbradley commented Dec 9, 2014

shivaram commented Dec 10, 2014

Lewuathe commented Dec 10, 2014

srowen commented Dec 10, 2014

jkbradley commented Dec 10, 2014

jkbradley commented Dec 16, 2014

jkbradley commented Dec 30, 2014

SparkQA commented Dec 30, 2014

SparkQA commented Dec 30, 2014

AmplabJenkins commented Dec 30, 2014

SparkQA commented Dec 30, 2014

SparkQA commented Dec 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 6, 2015

shivaram commented Feb 6, 2015

jkbradley commented Feb 6, 2015

jkbradley commented Feb 6, 2015

SparkQA commented Feb 6, 2015

SparkQA commented Feb 6, 2015

AmplabJenkins commented Feb 6, 2015

shivaram commented Feb 6, 2015

SparkQA commented Feb 6, 2015

AmplabJenkins commented Feb 6, 2015

mengxr commented Feb 6, 2015

pgirolami commented Feb 6, 2015

jkbradley commented Feb 6, 2015

srowen commented Feb 6, 2015

pgirolami commented Feb 7, 2015

pgirolami commented Feb 7, 2015

srowen commented Feb 7, 2015

pgirolami commented Feb 7, 2015

petro-rudenko commented Feb 9, 2015

jkbradley commented Feb 16, 2015

petro-rudenko commented Feb 16, 2015

jkbradley commented Feb 17, 2015