[SPARK-19155][ML] MLlib GeneralizedLinearRegression family and link should case insensitive #16516

yanboliang · 2017-01-09T13:52:58Z

What changes were proposed in this pull request?

MLlib GeneralizedLinearRegression family and link should be case insensitive. This is consistent with some other MLlib params such as featureSubsetStrategy.

How was this patch tested?

Update corresponding tests.

yanboliang · 2017-01-09T14:10:26Z

cc @felixcheung

SparkQA · 2017-01-09T14:52:43Z

Test build #71082 has finished for PR 16516 at commit f1337d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2017-01-09T17:02:53Z

This is a nice fix. It looks like some other learners have this issue as well, eg LogisticRegression.scala under $(root)/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

imatiach-msft · 2017-01-09T17:07:43Z

Maybe a more generic fix would be to fix the method ParamValidators.inArray to be case insensitive. I see this method used in a lot of places. Doing a simple search brings up not just LogisticRegression.scala but also:
/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
/mllib/src/main/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.scala
/mllib/src/main/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala
/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
and many others as well, and it looks like they all suffer from the same bug. A more general fix would be preferred I think, especially to make all code consistent and use the same method, no? It doesn't seem like any parameter should be case-sensitive.

felixcheung · 2017-01-09T17:50:24Z

I'd agree with that. Given that wider scope of changes I'd suggest creating another JIRA to make it clear the scope & impact - it wouldn't be just affecting SparkR.

yanboliang · 2017-01-10T14:55:26Z

@imatiach-msft @felixcheung Sounds good, I opened SPARK-19155 to track and will update this PR soon. Thanks.

felixcheung · 2017-01-11T04:57:06Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

@@ -365,7 +365,7 @@ class LogisticRegression @Since("1.2.0") (
      case None => histogram.length
    }

-    val isMultinomial = $(family) match {
+    val isMultinomial = $(family).toLowerCase match {


is there a way to store the param as the lowered case version, instead of turning it into lower case when accessed? it might be less error prone that way?

It can, but I think it would need to be done in the concrete setXXX method each time.

I don't think we can do that in setXXX methods, since they are not the only entrance to set params, we can also use the following API to set value for params:

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): M = { val map = new ParamMap() .put(firstParamPair) .put(otherParamPairs: _*) fit(dataset, map) }

Please feel free to correct me if I have misunderstand. Thanks.
cc @jkbradley @sethah

maybe we need to have a different accessor that is consistently used on the transform/estimator side internally to:
1.) change the value to lowercase 2.) trim any whitespace
Changing the setter might cause issues because then when users try to validate that their parameters are set correctly they will see that they are modified, which is unexpected. The case-insensitive compare should be done as in this PR, but instead of calling toLowerCase everywhere explicitly we should be accessing using some other method that normalizes the parameter internally

@yanboliang is correct that there are other entrance points for setting and getting Params. I agree it'd be nice to consolidate them, but that would be quite a bit of work and lower priority than other tech debt we currently have, IMO.

SparkQA · 2017-01-11T05:35:58Z

Test build #71178 has finished for PR 16516 at commit de6994c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft

Is it possible to change the ParamValidators.inArray[String] method to verify the given string in a case-insensitive way? Then you wouldn't need to make as many changes.

imatiach-msft · 2017-01-11T21:45:48Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

@@ -91,8 +91,8 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas
  @Since("2.1.0")
  final val family: Param[String] = new Param(this, "family",
    "The name of family which is a description of the label distribution to be used in the " +
-      s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.",
-    ParamValidators.inArray[String](supportedFamilyNames))
+      s"model (case-insensitive). Supported options: ${supportedFamilyNames.mkString(", ")}.",


Is it possible to change the ParamValidators.inArray[String] method to verify the given string in a case-insensitive way? Then you wouldn't need to make as many changes. (eg this change could be reverted)

maybe you could add a ParamValidators.inStringArray(supportedFamilyNames)) method which would both normalize to lowercase and trim whitespace (?)

@imatiach-msft I think we should not to change the behavior of ParamValidators.inArray[String], since some other string params may case-sensitive which use the original check.
Adding a new method sounds reasonable, but I'm a bit worried that whether we should add a so concrete method in the common validation object ParamValidators which use generic type. I'm still open on this topic and would like to hear more thoughts. Thanks.

you're right, I searched through the code base and case-sensitivity matters when:
1.) we are specifying some column name as a parameter
2.) RModel formula (from RFormula.scala)
3.) Tokenizer.scala regex pattern
In all other cases it doesn't seem like it should matter.

Searching through the code base these are the places where we use Param[String]:

spark-mllib_2.11
org.apache.spark.ml.classification
LogisticRegression.scala
final val family: Param[String] = new Param(this, "family",
MultilayerPerceptronClassifier.scala
final val solver: Param[String] = new Param[String](this, "solver",
final val solver: Param[String] = new Param[String](this, "solver",
NaiveBayes.scala
final val modelType: Param[String] = new Param[String](this, "modelType", "The model type " +
final val modelType: Param[String] = new Param[String](this, "modelType", "The model type " +
org.apache.spark.ml.clustering
KMeans.scala
final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " +
LDA.scala
final val optimizer = new Param[String](this, "optimizer", "Optimizer or inference" +
final val topicDistributionCol = new Param[String](this, "topicDistributionCol", "Output column" +
org.apache.spark.ml.evaluation
BinaryClassificationEvaluator.scala
val metricName: Param[String] = {
MulticlassClassificationEvaluator.scala
val metricName: Param[String] = {
RegressionEvaluator.scala
val metricName: Param[String] = {
org.apache.spark.ml.feature
Bucketizer.scala
val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle " +
val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle " +
ChiSqSelector.scala
final val selectorType = new Param[String](this, "selectorType",
QuantileDiscretizer.scala
val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle " +
val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", "how to handle " +
RFormula.scala
val formula: Param[String] = new Param(this, "formula", "R model formula")
SQLTransformer.scala
final val statement: Param[String] = new Param[String](this, "statement", "SQL statement")
final val statement: Param[String] = new Param[String](this, "statement", "SQL statement")
Tokenizer.scala
val pattern: Param[String] = new Param(this, "pattern", "regex pattern used for tokenizing")
org.apache.spark.ml.param
ParamsSuite.scala
val param = new Param[String](dummy, "name", "doc")
org.apache.spark.ml.param.shared
sharedParams.scala
final val featuresCol: Param[String] = new Param[String](this, "featuresCol", "features column name")
final val featuresCol: Param[String] = new Param[String](this, "featuresCol", "features column name")
final val labelCol: Param[String] = new Param[String](this, "labelCol", "label column name")
final val labelCol: Param[String] = new Param[String](this, "labelCol", "label column name")
final val predictionCol: Param[String] = new Param[String](this, "predictionCol", "prediction column name")
final val predictionCol: Param[String] = new Param[String](this, "predictionCol", "prediction column name")
final val rawPredictionCol: Param[String] = new Param[String](this, "rawPredictionCol", "raw prediction (a.k.a. confidence) column name")
final val rawPredictionCol: Param[String] = new Param[String](this, "rawPredictionCol", "raw prediction (a.k.a. confidence) column name")
... P...
... P...
final val varianceCol: Param[String] = new Param[String](this, "varianceCol", "Column name for the biased sample variance of prediction")
final val varianceCol: Param[String] = new Param[String](this, "varianceCol", "Column name for the biased sample variance of prediction")
final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name")
final val inputCol: Param[String] = new Param[String](this, "inputCol", "input column name")
final val outputCol: Param[String] = new Param[String](this, "outputCol", "output column name")
final val outputCol: Param[String] = new Param[String](this, "outputCol", "output column name")
... P...
... P...
final val weightCol: Param[String] = new Param[String](this, "weightCol", "weight column name. If this is not set or empty, we treat all instance weights as 1.0")
final val weightCol: Param[String] = new Param[String](this, "weightCol", "weight column name. If this is not set or empty, we treat all instance weights as 1.0")
final val solver: Param[String] = new Param[String](this, "solver", "the solver algorithm for optimization. If this is not set or empty, default value is 'auto'")
final val solver: Param[String] = new Param[String](this, "solver", "the solver algorithm for optimization. If this is not set or empty, default value is 'auto'")
org.apache.spark.ml.recommendation
ALS.scala
val userCol = new Param[String](this, "userCol", "column name for user ids. Ids must be within " +
val itemCol = new Param[String](this, "itemCol", "column name for item ids. Ids must be within " +
val ratingCol = new Param[String](this, "ratingCol", "column name for ratings")
val intermediateStorageLevel = new Param[String](this, "intermediateStorageLevel",
val finalStorageLevel = new Param[String](this, "finalStorageLevel",
org.apache.spark.ml.regression
AFTSurvivalRegression.scala
final val censorCol: Param[String] = new Param(this, "censorCol", "censor column name")
final val quantilesCol: Param[String] = new Param(this, "quantilesCol", "quantiles column name")
GeneralizedLinearRegression.scala
final val family: Param[String] = new Param(this, "family",
final val link: Param[String] = new Param(this, "link", "The name of link function " +
final val linkPredictionCol: Param[String] = new Param[String](this, "linkPredictionCol",
final val linkPredictionCol: Param[String] = new Param[String](this, "linkPredictionCol",
org.apache.spark.ml.tree
treeParams.scala
final val impurity: Param[String] = new Param[String](this, "impurity", "Criterion used for" +
final val impurity: Param[String] = new Param[String](this, "impurity", "Criterion used for" +
final val impurity: Param[String] = new Param[String](this, "impurity", "Criterion used for" +
final val impurity: Param[String] = new Param[String](this, "impurity", "Criterion used for" +
final val featureSubsetStrategy: Param[String] = new Param[String](this, "featureSubsetStrategy",
final val featureSubsetStrategy: Param[String] = new Param[String](this, "featureSubsetStrategy",
val lossType: Param[String] = new Param[String](this, "lossType", "Loss function which GBT" +
val lossType: Param[String] = new Param[String](this, "lossType", "Loss function which GBT" +
val lossType: Param[String] = new Param[String](this, "lossType", "Loss function which GBT" +
val lossType: Param[String] = new Param[String](this, "lossType", "Loss function which GBT" +
org.apache.spark.ml.util
DefaultReadWriteTest.scala
final val stringParam: Param[String] = new Param[String](this, "stringParam", "doc")
final val stringParam: Param[String] = new Param[String](this, "stringParam", "doc")

maybe we can add an additional string param validators class then to the same params.scala file in ml folder? There should be a generic function and the params.scala file seems to be the right place.

imatiach-msft · 2017-01-12T18:02:55Z

It looks like you can also update the metric name in the evaluators (binary, regression, multiclass) as well. Those should be case-insensitive too, I think.

yanboliang · 2017-01-13T14:00:06Z

@imatiach-msft I think not all string params should be case-insensitive, such as:

All column name params should not case-insensitive, like inputCol.
Param names which were composed by multiple words, like areaUnderROC.

Please see the PR description.

And for lots of other string params that you searched out, like impurity, are already case-insensitive.
Other string params, like SQLTransformer.statement, are not need to be updated, since they are not set with string words. The backend Spark SQL engine will handle all kinds of SQL statements.

imatiach-msft · 2017-01-13T19:18:53Z

yep, I wrote that in a comment above, I totally agree:
1.) we are specifying some column name as a parameter
2.) RModel formula (from RFormula.scala)
3.) Tokenizer.scala regex pattern
for AUC I don't think it should matter though, but it's not too significant.
I still think for the check we should have one method instead of duplicating code, and same for accessing the value (instead of calling .toLower everywhere in the transform's/estimator's code).
I believe anywhere where there is duplicate code there is room for refactoring. Otherwise, the changes look good to me.

yanboliang · 2017-01-20T14:34:09Z

I found this involves lots of problems which need further defined and refactor some code, so I will narrow the scope of this PR to only make GeneralizedLinearRegression family and link case insensitive, since it's a bug that GLM should support Gamma family.

imatiach-msft · 2017-01-20T14:55:45Z

Hmm ok, I guess that's fine. I'm just worried this line is duplicated, maybe you could add a method for it and put it in a common place:
(value: String) => supportedFamilyNames.contains(value.toLowerCase))

Otherwise the code looks great to me!

yanboliang · 2017-01-20T15:13:08Z

@imatiach-msft Yeah, that line was also duplicated in some other estimators. I don't think we have a good way to add it to the base class Param, since it's abstract and not bound to specific type. Do you have some better suggestion? Thanks.

SparkQA · 2017-01-20T15:33:37Z

Test build #71722 has finished for PR 16516 at commit f1f4c89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-21T06:11:15Z

looks good to me

…hould case insensitive ## What changes were proposed in this pull request? MLlib ```GeneralizedLinearRegression``` ```family``` and ```link``` should be case insensitive. This is consistent with some other MLlib params such as [```featureSubsetStrategy```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L415). ## How was this patch tested? Update corresponding tests. Author: Yanbo Liang <[email protected]> Closes #16516 from yanboliang/spark-19133. (cherry picked from commit 3dcad9f) Signed-off-by: Yanbo Liang <[email protected]>

yanboliang · 2017-01-22T05:17:20Z

Merged into master and branch-2.1. Thanks for all your reviewing.

## What changes were proposed in this pull request? This is a supplement to PR #16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily` ``` model.getFamily == Binomial.name || model.getFamily == Poisson.name ``` ## How was this patch tested? Update existing tests for 'Poisson' and 'Binomial'. yanboliang felixcheung imatiach-msft Author: actuaryzhang <[email protected]> Closes #16675 from actuaryzhang/family. (cherry picked from commit f067ace) Signed-off-by: Yanbo Liang <[email protected]>

## What changes were proposed in this pull request? This is a supplement to PR apache#16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily` ``` model.getFamily == Binomial.name || model.getFamily == Poisson.name ``` ## How was this patch tested? Update existing tests for 'Poisson' and 'Binomial'. yanboliang felixcheung imatiach-msft Author: actuaryzhang <[email protected]> Closes apache#16675 from actuaryzhang/family.

…hould case insensitive ## What changes were proposed in this pull request? MLlib ```GeneralizedLinearRegression``` ```family``` and ```link``` should be case insensitive. This is consistent with some other MLlib params such as [```featureSubsetStrategy```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L415). ## How was this patch tested? Update corresponding tests. Author: Yanbo Liang <[email protected]> Closes apache#16516 from yanboliang/spark-19133.

## What changes were proposed in this pull request? This is a supplement to PR apache#16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily` ``` model.getFamily == Binomial.name || model.getFamily == Poisson.name ``` ## How was this patch tested? Update existing tests for 'Poisson' and 'Binomial'. yanboliang felixcheung imatiach-msft Author: actuaryzhang <[email protected]> Closes apache#16675 from actuaryzhang/family.

…hould case insensitive ## What changes were proposed in this pull request? MLlib ```GeneralizedLinearRegression``` ```family``` and ```link``` should be case insensitive. This is consistent with some other MLlib params such as [```featureSubsetStrategy```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L415). ## How was this patch tested? Update corresponding tests. Author: Yanbo Liang <[email protected]> Closes apache#16516 from yanboliang/spark-19133.

## What changes were proposed in this pull request? This is a supplement to PR apache#16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily` ``` model.getFamily == Binomial.name || model.getFamily == Poisson.name ``` ## How was this patch tested? Update existing tests for 'Poisson' and 'Binomial'. yanboliang felixcheung imatiach-msft Author: actuaryzhang <[email protected]> Closes apache#16675 from actuaryzhang/family.

yanboliang mentioned this pull request Jan 9, 2017

[SPARK-19133][SPARKR][ML] fix glm for Gamma, clarify glm family supported #16511

Closed

yanboliang changed the title ~~[SPARK-19133][ML] ML GLR family and link could be uppercase.~~ [WIP][SPARK-19155][ML] ML GLR family and link could be uppercase. Jan 10, 2017

yanboliang changed the title ~~[WIP][SPARK-19155][ML] ML GLR family and link could be uppercase.~~ [SPARK-19155][ML] Make some string params of ML algorithms case insensitive Jan 11, 2017

felixcheung reviewed Jan 11, 2017

View reviewed changes

imatiach-msft reviewed Jan 11, 2017

View reviewed changes

ML GLR family and link could be uppercase.

f1f4c89

yanboliang force-pushed the spark-19133 branch from de6994c to f1f4c89 Compare January 20, 2017 14:27

yanboliang changed the title ~~[SPARK-19155][ML] Make some string params of ML algorithms case insensitive~~ [SPARK-19155][ML] MLlib GeneralizedLinearRegression family and link should case insensitive Jan 20, 2017

asfgit closed this in 3dcad9f Jan 22, 2017

yanboliang deleted the spark-19133 branch January 22, 2017 05:19

yanboliang mentioned this pull request Jan 22, 2017

[SPARK-18929][ML] Add Tweedie distribution in GLM #16344

Closed

actuaryzhang mentioned this pull request Jan 23, 2017

[SPARK-19155][ML] Make family case insensitive in GLM #16675

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19155][ML] MLlib GeneralizedLinearRegression family and link should case insensitive #16516

[SPARK-19155][ML] MLlib GeneralizedLinearRegression family and link should case insensitive #16516

yanboliang commented Jan 9, 2017 •

edited

Loading

yanboliang commented Jan 9, 2017

SparkQA commented Jan 9, 2017

imatiach-msft commented Jan 9, 2017

imatiach-msft commented Jan 9, 2017

felixcheung commented Jan 9, 2017

yanboliang commented Jan 10, 2017

felixcheung Jan 11, 2017

MLnick Jan 11, 2017

yanboliang Jan 11, 2017 •

edited

Loading

imatiach-msft Jan 11, 2017

jkbradley Jan 12, 2017

SparkQA commented Jan 11, 2017

imatiach-msft left a comment

imatiach-msft Jan 11, 2017

imatiach-msft Jan 11, 2017

yanboliang Jan 12, 2017 •

edited

Loading

imatiach-msft Jan 12, 2017

imatiach-msft Jan 12, 2017

imatiach-msft Jan 12, 2017

imatiach-msft commented Jan 12, 2017

yanboliang commented Jan 13, 2017 •

edited

Loading

imatiach-msft commented Jan 13, 2017

yanboliang commented Jan 20, 2017

imatiach-msft commented Jan 20, 2017

yanboliang commented Jan 20, 2017

SparkQA commented Jan 20, 2017

felixcheung commented Jan 21, 2017

yanboliang commented Jan 22, 2017 •

edited

Loading

[SPARK-19155][ML] MLlib GeneralizedLinearRegression family and link should case insensitive #16516

[SPARK-19155][ML] MLlib GeneralizedLinearRegression family and link should case insensitive #16516

Conversation

yanboliang commented Jan 9, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

yanboliang commented Jan 9, 2017

SparkQA commented Jan 9, 2017

imatiach-msft commented Jan 9, 2017

imatiach-msft commented Jan 9, 2017

felixcheung commented Jan 9, 2017

yanboliang commented Jan 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Jan 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 11, 2017

imatiach-msft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Jan 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imatiach-msft commented Jan 12, 2017

yanboliang commented Jan 13, 2017 • edited Loading

imatiach-msft commented Jan 13, 2017

yanboliang commented Jan 20, 2017

imatiach-msft commented Jan 20, 2017

yanboliang commented Jan 20, 2017

SparkQA commented Jan 20, 2017

felixcheung commented Jan 21, 2017

yanboliang commented Jan 22, 2017 • edited Loading

yanboliang commented Jan 9, 2017 •

edited

Loading

yanboliang Jan 11, 2017 •

edited

Loading

yanboliang Jan 12, 2017 •

edited

Loading

yanboliang commented Jan 13, 2017 •

edited

Loading

yanboliang commented Jan 22, 2017 •

edited

Loading