SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer #5500

sryza · 2015-04-13T23:50:46Z

This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach.

A couple choices made here:

There's an includeFirst option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn.
The user is expected to pass a Seq of category names when instantiating a OneHotEncoder. These can be easily gotten from a StringIndexer. The names are used for the output column names, which take the form colName_categoryName.

SparkQA · 2015-04-14T01:24:34Z

Test build #30206 has finished for PR 5500 at commit 04590bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class OneHotEncoder(labelNames: Seq[String], includeFirst: Boolean = true) extends Transformer
This patch does not change any dependencies.

mengxr · 2015-04-15T20:44:04Z

mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala

+import org.apache.spark.sql.types.{StringType, StructType}
+
+@AlphaComponent
+class OneHotEncoder(labelNames: Seq[String], includeFirst: Boolean = true) extends Transformer


Need JavaDoc. For pipeline components, we use default constructor and set parameter values in setters. Since OneHotEncoder is a unary transformer, you can extend UnaryTransformer directly and then overwrite createTransformFunc. Please check the implementation of StringIndexer and see how to define parameters and their default values.

mengxr · 2015-04-15T20:49:48Z

@sryza Thanks for adding OneHotEncoder to spark.ml. One thing I want to discuss is the expected input of OneHotEncoder. There are two use cases:

input a string column and output a vector column with binary values
input a column with category indices and output a vector column with binary values

The input to 2) would be the output from StringIndexer we recently merged. I would call 1) OneHotEncoder, which is a UnaryTransformer, and 2) StringVectorizer, which is a combination of StringIndexer and OneHotEncoder. If both of us agree on the semantics, we can implement 1) as OneHotEncoder in this PR and add StringVectorizer in another PR. Does it sound good to you?

sryza · 2015-04-16T17:58:30Z

@mengxr that makes sense to me, but did you mean to switch 1 and 2? I.e. in this PR we should implement OneHotEncoder, which takes an input column with category indices and outputs a vector column, and in a later PR we can implement StringVectorizer which chains StringIndexer with OneHotEncoder.

sryza · 2015-04-16T21:18:59Z

Updated the patch conform to the approach described above.

SparkQA · 2015-04-16T22:53:39Z

Test build #30436 has finished for PR 5500 at commit 64da101.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

mengxr · 2015-04-21T22:11:27Z

mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala

+ * (0.0, 0.0, 1.0, 0.0, 0.0). If includeFirst is set to false, the first category is omitted, so the
+ * output vector for the previous example would be (0.0, 1.0, 0.0, 0.0) and an input value
+ * of 0.0 would map to a vector of all zeros.  Omitting the first category enables the vector
+ * columns to be independent.


Without the first category the rest could be still dependent. We should say "Including the first category would make the vector columns linearly dependent because they sum up to one."

SparkQA · 2015-04-22T05:17:37Z

Test build #30725 has finished for PR 5500 at commit 7e53579.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-27T18:19:23Z

Test build #31010 has started for PR 5500 at commit 7e53579.

sryza · 2015-04-29T21:21:07Z

Hey @mengxr this should be ready for review again.

mengxr · 2015-04-29T21:36:55Z

@sryza I think we can add label names later. For this PR, if the input column carries a nominal attribute with values, we can use it for names. Otherwise, we put no names in the output column. Please also organize imports in the test suite.

mengxr · 2015-05-01T15:33:01Z

@sryza Could you address the comments? If you are busy, I can send you an update.

SparkQA · 2015-05-05T03:08:02Z

Test build #31803 timed out for PR 5500 at commit f383250 after a configured wait of 150m.

mengxr · 2015-05-05T03:48:24Z

test this please

SparkQA · 2015-05-05T03:55:37Z

Test build #31826 has finished for PR 5500 at commit f383250.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-05-05T17:07:15Z

test this please

SparkQA · 2015-05-05T18:57:34Z

Test build #31889 has finished for PR 5500 at commit f383250.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza <[email protected]> Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer (cherry picked from commit 47728db) Signed-off-by: Xiangrui Meng <[email protected]>

mengxr · 2015-05-05T19:34:30Z

LGTM. Merged into master and branch-1.4. Thanks!

This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza <[email protected]> Closes apache#5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer

mengxr reviewed Apr 15, 2015
View reviewed changes

sryza force-pushed the sandy-spark-5888 branch from 04590bc to 64da101 Compare April 16, 2015 21:17

mengxr reviewed Apr 21, 2015
View reviewed changes

sryza force-pushed the sandy-spark-5888 branch from 64da101 to 7e53579 Compare April 22, 2015 03:38

sryza added 4 commits May 4, 2015 13:02

SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer

1c182dd

Vector transformers

7c539cf

Review comments

6e257b9

Infer label names automatically

f383250

sryza force-pushed the sandy-spark-5888 branch from 7e53579 to f383250 Compare May 5, 2015 00:33

asfgit closed this in 47728db May 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer #5500

SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer #5500

sryza commented Apr 13, 2015

SparkQA commented Apr 14, 2015

mengxr Apr 15, 2015

mengxr commented Apr 15, 2015

sryza commented Apr 16, 2015

sryza commented Apr 16, 2015

SparkQA commented Apr 16, 2015

mengxr Apr 21, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 27, 2015

sryza commented Apr 29, 2015

mengxr commented Apr 29, 2015

mengxr commented May 1, 2015

SparkQA commented May 5, 2015

mengxr commented May 5, 2015

SparkQA commented May 5, 2015

mengxr commented May 5, 2015

SparkQA commented May 5, 2015

mengxr commented May 5, 2015

SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer #5500

SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer #5500

Conversation

sryza commented Apr 13, 2015

SparkQA commented Apr 14, 2015

mengxr Apr 15, 2015

Choose a reason for hiding this comment

mengxr commented Apr 15, 2015

sryza commented Apr 16, 2015

sryza commented Apr 16, 2015

SparkQA commented Apr 16, 2015

mengxr Apr 21, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 22, 2015

SparkQA commented Apr 27, 2015

sryza commented Apr 29, 2015

mengxr commented Apr 29, 2015

mengxr commented May 1, 2015

SparkQA commented May 5, 2015

mengxr commented May 5, 2015

SparkQA commented May 5, 2015

mengxr commented May 5, 2015

SparkQA commented May 5, 2015

mengxr commented May 5, 2015