Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer #5500

Closed
wants to merge 4 commits into from

Conversation

sryza
Copy link
Contributor

@sryza sryza commented Apr 13, 2015

This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach.

A couple choices made here:

  • There's an includeFirst option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn.
  • The user is expected to pass a Seq of category names when instantiating a OneHotEncoder. These can be easily gotten from a StringIndexer. The names are used for the output column names, which take the form colName_categoryName.

@SparkQA
Copy link

SparkQA commented Apr 14, 2015

Test build #30206 has finished for PR 5500 at commit 04590bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class OneHotEncoder(labelNames: Seq[String], includeFirst: Boolean = true) extends Transformer
  • This patch does not change any dependencies.

import org.apache.spark.sql.types.{StringType, StructType}

@AlphaComponent
class OneHotEncoder(labelNames: Seq[String], includeFirst: Boolean = true) extends Transformer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need JavaDoc. For pipeline components, we use default constructor and set parameter values in setters. Since OneHotEncoder is a unary transformer, you can extend UnaryTransformer directly and then overwrite createTransformFunc. Please check the implementation of StringIndexer and see how to define parameters and their default values.

@mengxr
Copy link
Contributor

mengxr commented Apr 15, 2015

@sryza Thanks for adding OneHotEncoder to spark.ml. One thing I want to discuss is the expected input of OneHotEncoder. There are two use cases:

  1. input a string column and output a vector column with binary values
  2. input a column with category indices and output a vector column with binary values

The input to 2) would be the output from StringIndexer we recently merged. I would call 1) OneHotEncoder, which is a UnaryTransformer, and 2) StringVectorizer, which is a combination of StringIndexer and OneHotEncoder. If both of us agree on the semantics, we can implement 1) as OneHotEncoder in this PR and add StringVectorizer in another PR. Does it sound good to you?

@sryza
Copy link
Contributor Author

sryza commented Apr 16, 2015

@mengxr that makes sense to me, but did you mean to switch 1 and 2? I.e. in this PR we should implement OneHotEncoder, which takes an input column with category indices and outputs a vector column, and in a later PR we can implement StringVectorizer which chains StringIndexer with OneHotEncoder.

@sryza
Copy link
Contributor Author

sryza commented Apr 16, 2015

Updated the patch conform to the approach described above.

@SparkQA
Copy link

SparkQA commented Apr 16, 2015

Test build #30436 has finished for PR 5500 at commit 64da101.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

* (0.0, 0.0, 1.0, 0.0, 0.0). If includeFirst is set to false, the first category is omitted, so the
* output vector for the previous example would be (0.0, 1.0, 0.0, 0.0) and an input value
* of 0.0 would map to a vector of all zeros. Omitting the first category enables the vector
* columns to be independent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the first category the rest could be still dependent. We should say "Including the first category would make the vector columns linearly dependent because they sum up to one."

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30725 has finished for PR 5500 at commit 7e53579.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #31010 has started for PR 5500 at commit 7e53579.

@sryza
Copy link
Contributor Author

sryza commented Apr 29, 2015

Hey @mengxr this should be ready for review again.

@mengxr
Copy link
Contributor

mengxr commented Apr 29, 2015

@sryza I think we can add label names later. For this PR, if the input column carries a nominal attribute with values, we can use it for names. Otherwise, we put no names in the output column. Please also organize imports in the test suite.

@mengxr
Copy link
Contributor

mengxr commented May 1, 2015

@sryza Could you address the comments? If you are busy, I can send you an update.

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31803 timed out for PR 5500 at commit f383250 after a configured wait of 150m.

@mengxr
Copy link
Contributor

mengxr commented May 5, 2015

test this please

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31826 has finished for PR 5500 at commit f383250.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented May 5, 2015

test this please

@SparkQA
Copy link

SparkQA commented May 5, 2015

Test build #31889 has finished for PR 5500 at commit f383250.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 5, 2015
This patch adds a one hot encoder for categorical features.  Planning to add documentation and another test after getting feedback on the approach.

A couple choices made here:
* There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns.  The default is true, which is the behavior in scikit-learn.
* The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`.  These can be easily gotten from a `StringIndexer`.  The names are used for the output column names, which take the form colName_categoryName.

Author: Sandy Ryza <[email protected]>

Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits:

f383250 [Sandy Ryza] Infer label names automatically
6e257b9 [Sandy Ryza] Review comments
7c539cf [Sandy Ryza] Vector transformers
1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer

(cherry picked from commit 47728db)
Signed-off-by: Xiangrui Meng <[email protected]>
@mengxr
Copy link
Contributor

mengxr commented May 5, 2015

LGTM. Merged into master and branch-1.4. Thanks!

@asfgit asfgit closed this in 47728db May 5, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This patch adds a one hot encoder for categorical features.  Planning to add documentation and another test after getting feedback on the approach.

A couple choices made here:
* There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns.  The default is true, which is the behavior in scikit-learn.
* The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`.  These can be easily gotten from a `StringIndexer`.  The names are used for the output column names, which take the form colName_categoryName.

Author: Sandy Ryza <[email protected]>

Closes apache#5500 from sryza/sandy-spark-5888 and squashes the following commits:

f383250 [Sandy Ryza] Infer label names automatically
6e257b9 [Sandy Ryza] Review comments
7c539cf [Sandy Ryza] Vector transformers
1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This patch adds a one hot encoder for categorical features.  Planning to add documentation and another test after getting feedback on the approach.

A couple choices made here:
* There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns.  The default is true, which is the behavior in scikit-learn.
* The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`.  These can be easily gotten from a `StringIndexer`.  The names are used for the output column names, which take the form colName_categoryName.

Author: Sandy Ryza <[email protected]>

Closes apache#5500 from sryza/sandy-spark-5888 and squashes the following commits:

f383250 [Sandy Ryza] Infer label names automatically
6e257b9 [Sandy Ryza] Review comments
7c539cf [Sandy Ryza] Vector transformers
1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This patch adds a one hot encoder for categorical features.  Planning to add documentation and another test after getting feedback on the approach.

A couple choices made here:
* There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns.  The default is true, which is the behavior in scikit-learn.
* The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`.  These can be easily gotten from a `StringIndexer`.  The names are used for the output column names, which take the form colName_categoryName.

Author: Sandy Ryza <[email protected]>

Closes apache#5500 from sryza/sandy-spark-5888 and squashes the following commits:

f383250 [Sandy Ryza] Infer label names automatically
6e257b9 [Sandy Ryza] Review comments
7c539cf [Sandy Ryza] Vector transformers
1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants