-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer #5500
Conversation
Test build #30206 has finished for PR 5500 at commit
|
import org.apache.spark.sql.types.{StringType, StructType} | ||
|
||
@AlphaComponent | ||
class OneHotEncoder(labelNames: Seq[String], includeFirst: Boolean = true) extends Transformer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need JavaDoc. For pipeline components, we use default constructor and set parameter values in setters. Since OneHotEncoder is a unary transformer, you can extend UnaryTransformer
directly and then overwrite createTransformFunc
. Please check the implementation of StringIndexer
and see how to define parameters and their default values.
@sryza Thanks for adding
The input to 2) would be the output from |
@mengxr that makes sense to me, but did you mean to switch 1 and 2? I.e. in this PR we should implement |
Updated the patch conform to the approach described above. |
Test build #30436 has finished for PR 5500 at commit
|
* (0.0, 0.0, 1.0, 0.0, 0.0). If includeFirst is set to false, the first category is omitted, so the | ||
* output vector for the previous example would be (0.0, 1.0, 0.0, 0.0) and an input value | ||
* of 0.0 would map to a vector of all zeros. Omitting the first category enables the vector | ||
* columns to be independent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the first category the rest could be still dependent. We should say "Including the first category would make the vector columns linearly dependent because they sum up to one."
Test build #30725 has finished for PR 5500 at commit
|
Test build #31010 has started for PR 5500 at commit |
Hey @mengxr this should be ready for review again. |
@sryza I think we can add label names later. For this PR, if the input column carries a nominal attribute with values, we can use it for names. Otherwise, we put no names in the output column. Please also organize imports in the test suite. |
@sryza Could you address the comments? If you are busy, I can send you an update. |
Test build #31803 timed out for PR 5500 at commit |
test this please |
Test build #31826 has finished for PR 5500 at commit
|
test this please |
Test build #31889 has finished for PR 5500 at commit
|
This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza <[email protected]> Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer (cherry picked from commit 47728db) Signed-off-by: Xiangrui Meng <[email protected]>
LGTM. Merged into master and branch-1.4. Thanks! |
This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza <[email protected]> Closes apache#5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza <[email protected]> Closes apache#5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza <[email protected]> Closes apache#5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach.
A couple choices made here:
includeFirst
option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn.Seq
of category names when instantiating aOneHotEncoder
. These can be easily gotten from aStringIndexer
. The names are used for the output column names, which take the form colName_categoryName.