Skip to content

Commit

Permalink
Add new lines for Spark XGBoost missing values section (#5180)
Browse files Browse the repository at this point in the history
  • Loading branch information
cpfarrell authored and trivialfis committed Jan 7, 2020
1 parent ee28780 commit 9049c7c
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions doc/jvm/xgboost4j_spark_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,9 +188,11 @@ Example of setting a missing value (e.g. -999) to the "missing" parameter in XGB
doing this with missing values encoded as NaN, you will want to set ``setHandleInvalid = "keep"`` on VectorAssembler
in order to keep the NaN values in the dataset. You would then set the "missing" parameter to whatever you want to be
treated as missing. However this may cause a large amount of memory use if your dataset is very sparse.

2. Before calling VectorAssembler you can transform the values you want to represent missing into an irregular value
that is not 0, NaN, or Null and set the "missing" parameter to 0. The irregular value should ideally be chosen to be
outside the range of values that your features have.

3. Do not use the VectorAssembler class and instead use a custom way of constructing a SparseVector that allows for
specifying sparsity to indicate a non-zero value. You can then set the "missing" parameter to whatever sparsity
indicates in your Dataset. If this approach is taken you can pass the parameter
Expand Down

0 comments on commit 9049c7c

Please sign in to comment.