Does not handle NaN #218

Roh-codeur · 2023-03-30T22:35:12Z

Hi

I am trying out EvoTrees for Binary classification task as below. it turns out, however, that it does not support NaNs. is there a reason why to doesn't? I currently XGBoost for my model and it handles NaNs. the primary issue is that I have missing data in my dataset and some of the features are NaNs for some values (correctly so).

config = EvoTreeRegressor(
    loss=:logistic, 
    metric = :logloss,
    nrounds=100, 
    nbins = 32,
    lambda = 0.5, 
    gamma=0.1, eta=0.1,
    max_depth = 6,
    rowsample=0.5, 
    colsample=1.0)

model = fit_evotree(config; x_train=X, y_train=Y, x_eval=testX, y_eval=testY, print_every_n = 25)

Is there a plan for EvoTrees to handle NaNs please? I am also curious as to how others handle NaNs? Imputing or mean/median? those approaches wont work for me, I am afraid. anything else I can try out please?

ta!

jeremiedb · 2023-03-31T02:07:53Z

It wasn't on my radar to have support for the NAs/missings. Typically, inputing would work (mean/media, or min/max), or otherwise creation of indicator variable [0, 1] if either missing or not. Any reason why such method wouldn't be applicable to your situation?

Roh-codeur · 2023-03-31T10:37:10Z

Imputing wouldn’t work. I have NaNs or missing in instances where I don’t have enough data to calculate features. For instance, the first few features would be NaN or missing.

I am afraid I am not aware of indicator variables - can you please elaborate? My features are continuous values

I hope we can come up with a solution for this, I am quite impressed with the performance, am looking forward to using this package

Thanks!

jeremiedb · 2023-04-01T04:57:26Z

By indicator variable, I mean something similar to:

julia> df = DataFrame(v1 = [missing, rand(3)...])
4×1 DataFrame
 Row │ v1
     │ Float64?
─────┼─────────────────
   1 │ missing
   2 │       0.777368
   3 │       0.0461273
   4 │       0.71682

julia> transform!(df, "v1" => ByRow(ismissing) => "v1_flag")
4×2 DataFrame
 Row │ v1               v1_flag
     │ Float64?         Bool
─────┼──────────────────────────
   1 │ missing             true
   2 │       0.777368     false
   3 │       0.0461273    false
   4 │       0.71682      false

Then, you need to make an imputation for the original variable, could be 0, mean, or other relevant value:

julia> transform!(df, "v1" => ByRow(x -> ismissing(x) ? 0.0 : x) => "v1")

4×2 DataFrame
 Row │ v1         v1_flag
     │ Float64    Bool
─────┼────────────────────
   1 │ 0.0           true
   2 │ 0.777368     false
   3 │ 0.0461273    false
   4 │ 0.71682      false

Such kind of approach should typically cover most use cases.

Roh-codeur · 2023-04-01T13:18:33Z

ahh, I understand. thanks for this. I will give this a shot, although, I will have to think more about the replacement values. I am working with Financial data, so, 0s are valid values. Imputing data with mean, median etc, would be misleading.

thanks
Roh

jeremiedb closed this as completed Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does not handle NaN #218

Does not handle NaN #218

Roh-codeur commented Mar 30, 2023 •

edited

Loading

jeremiedb commented Mar 31, 2023

Roh-codeur commented Mar 31, 2023 •

edited

Loading

jeremiedb commented Apr 1, 2023 •

edited

Loading

Roh-codeur commented Apr 1, 2023

Does not handle NaN #218

Does not handle NaN #218

Comments

Roh-codeur commented Mar 30, 2023 • edited Loading

jeremiedb commented Mar 31, 2023

Roh-codeur commented Mar 31, 2023 • edited Loading

jeremiedb commented Apr 1, 2023 • edited Loading

Roh-codeur commented Apr 1, 2023

Roh-codeur commented Mar 30, 2023 •

edited

Loading

Roh-codeur commented Mar 31, 2023 •

edited

Loading

jeremiedb commented Apr 1, 2023 •

edited

Loading