Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] get dask_cudf.Series from xgb.dask.predict() #4525

Closed
rnyak opened this issue Mar 16, 2020 · 1 comment
Closed

[FEA] get dask_cudf.Series from xgb.dask.predict() #4525

rnyak opened this issue Mar 16, 2020 · 1 comment
Labels
feature request New feature or request

Comments

@rnyak
Copy link
Contributor

rnyak commented Mar 16, 2020

Is your feature request related to a problem? Please describe.
I would like to obtain prediction result from prediction = xgb.dask.predict(client, output, dtrain) as dask_cudf.Series instead of dask.array.core.Array .

I am using, rapids 0.13 nightly in a conda env, and Dask 2.12.0. Here is a min rep example code:

cdf = cudf.DataFrame()
cdf['day'] = [15, 10, 20, 20,  21, 25, 28, 29]
cdf['hour'] = [19, 20, 20, 21, 18, 12, 15, 13]
cdf['passenger_count'] = [1, 1, 2, 2, 3, 3, 4, 2]
cdf['fare_amount'] = [5.0, 3.5, 12.5, 4.5, 9.0, 5.0, 3.5, 7.5]
ddf=dask_cudf.from_cudf(cdf, npartitions=2)

%%time
X_train = ddf.query('day < 25').persist()

# create a Y_train ddf with just the target variable
Y_train = X_train[['fare_amount']].persist()
# drop the target variable from the training ddf
X_train = X_train[X_train.columns.difference(['fare_amount'])]

# this wont return until all data is in GPU memory
done = wait([X_train, Y_train])

import xgboost as xgb
dtrain = xgb.dask.DaskDMatrix(client, X_train, Y_train)

%%time
trained_model = xgb.dask.train(client,
                        {
                         'learning_rate': 0.3,
                          'max_depth': 8,
                          'objective': 'reg:squarederror',
                          'subsample': 0.6,
                          'gamma': 1,
                          'silent': True,
                          'verbose_eval': True,
                          'tree_method':'gpu_hist',
                          'n_gpus': 1
                        },
                        dtrain,
                        num_boost_round=100, evals=[(dtrain, 'train')])
def drop_empty_partitions(df):
    lengths = df.map_partitions(len).compute()
    nonempty = [length > 0 for length in lengths]
    return df.partitions[nonempty]

X_test = ddf.query('day >= 25').persist()
X_test = drop_empty_partitions(X_test)

# Create Y_test with just the fare amount
Y_test = X_test[['fare_amount']]

# Drop the fare amount from X_test
X_test = X_test[X_test.columns.difference(['fare_amount'])]

dtest = xgb.dask.DaskDMatrix(client, X_test, Y_test)
prediction = xgb.dask.predict(client, trained_model['booster'], dtest)
type(prediction)
dask.array.core.Array

Describe the solution you'd like

prediction = xgb.dask.predict(client, output, dtrain)
type(prediction)
dask_cudf.core.Series

# I'd like to be able to calculate RMSE error as follows:
Y_test['squared_error'] = (prediction- Y_test['fare_amount'])**2
math.sqrt(Y_test.squared_error.mean().compute())
@rnyak rnyak added Needs Triage Need team to review and classify feature request New feature or request labels Mar 16, 2020
@rnyak rnyak changed the title get dask_cudf.Series from xgb.dask.predict() [FEA] get dask_cudf.Series from xgb.dask.predict() Mar 17, 2020
@rnyak rnyak closed this as completed Mar 17, 2020
@rnyak
Copy link
Contributor Author

rnyak commented Mar 17, 2020

Moved to dmlcx/xgboost

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants