Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting the change for XGBoost integration into EVADb. #1232

Merged
merged 8 commits into from
Oct 18, 2023

Conversation

jineetd
Copy link
Contributor

@jineetd jineetd commented Sep 28, 2023

No description provided.

@jineetd
Copy link
Contributor Author

jineetd commented Sep 28, 2023

Unit test added is failing with the following error:

ERROR    evadb.utils.logging_manager:plan_executor.py:186 `best_iteration` is only defined when early stopping is used.
Traceback (most recent call last):
  File "/Users/jineetdesai/evadb/evadb/executor/plan_executor.py", line 182, in execute_plan
    yield from output
  File "/Users/jineetdesai/evadb/evadb/executor/create_function_executor.py", line 447, in exec
    ) = self.handle_xgboost_function()
  File "/Users/jineetdesai/evadb/evadb/executor/create_function_executor.py", line 196, in handle_xgboost_function
    model.fit(
  File "/Users/jineetdesai/evadb/test_evadb_venv/lib/python3.8/site-packages/flaml/automl/automl.py", line 1928, in fit
    self._search()
  File "/Users/jineetdesai/evadb/test_evadb_venv/lib/python3.8/site-packages/flaml/automl/automl.py", line 2482, in _search
    self._search_sequential()
  File "/Users/jineetdesai/evadb/test_evadb_venv/lib/python3.8/site-packages/flaml/automl/automl.py", line 2318, in _search_sequential
    analysis = tune.run(
  File "/Users/jineetdesai/evadb/test_evadb_venv/lib/python3.8/site-packages/flaml/tune/tune.py", line 808, in run
    result = evaluation_function(trial_to_run.config)
  File "/Users/jineetdesai/evadb/test_evadb_venv/lib/python3.8/site-packages/flaml/automl/state.py", line 302, in _compute_with_config_base
    ) = compute_estimator(
  File "/Users/jineetdesai/evadb/test_evadb_venv/lib/python3.8/site-packages/flaml/automl/ml.py", line 369, in compute_estimator
    val_loss, metric_for_logging, train_time, pred_time = task.evaluate_model_CV(
  File "/Users/jineetdesai/evadb/test_evadb_venv/lib/python3.8/site-packages/flaml/automl/task/generic_task.py", line 737, in evaluate_model_CV
    val_loss_i, metric_i, train_time_i, pred_time_i = get_val_loss(
  File "/Users/jineetdesai/evadb/test_evadb_venv/lib/python3.8/site-packages/flaml/automl/ml.py", line 494, in get_val_loss
    estimator.fit(X_train, y_train, budget=budget, free_mem_ratio=free_mem_ratio, **fit_kwargs)
  File "/Users/jineetdesai/evadb/test_evadb_venv/lib/python3.8/site-packages/flaml/automl/model.py", line 1652, in fit
    return super().fit(X_train, y_train, budget, free_mem_ratio, **kwargs)
  File "/Users/jineetdesai/evadb/test_evadb_venv/lib/python3.8/site-packages/flaml/automl/model.py", line 1415, in fit
    self._model.get_booster().best_iteration
  File "/Users/jineetdesai/evadb/test_evadb_venv/lib/python3.8/site-packages/xgboost/core.py", line 2602, in best_iteration
    raise AttributeError(
AttributeError: `best_iteration` is only defined when early stopping is used.

Will need to check this further.

@xzdandy
Copy link
Collaborator

xzdandy commented Sep 28, 2023

Please merge the latest staging please. Thanks!

@xzdandy xzdandy added the AI Engines Features, Bugs, related to AI Engines label Sep 29, 2023
@xzdandy xzdandy added this to the v0.3.7 milestone Sep 29, 2023
@xzdandy xzdandy linked an issue Sep 29, 2023 that may be closed by this pull request
2 tasks
@xzdandy xzdandy modified the milestones: v0.3.7, v0.3.8 Sep 30, 2023
@xzdandy
Copy link
Collaborator

xzdandy commented Oct 11, 2023

Thanks @jineetd. I will add the dependency requirements.

@xzdandy
Copy link
Collaborator

xzdandy commented Oct 11, 2023

Hi @jineetd, please update the documentation. You can use https://evadb.readthedocs.io/en/stable/source/reference/ai/model-train-sklearn.html as the reference.

@jineetd
Copy link
Contributor Author

jineetd commented Oct 11, 2023

Sure @xzdandy , shall update the documentation for XGBoost.

def forward(self, frames: pd.DataFrame) -> pd.DataFrame:
# Last column is the value to predict, hence don't pass that to the
# predict method.
predictions = self.model.predict(frames.iloc[:, :-1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that last the column is the value to predict I think in this case. We need to store the column to predict in this case. You can again check the ludwig for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it seems that the auto_train methods for Ludwig and XGBoost are different. In Ludwig, you provide the entire dataset (X + Y) to the auto train method and then specify the column which is supposed to act as Y. Whereas XGBoost auto train specifies the feature matrix X and prediction variable Y.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate what is the difference?

What I meant originally is that the following query will not work, because the predict column is not the last one.

CREATE FUNCTION IF NOT EXISTS PredictRent FROM
( SELECT number_of_rooms, number_of_bathrooms, days_on_market, rental_price FROM HomeRentals )
TYPE XGBoost
PREDICT 'number_of_rooms';

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we now pass the prediction column to the .py model files

PREDICT 'rental_price';
In the above query, you are creating a new customized function by training a model from the ``HomeRentals`` table using the ``Flaml XGBoost`` framework.
The ``rental_price`` column will be the target column for predication, while the rest columns from the ``SELET`` query are the inputs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need add documentation on all the parameters XGBoost support. time_limit and metric are the two parameters we support now.

Copy link
Collaborator

@xzdandy xzdandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you fix the merge conflicts?

@xzdandy xzdandy merged commit 201f901 into georgia-tech-db:staging Oct 18, 2023
7 checks passed
a0x8o pushed a commit to alexxx-db/eva that referenced this pull request Oct 30, 2023
a0x8o pushed a commit to alexxx-db/eva that referenced this pull request Oct 30, 2023
a0x8o pushed a commit to alexxx-db/eva that referenced this pull request Nov 22, 2023
a0x8o pushed a commit to alexxx-db/eva that referenced this pull request Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI Engines Features, Bugs, related to AI Engines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xgboost Integration
2 participants