Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add snippet for creating boosted tree model #1142

Merged
merged 9 commits into from
Nov 19, 2024
Merged
26 changes: 25 additions & 1 deletion samples/snippets/classification_boosted_tree_model_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@


def test_boosted_tree_model(random_model_id: str) -> None:
# your_model_id = random_model_id
your_model_id = random_model_id
# [START bigquery_dataframes_bqml_boosted_tree_prepare]
import bigframes.pandas as bpd

Expand All @@ -39,4 +39,28 @@ def test_boosted_tree_model(random_model_id: str) -> None:
)
del input_data["functional_weight"]
# [END bigquery_dataframes_bqml_boosted_tree_prepare]
# [START bigquery_dataframes_bqml_boosted_tree_create]
from bigframes.ml import ensemble

# input_data is defined in an earlier step.
training_data = input_data[input_data["dataframe"] == "training"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No action needed, but something to consider for future: it would be nice to update the prepare section above to work without referencing an index (e.g. when ordering mode = "partial").

We have a few options, but the easiest will be to start with a string column and add (True, "training") as the last in the list of cases.

Aside: we have an issue open (349926559) to allow selecting any column in the dataframe (such as functional_weight, which would be a natural choice in this example) even if its a different type, so long as a True (default) case is provided.

X = training_data.drop(columns=["income_bracket", "dataframe"])
y = training_data["income_bracket"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably you ran this code sample and it worked OK? I remember we had some bugs where y had to be a DataFrame not a Series in past, so just double-checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# create and train the model
census_model = ensemble.XGBClassifier(
n_estimators=1,
booster="gbtree",
tree_method="hist",
max_iterations=1, # For a more accurate model, try 50 iterations.
subsample=0.85,
)
census_model.fit(X, y)

census_model.to_gbq(
your_model_id, # For example: "your-project.census.census_model"
replace=True,
)
# [END bigquery_dataframes_bqml_boosted_tree_create]
assert input_data is not None
assert census_model is not None