Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add snippet for creating boosted tree model #1142

Merged
merged 9 commits into from
Nov 19, 2024

Conversation

rey-esp
Copy link
Contributor

@rey-esp rey-esp commented Nov 11, 2024

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. samples Issues that are directly related to samples. labels Nov 11, 2024
@product-auto-label product-auto-label bot added size: s Pull request size is small. and removed size: m Pull request size is medium. labels Nov 11, 2024
import bigframes.ml.linear_model

# input_data is defined in an earlier step.
training_data = input_data[input_data["dataframe"] == "training"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No action needed, but something to consider for future: it would be nice to update the prepare section above to work without referencing an index (e.g. when ordering mode = "partial").

We have a few options, but the easiest will be to start with a string column and add (True, "training") as the last in the list of cases.

Aside: we have an issue open (349926559) to allow selecting any column in the dataframe (such as functional_weight, which would be a natural choice in this example) even if its a different type, so long as a True (default) case is provided.

# input_data is defined in an earlier step.
training_data = input_data[input_data["dataframe"] == "training"]
X = training_data.drop(columns=["income_bracket", "dataframe"])
y = training_data["income_bracket"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably you ran this code sample and it worked OK? I remember we had some bugs where y had to be a DataFrame not a Series in past, so just double-checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 52 to 56
census_model = bigframes.ml.linear_model.LogisticRegression(
# model_type="BOOSTED_TREE_CLASSIFIER",
# booster_type="gbtree",
max_iterations=50,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be doing LogisticRegression here. In the SQL we do use model_type='BOOSTED_TREE_CLASSIFIER', but in BigQuery DataFrames we normally use separate Python classes to represent the different model types.

A few ways to discover which class to use:

These should give you some strong hints as to which class to use instead.

Copy link
Collaborator

@tswast tswast Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying this from an internal comment I made for visibility:

Just like scikit-learn, it's one of the "ensemble" methods: https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.ensemble

Normally we try to use the scikit-learn class names too, but I think we may have added this class before GradientBoostingClassifier was in scikit-learn

@tswast tswast marked this pull request as ready for review November 19, 2024 20:42
@tswast tswast requested review from a team as code owners November 19, 2024 20:42
@tswast tswast requested a review from m-strzelczyk November 19, 2024 20:42
Copy link

snippet-bot bot commented Nov 19, 2024

Here is the summary of changes.

You are about to add 1 region tag.

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

  • Refresh this comment

@rey-esp rey-esp added the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 19, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Nov 19, 2024
@rey-esp rey-esp merged commit a972668 into main Nov 19, 2024
23 checks passed
@rey-esp rey-esp deleted the b338872698-bigframes-v1 branch November 19, 2024 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. samples Issues that are directly related to samples. size: s Pull request size is small.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants