-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add snippet for creating boosted tree model #1142
Conversation
import bigframes.ml.linear_model | ||
|
||
# input_data is defined in an earlier step. | ||
training_data = input_data[input_data["dataframe"] == "training"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No action needed, but something to consider for future: it would be nice to update the prepare
section above to work without referencing an index (e.g. when ordering mode = "partial").
We have a few options, but the easiest will be to start with a string column and add (True, "training")
as the last in the list of cases.
Aside: we have an issue open (349926559) to allow selecting any column in the dataframe (such as functional_weight
, which would be a natural choice in this example) even if its a different type, so long as a True
(default) case is provided.
# input_data is defined in an earlier step. | ||
training_data = input_data[input_data["dataframe"] == "training"] | ||
X = training_data.drop(columns=["income_bracket", "dataframe"]) | ||
y = training_data["income_bracket"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably you ran this code sample and it worked OK? I remember we had some bugs where y
had to be a DataFrame not a Series in past, so just double-checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code sample seems to run! Not sure if I did it right so here's the colab: https://colab.sandbox.google.com/drive/10jA6zSRiptXWrTkCcmyCT_sYBjLqGJx0?resourcekey=0-0TrIkmDzAJw_F6ONFikwaA#scrollTo=wU367u1SAj3Y
census_model = bigframes.ml.linear_model.LogisticRegression( | ||
# model_type="BOOSTED_TREE_CLASSIFIER", | ||
# booster_type="gbtree", | ||
max_iterations=50, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be doing LogisticRegression
here. In the SQL we do use model_type='BOOSTED_TREE_CLASSIFIER'
, but in BigQuery DataFrames we normally use separate Python classes to represent the different model types.
A few ways to discover which class to use:
- Search our code for
BOOSTED_TREE_CLASSIFIER
- Google search for boosted trees BigFrames
These should give you some strong hints as to which class to use instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying this from an internal comment I made for visibility:
Just like scikit-learn, it's one of the "ensemble" methods: https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.ensemble
Normally we try to use the scikit-learn class names too, but I think we may have added this class before GradientBoostingClassifier was in scikit-learn
Here is the summary of changes. You are about to add 1 region tag.
This comment is generated by snippet-bot.
|
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕