Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add predict sample to samples/snippets/bqml_getting_started_test.py #388

Merged
merged 27 commits into from
Mar 8, 2024
Merged
Changes from 13 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
4cf9a0e
docs: Add a sample to demonstrate the evaluation results
DevStephanie Jan 31, 2024
ffcf185
Adding comments explaining logistic regression results
DevStephanie Feb 1, 2024
8e5ba68
editing read_gbd explanation
DevStephanie Feb 5, 2024
202bf76
docs: add predict sample to samples/snippets/bqml_getting_started_tes…
DevStephanie Feb 23, 2024
ca3783f
Merge remote-tracking branch 'origin/main' into bqml_predict1
DevStephanie Feb 23, 2024
d3a8d8d
Merge branch 'main' into bqml_predict1
DevStephanie Feb 26, 2024
7198e7f
Merge branch 'main' into bqml_predict1
DevStephanie Feb 27, 2024
4984cfc
Merge branch 'main' into bqml_predict1
DevStephanie Feb 27, 2024
b89f30b
Merge branch 'main' of https://github.com/googleapis/python-bigquery-…
DevStephanie Feb 28, 2024
0aba4d2
Merge branch 'main' into bqml_predict1
DevStephanie Feb 28, 2024
fb79526
correcting variable names
DevStephanie Feb 28, 2024
b6d6430
Merge remote-tracking branch 'refs/remotes/origin/main' into bqml_pre…
DevStephanie Feb 28, 2024
262661c
Merge remote-tracking branch 'origin/bqml_predict1' into bqml_predict1
DevStephanie Feb 28, 2024
ad398ad
Correcting python variables
DevStephanie Mar 4, 2024
f0eaa6c
Merge branch 'main' into bqml_predict1
DevStephanie Mar 4, 2024
7f06521
Merge branch 'main' into bqml_predict2
DevStephanie Mar 4, 2024
ca17b39
feat: add predict by visit to samples/snippets/bqml_getting_started_t…
DevStephanie Mar 6, 2024
190cf9e
Merge branch 'bqml_predict2' into bqml_predict1
DevStephanie Mar 6, 2024
9df8bdd
file
DevStephanie Mar 6, 2024
1a25f5f
file
DevStephanie Mar 6, 2024
daa3bdb
file
DevStephanie Mar 6, 2024
bde7a12
Merge branch 'main' into bqml_predict1
tswast Mar 6, 2024
3613489
Merge branch 'main' into bqml_predict1
tswast Mar 6, 2024
249631c
Merge branch 'main' into bqml_predict1
tswast Mar 7, 2024
6ef78bb
Merge branch 'main' into bqml_predict1
tswast Mar 7, 2024
aa6d323
Merge branch 'main' into bqml_predict1
tswast Mar 7, 2024
defabf8
Merge branch 'main' into bqml_predict1
tswast Mar 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 57 additions & 30 deletions samples/snippets/bqml_getting_started_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,17 +26,12 @@ def test_bqml_getting_started(random_model_id):
# https://github.com/googleapis/python-bigquery-dataframes/issues/169
# for updates to `read_gbq` to support wildcard tables.

df = bpd.read_gbq(
"""
-- Since the order of rows isn't useful for the model training,
-- generate a random ID to use as the index for the DataFrame.
SELECT GENERATE_UUID() AS rowindex, *
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
""",
index_col="rowindex",
df = bpd.read_gbq_table(
"bigquery-public-data.google_analytics_sample.ga_sessions_*",
filters=[
("_table_suffix", ">=", "20170701"),
("_table_suffix", "<=", "20170801"),
],
)

# Extract the total number of transactions within
Expand All @@ -56,11 +51,11 @@ def test_bqml_getting_started(random_model_id):
label = transactions.notnull().map({True: 1, False: 0})

# Extract the operating system of the visitor's device.
operatingSystem = df["device"].struct.field("operatingSystem")
operatingSystem = operatingSystem.fillna("")
operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")

# Extract whether the visitor's device is a mobile device.
isMobile = df["device"].struct.field("isMobile")
is_mobile = df["device"].struct.field("isMobile")

# Extract the country from which the sessions originated, based on the IP address.
country = df["geoNetwork"].struct.field("country").fillna("")
Expand All @@ -72,8 +67,8 @@ def test_bqml_getting_started(random_model_id):
# to use as training data.
features = bpd.DataFrame(
{
"os": operatingSystem,
"is_mobile": isMobile,
"os": operating_system,
"isMobile": is_mobile,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "is_mobile" string didn't need to be changed, but if you do you must change it everywhere.

"country": country,
"pageviews": pageviews,
}
Expand Down Expand Up @@ -107,27 +102,24 @@ def test_bqml_getting_started(random_model_id):
# of the model. It was collected in the month immediately following the time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs to be updated too. You aren't using a WHERE clause anymore.

# period spanned by the training data.

df = bpd.read_gbq(
"""
SELECT GENERATE_UUID() AS rowindex, *
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'
""",
index_col="rowindex",
df = bpd.read_gbq_table(
"bigquery-public-data.google_analytics_sample.ga_sessions_*",
filters=[
("_table_suffix", ">=", "20170701"),
("_table_suffix", "<=", "20170801"),
],
)
transactions = df["totals"].struct.field("transactions")
label = transactions.notnull().map({True: 1, False: 0})
operatingSystem = df["device"].struct.field("operatingSystem")
operatingSystem = operatingSystem.fillna("")
isMobile = df["device"].struct.field("isMobile")
operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")
is_mobile = df["device"].struct.field("isMobile")
country = df["geoNetwork"].struct.field("country").fillna("")
pageviews = df["totals"].struct.field("pageviews").fillna(0)
features = bpd.DataFrame(
{
"os": operatingSystem,
"is_mobile": isMobile,
"os": operating_system,
"isMobile": is_mobile,
"country": country,
"pageviews": pageviews,
}
Expand Down Expand Up @@ -164,5 +156,40 @@ def test_bqml_getting_started(random_model_id):
# [END bigquery_dataframes_bqml_getting_started_tutorial_evaluate]

# [START bigquery_dataframes_bqml_getting_started_tutorial_predict]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two "predict" samples in https://cloud.google.com/bigquery/docs/create-machine-learning-model#use_your_model_to_predict_outcomes and https://cloud.google.com/bigquery/docs/create-machine-learning-model#predict_purchases_per_user

We'll need one for each. Let's disambiguate. I believe you are doing the first one, so bigquery_dataframes_bqml_getting_started_tutorial_predict_by_country may be a good more-specific region tag.

df = bpd.read_gbq_table(
"bigquery-public-data.google_analytics_sample.ga_sessions_*",
filters=[
("_table_suffix", ">=", "20170701"),
("_table_suffix", "<=", "20170801"),
],
)

operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")
is_mobile = df["device"].struct.field("isMobile")
country = df["geoNetwork"].struct.field("country").fillna("")
pageviews = df["totals"].struct.field("pageviews").fillna(0)
features = bpd.DataFrame(
{
"os": operating_system,
"isMobile": is_mobile,
"country": country,
"pageviews": pageviews,
}
)
# Use Logistic Regression predict method to, find more information here in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete sentence.

# [BigFrames](/bigframes/latest/bigframes.ml.linear_model.LogisticRegression#bigframes_ml_linear_model_LogisticRegression_predict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it result in a clickable link leading to docs.google.com documentation? Asking because in the other place (line 157) we are using absolute https://... path

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not. We should use absolute path here in comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, corrected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: If this has been corrected, your change hasn't been pushed to GitHub yet.

predictions = model.predict(features)
countries = predictions.groupby(["country"])[["predicted_transactions"]].sum()

countries.sort_values(ascending=False).head(10)

predictions = model.predict(features)

total_predicted_purchases = predictions.groupby(["country"])[
["predicted_transactions"]
].sum()

total_predicted_purchases.sort_values(ascending=False).head(10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have some redundant code here.

Suggested change
countries = predictions.groupby(["country"])[["predicted_transactions"]].sum()
countries.sort_values(ascending=False).head(10)
predictions = model.predict(features)
total_predicted_purchases = predictions.groupby(["country"])[
["predicted_transactions"]
].sum()
total_predicted_purchases.sort_values(ascending=False).head(10)
total_predicted_purchases = predictions.groupby(["country"])[
["predicted_transactions"]
].sum()
total_predicted_purchases.sort_values(ascending=False).head(10)

Also, let's explain these lines with a comment. For example,

The GROUP BY and ORDER BY clauses group the results by country and order them by the sum of the predicted purchases in descending order.

The LIMIT clause is used here to display only the top 10 results.

is the equivalent in the SQL explanation. Please translate that to a Python comment.


DevStephanie marked this conversation as resolved.
Show resolved Hide resolved
# [END bigquery_dataframes_bqml_getting_started_tutorial_predict]