-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add a code sample for creating a kmeans model #267
Changes from 25 commits
69fe5d7
5fb1d4f
523255f
3bb267a
73d2a46
b3c0578
c25aeb5
db9f439
e7bd5ef
2a7d575
809ed05
5dba2b9
2207941
5e00a3c
7c64227
11678e0
f95cd9f
0df2dec
06a2490
1a9f7d9
464cf1c
d03f46c
019e243
72174f9
50a447d
ac348bf
7ce5337
29b2e1f
7762f0f
479a828
1572ddd
3d77ddd
505b790
4505c5c
cad2185
3ab8220
816881c
9b382d6
ae9a362
5eb59ec
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,134 @@ | ||||||||||||||
# Copyright 2023 Google LLC | ||||||||||||||
# | ||||||||||||||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||||||||||||||
# you may not use this file except in compliance with the License. | ||||||||||||||
# You may obtain a copy of the License at | ||||||||||||||
# | ||||||||||||||
# http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||||
# | ||||||||||||||
# Unless required by applicable law or agreed to in writing, software | ||||||||||||||
# distributed under the License is distributed on an "AS IS" BASIS, | ||||||||||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||||||||||
# See the License for the specific language governing permissions and | ||||||||||||||
# limitations under the License. | ||||||||||||||
|
||||||||||||||
|
||||||||||||||
def test_kmeans_sample(): | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
# [START bigquery_dataframes_bqml_kmeans] | ||||||||||||||
import datetime | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
import bigframes | ||||||||||||||
import bigframes.pandas as bpd | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
bigframes.options.bigquery.project = "salemb-testing" | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
# You must compute in the EU multi-region to query the London bicycles dataset. | ||||||||||||||
bigframes.options.bigquery.location = "EU" | ||||||||||||||
|
||||||||||||||
# Extract the information you'll need to train the k-means model later in this tutorial. Use the | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
# read_gbq function to represent cycle hires data as a DataFrame. | ||||||||||||||
h = bpd.read_gbq( | ||||||||||||||
"bigquery-public-data.london_bicycles.cycle_hire", | ||||||||||||||
col_order=["start_station_name", "start_station_id", "start_date", "duration"], | ||||||||||||||
).rename( | ||||||||||||||
columns={ | ||||||||||||||
"start_station_name": "station_name", | ||||||||||||||
"start_station_id": "station_id", | ||||||||||||||
} | ||||||||||||||
) | ||||||||||||||
|
||||||||||||||
s = bpd.read_gbq( | ||||||||||||||
tswast marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
# Use ST_GEOPOINT and ST_DISTANCE to analyze geographical data. | ||||||||||||||
# These functions determine spatial relationships between the geographical features. | ||||||||||||||
""" | ||||||||||||||
SELECT | ||||||||||||||
id, | ||||||||||||||
ST_DISTANCE( | ||||||||||||||
ST_GEOGPOINT(s.longitude, s.latitude), | ||||||||||||||
ST_GEOGPOINT(-0.1, 51.5) | ||||||||||||||
) / 1000 AS distance_from_city_center | ||||||||||||||
FROM | ||||||||||||||
`bigquery-public-data.london_bicycles.cycle_stations` s | ||||||||||||||
""" | ||||||||||||||
) | ||||||||||||||
|
||||||||||||||
# Define Python datetime objects in the UTC timezone for range comparison, because BigQuery stores | ||||||||||||||
# timestamp data in the UTC timezone. | ||||||||||||||
sample_time = datetime.datetime(2015, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc) | ||||||||||||||
sample_time2 = datetime.datetime(2016, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc) | ||||||||||||||
|
||||||||||||||
h = h.loc[(h["start_date"] >= sample_time) & (h["start_date"] <= sample_time2)] | ||||||||||||||
|
||||||||||||||
# Replace each day-of-the-week number with the corresponding "weekday" or "weekend" label by using the | ||||||||||||||
# Series.map method. | ||||||||||||||
h = h.assign( | ||||||||||||||
isweekday=h.start_date.dt.dayofweek.map( | ||||||||||||||
{ | ||||||||||||||
0: "weekday", | ||||||||||||||
1: "weekday", | ||||||||||||||
2: "weekday", | ||||||||||||||
3: "weekday", | ||||||||||||||
4: "weekday", | ||||||||||||||
5: "weekend", | ||||||||||||||
6: "weekend", | ||||||||||||||
} | ||||||||||||||
) | ||||||||||||||
) | ||||||||||||||
|
||||||||||||||
# Supplement each trip in "h" with the station distance information from "s" by | ||||||||||||||
# merging the two DataFrames by station ID. | ||||||||||||||
merged_df = h.merge( | ||||||||||||||
right=s, | ||||||||||||||
how="inner", | ||||||||||||||
left_on="station_id", | ||||||||||||||
right_on="id", | ||||||||||||||
) | ||||||||||||||
|
||||||||||||||
# Engineer features to cluster the stations. For each station, find the average trip duration, number of | ||||||||||||||
# trips, and distance from city center. | ||||||||||||||
stationstats = merged_df.groupby(["station_name", "isweekday"]).agg( | ||||||||||||||
{"duration": ["mean", "count"], "distance_from_city_center": "max"} | ||||||||||||||
) | ||||||||||||||
stationstats.columns = ["duration", "num_trips", "distance_from_city_center"] | ||||||||||||||
stationstats = stationstats.sort_values( | ||||||||||||||
by="distance_from_city_center", ascending=True | ||||||||||||||
).reset_index() | ||||||||||||||
|
||||||||||||||
# Expected output results: >>> stationstats.head(3) | ||||||||||||||
# station_name isweekday duration num_trips distance_from_city_center | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
# Borough Road... weekday 1110 5749 0.12624 | ||||||||||||||
# Borough Road... weekend 2125 1774 0.12624 | ||||||||||||||
# Webber Street... weekday 795 6517 0.164021 | ||||||||||||||
# 3 rows × 5 columns | ||||||||||||||
|
||||||||||||||
# [END bigquery_dataframes_bqml_kmeans] | ||||||||||||||
tswast marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
# [START bigquery_dataframes_bqml_kmeans_fit] | ||||||||||||||
|
||||||||||||||
from bigframes.ml.cluster import KMeans | ||||||||||||||
|
||||||||||||||
# To determine an optimal number of clusters, you would run the CREATE MODEL query for different values of | ||||||||||||||
# num_clusters, find the error measure, and pick the point at which the error measure is at its minimum value. | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
cluster_model = KMeans(n_clusters=4) | ||||||||||||||
tswast marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
cluster_model.fit(stationstats) | ||||||||||||||
|
||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's do a It should look very similar to the getting started tutorial:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added to_gbq() to save the model. |
||||||||||||||
# [END bigquery_dataframes_bqml_kmeans_fit] | ||||||||||||||
|
||||||||||||||
# [START bigquery_dataframes_bqml_kmeans_predict] | ||||||||||||||
|
||||||||||||||
tswast marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
# Use 'contains' function to predict which clusters contain the stations with string "Kennington". | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
stationstats = stationstats.loc[ | ||||||||||||||
stationstats["station_name"].str.contains("Kennington") | ||||||||||||||
] | ||||||||||||||
|
||||||||||||||
result = cluster_model.predict(stationstats) | ||||||||||||||
|
||||||||||||||
tswast marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
# Expected output results: >>>results.head(3) | ||||||||||||||
# CENTROID_ID NEAREST_CENTROIDS... station_name isweekday duration num_trips distance... | ||||||||||||||
# 1 [{'CENTROID_ID': 1, 'DISTANCE': 2 Borough... weekday 1110 5749 0.13 | ||||||||||||||
# 2 [{'CENTROID_ID': 2, 'DISTANCE': 2 Borough... weekend 2125 1774 0.13 | ||||||||||||||
# 1 [{'CENTROID_ID': 1, 'DISTANCE': 2 Webber... weekday 795 6517 0.16 | ||||||||||||||
# 3 rows × 7 columns | ||||||||||||||
|
||||||||||||||
# [END bigquery_dataframes_bqml_kmeans_predict] | ||||||||||||||
|
||||||||||||||
assert result is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 2024 now. These headers should reflect when the text was first written.