-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add a code sample for creating a kmeans model #267
Changes from 10 commits
69fe5d7
5fb1d4f
523255f
3bb267a
73d2a46
b3c0578
c25aeb5
db9f439
e7bd5ef
2a7d575
809ed05
5dba2b9
2207941
5e00a3c
7c64227
11678e0
f95cd9f
0df2dec
06a2490
1a9f7d9
464cf1c
d03f46c
019e243
72174f9
50a447d
ac348bf
7ce5337
29b2e1f
7762f0f
479a828
1572ddd
3d77ddd
505b790
4505c5c
cad2185
3ab8220
816881c
9b382d6
ae9a362
5eb59ec
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,91 @@ | ||||||||||||||
# Copyright 2023 Google LLC | ||||||||||||||
# | ||||||||||||||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||||||||||||||
# you may not use this file except in compliance with the License. | ||||||||||||||
# You may obtain a copy of the License at | ||||||||||||||
# | ||||||||||||||
# http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||||
# | ||||||||||||||
# Unless required by applicable law or agreed to in writing, software | ||||||||||||||
# distributed under the License is distributed on an "AS IS" BASIS, | ||||||||||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||||||||||
# See the License for the specific language governing permissions and | ||||||||||||||
# limitations under the License. | ||||||||||||||
|
||||||||||||||
def test_kmeans_sample(): | ||||||||||||||
# [START bigquery_dataframes_bqml_kmeans] | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
import bigframes.pandas as bpd | ||||||||||||||
import bigframes | ||||||||||||||
from bigframes import dataframe | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
import bigframes.pandas as bpd | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
import datetime | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
#Load data from BigQuery | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
h = bpd.read_gbq("bigquery-public-data.london_bicycles.cycle_hire", h.rename( | ||||||||||||||
columns = {"start_station_name": "station_name", "start_station_id": "station_id"} | ||||||||||||||
)) | ||||||||||||||
s = bpd.read_gbq( | ||||||||||||||
tswast marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
""" | ||||||||||||||
SELECT | ||||||||||||||
id, | ||||||||||||||
ST_DISTANCE( | ||||||||||||||
ST_GEOGPOINT(s.longitude, s.latitude), | ||||||||||||||
ST_GEOGPOINT(-0.1, 51.5) | ||||||||||||||
) / 1000 AS distance_from_city_center | ||||||||||||||
FROM | ||||||||||||||
`bigquery-public-data.london_bicycles.cycle_stations` s | ||||||||||||||
""" ) | ||||||||||||||
|
||||||||||||||
# transform data into queryable format | ||||||||||||||
SalemJorden marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
sample_time = datetime.datetime(2015, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc) | ||||||||||||||
sample_time2 = datetime.datetime(2016, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc) | ||||||||||||||
|
||||||||||||||
h = h.loc[(h["start_date"] >= sample_time) & (h["start_date"] <= sample_time2)] | ||||||||||||||
|
||||||||||||||
h.start_date.dt.dayofweek.map( | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This just runs the mapping and discards the results. You're going to need to save this somewhere. You probably want something like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Corrected. |
||||||||||||||
{ | ||||||||||||||
0: "weekday", | ||||||||||||||
1: "weekday", | ||||||||||||||
2: "weekday", | ||||||||||||||
3: "weekday", | ||||||||||||||
4: "weekday", | ||||||||||||||
5: "weekend", | ||||||||||||||
6: "weekend", | ||||||||||||||
} | ||||||||||||||
) | ||||||||||||||
|
||||||||||||||
#merge dataframes h and s | ||||||||||||||
merged_df = h.merge( | ||||||||||||||
right=s, | ||||||||||||||
how="inner", | ||||||||||||||
left_on="station_id", | ||||||||||||||
right_on="id", | ||||||||||||||
) | ||||||||||||||
# Create new dataframe variable from merge: 'stationstats' | ||||||||||||||
stationstats = merged_df.groupby("station_name").agg( | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there's actually a mistake in the SQL version of this tutorial. We actually want to groupby both "station_name" and "isweekday" like the actual SELECT
station_name,
isweekday,
AVG(duration) AS duration,
COUNT(duration) AS num_trips,
MAX(distance_from_city_center) AS distance_from_city_center
FROM
hs
GROUP BY
station_name, isweekday) https://cloud.google.com/bigquery/docs/kmeans-tutorial#run_the_query_2 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added "isweekday" to groupby() function. |
||||||||||||||
{"duration": ["mean", "count"], "distance_from_city_center": "max"} | ||||||||||||||
) | ||||||||||||||
# [END bigquery_dataframes_bqml_kmeans] | ||||||||||||||
tswast marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
|
||||||||||||||
# [START bigquery_dataframes_bqml_kmeans_fit] | ||||||||||||||
|
||||||||||||||
# import the KMeans model from bigframes.ml to cluster the data | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No need for this comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed. |
||||||||||||||
from bigframes.ml.cluster import KMeans | ||||||||||||||
|
||||||||||||||
cluster_model = KMeans(n_clusters=4) | ||||||||||||||
tswast marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
cluster_model = cluster_model.fit(stationstats).to_gbq(cluster_model) | ||||||||||||||
|
||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's do a It should look very similar to the getting started tutorial:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added to_gbq() to save the model. |
||||||||||||||
# [END bigquery_dataframes_bqml_kmeans_fit] | ||||||||||||||
|
||||||||||||||
# [START bigquery_dataframes_bqml_kmeans_predict] | ||||||||||||||
|
||||||||||||||
tswast marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
# Use 'contains' function to find all entries with string "Kennington". | ||||||||||||||
stationstats = stationstats.str.contains("Kennington") | ||||||||||||||
|
||||||||||||||
#Predict using the model | ||||||||||||||
result = cluster_model.predict(stationstats) | ||||||||||||||
|
||||||||||||||
tswast marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
# [END bigquery_dataframes_bqml_kmeans_predict] | ||||||||||||||
|
||||||||||||||
assert result is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 2024 now. These headers should reflect when the text was first written.