Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python SDK create_dataset is actually creating dataset in BQ #201

Closed
budi opened this issue May 27, 2019 · 1 comment · Fixed by #208
Closed

Python SDK create_dataset is actually creating dataset in BQ #201

budi opened this issue May 27, 2019 · 1 comment · Fixed by #208

Comments

@budi
Copy link
Contributor

budi commented May 27, 2019

Expected Behavior

"Dataset" that was meant on the sdk is a collection of features selected by the user. While this term is also used in BigQuery, the SDK should not create a dataset whenever this function is called, rather just create a view in the feast dataset.

Current Behavior

It creates a new BigQuery Dataset.

Steps to reproduce

Use the quickstart:

feature_set = FeatureSet(entity="ride", 
  features=["ride.log_trip_duration", 
     "ride.distance_haversine",
     "ride.distance_dummy_manhattan",
     "ride.direction",
     "ride.month",
     "ride.day_of_month",
     "ride.hour",
     "ride.day_of_week",
     "ride.vi_1",
     "ride.vi_2",
     "ride.sf_n",
     "ride.sf_y"])
dataset_info = fs.create_dataset(feature_set, "2016-06-01", "2016-08-01")
dataset = fs.download_dataset_to_df(dataset_info, staging_location=STAGING_LOCATION)

dataset.head()

Specifications

Possible Solution

  • Fix create_dataset to create view in feast dataset instead
  • Write down dataset definition
  • Change to create_view?
@woop
Copy link
Member

woop commented Jun 5, 2019

Hey @budi. This issue is a bit confusing.

The idea of a dataset to the client should be a materialization of a feature set. Basically a collection of rows for specific columns (features). We should not care what happens in BQ in terms of naming (bq table vs dataset).

In this case I think it is important that we create a table as a snapshot of the data (not a BQ dataset) in order to make the dataset immutable. A view would not provide that. In the event that the user wants to have new data, they should create a new dataset using the same features and time range query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants