Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Incorporate substrait ODFVs into ibis-based offline store queries #4102

Merged
merged 1 commit into from
Apr 19, 2024

Conversation

tokoko
Copy link
Collaborator

@tokoko tokoko commented Apr 16, 2024

What this PR does / why we need it:

This PR changes the offline flow for ODFV execution in ibis-based offline stores (currently only duckdb). Instead of collecting raw offline store output into a pyarrow table and applying a substrait transformation with acero, ibis-based offline stores now get ibis functions and directly extend ibis logical plan, meaning ODFVs will be executed by the offline store engine itself.

The feature necessitated a couple of underlying changes in substrait ODFVs:

  1. substrait ODVS now also store ibis python functions in protos to directly apply these functions in transform_ibis. This can probably be avoided in the future if there's a good substrait to ibis compiler in place, but currently is the simplest solution. Note: there are other problems here as well, substrait plans need to know precisely what input columns they will consume, this is problematic in the offline flow as there's no way to know beforehand which features the user will request in get_historical_features call. To correctly apply ODFVs, one would need to apply each ODFV to a relevant subset of columns and then join all tables back together (kind of like how it's done in the online flow), which is a horrible waste of resources.
  2. Unlike pandas transformations, ibis functions must now return all the columns that are passed as an input to the function (in addition to on demand features) as opposed to returning only the on demand features that they generate. This is to avoid an extra join back to the main table after ibis functions are applied to the output. This way multiple ODFVs can easily be chained, each mutating the offline store output to append the relevant columns.

With this PR, all of the planned ibis/substrait features will now be in place in python sdk. The next steps will be to enable native substrait ODFV execution in non-python sdks as well.

Which issue(s) this PR fixes:

Fixes #3979

Copy link
Collaborator

@HaoXuAI HaoXuAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, maybe a bad question, does it mean the substrait transformation is only available in ibis offline store? Is it possible to use it with other offline store? Suppose I have a s3 file sources and want to use the substrait transform directly

@tokoko
Copy link
Collaborator Author

tokoko commented Apr 16, 2024

@HaoXuAI No, it's applicable for all offline stores, the difference is that for non-ibis offline stores the whole dataset will be collected to a single process as an arrow table and transform_arrow method will be used to apply transformation. (This is true for both substrait and pandas transformations)

def transform_arrow(self, pa_table: pyarrow.Table) -> pyarrow.Table:

What this PR adds on top of this is that for ibis-based offline stores collect to arrow is no longer necessary and instead of transform_arrow method, newly added transform_ibis will be used to directly apply transformation with the offline store engine.

P.S. duckdb offline store which is for now the the only ibis-based implementation can work with s3 file sources even now.

Copy link
Collaborator

@HaoXuAI HaoXuAI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@HaoXuAI HaoXuAI merged commit c3a102f into feast-dev:master Apr 19, 2024
21 checks passed
@tokoko tokoko deleted the duckdb-substrait-odfv branch April 19, 2024 17:51
franciscojavierarceo pushed a commit that referenced this pull request May 24, 2024
# [0.38.0](v0.37.0...v0.38.0) (2024-05-24)

### Bug Fixes

* Add vector database doc ([#4165](#4165)) ([37f36b6](37f36b6))
* Change checkout action back to v3 from v5 which isn't released yet ([#4147](#4147)) ([9523fff](9523fff))
* Change numpy version <1.25 dependency to <2 in setup.py ([#4085](#4085)) ([2ba71ff](2ba71ff)), closes [#4084](#4084)
* Changed the code the way mysql container is initialized.  ([#4140](#4140)) ([8b5698f](8b5698f)), closes [#4126](#4126)
* Correct nightly install command, move all installs to uv ([#4164](#4164)) ([c86d594](c86d594))
* Default value is not set in Redis connection string using environment variable ([#4136](#4136)) ([95acfb4](95acfb4)), closes [#3669](#3669)
* Get container host addresses from testcontainers (java) ([#4125](#4125)) ([9184dde](9184dde))
* Get rid of empty string `name_alias` during feature view projection deserialization  ([#4116](#4116)) ([65056ce](65056ce))
* Helm chart `feast-feature-server`, improve Service template name ([#4161](#4161)) ([dedc164](dedc164))
* Improve the code related to on-demand-featureview. ([#4203](#4203)) ([d91d7e0](d91d7e0))
* Integration tests for async sdk method ([#4201](#4201)) ([08c44ae](08c44ae))
* Make sure schema is used when calling `get_table_query_string` method for Snowflake datasource ([#4131](#4131)) ([c1579c7](c1579c7))
* Make sure schema is used when generating `from_expression` for Snowflake ([#4177](#4177)) ([5051da7](5051da7))
* Pass native input values to `get_online_features` from feature server ([#4117](#4117)) ([60756cb](60756cb))
* Pass region to S3 client only if set (Java) ([#4151](#4151)) ([b8087f7](b8087f7))
* Pgvector patch ([#4108](#4108)) ([ad45bb4](ad45bb4))
* Update doc ([#4153](#4153)) ([e873636](e873636))
* Update master-only benchmark bucket name due to credential update ([#4183](#4183)) ([e88f1e3](e88f1e3))
* Updating the instructions for quickstart guide. ([#4120](#4120)) ([0c30e96](0c30e96))
* Upgrading the test container so that local tests works with updated d… ([#4155](#4155)) ([93ddb11](93ddb11))

### Features

* Add a Kubernetes Operator for the Feast Feature Server ([#4145](#4145)) ([4a696dc](4a696dc))
* Add delta format to `FileSource`, add support for it in ibis/duckdb ([#4123](#4123)) ([2b6f1d0](2b6f1d0))
* Add materialization support to ibis/duckdb ([#4173](#4173)) ([369ca98](369ca98))
* Add optional private key params to Snowflake config ([#4205](#4205)) ([20f5419](20f5419))
* Add s3 remote storage export for duckdb ([#4195](#4195)) ([6a04c48](6a04c48))
* Adding DatastoreOnlineStore 'database' argument. ([#4180](#4180)) ([e739745](e739745))
* Adding get_online_features_async to feature store sdk ([#4172](#4172)) ([311efc5](311efc5))
* Adding support for dictionary writes to online store  ([#4156](#4156)) ([abfac01](abfac01))
* Elasticsearch vector database ([#4188](#4188)) ([bf99640](bf99640))
* Enable other distance metrics for Vector DB and Update docs ([#4170](#4170)) ([ba9f4ef](ba9f4ef))
* Feast/IKV datetime edgecase errors ([#4211](#4211)) ([bdae562](bdae562))
* Feast/IKV documenation language changes ([#4149](#4149)) ([690a621](690a621))
* Feast/IKV online store contrib plugin integration ([#4068](#4068)) ([f2b4eb9](f2b4eb9))
* Feast/IKV online store documentation ([#4146](#4146)) ([73601e4](73601e4))
* Feast/IKV upgrade client version ([#4200](#4200)) ([0e42150](0e42150))
* Incorporate substrait ODFVs into ibis-based offline store queries ([#4102](#4102)) ([c3a102f](c3a102f))
* Isolate input-dependent calculations in `get_online_features` ([#4041](#4041)) ([2a6edea](2a6edea))
* Make arrow primary interchange for online ODFV execution ([#4143](#4143)) ([3fdb716](3fdb716))
* Move data source validation entrypoint to offline store ([#4197](#4197)) ([a17725d](a17725d))
* Upgrading python version to 3.11, adding support for 3.11 as well. ([#4159](#4159)) ([4b1634f](4b1634f)), closes [#4152](#4152) [#4114](#4114)

### Reverts

* Reverts "fix: Using version args to install the correct feast version" ([#4112](#4112)) ([b66baa4](b66baa4)), closes [#3953](#3953)
franciscojavierarceo pushed a commit that referenced this pull request May 27, 2024
# [0.38.0](v0.37.0...v0.38.0) (2024-05-24)

### Bug Fixes

* Add vector database doc ([#4165](#4165)) ([37f36b6](37f36b6))
* Change checkout action back to v3 from v5 which isn't released yet ([#4147](#4147)) ([9523fff](9523fff))
* Change numpy version <1.25 dependency to <2 in setup.py ([#4085](#4085)) ([2ba71ff](2ba71ff)), closes [#4084](#4084)
* Changed the code the way mysql container is initialized.  ([#4140](#4140)) ([8b5698f](8b5698f)), closes [#4126](#4126)
* Correct nightly install command, move all installs to uv ([#4164](#4164)) ([c86d594](c86d594))
* Default value is not set in Redis connection string using environment variable ([#4136](#4136)) ([95acfb4](95acfb4)), closes [#3669](#3669)
* Get container host addresses from testcontainers (java) ([#4125](#4125)) ([9184dde](9184dde))
* Get rid of empty string `name_alias` during feature view projection deserialization  ([#4116](#4116)) ([65056ce](65056ce))
* Helm chart `feast-feature-server`, improve Service template name ([#4161](#4161)) ([dedc164](dedc164))
* Improve the code related to on-demand-featureview. ([#4203](#4203)) ([d91d7e0](d91d7e0))
* Integration tests for async sdk method ([#4201](#4201)) ([08c44ae](08c44ae))
* Make sure schema is used when calling `get_table_query_string` method for Snowflake datasource ([#4131](#4131)) ([c1579c7](c1579c7))
* Make sure schema is used when generating `from_expression` for Snowflake ([#4177](#4177)) ([5051da7](5051da7))
* Pass native input values to `get_online_features` from feature server ([#4117](#4117)) ([60756cb](60756cb))
* Pass region to S3 client only if set (Java) ([#4151](#4151)) ([b8087f7](b8087f7))
* Pgvector patch ([#4108](#4108)) ([ad45bb4](ad45bb4))
* Update doc ([#4153](#4153)) ([e873636](e873636))
* Update master-only benchmark bucket name due to credential update ([#4183](#4183)) ([e88f1e3](e88f1e3))
* Updating the instructions for quickstart guide. ([#4120](#4120)) ([0c30e96](0c30e96))
* Upgrading the test container so that local tests works with updated d… ([#4155](#4155)) ([93ddb11](93ddb11))

### Features

* Add a Kubernetes Operator for the Feast Feature Server ([#4145](#4145)) ([4a696dc](4a696dc))
* Add delta format to `FileSource`, add support for it in ibis/duckdb ([#4123](#4123)) ([2b6f1d0](2b6f1d0))
* Add materialization support to ibis/duckdb ([#4173](#4173)) ([369ca98](369ca98))
* Add optional private key params to Snowflake config ([#4205](#4205)) ([20f5419](20f5419))
* Add s3 remote storage export for duckdb ([#4195](#4195)) ([6a04c48](6a04c48))
* Adding DatastoreOnlineStore 'database' argument. ([#4180](#4180)) ([e739745](e739745))
* Adding get_online_features_async to feature store sdk ([#4172](#4172)) ([311efc5](311efc5))
* Adding support for dictionary writes to online store  ([#4156](#4156)) ([abfac01](abfac01))
* Elasticsearch vector database ([#4188](#4188)) ([bf99640](bf99640))
* Enable other distance metrics for Vector DB and Update docs ([#4170](#4170)) ([ba9f4ef](ba9f4ef))
* Feast/IKV datetime edgecase errors ([#4211](#4211)) ([bdae562](bdae562))
* Feast/IKV documenation language changes ([#4149](#4149)) ([690a621](690a621))
* Feast/IKV online store contrib plugin integration ([#4068](#4068)) ([f2b4eb9](f2b4eb9))
* Feast/IKV online store documentation ([#4146](#4146)) ([73601e4](73601e4))
* Feast/IKV upgrade client version ([#4200](#4200)) ([0e42150](0e42150))
* Incorporate substrait ODFVs into ibis-based offline store queries ([#4102](#4102)) ([c3a102f](c3a102f))
* Isolate input-dependent calculations in `get_online_features` ([#4041](#4041)) ([2a6edea](2a6edea))
* Make arrow primary interchange for online ODFV execution ([#4143](#4143)) ([3fdb716](3fdb716))
* Move data source validation entrypoint to offline store ([#4197](#4197)) ([a17725d](a17725d))
* Upgrading python version to 3.11, adding support for 3.11 as well. ([#4159](#4159)) ([4b1634f](4b1634f)), closes [#4152](#4152) [#4114](#4114)

### Reverts

* Reverts "fix: Using version args to install the correct feast version" ([#4112](#4112)) ([b66baa4](b66baa4)), closes [#3953](#3953)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Abstract ibis-based Offline Store
2 participants