Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add column reordering to write_to_offline_store #2876

Merged
merged 11 commits into from
Jun 30, 2022

Conversation

felixwang9817
Copy link
Collaborator

@felixwang9817 felixwang9817 commented Jun 28, 2022

What this PR does / why we need it: In addition to adding column reordering logic, this PR adds logic for extracting the latest feature values into the SparkKafkaProcessor.

Which issue(s) this PR fixes:

Fixes #

@codecov-commenter
Copy link

codecov-commenter commented Jun 28, 2022

Codecov Report

Merging #2876 (e54ea6e) into master (86e9efd) will decrease coverage by 0.08%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2876      +/-   ##
==========================================
- Coverage   80.68%   80.59%   -0.09%     
==========================================
  Files         176      176              
  Lines       15670    15663       -7     
==========================================
- Hits        12643    12624      -19     
- Misses       3027     3039      +12     
Flag Coverage Δ
integrationtests 70.75% <100.00%> (-0.17%) ⬇️
unittests 59.34% <8.69%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdk/python/feast/infra/offline_stores/bigquery.py 87.69% <ø> (ø)
sdk/python/feast/infra/offline_stores/file.py 93.99% <ø> (-0.43%) ⬇️
sdk/python/feast/infra/offline_stores/redshift.py 91.58% <ø> (-0.50%) ⬇️
sdk/python/feast/infra/offline_stores/snowflake.py 90.52% <ø> (ø)
sdk/python/feast/feature_store.py 87.12% <100.00%> (+0.10%) ⬆️
...ests/integration/e2e/test_python_feature_server.py 100.00% <100.00%> (ø)
...ts/integration/offline_store/test_offline_write.py 100.00% <100.00%> (ø)
...ation/offline_store/test_push_offline_retrieval.py 100.00% <100.00%> (ø)
...gration/online_store/test_push_online_retrieval.py 100.00% <100.00%> (ø)
sdk/python/tests/utils/online_read_write_test.py 93.54% <0.00%> (-6.46%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 86e9efd...e54ea6e. Read the comment docs.


class SparkProcessorConfig(ProcessorConfig):
spark_session: SparkSession
processing_time: str
query_timeout: int
processing_time: str = "30 seconds"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't set a default here since we have no clue what the correct window should be. Should force the user to set the processing window.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

processing_time: str
query_timeout: int
processing_time: str = "30 seconds"
query_timeout: int = 15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from feast.infra.contrib.stream_processor import (
ProcessorConfig,
StreamProcessor,
StreamTable,
)
from feast.stream_feature_view import StreamFeatureView

if TYPE_CHECKING:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline conversation, this is dangerous. If we ever want to move the functionality into a supported passthrough function in feature store, this is a circular dependency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized it actually isn't circular lol, updating

)
source_columns = [column for column, _ in column_names_and_types]
source_columns = [
column for column in source_columns if not re.match("__|__$", column)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are columns w/. underscores what is the behavior here? Does it just auto fail? I'm confused about why we need to do this check, are we not writing to these internal columns?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this isn't necessary; good catch

Copy link
Collaborator

@kevjumba kevjumba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Collaborator

@kevjumba kevjumba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@feast-ci-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: felixwang9817, kevjumba

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [felixwang9817,kevjumba]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot feast-ci-bot merged commit 8abc2ef into feast-dev:master Jun 30, 2022
felixwang9817 added a commit to felixwang9817/feast that referenced this pull request Jul 1, 2022
* Add feature extraction logic to batch writer

Signed-off-by: Felix Wang <[email protected]>

* Enable StreamProcessor to write to both online and offline stores

Signed-off-by: Felix Wang <[email protected]>

* Fix incorrect columns error message

Signed-off-by: Felix Wang <[email protected]>

* Reorder columns in _write_to_offline_store

Signed-off-by: Felix Wang <[email protected]>

* Make _write_to_offline_store a public method

Signed-off-by: Felix Wang <[email protected]>

* Import FeatureStore correctly

Signed-off-by: Felix Wang <[email protected]>

* Remove defaults for `processing_time` and `query_timeout`

Signed-off-by: Felix Wang <[email protected]>

* Clean up `test_offline_write.py`

Signed-off-by: Felix Wang <[email protected]>

* Do not do any custom logic for double underscore columns

Signed-off-by: Felix Wang <[email protected]>

* Lint

Signed-off-by: Felix Wang <[email protected]>

* Switch entity values for all tests using push sources to not affect other tests

Signed-off-by: Felix Wang <[email protected]>
felixwang9817 pushed a commit that referenced this pull request Aug 2, 2022
# [0.23.0](v0.22.0...v0.23.0) (2022-08-02)

### Bug Fixes

* Add dummy alias to pull_all_from_table_or_query ([#2956](#2956)) ([5e45228](5e45228))
* Bump version of Guava to mitigate cve ([#2896](#2896)) ([51df8be](51df8be))
* Change numpy version on setup.py and upgrade it to resolve dependabot warning ([#2887](#2887)) ([80ea7a9](80ea7a9))
* Change the feature store plan method to public modifier ([#2904](#2904)) ([0ec7d1a](0ec7d1a))
* Deprecate 3.7 wheels and fix verification workflow ([#2934](#2934)) ([040c910](040c910))
* Do not allow same column to be reused in data sources ([#2965](#2965)) ([661c053](661c053))
* Fix build wheels workflow to install apache-arrow correctly ([#2932](#2932)) ([bdeb4ae](bdeb4ae))
* Fix file offline store logic for feature views without ttl ([#2971](#2971)) ([26f6b69](26f6b69))
* Fix grpc and update protobuf ([#2894](#2894)) ([86e9efd](86e9efd))
* Fix night ci syntax error and update readme ([#2935](#2935)) ([b917540](b917540))
* Fix nightly ci again ([#2939](#2939)) ([1603c9e](1603c9e))
* Fix the go build and use CgoArrowAllocator to prevent incorrect garbage collection ([#2919](#2919)) ([130746e](130746e))
* Fix typo in CONTRIBUTING.md ([#2955](#2955)) ([8534f69](8534f69))
* Fixing broken links to feast documentation on java readme and contribution ([#2892](#2892)) ([d044588](d044588))
* Fixing Spark min / max entity df event timestamps range return order ([#2735](#2735)) ([ac55ce2](ac55ce2))
* Move gcp back to 1.47.0 since grpcio-tools 1.48.0 got yanked from pypi ([#2990](#2990)) ([fc447eb](fc447eb))
* Refactor testing and sort out unit and integration tests ([#2975](#2975)) ([2680f7b](2680f7b))
* Remove hard-coded integration test setup for AWS & GCP ([#2970](#2970)) ([e4507ac](e4507ac))
* Resolve small typo in README file ([#2930](#2930)) ([16ae902](16ae902))
* Revert "feat: Add snowflake online store ([#2902](#2902))" ([#2909](#2909)) ([38fd001](38fd001))
* Snowflake_online_read fix ([#2988](#2988)) ([651ce34](651ce34))
* Spark source support table with pattern "db.table" ([#2606](#2606)) ([3ce5139](3ce5139)), closes [#2605](#2605)
* Switch mysql log string to use regex ([#2976](#2976)) ([5edf4b0](5edf4b0))
* Update gopy to point to fork to resolve github annotation errors. ([#2940](#2940)) ([ba2dcf1](ba2dcf1))
* Version entity serialization mechanism and fix issue with int64 vals ([#2944](#2944)) ([d0d27a3](d0d27a3))

### Features

* Add an experimental lambda-based materialization engine ([#2923](#2923)) ([6f79069](6f79069))
* Add column reordering to `write_to_offline_store` ([#2876](#2876)) ([8abc2ef](8abc2ef))
* Add custom JSON table tab w/ formatting ([#2851](#2851)) ([0159f38](0159f38))
* Add CustomSourceOptions to SavedDatasetStorage ([#2958](#2958)) ([23c09c8](23c09c8))
* Add Go option to `feast serve` command ([#2966](#2966)) ([a36a695](a36a695))
* Add interfaces for batch materialization engine ([#2901](#2901)) ([38b28ca](38b28ca))
* Add pages for individual Features to the Feast UI ([#2850](#2850)) ([9b97fca](9b97fca))
* Add snowflake online store ([#2902](#2902)) ([f758f9e](f758f9e)), closes [#2903](#2903)
* Add Snowflake online store (again) ([#2922](#2922)) ([2ef71fc](2ef71fc)), closes [#2903](#2903)
* Add to_remote_storage method to RetrievalJob ([#2916](#2916)) ([109ee9c](109ee9c))
* Support retrieval from multiple feature views with different join keys ([#2835](#2835)) ([056cfa1](056cfa1))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants