-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement spark materialization engine #3184
feat: Implement spark materialization engine #3184
Conversation
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
Codecov ReportBase: 67.02% // Head: 58.28% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #3184 +/- ##
==========================================
- Coverage 67.02% 58.28% -8.74%
==========================================
Files 175 210 +35
Lines 15942 17689 +1747
==========================================
- Hits 10685 10310 -375
- Misses 5257 7379 +2122
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py
Outdated
Show resolved
Hide resolved
spark_df = offline_job.to_spark_df() | ||
if self.repo_config.batch_engine.partitions != 0: | ||
spark_df = spark_df.repartition( | ||
self.repo_config.batch_engine.partitions | ||
) | ||
|
||
spark_df.foreachPartition( | ||
lambda x: _process_by_partition(x, spark_serialized_artifacts) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
@niklasvm Can you fix integration tests? We are also waiting on this PR and would like to use it. Did you get a chance to test it on a cluster? Feast would have to be provided to worker nodes since it deserializes config. We can have example how to do it. |
@ckarwicki I didn't realise the integration test failed. It looks like the issue is related to the I have not tested this on a cluster, only in spark local mode. What type of cluster are you using? |
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
Signed-off-by: niklasvm <[email protected]>
# unserialize artifacts | ||
feature_view, online_store, repo_config = spark_serialized_artifacts.unserialize() | ||
|
||
if feature_view.batch_source.field_mapping is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since lines 249 to 257 are also used in feature_store.write_to_online_store, maybe it makes sense to refactor this into a util method?
) | ||
|
||
|
||
class SparkMaterializationEngineConfig(FeastConfigBaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably makes sense to throw an error somewhere if the offline store is not the SparkOfflineStore
?
I'm also looking forward to use this! |
@adchia what is left before this can be merged. I see there is one test failing however the failures are unrelated to this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: adchia, niklasvm The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
# [0.25.0](v0.24.0...v0.25.0) (2022-09-20) ### Bug Fixes * Broken Feature Service Link ([#3227](#3227)) ([e117082](e117082)) * Feature-server image is missing mysql dependency for mysql registry ([#3223](#3223)) ([ae37b20](ae37b20)) * Fix handling of TTL in Go server ([#3232](#3232)) ([f020630](f020630)) * Fix materialization when running on Spark cluster. ([#3166](#3166)) ([175fd25](175fd25)) * Fix push API to respect feature view's already inferred entity types ([#3172](#3172)) ([7c50ab5](7c50ab5)) * Fix release workflow ([#3144](#3144)) ([20a9dd9](20a9dd9)) * Fix Shopify timestamp bug and add warnings to help with debugging entity registration ([#3191](#3191)) ([de75971](de75971)) * Handle complex Spark data types in SparkSource ([#3154](#3154)) ([5ddb83b](5ddb83b)) * Local staging location provision ([#3195](#3195)) ([cdf0faf](cdf0faf)) * Remove bad snowflake offline store method ([#3204](#3204)) ([dfdd0ca](dfdd0ca)) * Remove opening file object when validating S3 parquet source ([#3217](#3217)) ([a906018](a906018)) * Snowflake config file search error ([#3193](#3193)) ([189afb9](189afb9)) * Update Snowflake Online docs ([#3206](#3206)) ([7bc1dff](7bc1dff)) ### Features * Add `to_remote_storage` functionality to `SparkOfflineStore` ([#3175](#3175)) ([2107ce2](2107ce2)) * Add ability to give boto extra args for registry config ([#3219](#3219)) ([fbc6a2c](fbc6a2c)) * Add health endpoint to py server ([#3202](#3202)) ([43222f2](43222f2)) * Add snowflake support for date & number with scale ([#3148](#3148)) ([50e8755](50e8755)) * Add tag kwarg to set Snowflake online store table path ([#3176](#3176)) ([39aeea3](39aeea3)) * Add workgroup to athena offline store config ([#3139](#3139)) ([a752211](a752211)) * Implement spark materialization engine ([#3184](#3184)) ([a59c33a](a59c33a))
What this PR does / why we need it:
Implement
SparkMaterializationEngine
which parallelizes writing to the online store across spark executors. This introduces aspark
batch engine type.How
foreachPartition
is called on the spark data frame. Each partition of data is processed on the worker nodes.Usage
The
SparkMaterializationEngine
is intended to only work with theSparkOfflineStore
and an online store that supports parallel writes (not sqlite).e.g. feature_store.yaml:
Some considerations
spark.offline
andspark.engine
?Unit and integration tests are running successfully however this process should be tested on a larger set of data to ensure parallelization is working appropriately.
Which issue(s) this PR fixes:
Fixes #3167