-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coalesce FeatureRows for improved "latest" value consistency in serving stores #88
Comments
/assign tims |
Should we have this enabled or disabled by default? |
I don't think you should coalesce rows for the warehouse store. The warehouse store should have the most complete dataset possible, for all timestamps, regardless whether the feature rows were late or not. |
I think it should be ON by default as it impact the correctness of ingestion result |
Pushed change to PR so it defaults to true. Also, @zhilingc the PR no longer coalesce's rows writing to the warehouse, they get written as is. |
Closed with merge of Coalesce rows PR #89 |
Is your feature request related to a problem? Please describe.
Now that we plan to store features in the serving store as Latest only, instead of the full history. We need to add logic to ensure writes of feature values from older rows don't clobber new ones. This is especially a problem for batch, but we'd like to solve it generally for streaming as well.
Describe the solution you'd like
For batch streams, we will combine feature rows by key. Where the key is entity name + entity key.
For streaming, we will also do this, but we emit rows with a timer, keeping the previous value in state for future combines. We will evict state with another timeout timer, this will allow for very large key infrequently updated key spaces to be less demanding.
The feature rows should be stored in the warehouse store in the combined form, not as their raw separate rows, because we want it to reflect the rows as they are written to the serving stores. This ensures that a the equivalent of a Select * AS OF, query would reflect what was available in the serving stores.
We should also make sure that we can turn this feature off as an option in each import spec, as this could introduce performance hits in very high throughput streams with large backlogs. And is also not required for batch jobs that know they only have one row per entity key.
Describe alternatives you've considered
Alternatives are to keep each feature row separately in the serving stores and combine them in the serving api (which is what we started with). This is not as performant as many use cases require, and we'd prefer to optimize for serving speed. We are willing to accept that we cannot guarantee true consistency using the above approach, as it's possible that data already in the store may conflict with data being ingested in a new job. But we will assume that incoming feature data is eventually consistent and will overwrite any issues.
Additional context
The PR I'm working on for this issue is WIP and will require #87 to be finished first, as we should be grouping on the {entityName}+{entityKey}, not with the event timestamp + granularity. It would be more work to deal with the latest values in the stores store separately, so it's best we wait.
The text was updated successfully, but these errors were encountered: