-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize _populate_result_rows_from_feature_view
#2223
Conversation
7207717
to
307158b
Compare
307158b
to
6cf4d0f
Compare
Codecov Report
@@ Coverage Diff @@
## master #2223 +/- ##
==========================================
+ Coverage 84.94% 84.96% +0.02%
==========================================
Files 105 105
Lines 8496 8515 +19
==========================================
+ Hits 7217 7235 +18
- Misses 1279 1280 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
6cf4d0f
to
75a6758
Compare
Whats the usecase for having duplicate entities? Why not e.g. dedup before sending? |
Seems like we adding a lot of complexity and really unnecessary optimization for the use case that should be solved on client side. |
These would be example entity rows: The reason for not deduping before the request is that I'm requesting features that have In this case the features for the This change means that only the unique entities for each FeatureView are actually retrieved. For very large numbers of rows (~9k in my test case) this can reduce the time spent interfacing with RedisOnlineStore by almost half from about 600ms down to 300ms because for one of the two FeatureViews I'm retrieving features from only a single Entity is needed rather than 9k. Does that make sense? |
Absolutely disagree. See example in reply above. There are cases where doing that would require multiple calls to Feast which would somewhat defeat the purpose. To me this is an obvious optimization and the current implementation is actually rather naive. I think the issue here is probably one of terminology rather than an actual disagreement, however? The terminology perhaps being the distinction between an I suppose the case I'm handling here is actually duplicate There is still an additional optimization that could be done here. Imagine I'm getting features from 2 |
But is it so bad thing (I mean multiple calls)? It will be converted to multiple calls to the backend in any case. Also, the user knows much better their data model and all additional logic proposed in this PR can be expressed on the user side simpler:
I'm trying to say that we cannot push all optimizations for all possible use cases to the Feast side. The code in IMO, we should move towards simplifying logic on Complain MODE ON Regarding already existing complexity, let me give a few examples:
EntityKeyProto is redundant and not even used as a key anymore, because proto serialization doesn't guarantee stability. And each implementation converts this proto into its own key. So we do a lot of not very useful job by massaging those input from user, packing it into different protos and then throw it away mostly. Complain MODE OFF Thanks for listening :) |
I'd say yes. Feast is supposed to make the serving logic simpler not require the user to know what Feast is going to do internally and optimize around that. It is in no way intuitive that if I provide
The data model is Feast's data model - Feast chooses how to store the data in the OnlineStore. Feast should be able to query its own data model efficiently, surely? The whole point of Feast / a FeatureStore is for the user to be able to query any features in the store for any set of given entities. Feast should be able to do this efficiently. The user should not have to understand the internal workings of the SDK! It is not intuitive that Feast will behave naively to duplicates. And in fact from the user's perspective they are not duplicate - a
You've neglected the
Furthermore, without Pandas this becomes much more cumbersome and it is (in my experience) a bad idea to put Pandas in a latency sensitive user request path.
I don't think this is 'an edge case'. This would be incredibly common for any recommender system.
If you're going to go this far then why not limit the user to only querying one FeatureView at a time? Or better yet they can just directly query the OnlineStore? The user can use and understands Redis or Datastore, right?
There are pros and cons here aren't there? It was my impression that one of the core features of Feast is the ease of implementing new Online and Offline Stores? If this is considered a core feature then the logic of these should be simple - so as to encourage the community to develop them? Therefore, I believe that they should not be expected to deduplicate. Whether or not they fetch by FeatureView or Entity is a more interesting question but isn't where we are right now. I do, however, agree that the limitation of one call per FeatureView is problematic. But I don't think this is an argument against not asking for the same data multiple times when we don't have to.
Okay, then that optimization doesn't need to be done. It can just naturally fall out when/if
I honestly don't think that the logic implemented in this PR is actually very complex:
As long as the OnlineStore always returns the Feature data in the same order it is requested then this is really a very simple algorithm. Do the work once and copy it. It's 9 lines of additional code. |
@judahrand, I see your point that Feast should make lives easier, and ideally, users should not think about internal implementation details. Totally agree. But I would argue that it's two different things: (a) provide comprehensive functionality, so
in a reasonable time (and that time should be expected to grow with the growing complexity of the request) and (b) do some "magic" to run retrieval efficiently for 9K rows with features attached to different entities (with duplicated values in some keys) in the same request. And in the latter case, I'd say - yes it's fine to know some details of implementation and do this optimization in user code (as I proposed above). That being said, I could be wrong and this optimization is indeed critical.
|
It really isn't magic. It's a really simple algorithm. It also doesn't only apply to the case of 9K rows, it is also more efficient with 2 rows.
This is a fair comment, perhaps I need to add some tests which test for the goal of this optimization. This should be completely doable using
Unfortunately, this approach cannot work with the way that I once again draw your attention to This information can currently only be gleaned inside I'm all for making this more modular, tested, etc. I think the modularity will be an issue given how the rest of Another option would be to completely overhaul how
I'm not sure I entirely understand what you're getting at here, do you mind if I reach out on Slack to learn more as I'm not sure this is directly relevant to this PR? |
I got this. That's why there's projection before deduplication. Please pay attention to these lines:
Yes, we would need to group feature_refs, so we will need to have another util function:
but that would be trivial to implement (plus we already inside FeatureStore and have access to the registry)
Yeah, that seems similar to what I proposed.
Yeah, this is a reasonable concern. And it's not alone. The whole
Sure, this is not relevant to this PR. |
Signed-off-by: Judah Rand <[email protected]>
Signed-off-by: Judah Rand <[email protected]>
180dedb
to
0d91f6b
Compare
Signed-off-by: Judah Rand <[email protected]>
This is great, @judahrand . Just one request: can you please add a unit test for |
Signed-off-by: Judah Rand <[email protected]>
Signed-off-by: Judah Rand <[email protected]>
0485b92
to
c81dc27
Compare
@pyalex I've added a simple unittest and just as well too! I hadn't realized that Thanks for all the feedback and working towards a maintainable solution with me. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: judahrand, pyalex The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -245,9 +245,9 @@ def get_local_server_port(self) -> int: | |||
|
|||
def table_name_from_data_source(ds: DataSource) -> Optional[str]: | |||
if hasattr(ds, "table_ref"): | |||
return ds.table_ref | |||
return ds.table_ref # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, but why are we adding an ignore statement here? Was this an omission? We've not had to use ignores elsewhere in the code base, so this stood out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was because I believe neither table
not table_ref
are actually attributes of DataSource
- only its subclasses. The other option would have been a complex cast
, I believe. This might be preferable but especially as it was in a test util I decided not to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There area handful of other # type: ignore
comments around the codebase and I've also discovered that the typing around Protobufs is tricky and often seemingly wrong or unhelpful.
Signed-off-by: Judah Rand [email protected]
What this PR does / why we need it:
This commit optimizes the fetching of features by only fetching
the features for each unique Entity once and then expands the result
to the shape of input EntityKeys.
Previously, if an Entity occurred twice the features would be fetched
from the OnlineStore twice. This can be hugely inefficient.
The only assumption that this makes is that the OnlineStore will return
the feature data in the same order as the EntityKeyProtos are provided.
Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a user-facing change?: