Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature retrieval latency scales with number of feature views, doesn't appear to cache #3442

Closed
ecotner opened this issue Jan 11, 2023 · 3 comments
Labels
kind/bug priority/p2 wontfix This will not be worked on

Comments

@ecotner
Copy link

ecotner commented Jan 11, 2023

So I'd like to preface this by saying that I'm new to feast; furthermore I am more of a user of this as part of a subsystem of a larger ecosystem that is in place at my job, and do not interact with the library directly, only an API that sits between it and me. So hopefully my understanding of how it works is correct, but if not, I would love to understand more of how this system works!

Expected Behavior

So my understanding of how "projects" and "feature views" work is kind of analogous to the relationship between a schema and table in a database. The project acts as a namespace for feature views, which is a second-tier namespace for features, in the same way that a schema acts as a namespace for tables, which act as a second-tier for columns. In this mental model, I expect that when I use FeatureStore.get_online_features, it will do some kind of fast lookup based on a map from the combination of (project, feature view) (maybe based on a hash table?) to figure out where the features are located, and then retrieve the data. If that map is not available locally, then it can be retrieved from the remote registry and then cached for future repeated use. Once cached, I would expect the overhead of figuring out which feature view or data location to use would drop to near zero, but the process of retrieving the data itself would be unchanged (i.e. scale with the size of the data + normal internet latency).

Current Behavior

What I observe is that when querying for data using FeatureStore.get_online_features from a particular project I'm working with, it appears that searching for the appropriate feature view takes a significant amount of time when first run (as expected if the registry has not been cached yet; our registry is stored in a GCS bucket, for reference). I measure this by profiling my code (using line_profiler) and looking at time spent in the function FeatureStore._get_feature_views_to_use. Then there is some time for the data itself to be received, which I measure by looking at time spent in FeatureStore._read_from_online_store (in our case, online feature data is stored on a redis instance). On successive calls to get_online_features, the time spent querying the registry for the feature view metadata remains unchanged, counter to what I would expect (less time spent) if this data had been cached. Furthermore, successive calls result in significantly less time spent retrieving the actual feature data from redis (which I would expect to be unchanged).

In the process of diagnosing this, I tried querying feature data from a different project to see if it also had the same problems. The main difference is that this second project is significantly "smaller" than the first. The first project has maybe 20 feature views, which have anywhere between 2 and 32 features each. The second project only has two feature views with two features each. What I noticed was that when switching to project two, the time spent retrieving the registry dropped significantly (to like 10% of the time it takes to do the same for project one), although it still did not appear to be caching correctly as the time did not change on successive calls. Time retrieving the actual data was similar in both projects. See the table below for a summary of what we tried/observed:

project # features in view ran before profiling _get_feature_views_to_use _read_from_online_store
1 32 no 101 ms 312 ms
1 32 yes 99 ms 117 ms
1 2 no 98 ms 318 ms
1 2 yes 96 ms 101 ms
2 2 no 6.4 ms 303 ms
2 2 yes 7.0 ms 103 ms

In addition, we found that within _get_feature_views_to_use, all the time was being taken up by the following loop:

        for fv in self._registry.list_feature_views(
            self.project, allow_cache=allow_cache
        ):
            if hide_dummy_entity and fv.entities[0] == DUMMY_ENTITY_NAME:
                fv.entities = []
                fv.entity_columns = []
            feature_views.append(fv)

In particular, it looks like the body of the loop had negligible runtime, but executing Registry.list_feature_views for project 1 returned an iterator of 18 items (equal to the number of feature views in that project) over 102 ms, which works out to about 5.6 ms/iteration, roughly on par with the total time to retrieve the registry for project 2. I initially suspected that perhaps this function is sequentially calling out to GCS to retrieve the registry data one feature view at a time rather than getting it all in one go? But I was able to further trace this all the way down to some calls to FeatureView.from_proto, which may suggest that deserializing some protobuf might be the bottleneck? My profiler wasn't able to go any deeper for some reason and I feel like I'm already in the weeds here anyway so I'll leave it at that.

Steps to reproduce

  • Have a registry in a GCS bucket
  • Have a redis online store
  • Create a project with a bunch of feature views and features
  • Create another project with very few feature views and features
  • Query the data using FeatureStore.get_online_features and profile the functions FeatureStore._get_feature_views_to_use and FeatureStore._read_from_online_store

Specifications

  • Version: python==3.9.16, feast==0.28.0
  • Platform: mac, linux
  • Subsystem: ?

Possible Solution

I think the best possible solution is to figure out why the registry isn't being cached and fix that. All of our TTL parameters are set to very long periods of time, so I'm not sure why caching does not occur though.

Given the above observations, a potential workaround might be to limit the number of feature views in each of our projects. This is less than ideal though since we have leaned heavily into using the project and feature view names as a hierarchical namespace (different groups/teams have separate projects, different services and ML models within each team have separate feature views).

@jerive
Copy link
Contributor

jerive commented Feb 9, 2023

Seems related to #3090

@stale
Copy link

stale bot commented Jun 10, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Jun 10, 2023
@phil-park
Copy link
Contributor

@ecotner
I completely agree with you.
I also experienced severe slowdown as the number of functions increased.
First, we implemented a properly functioning cache with minimal code, and this is currently a temporary solution.
Check out this PR (#3702)

But this is not the right solution and is only a temporary solution.
I think feast should stop using serialized objects as protocol buffers.

@stale stale bot closed this as completed Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug priority/p2 wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants