-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added Remote offline server and client using arrow flight server #3
Conversation
f"grpc://{config.offline_store.host}:{config.offline_store.port}" | ||
) | ||
# Put API parameters | ||
self._put_parameters(feature_refs, entity_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure, but I think this should also go in _to_arrow_internal
method. That's the behavior in other offline stores. Nothing is actually happening until the user calls one of the "action" methods on RetrievalJob
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you clarify more, what I understand you are asking to add self._put_parameters(feature_refs, entity_df)
into function _to_arrow_internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, take bigquery engine for example. It's a bit more complicated, but note that _upload_entity_df
(which essentially does the same thing as self._put_parameters(feature_refs, entity_df)
) method is not called during the creation of RetrievalJob, it's actually part of query_generator
function and is called from _to_arrow_internal
/_to_df_internal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, we can move the parameter exchange a bit later in the flow. here we only wanted to create the flight at the request time, but there's actually no reason for it, it can surely be postponed
writer.write_table(entity_df_table) | ||
writer.close() | ||
|
||
features_array = pa.array(feature_refs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be doing multiple do_put
s here, the do_put
should be called for entity dataset only. I would treat features as a parameter and try to pass at along either in the metadata of the previous do_put call or if that proves too hard, I think we will have to encode all the parameters in the FlightDescriptor itself (It could contain uuid and all params, some binary encoding of ours or maybe a proto message). wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes we can pass feature_refs as json string into entity_df_table metadata something like this.
features_json = json.dumps(feature_refs)
entity_df_table = pa.Table.from_pandas(entity_df)
writer, _ = self.client.do_put(
historical_flight_descriptor,
entity_df_table.schema.with_metadata(
{
"command": self.command,
"api": "get_historical_features",
"param": "entity_df",
"features": features_json
}
),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that looks better. I still anticipate some problems with this choice, though. For example, there is an option to pass entity dataset as a query rather than a pandas dataframe (which I'm not sure we'd like to support or not, frankly). In that case, we probably wouldn't do any do_put
calls whatsover as there's no data to be passed to the server. That's why putting all these into a flight descriptor seems more appropriate to me. Having said that, we can try like this first and revisit the choice later if necessary...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can try to minimize the do_put calls, that's a good catch.
For parameters that have a double nature like entity_df: Union[pd.DataFrame, str]
, or those that are of primitive type like project: str, full_feature_names: bool = False
, we can try to encode them in the descriptor as you said, probably together with an identifier of the actual API like get_historical_features
.
Before trying the proto message option (and the related overhead), I would try to encode them somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, proto will be an overkill for now (although probably where we will end up down the line 😄). Just some dict -> json -> base64 or something like that will do.
sdk/python/feast/offline_server.py
Outdated
tuple(descriptor.path or tuple()), | ||
) | ||
|
||
# TODO: since we cannot anticipate here the call to get_historical_features call, what data should we return? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, I think we will have to extend RetrievalJob
interface to expose schema of the result set in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really relevant for our use case, BTW? our protocol uses get_flight_info
only to retrieve the ticket before invoking the do_get
, it's not the regular use case for which the Arrow Flight protocol was initially designed for: there is no existing data set from which we can extract the schema and the metadata until we invoke the API on the store, and probably our client code is not interested in inspecting the descriptor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, probably not relevant right now if our client implementation can do without. Might be more important if we manage to polish API to a degree when it becomes usable even without feast client.
P.S. I wouldn't say it's not the regular use case for that particular reason. True, dataset isn't ready yet, but that can be said for any sql query to any database, right? Actually it's probably pretty trivial to extract schema without running the actual retrieval job, we know the schema for entity df and also know data types for all features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, schema is resolvable by means of the registry for the given features. we can use -1 for others requiring to run the query:
total_bytes: The size in bytes of the data in this flight, or -1 if unknown.
total_records: The total record count of this flight, or -1 if unknown.
some of the suggested changes are available in PR #4 |
I'll answer here to keep the discussion contained 😄 If you're encoding the whole thing anyway, wouldn't this be simpler? It would avoid do_put metadata entirely.
|
@dmartinol you also don't need to call |
I tried the same before, with the goal of removing the put step, but translating the df Anyway I updated PR #4 with latest proposals |
sdk/python/feast/offline_server.py
Outdated
self.store = store | ||
|
||
@classmethod | ||
def descriptor_to_key(self, descriptor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this to be a class method and not instance method or at least it should be in a util file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just followed the implementation approach from Arrow Flight Repo - here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just followed the implementation approach from Arrow Flight Repo - here
I just saying I don't like all those none object oriented concepts that are currently going on in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for changing it to @staticmethod
or extracting as a separate function
@dmartinol hey, you can also reuse some of the existing offline store tests by creating a custom |
You mean to develop an integration test, right? |
@dmartinol sure, it's a bit of an unusual case because of an additional server-side feature store, but other configs also require external resources here and there. For example, s3+duckdb config brings up minio in a container during startup. I was thinking we could do something similar here (ideally in a container, but might be hard to copy current code in there). Our version of DataSourceCreator would:
I can give it a try later today and give you some draft version if you're fine with that... |
@tokoko sure, any help is appreciated. BTW: what about using the regular |
@dmartinol We could do it that way (similar to feature server) but we would be missing out on easily reusing offline store tests in that case. The reason they decided to treat feature server as a special case was probably because they weren't planning to put feature server behind "normal" feast APIs. Once There's not too much difference between |
Turns out you're right, it's more complicated than I thought initially, particularly registry configuration for server side feature store is impossible to acquire now, |
@dmartinol Forgive me if this is unnecessary 😄 After the changes in 4210, you should be able to do something like this:
|
Thanks @tokoko for your prompt reaction! i will give it a try tomorrow and let you know!
|
@tokoko sorry for delayed reply but I faced some issues with this implementation.
Any hints? |
|
I see your point and I found the problem (not the fix). feature_views: List[FeatureView],
feature_refs: List[str], The remote offline store cannot rebuild the original feature store API, so it transfers the input params to the arrow server as-is. In the server we invoke the offline store API, by-passing the feature store (it's not in the current PR, but a recent change to be committed). The problem, as you said, is that when there are feature services, the |
Oh... right, got it. Trying to pass the FV objects could be a workaround, but definitely not a long-term solution. If server accepted arbitrary objects from the client and not retrieve them from registry, there would be no way to access additional info like auth/permissions in a trusted way. We shouldn't waste time on that, I think. Looks like we'll have to redesign the OfflineStore API sooner rather than later. It should accept both FeatureServices and FeatureViews/Features as simple python objects. P.S. I don't think there will be many tests testing for FeatureServices with aliases, let's simply disable them for remote offline store for the time being. |
@tokoko I reviewed the server implementation to pass feature view names and disabled ITs using feature services. It will be merged to this PR after internal review.
[IMO the first 2 point deserve their own GH issue] |
|
Signed-off-by: Abdul Hameed <[email protected]>
Signed-off-by: Abdul Hameed <[email protected]>
…ight server and client Signed-off-by: Abdul Hameed <[email protected]>
Signed-off-by: Abdul Hameed <[email protected]>
use feature_view_names to transfer feature views and remove dummies Signed-off-by: Abdul Hameed <[email protected]>
…ight server and client Signed-off-by: Abdul Hameed <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]> Signed-off-by: Abdul Hameed <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]> Signed-off-by: Abdul Hameed <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]> Signed-off-by: Abdul Hameed <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]> Signed-off-by: Abdul Hameed <[email protected]>
Signed-off-by: Abdul Hameed <[email protected]>
Signed-off-by: Theodor Mihalache <[email protected]> Signed-off-by: Abdul Hameed <[email protected]>
2e1879a
to
08c2616
Compare
Signed-off-by: Abdul Hameed <[email protected]>
08c2616
to
dec05c9
Compare
Signed-off-by: Theodor Mihalache <[email protected]>
Signed-off-by: Abdul Hameed <[email protected]>
update the document change and fix teardown function
Signed-off-by: Theodor Mihalache <[email protected]>
Implemented PR change proposal
What this PR does / why we need it:
This PR anticipates the changes needed to implement an offline store using an
Arrow Flight
server for data transfer.Please note that we're using a fork of the upstream repo under the
RHEcosystemAppEng
organization to facilitate the collaboration among the team's developers: once we agree with the community members on the final approach, we'll complete the missing functionalities and send a PR to the community repo.Which issue(s) this PR fixes:
Partially addresses #4032
Notes
The PR includes a working example that the reviewers can run, with a remote offline store server and the related client.
This example is only meant for explanatory purposes, and will not be part of the final PR.
The example includes a README file with some details on the proposed parameter transfer protocol, which explains how we decided to transfer the parameters of the
OfflineStore
APIs to the remote server. Again, this will not be committed as-is to the upstream repo but we'll create a dedicated section in the user guide.