-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add storage interfaces, basic file structure #529
Add storage interfaces, basic file structure #529
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zhilingc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
storage/api/src/main/java/feast/storage/api/retrieval/BatchRetriever.java
Outdated
Show resolved
Hide resolved
storage/api/src/main/java/feast/storage/api/retrieval/OnlineRetriever.java
Outdated
Show resolved
Hide resolved
storage/api/src/main/java/feast/storage/api/retrieval/OnlineRetriever.java
Outdated
Show resolved
Hide resolved
/retest |
1 similar comment
/retest |
03960e0
to
66e9520
Compare
/retest |
4e461a1
to
11a7083
Compare
11a7083
to
1bd20c5
Compare
lgtm |
/lgtm |
* Add storage interfaces, basic file structure * Apply spotless, add comments * Move parseResponse and isEmpty to response object * Make changes to write interface to be more beam-like * Pass feature specs to the retriever * Pass feature specs to online retriever * Add FeatureSetRequest * Add mistakenly removed TestUtil * Add mistakenly removed TestUtil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Post-merge review, hope it can still be useful.
I have a major concern about a single storage-api
module having Beam dependencies, which serving
should not have. We see it coming to life in #553 adding an exclusion on org.apache.beam:*
…
@zhilingc had some thoughts that there may be unavoidable realities—seems like a nuanced discussion, maybe should take it to public Feast Slack or calls.
Although writing via Beam IOs is very much part of "storage", I feel it's vital that storage API needs of serving and ingestion are decoupled at a module level.
<dependency> | ||
<groupId>junit</groupId> | ||
<artifactId>junit</artifactId> | ||
<version>4.12</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can omit the JUnit version from all of these, 4.12 is the one applied by our <dependencyManagement>
via the imported one from Spring.
Apache commons-lang3 is also there, at version 3.7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha
import java.util.List; | ||
|
||
/** Interface for implementing user defined retrieval functionality from Batch/historical stores. */ | ||
public interface BatchRetriever { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: "Interface for implementing" is redundant, it is (statically) an interface
and those are always for implementing
Personally I find it helpful, as a reader and writer, to read/think in terms of the abstraction that's represented, as a noun—what an instance would be/do. Some of the first sentences of thejava.nio.channels.Channel
interface Javadoc, for instance:
A channel represents an open connection to an entity such as a hardware device, a file, a network socket, or a program component that is capable of performing one or more distinct I/O operations, for example reading or writing.
This is harder when you get into things that are architectural glue more than domain concepts. I try to defer defining those for as long as possible… we're necessarily near the edge of it here.
I'll take a riff:
A
BatchRetriever
fetches historical feature values from storage.
Thinking about it, I wonder if HistoricalRetriever
is a more expressive name. It's really what conceptually—and technically—distinguishes it from online retrieval in Feast. "Offline" and "batch" are more overloaded, fuzzier terms. Something like Druid or various streaming SQL engines could make online historical usage feasible. Whether that has a place in Feast is beyond the subject here, but hopefully my point stands regardless.
Honestly I liked your first iteration with getFeatures
naming and I agree with your instinct that single implementations like class SQLLiteRetriever implements BatchRetriever, OnlineRetriever
are going to be the exception much more than the norm. And they would still be possible with the overloads, even elegant. It's subjective taste at that point though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it, I wonder if
HistoricalRetriever
is a more expressive name.
Makes sense. We used to call it historical serving
but for some reason it just didn't have the right ring to it. It's definitely more accurate than simply batch
.
* contains the error to be returned to the user. | ||
*/ | ||
ServingAPIProto.Job getBatchFeatures( | ||
ServingAPIProto.GetBatchFeaturesRequest request, List<FeatureSetRequest> featureSetRequests); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be controversial because it might imply many new classes that are very close to proto generated ones, and a translation layer somewhere with some tedious boilerplate, but I would really like to see "api" modules working with domain model types that are not proto wire format—like the newly-defined FeatureSetRequest
in this PR as part of "api" is a step in the right direction to me, but it's essentially just a wrapper of ImmutableSet<FeatureReference>
so I wonder if we could go further and use that directly here instead of a wrapper (although FeatureReference
is still coming from proto, but maybe it's acceptable when they're more of a fundamental data model type and not request objects, DTOs, etc.).
Concretely I'm skeptical of ServingAPIProto.GetBatchFeaturesRequest
and ServingAPIProto.Job
at this layer: the former is a request object that should go no deeper than the RPC controller layer of serving
IMO. Could the latter be served here by feast.core.model.Job
? Granted that having storage-api
depend on core
is not ideal, but it may be that ubiquitous models that cross these new layers need to move (perhaps datatypes-java
starts to carry source code beyond only generated, or "job" is really fundamentally an Ingestion domain concept and a Job
model could be in an "ingestion-api" that helps bridge us from Core to Serving doing job management…).
It's a bit odd to me that ServingAPIProto.Job
is defined where it is currently anyhow, though I guess this might also be part of the inclination to move job management. And I acknowledge it might be justified to have different contracts for what a client making a historical query cares about status, versus more internal concerns of ingestion job management. But it's like a serving-specific View Model of the core domain model then. I digress, this is a tangent from this already rambling comment, but a good one to carry on elsewhere.
It feels redundant that this interface method has both the GetBatchFeaturesRequest
and List<FeatureSetRequest>
as arguments—there's substantial overlap isn't there? Seems like incidental complexity for the caller to provide both, all the information should be in the feature references. If we boil this down to essentials, forgetting any existing types that happen to be defined, I think the method requires these things:
- Set/list of (qualified) feature references
- Set of entities, paired with a datetime
(Do we also handle/support a query-supplied max age per feature set, for batch? Proto schemas allow it but Python API doesn't seem to).
As an API interface, what would it look like to simplify this to domain model and basic collection types?
Sorry for such a long and meandering comment…
/** | ||
* Get the featureset associated with this response. | ||
* | ||
* @return String featureset reference in format featureSet:version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've alluded to the same in TestUtil
, but the references dearly need to become first-class types instead of strings, IMO. But again, something we should address outside this PR (but quite possibly before storage-refactor
merges to mainline, their absence is bearing on these fundamental interfaces we're designing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import feast.types.FeatureRowProto; | ||
|
||
/** Response from an online store. */ | ||
public interface OnlineRetrieverResponse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my earlier remark about FeatureSetRequest
, I'm questioning the value at this layer of a wrapper over a simple collection of FeatureRow
s. They carry the information of what feature set they're associated with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I think this is an artifact of a previous iteration that i forgot to remove.
* @return list of {@link OnlineRetrieverResponse} for each entity row | ||
*/ | ||
List<List<FeatureRow>> getOnlineFeatures( | ||
List<EntityRow> entityRows, List<FeatureSetRequest> featureSetRequests); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the types used in the signature here, in contrast to getBatchFeatures
, lend support to my arguments there.
* | ||
* @return {@link PTransform} | ||
*/ | ||
PTransform<PCollection<FeatureRow>, WriteResult> write(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshedding, but the method name feels a little odd for what it does: it sounds imperative. Maybe writer()
solves that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is a common convention in Beam especially for the Write
transform implementations of IOs, my bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine, most other implementations do it because it so that the API is fluent: it usually follows with the write destination, e.g. SomeIO.write().to(somewhere)
. In this case we don't supply such an option, so writer()
makes more sense
import org.apache.beam.sdk.values.PCollection; | ||
|
||
/** Interface for implementing user defined feature sink functionality. */ | ||
public interface FeatureSink extends Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** A
FeatureSink
defines how feature values are written to storage, through a Beam transform. */
Or something like that… The current doc tells me practically nothing as an implementer of what these are for (the method docs are good!). It's also a muddy description of a contract, sounds like we're uncommitted to keeping this abstraction well-defined as time goes on.
* Add storage interfaces, basic file structure * Apply spotless, add comments * Move parseResponse and isEmpty to response object * Make changes to write interface to be more beam-like * Pass feature specs to the retriever * Pass feature specs to online retriever * Add FeatureSetRequest * Add mistakenly removed TestUtil * Add mistakenly removed TestUtil
* Add storage interfaces, basic file structure * Apply spotless, add comments * Move parseResponse and isEmpty to response object * Make changes to write interface to be more beam-like * Pass feature specs to the retriever * Pass feature specs to online retriever * Add FeatureSetRequest * Add mistakenly removed TestUtil * Add mistakenly removed TestUtil
* Add storage interfaces, basic file structure * Apply spotless, add comments * Move parseResponse and isEmpty to response object * Make changes to write interface to be more beam-like * Pass feature specs to the retriever * Pass feature specs to online retriever * Add FeatureSetRequest * Add mistakenly removed TestUtil * Add mistakenly removed TestUtil
* Add storage interfaces, basic file structure * Apply spotless, add comments * Move parseResponse and isEmpty to response object * Make changes to write interface to be more beam-like * Pass feature specs to the retriever * Pass feature specs to online retriever * Add FeatureSetRequest * Add mistakenly removed TestUtil * Add mistakenly removed TestUtil
* Add storage interfaces, basic file structure * Apply spotless, add comments * Move parseResponse and isEmpty to response object * Make changes to write interface to be more beam-like * Pass feature specs to the retriever * Pass feature specs to online retriever * Add FeatureSetRequest * Add mistakenly removed TestUtil * Add mistakenly removed TestUtil
…567) * Add storage interfaces, basic file structure (#529) * Add storage interfaces, basic file structure * Apply spotless, add comments * Move parseResponse and isEmpty to response object * Make changes to write interface to be more beam-like * Pass feature specs to the retriever * Pass feature specs to online retriever * Add FeatureSetRequest * Add mistakenly removed TestUtil * Add mistakenly removed TestUtil * Add BigQuery storage (#546) * Add Redis storage implementation (#547) * Add Redis storage * Remove staleness check; can be checked at the service level * Remove staleness related tests * Add dependencies to top level pom * Clean up code * Change serving and ingestion to use storage API (#553) * Change serving and ingestion to use storage API * Remove extra exclusion clause * Storage refactor API and docstring tweaks (#569) * API and docstring tweaks * Fix javadoc linting errors * Apply spotless * Fix javadoc formatting * Drop result from HistoricalRetrievalResult constructors * Change pipeline to use DeadletterSink API (#586) * Add better code docs to storage refactor (#601) * Add better code documentation, make GetFeastServingInfo independent of retriever * Make getStagingLocation method of historical retriever * Apply spotless * Clean up dependencies, remove exclusions at serving (#607) * Clean up OnlineServingService code (#605) * Clean up OnlineServingService code to be more readable * Revert Metrics * Rename storage API packages to nouns
What this PR does / why we need it:
This PR adds the proposed storage interfaces for the storage refactor mentioned in issue #402 . Implementations of the interfaces defined should go into individual modules in
storage/connectors
. The introduction of these interfaces and their implementations will hopefully allow for greater extensibility and less code duplication, as well as unspaghettify the Serving codebase.Making this PR for discussion before i move the redis and BQ code.
Notes:
OnlineRetriever
closely follows the implementation written by @smadarasmi for the cassandra store. It is less generic than theBatchRetriever
interface so that implementations don't have to implement the code tracking missing/stale features. EDIT: I've made the methods a lot more high-level.BatchRetriever
methods should be more granular. If we want to support a wide range of stores, this might be as granular as it gets.FeatureSink
writes emitWriteResult
objects that support the retrieval of successful and failed attempts in anticipation of Shift ingestion feature metrics to after store write #489