Add storage interfaces, basic file structure #529

zhilingc · 2020-03-10T07:21:59Z

What this PR does / why we need it:
This PR adds the proposed storage interfaces for the storage refactor mentioned in issue #402 . Implementations of the interfaces defined should go into individual modules in storage/connectors. The introduction of these interfaces and their implementations will hopefully allow for greater extensibility and less code duplication, as well as unspaghettify the Serving codebase.

Making this PR for discussion before i move the redis and BQ code.

Notes:

The OnlineRetriever closely follows the implementation written by @smadarasmi for the cassandra store. It is less generic than the BatchRetriever interface so that implementations don't have to implement the code tracking missing/stale features. EDIT: I've made the methods a lot more high-level.
I'm not sure if the BatchRetriever methods should be more granular. If we want to support a wide range of stores, this might be as granular as it gets.
FeatureSink writes emit WriteResult objects that support the retrieval of successful and failed attempts in anticipation of Shift ingestion feature metrics to after store write #489

feast-ci-bot · 2020-03-10T07:22:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zhilingc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [zhilingc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

storage/api/src/main/java/feast/storage/api/retrieval/BatchRetriever.java

storage/api/src/main/java/feast/storage/api/retrieval/OnlineRetriever.java

zhilingc · 2020-03-10T10:31:23Z

/retest

zhilingc · 2020-03-10T13:04:45Z

/retest

zhilingc · 2020-03-11T08:11:23Z

/retest

woop · 2020-03-17T08:02:18Z

lgtm

woop · 2020-03-18T02:07:33Z

/lgtm

* Add storage interfaces, basic file structure * Apply spotless, add comments * Move parseResponse and isEmpty to response object * Make changes to write interface to be more beam-like * Pass feature specs to the retriever * Pass feature specs to online retriever * Add FeatureSetRequest * Add mistakenly removed TestUtil * Add mistakenly removed TestUtil

ches

Post-merge review, hope it can still be useful.

I have a major concern about a single storage-api module having Beam dependencies, which serving should not have. We see it coming to life in #553 adding an exclusion on org.apache.beam:*…

@zhilingc had some thoughts that there may be unavoidable realities—seems like a nuanced discussion, maybe should take it to public Feast Slack or calls.

Although writing via Beam IOs is very much part of "storage", I feel it's vital that storage API needs of serving and ingestion are decoupled at a module level.

ches · 2020-03-21T14:11:50Z

storage/api/pom.xml

+    <dependency>
+      <groupId>junit</groupId>
+      <artifactId>junit</artifactId>
+      <version>4.12</version>


We can omit the JUnit version from all of these, 4.12 is the one applied by our <dependencyManagement> via the imported one from Spring.

Apache commons-lang3 is also there, at version 3.7.

storage/connectors/pom.xml

storage/api/pom.xml

ches · 2020-03-22T08:14:26Z

storage/api/src/main/java/feast/storage/api/retrieval/BatchRetriever.java

+import java.util.List;
+
+/** Interface for implementing user defined retrieval functionality from Batch/historical stores. */
+public interface BatchRetriever {


Nitpick: "Interface for implementing" is redundant, it is (statically) an interface and those are always for implementing ☺️

Personally I find it helpful, as a reader and writer, to read/think in terms of the abstraction that's represented, as a noun—what an instance would be/do. Some of the first sentences of thejava.nio.channels.Channel interface Javadoc, for instance:

A channel represents an open connection to an entity such as a hardware device, a file, a network socket, or a program component that is capable of performing one or more distinct I/O operations, for example reading or writing.

This is harder when you get into things that are architectural glue more than domain concepts. I try to defer defining those for as long as possible… we're necessarily near the edge of it here.

I'll take a riff:

A BatchRetriever fetches historical feature values from storage.

Thinking about it, I wonder if HistoricalRetriever is a more expressive name. It's really what conceptually—and technically—distinguishes it from online retrieval in Feast. "Offline" and "batch" are more overloaded, fuzzier terms. Something like Druid or various streaming SQL engines could make online historical usage feasible. Whether that has a place in Feast is beyond the subject here, but hopefully my point stands regardless.

Honestly I liked your first iteration with getFeatures naming and I agree with your instinct that single implementations like class SQLLiteRetriever implements BatchRetriever, OnlineRetriever are going to be the exception much more than the norm. And they would still be possible with the overloads, even elegant. It's subjective taste at that point though.

Thinking about it, I wonder if HistoricalRetriever is a more expressive name.

Makes sense. We used to call it historical serving but for some reason it just didn't have the right ring to it. It's definitely more accurate than simply batch.

ches · 2020-03-22T08:49:42Z

storage/api/src/main/java/feast/storage/api/retrieval/BatchRetriever.java

+   *     contains the error to be returned to the user.
+   */
+  ServingAPIProto.Job getBatchFeatures(
+      ServingAPIProto.GetBatchFeaturesRequest request, List<FeatureSetRequest> featureSetRequests);


This may be controversial because it might imply many new classes that are very close to proto generated ones, and a translation layer somewhere with some tedious boilerplate, but I would really like to see "api" modules working with domain model types that are not proto wire format—like the newly-defined FeatureSetRequest in this PR as part of "api" is a step in the right direction to me, but it's essentially just a wrapper of ImmutableSet<FeatureReference> so I wonder if we could go further and use that directly here instead of a wrapper (although FeatureReference is still coming from proto, but maybe it's acceptable when they're more of a fundamental data model type and not request objects, DTOs, etc.).

Concretely I'm skeptical of ServingAPIProto.GetBatchFeaturesRequest and ServingAPIProto.Job at this layer: the former is a request object that should go no deeper than the RPC controller layer of serving IMO. Could the latter be served here by feast.core.model.Job? Granted that having storage-api depend on core is not ideal, but it may be that ubiquitous models that cross these new layers need to move (perhaps datatypes-java starts to carry source code beyond only generated, or "job" is really fundamentally an Ingestion domain concept and a Job model could be in an "ingestion-api" that helps bridge us from Core to Serving doing job management…).

It's a bit odd to me that ServingAPIProto.Jobis defined where it is currently anyhow, though I guess this might also be part of the inclination to move job management. And I acknowledge it might be justified to have different contracts for what a client making a historical query cares about status, versus more internal concerns of ingestion job management. But it's like a serving-specific View Model of the core domain model then. I digress, this is a tangent from this already rambling comment, but a good one to carry on elsewhere.

It feels redundant that this interface method has both the GetBatchFeaturesRequest and List<FeatureSetRequest> as arguments—there's substantial overlap isn't there? Seems like incidental complexity for the caller to provide both, all the information should be in the feature references. If we boil this down to essentials, forgetting any existing types that happen to be defined, I think the method requires these things:

Set/list of (qualified) feature references

Set of entities, paired with a datetime

(Do we also handle/support a query-supplied max age per feature set, for batch? Proto schemas allow it but Python API doesn't seem to).

As an API interface, what would it look like to simplify this to domain model and basic collection types?

Sorry for such a long and meandering comment…

ches · 2020-03-22T08:58:40Z

storage/api/src/main/java/feast/storage/api/retrieval/OnlineRetrieverResponse.java

+  /**
+   * Get the featureset associated with this response.
+   *
+   * @return String featureset reference in format featureSet:version


I've alluded to the same in TestUtil, but the references dearly need to become first-class types instead of strings, IMO. But again, something we should address outside this PR (but quite possibly before storage-refactor merges to mainline, their absence is bearing on these fundamental interfaces we're designing).

yes! It will be added in @mrzzy's PR #548, in proto format, at least.

ches · 2020-03-22T09:03:18Z

storage/api/src/main/java/feast/storage/api/retrieval/OnlineRetrieverResponse.java

+import feast.types.FeatureRowProto;
+
+/** Response from an online store. */
+public interface OnlineRetrieverResponse {


Similar to my earlier remark about FeatureSetRequest, I'm questioning the value at this layer of a wrapper over a simple collection of FeatureRows. They carry the information of what feature set they're associated with.

Oops. I think this is an artifact of a previous iteration that i forgot to remove.

ches · 2020-03-22T09:25:34Z

storage/api/src/main/java/feast/storage/api/retrieval/OnlineRetriever.java

+   * @return list of {@link OnlineRetrieverResponse} for each entity row
+   */
+  List<List<FeatureRow>> getOnlineFeatures(
+      List<EntityRow> entityRows, List<FeatureSetRequest> featureSetRequests);


I think the types used in the signature here, in contrast to getBatchFeatures, lend support to my arguments there.

ches · 2020-03-22T09:31:28Z

storage/api/src/main/java/feast/storage/api/write/FeatureSink.java

+   *
+   * @return {@link PTransform}
+   */
+  PTransform<PCollection<FeatureRow>, WriteResult> write();


Bikeshedding, but the method name feels a little odd for what it does: it sounds imperative. Maybe writer() solves that?

I guess this is a common convention in Beam especially for the Write transform implementations of IOs, my bad.

It's fine, most other implementations do it because it so that the API is fluent: it usually follows with the write destination, e.g. SomeIO.write().to(somewhere). In this case we don't supply such an option, so writer() makes more sense

ches · 2020-03-22T09:46:35Z

storage/api/src/main/java/feast/storage/api/write/FeatureSink.java

+import org.apache.beam.sdk.values.PCollection;
+
+/** Interface for implementing user defined feature sink functionality. */
+public interface FeatureSink extends Serializable {


/** A FeatureSink defines how feature values are written to storage, through a Beam transform. */

Or something like that… The current doc tells me practically nothing as an implementer of what these are for (the method docs are good!). It's also a muddy description of a contract, sounds like we're uncommitted to keeping this abstraction well-defined as time goes on.

* Add storage interfaces, basic file structure * Apply spotless, add comments * Move parseResponse and isEmpty to response object * Make changes to write interface to be more beam-like * Pass feature specs to the retriever * Pass feature specs to online retriever * Add FeatureSetRequest * Add mistakenly removed TestUtil * Add mistakenly removed TestUtil

…567) * Add storage interfaces, basic file structure (#529) * Add storage interfaces, basic file structure * Apply spotless, add comments * Move parseResponse and isEmpty to response object * Make changes to write interface to be more beam-like * Pass feature specs to the retriever * Pass feature specs to online retriever * Add FeatureSetRequest * Add mistakenly removed TestUtil * Add mistakenly removed TestUtil * Add BigQuery storage (#546) * Add Redis storage implementation (#547) * Add Redis storage * Remove staleness check; can be checked at the service level * Remove staleness related tests * Add dependencies to top level pom * Clean up code * Change serving and ingestion to use storage API (#553) * Change serving and ingestion to use storage API * Remove extra exclusion clause * Storage refactor API and docstring tweaks (#569) * API and docstring tweaks * Fix javadoc linting errors * Apply spotless * Fix javadoc formatting * Drop result from HistoricalRetrievalResult constructors * Change pipeline to use DeadletterSink API (#586) * Add better code docs to storage refactor (#601) * Add better code documentation, make GetFeastServingInfo independent of retriever * Make getStagingLocation method of historical retriever * Apply spotless * Clean up dependencies, remove exclusions at serving (#607) * Clean up OnlineServingService code (#605) * Clean up OnlineServingService code to be more readable * Revert Metrics * Rename storage API packages to nouns

zhilingc requested review from davidheryanto, khorshuheng, pradithya and woop as code owners March 10, 2020 07:21

feast-ci-bot added approved size/L labels Mar 10, 2020

zhilingc requested review from ches and removed request for pradithya March 10, 2020 07:23

woop reviewed Mar 10, 2020

View reviewed changes

storage/api/src/main/java/feast/storage/api/retrieval/BatchRetriever.java Outdated Show resolved Hide resolved

woop reviewed Mar 10, 2020

View reviewed changes

storage/api/src/main/java/feast/storage/api/retrieval/OnlineRetriever.java Outdated Show resolved Hide resolved

woop reviewed Mar 10, 2020

View reviewed changes

storage/api/src/main/java/feast/storage/api/retrieval/OnlineRetriever.java Outdated Show resolved Hide resolved

zhilingc force-pushed the storage-refactor branch from 03960e0 to 66e9520 Compare March 11, 2020 07:49

feast-ci-bot added size/XL and removed size/L labels Mar 12, 2020

zhilingc force-pushed the storage-refactor branch from 4e461a1 to 11a7083 Compare March 13, 2020 03:59

zhilingc added 5 commits March 13, 2020 15:15

Add storage interfaces, basic file structure

2bf2fa3

Apply spotless, add comments

966a14b

Move parseResponse and isEmpty to response object

0411608

Make changes to write interface to be more beam-like

bef13ab

Pass feature specs to the retriever

1bd20c5

zhilingc force-pushed the storage-refactor branch from 11a7083 to 1bd20c5 Compare March 13, 2020 07:15

Pass feature specs to online retriever

0153cce

feast-ci-bot added size/XXL and removed size/XL labels Mar 16, 2020

zhilingc added 2 commits March 17, 2020 12:49

Add FeatureSetRequest

6db5e3e

Add mistakenly removed TestUtil

d3861f6

Add mistakenly removed TestUtil

3359ced

feast-ci-bot added size/XL and removed size/XXL labels Mar 17, 2020

feast-ci-bot assigned woop Mar 18, 2020

feast-ci-bot added the lgtm label Mar 18, 2020

feast-ci-bot merged commit 8d8737a into feast-dev:storage-refactor Mar 18, 2020

ches reviewed Mar 22, 2020

View reviewed changes

zhilingc mentioned this pull request Mar 25, 2020

Storage refactor API and docstring tweaks #569

Merged

ches added this to the v0.5.0 milestone Apr 26, 2020

This was referenced May 15, 2020

Modularize ingestion distributed compute engine support #444

Closed

Factor out storage key generation logic, to eliminate code duplication #402

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add storage interfaces, basic file structure #529

Add storage interfaces, basic file structure #529

zhilingc commented Mar 10, 2020 •

edited

Loading

feast-ci-bot commented Mar 10, 2020

zhilingc commented Mar 10, 2020

zhilingc commented Mar 10, 2020

zhilingc commented Mar 11, 2020

woop commented Mar 17, 2020

woop commented Mar 18, 2020

ches left a comment

ches Mar 21, 2020

zhilingc Mar 25, 2020

ches Mar 22, 2020

woop Mar 25, 2020

ches Mar 22, 2020

ches Mar 22, 2020

zhilingc Apr 5, 2020

ches Mar 22, 2020

zhilingc Mar 25, 2020

ches Mar 22, 2020

ches Mar 22, 2020

ches Mar 27, 2020

zhilingc Mar 27, 2020 •

edited

Loading

ches Mar 22, 2020

Add storage interfaces, basic file structure #529

Add storage interfaces, basic file structure #529

Conversation

zhilingc commented Mar 10, 2020 • edited Loading

feast-ci-bot commented Mar 10, 2020

zhilingc commented Mar 10, 2020

zhilingc commented Mar 10, 2020

zhilingc commented Mar 11, 2020

woop commented Mar 17, 2020

woop commented Mar 18, 2020

ches left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhilingc Mar 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhilingc commented Mar 10, 2020 •

edited

Loading

zhilingc Mar 27, 2020 •

edited

Loading