-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feast API: Sources #633
Comments
By "we" do you mean your current team? I'd love to get a better understanding of how you are currently doing it, especially if it means we can improve sources or stores.
I agree 100%. Part of the reason why we made sources optional was exactly for this reason. We didn't want a data scientist to have to think about configuring a source. In hindsight this created a lot of complexity for us that I think we should have delayed, but here we are. Sources today are a glorified "connection string". If a source was to only ever stay a connection string, then I don't really see the point in exposing that to users. This can easily be configured behind the scenes by administrators and exposed through source names or a more human friendly way. Users can then select the source they want to use. If I understand correctly then this is the largest part of what you think is a bad separation of concern. However, I do think there is value in having users configure some aspects of the sourcing of data. Perhaps source isn't the right term, but we can discuss that later. Most of our data still lives in data lakes or data warehouses, which means federation is a natural next step. In a federated model we would probably opt to extend sources to allow new data sources to be accessed through Feast, especially without users having to export and reimport into Feast, and with Feast being lazy towards retrieval and exports (no long running jobs). The next question is then, how does this influence the Feast API? One extreme is the Uber approach towards data transformation. There are parts of that API that I like, and parts that I don't like. I think we are in agreement that we shouldn't be building a general purpose data transformation system, for example. However, I see massive value in allowing users to define SQL queries. And this is the direction that I would like to take sources (or if not sources, another part of feature sets). One of the main reasons why I see this as valuable is that all of our users are familiar with SQL. Using SQL improves the Feast user experience because they are able to validate and prove that their query works without Feast in the loop. They can then bring that query to Feast, publish it, and see the results. If there is a failure, then the problem is likely Feast. SQL is also supported by virtually all sources and stores.
There is a bit of leakiness here in that the project, dataset, and table are leaked through. In theory we could abstract away all of this SQL, but it would require a bit more development effort and detract from the development experience. Would love to have your thoughts on this @ches. |
Thanks for the thoughtful reply @woop. I didn't pay close attention to the 0.5 milestone on #632 so thanks for diverting it here, makes sense to open this discussion issue that parallels other API ones.
Yes, at Agoda our Feast deployment is perhaps more centralized in the sense that:
As I'm sure is true for Gojek too, some core entity types in our business domain have very high cardinality (e.g. customers). Most client teams serving online will use some features of these, and it isn't practical or economical for us to deploy many storage cluster islands that can support the scale. Cassandra is massively scalable; Kafka throughput is massively scalable; we have dedicated teams expert at doing those. There's also the merit that new clients don't need to provision new infrastructure to start using the system, this is one of the key problems we're solving from the status quo before Feast. (We track cost attribution in other ways, if anyone wonders about that aspect). The one point above that I imagine could become more flexible over time is the Kafka topics, there may be use cases for special-purpose / priority ones, and I believe it should be straightforward to support that if the need arises. That brings up a notable distinction in regard to the current Source configuration I think: if we did support this, it will be useful to declaratively associate feature sets with source topics (as Feast already allows). However, users will never need to think about the brokers, they will differ for the same topic name across DCs and our SDK wrapper and Ingestion get them from service discovery. I think this speaks to your thought that "there is value in having users configure some aspects of the sourcing of data".
Yes I believe we're on the same page then. Roughly, an abstraction over operational or environmental details of infrastructure. Operators of a Feast deployment could plug service discovery into this abstraction, potentially.
I feel the federation is an elegant idea in theory, but I'm initially skeptical of how it will work out in practice. Not to say it isn't worth trying or to discourage it, just would urge breaking it off to an MVP to validate without disrupting Feast's ongoing technical improvement and data model refinement—it could be a year spent rearchitecting for such a pivot in vision, with considerable risk that it doesn't work out well or serve users markedly better. Some of my concerns with it:
We may learn differently with more experience, but at the outset in our org I think we are content to bring data into managed feature store storage. There's a cultural expectation that it is "special" data, expected to be subjected to higher quality standards, stable maintenance, etc. that federated sources may not.
I'm on board with looking for ways to use SQL as an interface to the system. It does make barriers vanish for many potential users, especially data scientists/engineers/analysts who can contribute new data sets to the feature store without more specialized development knowledge/skills. Indeed something that has happened even before Feast going live for us was another team eagerly building an integration with an in-house ETL tool we have to move data between engines with—you guessed it—SQL expressions of the input. So we've already "solved" this, in a proprietary way and with some overhead of redundant import/exports that you refer to with federation. We (at least I 😇) have a vision/dream of a streaming platform where users are expressing Beam/Flink/Spark/Whatever SQL, with ability to include/join feature store data, and (optionally) choosing to ingest results into the feature store in the same engine DAG (no extra hop out through Kafka or the like). In theory we are not that far from the query part, the data is already in tables the engine can make available in I may have lost the course a little bit there, but hopefully it gives color to ubiquity of SQL. |
This really strikes a chord on another note for me, perhaps outside of the The notion of centralization versus decentralization. The question that I have in my mind is whether we ever even need multiple historical stores per Why is this important? Well what happens when we move job management to However, it falls apart in another way. If a user ingests data, where do they ingest to? Do they ingest straight to one local Also, where would a user interface go? If it sits on Anyway, the bottom line is I think a lot of problems could be solved by only having a single historical store per
I wish this was the case for us. We currently have to manage our own storage clusters unfortunately, which is becoming a little bit of a pain. We'll probably be doubling down on Redis Cluster once 0.5 goes live, and try to keep things as simple as possible.
In some ways this ties to what I was saying above about having one historical store and what the architecture looks like when/if
Given our track record of hitting deadlines, I am just as weary about scope creep here. Just to be clear, I would not pick up federation as the sole way of dealing with data, and I would not have it started if we can't build an MVP off the critical path.
This is a valid concern. I'd want to limit the scope of this implementation as much as possible, and potentially add constraints. Let's take the BQ example from earlier. MVP: For every user that created a feature set with an external BQ source we create (1) a BigQuery view to the external table and (2) jobs that occasionally update online stores from this view. What does that give us? Well it saves us a lot of money on Dataflow jobs and means we can manage less infrastructure. It means users don't have to ingest data (and can simply plug in their queries). On both online and batch serving nothing changes.
I agree with this, but this problem doesn't go away in Gojek's case at least. Essentially we are adding one more step to "federation" by having the user extract the data and publish it. However, if the user isn't savvy then they will just propagate the problems from the source straight to Feast. Then, when something goes wrong they will ask why Feast doesn't have the right data. Tracing this back upstream then becomes harder because Feast doesn't have any context outside of rows landing on a stream. I'm not saying one approach is definitely superior, but I do think that there are trade-offs here.
I think federation in and of itself doesn't really add much value to end-users. It adds value to the operators. However Feast implements "external data sourcing" can change over time, as long as the development API that we expose is more useful to our users, then they will be happy. I can see that being the case with us exposing SQL through sources, especially on the batch only side where our users are less savvy.
I like this. I think we have touched on this idea before. I am hoping that the tooling that we develop on statistics and validation will eventually lend itself to higher trust from our users here, but many of our data scientists have asked for more tools around ingestion. They don't want to push bad data into Feast, so they want to know whether it is good or bad prior to pushing. Or they want tools to undo a bad ingestion once it fails or if the statistics look off. Something to ponder about.
This sounds somewhat similar to what we have internally right now, but the integration with Feast does require that extra hop through Kafka. The value add I can see right now is not so much on the streaming side, however. Even though that aspect is a dream state to aim for. We have enough tools to allow users to create new streams and streaming transformations and get them into Feast. I think the part that is still a bit painful is the export/import flow, but perhaps that can be made easier through better controls at ingestion time. |
Oh yeah, I should have made clearer that in that dream vision I don't see Feast being the whole thing, but a component of it. |
I have been thinking about some recurring issues related to sources, job management, stores, and streams. These issues are not singular but they have bled through the questions people ask in various proposals. Please see this document for architecture diagrams that illustrate the disinctions that I am making. Problems
Proposal
Please see this document for more details on this proposal. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue can be used to discuss the role of
sources
in Feast, and how we see the concept evolving in future versions.Status quo
Feast currently supports only a single
Source
type on KafkaSource. This can be defined through aFeature Set
or omitted. If users omit the source, Feast Core will fill in a default for the user.Contributor comment from @ches (link)
The text was updated successfully, but these errors were encountered: