Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modularize ingestion distributed compute engine support #444

Closed
ches opened this issue Jan 27, 2020 · 4 comments
Closed

Modularize ingestion distributed compute engine support #444

ches opened this issue Jan 27, 2020 · 4 comments

Comments

@ches
Copy link
Member

ches commented Jan 27, 2020

This is a companion to #402 and the larger topic of storage engine modularization which was realized in #529 and subsequent PRs that implemented the new interfaces.

Just as adding support for new storage engines tends to cause a dependency explosion for Feast ingestion & serving, the same is true for Beam Runner / job management adapter glue in core (this all could move to serving with future plans, but that won't change the fundamental problem this issue is about).

So for both storage and compute engines, I feel that some modularity strategy is needed for loose binding at build time, configurable for runtime. The goals would be to:

  • Minimize dependency pains that developers and contributors to Feast need to deal with if they are not actively working on a particular stack. The dependency trees are often large and fragile, especially in the Hadoop ecosystem, such as Hive and Spark.
  • Reduce deployment bloat if operators wish to package Feast internally with only the module JARs they need to support their organization's stack. (IIRC last I checked, hadoop-common or hadoop-client leave you with close to 200MB of jars, and beam-runners-spark and beam-sdks-java-io-hcatalog among others have these deps [as provided scope, but the point stands I believe]).

Possibilities might be OSGi or java.util.ServiceLoader (and Spring integration or alternatives thereof). Open to other ideas!

Relates to #362

@woop
Copy link
Member

woop commented Jan 29, 2020

Agreed with this problem. I think we should be defining the exact extension points here very clearly though.

It can be hard to talk about in the abstract, so the questions I see are

  1. Which specific compute or storage engines do we already see a need to cover?
  2. Which specific points in the code base do we need to integrate a modularization layer?

Then more than that. I can see the introduction of this layer introducing a lot of overhead and complexity in the short term, even though it will pay dividends if teams are starting to fork the code base now (which might be the case already with 0.3). I would want to make sure we get alignment on the future direction of Feast so that we can stabilize the architecture before solidifying these modularization points.

@stale
Copy link

stale bot commented Mar 29, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Mar 29, 2020
@woop woop added the keep-open label Mar 29, 2020
@stale stale bot removed the wontfix This will not be worked on label Mar 29, 2020
@dr3s
Copy link
Collaborator

dr3s commented Jun 5, 2020

I'm extremely wary of this type of complexity within the service. I'm pretty biased but I would prefer something more along the lines of these options:

  • Optimize for one implementation based upon open source (flink and jdbc) which is packaged with feast + connectors for managed services a la dataflow and bigquery. That should minimize the client libraries.
  • Modularity should be at the microservice boundary. For instance, spark may be able to be supported by running https://github.com/spark-jobserver/spark-jobserver that bundles the spark libraries. Even dataflow and bigquery could be other microservices and be optional parts of the installation.

I don't know what the state of the art is these days with OSGi or other SP frameworks but I don't think the complexity they bring is worth it.

@woop
Copy link
Member

woop commented Feb 8, 2021

Closing this issue since it is now stale. The Job Service manages jobs, and we have different launcher implementations available. Currently we are purely using Spark.

@woop woop closed this as completed Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants