-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High availability support for Federation #349
Comments
upvoteeeee 👏 👏 👏 👏 👏 |
It sounds to me like managed federation solves both of these issues? In managed mode, you will rarely need to re-deploy the gateway. Once it's up and running, it will get updated as federated services notify it via |
managed isn't an option for every team, mine included. It would be nice to have this built into the gateway or a service that can be deployed along side the gateway |
@shaneu can you explain why managed isn't an option for your team? I am just curious and playing the devil's advocate :) |
@rtymchyk I work for an enterprise bank that due to internal policy and compliance rules must manage all infrastructure on prem. Unless I'm incorrect, Apollo Graph Manager does not have an on prem solution. I'm sure my teams situation in those regards is not unique. |
I'm also blocked by this... If a federated container needs to restart then the gateway becomes pretty much useless since it wont try to reconnect/reload and will just start throwing |
@alanhoff Is this an an issue with federation? If your container needs to restart, you should be doing rolling restarts (or rolling updates for deployments). i.e. there shouldn't be a state where your service is down during a restart/deploy, regardless of whether or not it is used for federation. |
@rtymchyk When spinning up new ephemeral environments, for example, Apollo federation induces a startup order on services. In very large organizations with hundreds of microservices, this kind of hard coded ordering of services is a PAIN to maintain. As another example, when doing continuous deployment, a new version of a federated service means that you need to remember to redeploy the Apollo gateway(s) that depend on that service, hard coding this interdependency in the build pipeline, rather than in the service code itself. Like @shaneu mentioned above, a managed-off-prem dependency is a non-starter for us. Ideally, Apollo federation would have a mechanism to indicate "hey, the schema graph you constructed on startup may be out of date, please re-fetch from the same services.", to allow us to orchestrate the refresh policy in a way that meets our own needs. |
Managed federation still doesn't re-initialiaze if it fails to pick up the schema. The polling timer never gets started so you just have a dud running service. |
@shaneu You said that you manage the availability of your services using kubernetes, how do you update the gateway when some service goes down or a new service is available ? I'm asking you that because I'm also using kubernetes and to manage these services I dynamic update the array Apollo Gateway uses on serviceList. Take this code as an example:
Once a service goes down or goes up, I update Is there a better way to do this? |
@rtymchyk Does the gateway immediately updates its configuration whenever the registry is updated in managed mode? I thought the gateway would poll every interval of time the registry, which could introduce short disruptions on schema changes. |
@Nicoowr It's not immediate, but it's close. It shouldn't matter too much though if your API evolution is always backwards compatible. |
@rtymchyk very clear thanks. As a side note, breaking changes happen from time to time in our case, hope this won't be a problem though. |
After a year, I don't see improvement on this one. I am curious why this is not highest priority / critical. Does one use federated GraphQL in production with high scale and not having this issue and blocker? Can anyone point me best practices repo example to deploy my GraphQL services with CI/CD pipeline that accommodate re-initialisation if some services are offline? |
Are you talking about initial boot or mid lifecycle? |
@rhzs The relevant change here is in Apollo Server. Starting with AS v2.22, the |
@glasser's change to |
Apart from going down the path of managed federation, is there any way to prevent the entire GQL gateway from crashing if one of the federated services is not available? |
Initial boot. mid lifecycle works (when service went down and back up again).
I have noticed the improvement on the pooling for handling service unavailability (Kudos for the team. ), but that only works after you success starting (startup) all the services. |
Yes, your server can't start unless it knows its schema, which requires knowing all the subgraph schemas or being started directly with a supergraph schema. Polling a bunch of individual subgraphs is certainly a big process which is part of why we've developed managed federation and other mechanisms of getting schemas into Gateway. |
@glasser that should be at least in this projectREADME / Apollo Docs about Apollo Federation scalability. Not much articles discussing about Apollo Federation scalability until I experienced it on my own. |
@glasser I am curious to know, can't we just check the GraphQL schema hash (using SHA) and cache the schema in local file, rather than fetched all remote schemas again, and only fetched the necessary remote schema (or maybe just part of it)? |
Sure, that is something you could build with the primitives. |
tldr; Please allow cached federated service configuration for HA. @rhzs That makes way too much sense to do though. =P Just like most GraphQL implementations allow you to emit a schema file (useful for development & having other libraries utilize schema.. i.e. client side generators, etc). If Apollo Federation did something similar, that'd make a ton more sense. Yes, maybe in development mode the gateway can fail if a service isn't live, but in production mode, if you can use a cached file describing the services to fall back on in case of a service being down, then to me that would be a whole lot better than the gateway failing. Likely scenario: I have a gateway running on GCR (docker) and all my services running on GCR. If a service goes down and GCR scales my gateway instances up, I'll end up with a bunch of failed docker containers when I'd rather have the service queue up requests and serve gateway timeout errors after a certain time. Ideally still servicing all other service requests that are up. I get that yeah we can have this entirely huge E2E test suite that runs on CI before going live, but sometimes we have to do a rushed push (configuration changes, etc). Also, let's say that there's less mission critical services (admin analytics) and mission critical services (payment processing) and we break the admin analytics somehow or have downtime on that service.... our entire application is down? We're not making money? This is one of the bigger reasons people move to micro service architecture. I keep seeing "devil's advocate" responses pushing people to the managed services instead of actually realizing this can be an issue, though I get your reasonings for not pushing people to a good solution? I worked all day on moving to Federation but think I'm going to go back to a hand-coded GraphQL "gateway" with gRPC micro-services because I ran into this issue and can see where it'll become a PITA in production and cause revenue loss. |
@rhzs I found a public attempt at re-creating a schema registry. https://github.com/pipedrive/graphql-schema-registry It's pretty much the route I was attempting (the example gateway). Though they're extending ApolloGateway and recent changes to making everything private will likely break a library like this as the Apollo team doesn't want people to really extend ApolloGateway, etc (which I get, they just need to support things like this better and cleanly). After playing with caching attempts, I'd like to figure out how to do the following (sudo-code): class ApolloGatewayFactory {
protected gateway: ApolloGateway;
protected pubsub: PubSub;
async create(): Promise<ApolloGateway> {
const { supergraphSdl } = await this.load();
this.gateway = new ApolloGateway({ supergraphSdl });
this.pubsub.subscribe('schema_changed', async () => this.update());
return this.gateway;
}
async load(): Promise<SupergraphSdlGatewayConfig> {
// .. get service from wherever you'd like, however you'd like
}
async update(): Promise<SupergraphSdlGatewayConfig> {
const { supergraphSdl } = await this.load();
this.gateway.update({ supergraphSdl });
}
}
async start(): Promise<void> {
const factory = new ApolloGatewayFactory();
const gateway = await factory.create();
const server = new ApolloServer({
gateway
});
// ...
} This is one example. Here, I could build my own polling. I could create an endpoint to force reload the gateway. I could use etcd, consul, etc. I could use managed federation but create a fallback for higher availability just in case managed Apollo sees downtime. Personally, I'd like to use "Managed Federation" just for CI and not service discovery. I'd like to be able to submit my schema changes and run validations, backwards-compatibility checks, etc. Then on success (or on failure if I'm feeling antsy), I want to deploy services however I want. For us, we just don't rely on SaSS products simply because we've lost thousands and thousands of $ from downtime. Once we did a $250k USD media buy for a single day of traffic and a service went down that we relied on. Long story short, we lost about $100k due to that. |
There is another OSS registry https://github.com/StarpTech/graphql-registry. Just FYI. |
@kindermax That's cool too. Good to see people see the issue. There just needs to be better baked in support into |
@kindermax Ironically, their examples already show similar code as what I mentioned: https://mercurius.dev/#/docs/federation setTimeout(async () => {
const schema = await server.graphql.gateway.refresh()
if (schema !== null) {
server.graphql.replaceSchema(schema)
}
}, 10000) |
Looking at how Here is the code behind that: return dataSource
.process({
kind: GraphQLDataSourceRequestKind.LOADING_SCHEMA,
request,
context: {},
})
.then(({ data, errors }): ServiceDefinition => {
if (data && !errors) {
const typeDefs = data._service.sdl as string;
const previousDefinition = serviceSdlCache.get(name);
// this lets us know if any downstream service has changed
// and we need to recalculate the schema
if (previousDefinition !== typeDefs) {
isNewSchema = true;
}
serviceSdlCache.set(name, typeDefs);
return {
name,
url,
typeDefs: parse(typeDefs),
};
}
throw new Error(errors?.map((e) => e.message).join('\n'));
})
.catch((err) => {
const errorMessage =
`Couldn't load service definitions for "${name}" at ${url}` +
(err && err.message ? ': ' + err.message || err : '');
throw new Error(errorMessage);
}); Then if you have 100 services and service number 58 fails, it will only try to re-retrieve the schema for that service instead of trying to also re-retrieve services 1 - 57. Happy to submit a PR if I am not missing something. |
@shaneu I believe the gateway now behaves in the way you'd like (fails to start on init if subgraphs are unavailable). Additionally, the somewhat recent introduction of the I see a handful of other concerns in this issue which I believe are also addressed by @Borduhh that sounds reasonable, would you mind opening a separate issue for discussion? Thanks all. I'll gladly reopen this issue if the original issue is in fact not resolved. |
First let me say we love federation on our team. We split our monolith into approx. 11 microservices and have been very pleased with the ease of the refactor and the results.
There were some issues around availability that we had to roll ourselves, and it led me to think maybe these would be better included in the federation/gateway package.
First of all, the gateway attempts to connect to the federated services, and if it is unable it still get's initialized, only without the services it failed to connect too. This is troublesome for an application that is deployed automatically as it requires human intervention to monitor if the gateway initialization was successful.
Our workaround was to create a
readinessProbe
that performs the same query the gateway makes to the federated services, with the goal being to make sure the services are available and ready to receive traffic. This probe has a configurable number of retries and timeout before throwing. If the probe is unsuccessful we let the application to crash, as it doesn't make sense for us to allow it to initialize without all services being available.Perhaps this feature could be built into the gateway, along with a configurable number of retries? If it is unsuccessful it crashes and does not init. Perhaps that too could be a configuration
mustConnect: true/false
(I'm sure there's a better name for the property).Other than that we let kubernetes manage the availability of the other services via liveness probes, but for teams not using a deployment orchestration manager like a kubernetes perhaps there is a way by which the gateway could check the availability of its downstream services?
Also, a periodic reloading of the schema would be helpful. Let's take the case where we update a schema in one of our services and redeploy it. The gateway wouldn't know the service's schema had changed so we must restart the gateway for it to get the changes. This hurts availably by causing downtime while the gateway is redeployed. If the gateway would refresh its schema on a configurable polling interval that issue could be avoided. We've developed a hack using helm charts to tell the gateway to redeploy when one of the downstream services have been updated, but that still causes downtime.
If any of these ideas seems like it would be a good addition to the project I would be more than happy to make a PR. I am truly appreciate of the work the apollo team is doing and I'd love to contribute.
The text was updated successfully, but these errors were encountered: