Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[validation] Add "onError" option to allow for custom error handling behavior when performing validation #2074

Merged
merged 3 commits into from
Aug 20, 2019

Conversation

skevy
Copy link
Contributor

@skevy skevy commented Aug 6, 2019

TL;DR: This PR adds an "onError" argument to the validate function that allows for custom error handling when performing validation. A user can use this in many ways -- for instance, they can use this callback to bail early, and prevent performing potentially expensive validation operations hundreds or thousands of times in environments where detailed error messages are less important.

We should fix the root cause of the slowness that I describe below, however -- it's risky to have a utility like this enabled in production, given that it gets slower and slower the more types you have and the more errors you have in your query.

Read on for the story of why I decided to implement this. 😀


At Airbnb, we have a GraphQL server that acts as a gateway to many other GraphQL servers -- the schemas are stitched together from these downstream services to build a single schema that's used by our clients.

Today, in our staging environment, one of the downstream services published a broken schema that caused queries to the gateway to start failing with validation errors. When this happened, we saw the gateway's measured overhead (essentially the time it takes to do anything GraphQL-related that isn't fetching data from downstream services) jump by about 3000%:

image

After getting alerted on a this issue, I captured a trace. This is what I saw:

image

Zooming in a bit:

image

It turned out that the suggestionList utility, which uses the Levenshtein distance algorithm to build a list of suggestions for a given type or argument name, was being called many times as the validation visitor failed to find the types that were lost in the main schema due to the broken schema from the downstream service.

To give an idea of scale -- we have a moderately large GraphQL schema with about ~3000 types, and many of the type names are very long, with an average length of 27 characters.

After some investigation (and building a repro case), it seemed like the options were to either fix the suggestionList algorithm to do something more efficient, or allow bailing out early. This PR is an implementation of the latter.

At Airbnb, we disable introspection on our production GraphQL endpoint and point our internal GraphiQL instance at a production mirror that's only available internally. The idea with this PR is that we will make validation bail early so that we can take steps to make sure the validation time doesn't increase with the number of errors in the query.

@skevy skevy force-pushed the add-validate-bail-option branch 2 times, most recently from 1eadb9f to cea8a25 Compare August 6, 2019 06:10
@IvanGoncharov
Copy link
Member

@skevy Thanks for PR 👍
I think it's definitely something we need to address.

After some investigation (and building a repro case), it seemed like the options were to either fix the suggestionList algorithm to do something more efficient, or allow bailing out early. This PR is an implementation of the latter.

We definitely should make suggestionList algorithm faster since it can be critical even in development time for very big schemas.
At the same, it would always compromise between better DX in development and better performance in runtime.

I also think you uncovered potential DoS vector, e.g.:

{
  usre
  usre
  # ...
}

For very small query size (gzip will optimize all repetition) it will produce a huge number of same error but with a different location.
So it very similar to #1753

At the same time, limiting the number of the possible error to 1 is too restrictive especially for public APIs that need to protect them self against DoS attacks but don't want to sacrifice DX too much.

I think the correct solution would be to add onError option similar to #2062
it will allow implementing more flexible criteria e.g. first 50 errors or under 2s

const validationStart = Date.now();
let errors;

try {
  errors = validate(schema, documentAST, rules, typeInfo, (error, context) => {
    const msSpend = Date.now() - start;
    if (context.getErrors().length > 50 || msSpend > 2000) {
      errors = context.getErrors();
      throw new ValidationBailEarlyError();
    }
  })
} catch(e) {
  if (!(e instanceof ValidationBailEarlyError)) {
    throw e;
  }
}

Also, it can be used in other situations, e.g. if validation is done in a separate thread (e.g. WebWorker) onError can send message to the main thread that will display errors immediately after they are detected.

@skevy What do think? Will onError work for your use case?

@skevy
Copy link
Contributor Author

skevy commented Aug 6, 2019

@IvanGoncharov thanks for the thoughtful answer!

Yes, I’m happy to do a more flexible onError approach.

I think in any comms about this we shouldn’t recommend limiting the time of execution, as this blocks the event loop and as you said is a DoS vector. But certainly allowing more than one error to be thrown seems reasonable.

I’ll update the PR today.

@skevy
Copy link
Contributor Author

skevy commented Aug 6, 2019

Btw, regarding optimizing the suggestionList algorithm, we may want to take a look at an approach like http://stevehanov.ca/blog/index.php?id=114, which uses a prebuilt trie for fast lookups of edit distance.

@skevy skevy force-pushed the add-validate-bail-option branch 3 times, most recently from be3051b to 6632da6 Compare August 6, 2019 18:40
@skevy skevy changed the title [validation] Add "bail" option to allow for bailing early when validating a document [validation] Add "onError" option to allow for custom error handling behavior when performing validation Aug 6, 2019
@skevy
Copy link
Contributor Author

skevy commented Aug 6, 2019

@IvanGoncharov updated the PR accordingly.

@skevy skevy force-pushed the add-validate-bail-option branch from 6632da6 to 4edebb5 Compare August 6, 2019 18:55
@skevy skevy force-pushed the add-validate-bail-option branch from 4edebb5 to 953adaa Compare August 6, 2019 19:11
IvanGoncharov added a commit to IvanGoncharov/graphql-js that referenced this pull request Aug 7, 2019
IvanGoncharov added a commit that referenced this pull request Aug 8, 2019
@IvanGoncharov
Copy link
Member

updated the PR accordingly.

@skevy Thanks 👍 Look good.
One issue: Can you please remove unrelated changes in the order of imports.

Since it's public API I want to take a couple more days to think about alternatives.
Is it ok and do you have any deadlines for this feature?

@IvanGoncharov IvanGoncharov merged commit 4339864 into graphql:master Aug 20, 2019
@IvanGoncharov
Copy link
Member

@skevy Sorry for the delay.
I made some changes on top of your PR.
I also understood that the primary use case is to limit the number of validation errors to a certain number so I just added maxErrors option exactly for that.

@IvanGoncharov
Copy link
Member

@skevy I plan to release 14.5.0 tomorrow including this change.
Big thanks for patience 🙇

abernix added a commit to apollographql/apollo-tooling that referenced this pull request Jan 3, 2020
We'll use the new `onError` callback which was introduced in
`[email protected]` by graphql/graphql-js#2074 only if
the `getErrors` method that existed prior to that addition no longer exists.
trevor-scheer added a commit to apollographql/apollo-tooling that referenced this pull request Jul 7, 2020
This PR makes the minimal number of changes necessary for the apollo-tooling's
repository's tests to pass for the planned [email protected] release.

Notable changes to implementations are as follows:

an adjustment to account for the new onError callback introduced by
graphql/graphql-js#2074, which deprecated the previously used
ValidationContext.prototype.getErrors method.

utilize the GraphQLEnumValueConfig type instead of GraphQLEnumValue
(which lead to the removal of the name field, which isn't actually needed when
building the values for the GraphQLEnumType constructor.

Ref: graphql/graphql-js#2303

Co-authored-by: Trevor Scheer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: feature 🚀 requires increase of "minor" version number
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants