-
Notifications
You must be signed in to change notification settings - Fork 1.3k
resolve major search perf regression in v3.2.0 (add redis cache to git commit OID resolution) #3685
Conversation
Codecov Report
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a benchmark to show that it really helps?
Then we can also find out if the sync.Once logic really needs to be there.
What precisely would it benchmark?
This is not possible without a significant undertaking. In order to verify this, we would need to vet or add benchmarks for every downstream consumer of this method: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/frontend/graphqlbackend/git_commit.go#L86:29&tab=references |
To clarify some things in my last comment:
I mean to say that any (currently writeable) Go benchmark would not be sufficient here. Most of the perf harm would be done doing the HTTP requests outlined in the issue description, and for a Go test we would have to mock those out, so it wouldn't be representative of the real world. What would prove it is proper search load testing, but I am not convinced we should block this PR on having that.
While we could vet and benchmark all of these downstream consumers, the time would be better spent refactoring code so that Zoekt hands us the indexed revision as part of search results (this is the more correct long-term fix for this issue). I'll see if I can throw together a manual search benchmark that proves this change is a good one to alleviate your concern, but I'm not convinced we should block this PR on any of the above (if it was your intent to say that). Very open to hearing counter-points here! 😃 |
Wow, so thanks for pushing me to prove out the perf gain on this, I was 100% wrong about how straightforward this change was :) Turns out that Working on a better cache solution now. |
Your argument for this change makes sense to me, but at the same time it's notoriously hard to know in advance what effect a change will have on performance. We can see that by looking at what happened as a result of #2318. I would really like to see some performance numbers before this goes in. Otherwise we're flying blind and we don't know which performance-oriented code is helping and which isn't. I'm happy to help get those numbers. Ah, just saw your reply. Yep. That's how this stuff is. |
@ijt PTAL:
|
Looking now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cmd/frontend/graphqlbackend/git_commit_resolution_cache_test.go
Outdated
Show resolved
Hide resolved
Can you either flesh out the "Test plan:" bit in the description or remove it? It's a bit confusing. |
Filed https://github.com/sourcegraph/sourcegraph/issues/3750 for the long-term fix / tech debt we should tackle here. |
addressed all points, will merge once CI passes. |
…t commit OID resolution) (#3685) * refactor OID resolution logic (review with whitespace off) * add oid resolution cache * add observability * CHANGELOG: update * custom cache implementation * enterprise/dev/Procfile: fix bug where zoekt is not listening * remove redundant return stmt * resolveCommitIODUncached * faster test * Update CHANGELOG.md
…t commit OID resolution) (#3685) * refactor OID resolution logic (review with whitespace off) * add oid resolution cache * add observability * CHANGELOG: update * custom cache implementation * enterprise/dev/Procfile: fix bug where zoekt is not listening * remove redundant return stmt * resolveCommitIODUncached * faster test * Update CHANGELOG.md
…t commit OID resolution) (#3685) * refactor OID resolution logic (review with whitespace off) * add oid resolution cache * add observability * CHANGELOG: update * custom cache implementation * enterprise/dev/Procfile: fix bug where zoekt is not listening * remove redundant return stmt * resolveCommitIODUncached * faster test * Update CHANGELOG.md
…3.2.0 In #3685 I resolved a major indexed search performance regression that was introduced in v3.2.0. For a search returning 100 repositories and 100 results per repository: 1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout) 2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository). - This was shown in benchmarks and load tests on customer instances to be a substantial improvement. - On one customer instance, we've found this still hasn't been enough of an improvement. - On another customer instance, we've found we could be doing better. 3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information. Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing. Helps https://app.hubspot.com/contacts/2762526/company/407948923
…4048) * search: completely resolve indexed search performance regression in v3.2.0 In #3685 I resolved a major indexed search performance regression that was introduced in v3.2.0. For a search returning 100 repositories and 100 results per repository: 1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout) 2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository). - This was shown in benchmarks and load tests on customer instances to be a substantial improvement. - On one customer instance, we've found this still hasn't been enough of an improvement. - On another customer instance, we've found we could be doing better. 3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information. Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing. Helps https://app.hubspot.com/contacts/2762526/company/407948923 * remove now unnecessary Git OID resolution & caching logic * CHANGELOG * fixup reversion * fixup * update tests
…4048) * search: completely resolve indexed search performance regression in v3.2.0 In #3685 I resolved a major indexed search performance regression that was introduced in v3.2.0. For a search returning 100 repositories and 100 results per repository: 1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout) 2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository). - This was shown in benchmarks and load tests on customer instances to be a substantial improvement. - On one customer instance, we've found this still hasn't been enough of an improvement. - On another customer instance, we've found we could be doing better. 3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information. Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing. Helps https://app.hubspot.com/contacts/2762526/company/407948923 * remove now unnecessary Git OID resolution & caching logic * CHANGELOG * fixup reversion * fixup * update tests
…4048) * search: completely resolve indexed search performance regression in v3.2.0 In #3685 I resolved a major indexed search performance regression that was introduced in v3.2.0. For a search returning 100 repositories and 100 results per repository: 1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout) 2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository). - This was shown in benchmarks and load tests on customer instances to be a substantial improvement. - On one customer instance, we've found this still hasn't been enough of an improvement. - On another customer instance, we've found we could be doing better. 3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information. Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing. Helps https://app.hubspot.com/contacts/2762526/company/407948923 * remove now unnecessary Git OID resolution & caching logic * CHANGELOG * fixup reversion * fixup * update tests
…4048) * search: completely resolve indexed search performance regression in v3.2.0 In #3685 I resolved a major indexed search performance regression that was introduced in v3.2.0. For a search returning 100 repositories and 100 results per repository: 1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout) 2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository). - This was shown in benchmarks and load tests on customer instances to be a substantial improvement. - On one customer instance, we've found this still hasn't been enough of an improvement. - On another customer instance, we've found we could be doing better. 3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information. Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing. Helps https://app.hubspot.com/contacts/2762526/company/407948923 * remove now unnecessary Git OID resolution & caching logic * CHANGELOG * fixup reversion * fixup * update tests
In Sourcegraph v3.2.0 a major search performance regression was introduced by accident. It affected all instances using indexed search. It went unnoticed because we currently don't have sufficient load testing of search (which is something we are addressing very soon).
The regression was introduced in https://github.com/sourcegraph/sourcegraph/pull/2318 (cc @ijsnow @keegancsmith) by accident due to some complex code interaction. The change made the assumption that using a
sync.Once
associated with the*gitCommitResolver
was enough to prevent doing too much duplicative work. However, since there is a one-to-one mapping of search results andgitCommitResolver
we regressed significantly on search performance as after that change for every search result we would:git symbolic-ref HEAD
(and in some cases executing a second command).sync.Once
).This is all to say: after that change we would do a substantial amount of work per search result after https://github.com/sourcegraph/sourcegraph/pull/2318 was merged.
This PR resolves the issue by putting in place a Redis cache with a TTL of 60s. This makes computing this information just a simple cache lookup, which should be substantially faster. The
sync.Once
logic remains because we may still callOID
repetitively on the same*gitCommitObject
(I didn't vet all call sites.)Additionally, this PR adds metrics for us to monitor (which should've probably been added earlier as part of #2318):
Test plan:
Manual load testing performed by benchmarking search with vegeta (see "Raw data / details on how tests were ran" at end for exact steps I performed to test this change).
Comparison
All tests performed with this search query returning only 10 results:
repo:^github\\.com/sourcegraph/sourcegraph$ error count:10 timeout:60s
Initially I tried higher result counts (11,000, 1000) but the perf before this change was so bad ALL requests would timeout at just 1 QPS and vegeta can't send requests slower than that (or I couldn't find the option).
All tests
Before: Lots of warnings:
gitserver | WARN Long exec request, repo: github.com/sourcegraph/sourcegraph, args: [symbolic-ref HEAD], duration: 3.183s
After: No warnings.
Raw data / details on how tests were ran
Details
search.body.json
is the exact GraphQL search query we send when a user does a search request, copied from Chrome network tab.1 QPS for 60s
Before:
After:
5 QPS for 60s
Before:
After:
50 QPS for 5s
Before:
After: