resolve major search perf regression in v3.2.0 (add redis cache to git commit OID resolution) #3685

slimsag · 2019-05-01T00:24:35Z

In Sourcegraph v3.2.0 a major search performance regression was introduced by accident. It affected all instances using indexed search. It went unnoticed because we currently don't have sufficient load testing of search (which is something we are addressing very soon).

The regression was introduced in https://github.com/sourcegraph/sourcegraph/pull/2318 (cc @ijsnow @keegancsmith) by accident due to some complex code interaction. The change made the assumption that using a sync.Once associated with the *gitCommitResolver was enough to prevent doing too much duplicative work. However, since there is a one-to-one mapping of search results and gitCommitResolver we regressed significantly on search performance as after that change for every search result we would:

Resolve the default branch of the repo, which requires making an HTTP request to gitserver to ask it to execute git symbolic-ref HEAD (and in some cases executing a second command).
Ask zoekt to list indexed git refs for the repo over HTTP (expensive given it is also wrapped in sync.Once).
Resolve the git ref zoekt says is indexed back to a concrete git OID.

This is all to say: after that change we would do a substantial amount of work per search result after https://github.com/sourcegraph/sourcegraph/pull/2318 was merged.

This PR resolves the issue by putting in place a Redis cache with a TTL of 60s. This makes computing this information just a simple cache lookup, which should be substantially faster. The sync.Once logic remains because we may still call OID repetitively on the same *gitCommitObject (I didn't vet all call sites.)

Additionally, this PR adds metrics for us to monitor (which should've probably been added earlier as part of #2318):

How much time we spend computing this information.
How often we hit / miss the cache.
How much time we spend performing cache lookups (in case a Redis cache isn't sufficient and we need a process-local cache in the future).

Test plan:

Manual load testing performed by benchmarking search with vegeta (see "Raw data / details on how tests were ran" at end for exact steps I performed to test this change).

Comparison

All tests performed with this search query returning only 10 results: repo:^github\\.com/sourcegraph/sourcegraph$ error count:10 timeout:60s

Initially I tried higher result counts (11,000, 1000) but the perf before this change was so bad ALL requests would timeout at just 1 QPS and vegeta can't send requests slower than that (or I couldn't find the option).

QPS	Mean change (before -> after)	P95 change (before -> after)	before: non-200s / total req	after: non-200s / total req
1	(-92.98%) 228ms -> 16ms	(-93.67%) 332ms -> 21ms	0/60	0/60
5	(-99.37%) 2,230ms -> 14ms	(-99.60%) 4,321ms -> 17ms	0/300	0/300
50	(-99.845) 60,000ms -> 92ms	(-99.92%) 63,543ms -> 52ms	166/250 (502 timeouts)	0/250

Negative percent above indicates e.g. 92% less time spent after change.
Last row before time I believe is not super accurate because high number of timeouts skewed the request duration mean / P95..

All tests

Before: Lots of warnings: gitserver | WARN Long exec request, repo: github.com/sourcegraph/sourcegraph, args: [symbolic-ref HEAD], duration: 3.183s

After: No warnings.

Raw data / details on how tests were ran

Details

search.body.json is the exact GraphQL search query we send when a user does a search request, copied from Chrome network tab.

1 QPS for 60s

Before:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=1qps -rate=1 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            60, 1.02
Duration      [total, attack, wait]    59.1882914s, 59.004040464s, 184.250936ms
Latencies     [mean, 50, 95, 99, max]  228.168332ms, 213.124139ms, 332.409ms, 554.912644ms, 575.35818ms
Bytes In      [total, mean]            508206, 8470.10
Bytes Out     [total, mean]            327300, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:60  
Error Set:

After:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=1qps -rate=1 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            60, 1.02
Duration      [total, attack, wait]    59.01640499s, 59.004437167s, 11.967823ms
Latencies     [mean, 50, 95, 99, max]  16.430629ms, 13.723219ms, 20.86999ms, 133.612645ms, 145.057061ms
Bytes In      [total, mean]            508201, 8470.02
Bytes Out     [total, mean]            327300, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:60  
Error Set:

5 QPS for 60s

Before:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=5qps -rate=5 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    1m2.116713558s, 59.803600375s, 2.313113183s
Latencies     [mean, 50, 95, 99, max]  2.231654701s, 2.644444083s, 4.321173379s, 4.960104982s, 5.056616823s
Bytes In      [total, mean]            2541017, 8470.06
Bytes Out     [total, mean]            1636500, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

After:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=5qps -rate=5 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.815219217s, 59.804088472s, 11.130745ms
Latencies     [mean, 50, 95, 99, max]  13.718213ms, 13.242097ms, 16.508382ms, 25.167756ms, 129.288757ms
Bytes In      [total, mean]            2541003, 8470.01
Bytes Out     [total, mean]            1636500, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

50 QPS for 5s

Before:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=50qps -rate=50 -duration=5s | tee results.bin | vegeta report
Requests      [total, rate]            250, 50.18
Duration      [total, attack, wait]    1m7.318174155s, 4.982542856s, 1m2.335631299s
Latencies     [mean, 50, 95, 99, max]  1m0.103715429s, 1m0.957090646s, 1m3.543040632s, 1m4.220043201s, 1m4.250048319s
Bytes In      [total, mean]            736409, 2945.64
Bytes Out     [total, mean]            1363750, 5455.00
Success       [ratio]                  33.60%
Status Codes  [code:count]             200:84  502:166  
Error Set:
502 Bad Gateway

After:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=50qps -rate=50 -duration=5s | tee results.bin | vegeta report
Requests      [total, rate]            250, 50.17
Duration      [total, attack, wait]    4.993111979s, 4.983242598s, 9.869381ms
Latencies     [mean, 50, 95, 99, max]  92.333272ms, 13.51449ms, 52.175179ms, 1.841178441s, 1.869576507s
Bytes In      [total, mean]            2117508, 8470.03
Bytes Out     [total, mean]            1363750, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:250  
Error Set:

codecov · 2019-05-01T00:30:25Z

Codecov Report

Merging #3685 into master will increase coverage by 0.01%.
The diff coverage is 52.23%.

Impacted Files	Coverage Δ
cmd/frontend/graphqlbackend/git_commit.go	`17.47% <6.66%> (-2.53%)`	⬇️
...tend/graphqlbackend/git_commit_resolution_cache.go	`89.18% <89.18%> (ø)`

ijt

Can you add a benchmark to show that it really helps?
Then we can also find out if the sync.Once logic really needs to be there.

slimsag · 2019-05-01T02:04:46Z

Can you add a benchmark to show that it really helps?

What precisely would it benchmark?

Then we can also find out if the sync.Once logic really needs to be there.

This is not possible without a significant undertaking. In order to verify this, we would need to vet or add benchmarks for every downstream consumer of this method: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/frontend/graphqlbackend/git_commit.go#L86:29&tab=references

slimsag · 2019-05-01T02:30:08Z

To clarify some things in my last comment:

Can you add a benchmark to show that it really helps?

What precisely would it benchmark?

I mean to say that any (currently writeable) Go benchmark would not be sufficient here. Most of the perf harm would be done doing the HTTP requests outlined in the issue description, and for a Go test we would have to mock those out, so it wouldn't be representative of the real world. What would prove it is proper search load testing, but I am not convinced we should block this PR on having that.

Then we can also find out if the sync.Once logic really needs to be there.

This is not possible without a significant undertaking. In order to verify this, we would need to vet or add benchmarks for every downstream consumer of this method: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/frontend/graphqlbackend/git_commit.go#L86:29&tab=references

While we could vet and benchmark all of these downstream consumers, the time would be better spent refactoring code so that Zoekt hands us the indexed revision as part of search results (this is the more correct long-term fix for this issue).

I'll see if I can throw together a manual search benchmark that proves this change is a good one to alleviate your concern, but I'm not convinced we should block this PR on any of the above (if it was your intent to say that). Very open to hearing counter-points here! 😃

slimsag · 2019-05-01T03:51:27Z

Wow, so thanks for pushing me to prove out the perf gain on this, I was 100% wrong about how straightforward this change was :) Turns out that rcache would just cause us to become redis-connection-bound with failed to execute redis command, cmd: GET, error: read tcp 127.0.0.1:54758->127.0.0.1:6379: read: connection reset by peer errors.

Working on a better cache solution now.

ijt · 2019-05-01T03:54:35Z

Your argument for this change makes sense to me, but at the same time it's notoriously hard to know in advance what effect a change will have on performance. We can see that by looking at what happened as a result of #2318.

I would really like to see some performance numbers before this goes in. Otherwise we're flying blind and we don't know which performance-oriented code is helping and which isn't.

I'm happy to help get those numbers. Ah, just saw your reply. Yep. That's how this stuff is.

slimsag · 2019-05-01T06:53:20Z

@ijt PTAL:

Two new commits (& rebased against master to resolve conflict).
PR description updated with perf testing results for this new implementation. Quite staggering. Thank you for pushing me to perf test & keeping me honest, I was definitely wrong to not do so! 😸

ijt · 2019-05-01T16:10:44Z

Looking now.

ijt

LGTM

cmd/frontend/graphqlbackend/git_commit.go

enterprise/dev/Procfile

cmd/frontend/graphqlbackend/git_commit_resolution_cache_test.go

ijt · 2019-05-01T17:26:56Z

Can you either flesh out the "Test plan:" bit in the description or remove it? It's a bit confusing.

slimsag · 2019-05-01T19:56:54Z

Filed https://github.com/sourcegraph/sourcegraph/issues/3750 for the long-term fix / tech debt we should tackle here.

slimsag · 2019-05-01T20:08:26Z

addressed all points, will merge once CI passes.

CHANGELOG.md

…t commit OID resolution) (#3685) * refactor OID resolution logic (review with whitespace off) * add oid resolution cache * add observability * CHANGELOG: update * custom cache implementation * enterprise/dev/Procfile: fix bug where zoekt is not listening * remove redundant return stmt * resolveCommitIODUncached * faster test * Update CHANGELOG.md

…3.2.0 In #3685 I resolved a major indexed search performance regression that was introduced in v3.2.0. For a search returning 100 repositories and 100 results per repository: 1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout) 2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository). - This was shown in benchmarks and load tests on customer instances to be a substantial improvement. - On one customer instance, we've found this still hasn't been enough of an improvement. - On another customer instance, we've found we could be doing better. 3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information. Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing. Helps https://app.hubspot.com/contacts/2762526/company/407948923

…4048) * search: completely resolve indexed search performance regression in v3.2.0 In #3685 I resolved a major indexed search performance regression that was introduced in v3.2.0. For a search returning 100 repositories and 100 results per repository: 1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout) 2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository). - This was shown in benchmarks and load tests on customer instances to be a substantial improvement. - On one customer instance, we've found this still hasn't been enough of an improvement. - On another customer instance, we've found we could be doing better. 3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information. Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing. Helps https://app.hubspot.com/contacts/2762526/company/407948923 * remove now unnecessary Git OID resolution & caching logic * CHANGELOG * fixup reversion * fixup * update tests

slimsag requested review from ijt, keegancsmith and ijsnow May 1, 2019 00:24

ijt suggested changes May 1, 2019

View reviewed changes

slimsag added 6 commits April 30, 2019 23:51

refactor OID resolution logic (review with whitespace off)

e7fcd21

add oid resolution cache

5e6406a

add observability

69419f5

CHANGELOG: update

84d80e1

custom cache implementation

bffe2e6

enterprise/dev/Procfile: fix bug where zoekt is not listening

a9fc7f7

slimsag force-pushed the sg/oid-cache branch from ba81c90 to a9fc7f7 Compare May 1, 2019 06:51

ijt approved these changes May 1, 2019

View reviewed changes

cmd/frontend/graphqlbackend/git_commit.go Outdated Show resolved Hide resolved

enterprise/dev/Procfile Show resolved Hide resolved

cmd/frontend/graphqlbackend/git_commit_resolution_cache_test.go Outdated Show resolved Hide resolved

This was referenced May 1, 2019

enterprise Procfile out of sync with OSS one #3748

Closed

search: see if we can remove the sync.Once optimization in gitCommitResolver.OID #3746

Closed

gitCommitResolver should not do lazy OID resolution #3750

Closed

slimsag added 3 commits May 1, 2019 13:00

remove redundant return stmt

bf77662

resolveCommitIODUncached

70da873

faster test

b8290a1

slimsag commented May 1, 2019

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Update CHANGELOG.md

87d212b

slimsag mentioned this pull request May 1, 2019

3.2.5 Release #3751

Closed

15 tasks

slimsag merged commit 4fbdf0c into master May 1, 2019

slimsag deleted the sg/oid-cache branch May 1, 2019 20:28

slimsag mentioned this pull request May 15, 2019

Completely resolve indexed search performance regression in v3.2.0 #4048

Merged

slimsag mentioned this pull request Mar 18, 2020

search input: no debouncing done on suggestions #9133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resolve major search perf regression in v3.2.0 (add redis cache to git commit OID resolution) #3685

resolve major search perf regression in v3.2.0 (add redis cache to git commit OID resolution) #3685

slimsag commented May 1, 2019 •

edited

Loading

codecov bot commented May 1, 2019 •

edited

Loading

ijt left a comment

slimsag commented May 1, 2019

slimsag commented May 1, 2019 •

edited

Loading

slimsag commented May 1, 2019 •

edited

Loading

ijt commented May 1, 2019

slimsag commented May 1, 2019

ijt commented May 1, 2019

ijt left a comment

ijt commented May 1, 2019

slimsag commented May 1, 2019

slimsag commented May 1, 2019

resolve major search perf regression in v3.2.0 (add redis cache to git commit OID resolution) #3685

resolve major search perf regression in v3.2.0 (add redis cache to git commit OID resolution) #3685

Conversation

slimsag commented May 1, 2019 • edited Loading

Comparison

All tests

Details

1 QPS for 60s

5 QPS for 60s

50 QPS for 5s

codecov bot commented May 1, 2019 • edited Loading

Codecov Report

ijt left a comment

Choose a reason for hiding this comment

slimsag commented May 1, 2019

slimsag commented May 1, 2019 • edited Loading

slimsag commented May 1, 2019 • edited Loading

ijt commented May 1, 2019

slimsag commented May 1, 2019

ijt commented May 1, 2019

ijt left a comment

Choose a reason for hiding this comment

ijt commented May 1, 2019

slimsag commented May 1, 2019

slimsag commented May 1, 2019

slimsag commented May 1, 2019 •

edited

Loading

codecov bot commented May 1, 2019 •

edited

Loading

slimsag commented May 1, 2019 •

edited

Loading

slimsag commented May 1, 2019 •

edited

Loading