Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

resolve major search perf regression in v3.2.0 (add redis cache to git commit OID resolution) #3685

Merged
merged 10 commits into from
May 1, 2019

Conversation

slimsag
Copy link
Member

@slimsag slimsag commented May 1, 2019

In Sourcegraph v3.2.0 a major search performance regression was introduced by accident. It affected all instances using indexed search. It went unnoticed because we currently don't have sufficient load testing of search (which is something we are addressing very soon).

The regression was introduced in https://github.com/sourcegraph/sourcegraph/pull/2318 (cc @ijsnow @keegancsmith) by accident due to some complex code interaction. The change made the assumption that using a sync.Once associated with the *gitCommitResolver was enough to prevent doing too much duplicative work. However, since there is a one-to-one mapping of search results and gitCommitResolver we regressed significantly on search performance as after that change for every search result we would:

  1. Resolve the default branch of the repo, which requires making an HTTP request to gitserver to ask it to execute git symbolic-ref HEAD (and in some cases executing a second command).
  2. Ask zoekt to list indexed git refs for the repo over HTTP (expensive given it is also wrapped in sync.Once).
  3. Resolve the git ref zoekt says is indexed back to a concrete git OID.

This is all to say: after that change we would do a substantial amount of work per search result after https://github.com/sourcegraph/sourcegraph/pull/2318 was merged.

This PR resolves the issue by putting in place a Redis cache with a TTL of 60s. This makes computing this information just a simple cache lookup, which should be substantially faster. The sync.Once logic remains because we may still call OID repetitively on the same *gitCommitObject (I didn't vet all call sites.)

Additionally, this PR adds metrics for us to monitor (which should've probably been added earlier as part of #2318):

  • How much time we spend computing this information.
  • How often we hit / miss the cache.
  • How much time we spend performing cache lookups (in case a Redis cache isn't sufficient and we need a process-local cache in the future).

Test plan:

Manual load testing performed by benchmarking search with vegeta (see "Raw data / details on how tests were ran" at end for exact steps I performed to test this change).

Comparison

All tests performed with this search query returning only 10 results: repo:^github\\.com/sourcegraph/sourcegraph$ error count:10 timeout:60s

Initially I tried higher result counts (11,000, 1000) but the perf before this change was so bad ALL requests would timeout at just 1 QPS and vegeta can't send requests slower than that (or I couldn't find the option).

QPS Mean change (before -> after) P95 change (before -> after) before: non-200s / total req after: non-200s / total req
1 (-92.98%) 228ms -> 16ms (-93.67%) 332ms -> 21ms 0/60 0/60
5 (-99.37%) 2,230ms -> 14ms (-99.60%) 4,321ms -> 17ms 0/300 0/300
50 (-99.845) 60,000ms -> 92ms (-99.92%) 63,543ms -> 52ms 166/250 (502 timeouts) 0/250
  • Negative percent above indicates e.g. 92% less time spent after change.
  • Last row before time I believe is not super accurate because high number of timeouts skewed the request duration mean / P95..

All tests

Before: Lots of warnings: gitserver | WARN Long exec request, repo: github.com/sourcegraph/sourcegraph, args: [symbolic-ref HEAD], duration: 3.183s

After: No warnings.

Raw data / details on how tests were ran

Details

  • search.body.json is the exact GraphQL search query we send when a user does a search request, copied from Chrome network tab.

1 QPS for 60s

Before:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=1qps -rate=1 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            60, 1.02
Duration      [total, attack, wait]    59.1882914s, 59.004040464s, 184.250936ms
Latencies     [mean, 50, 95, 99, max]  228.168332ms, 213.124139ms, 332.409ms, 554.912644ms, 575.35818ms
Bytes In      [total, mean]            508206, 8470.10
Bytes Out     [total, mean]            327300, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:60  
Error Set:

After:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=1qps -rate=1 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            60, 1.02
Duration      [total, attack, wait]    59.01640499s, 59.004437167s, 11.967823ms
Latencies     [mean, 50, 95, 99, max]  16.430629ms, 13.723219ms, 20.86999ms, 133.612645ms, 145.057061ms
Bytes In      [total, mean]            508201, 8470.02
Bytes Out     [total, mean]            327300, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:60  
Error Set:

5 QPS for 60s

Before:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=5qps -rate=5 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    1m2.116713558s, 59.803600375s, 2.313113183s
Latencies     [mean, 50, 95, 99, max]  2.231654701s, 2.644444083s, 4.321173379s, 4.960104982s, 5.056616823s
Bytes In      [total, mean]            2541017, 8470.06
Bytes Out     [total, mean]            1636500, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

After:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=5qps -rate=5 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.815219217s, 59.804088472s, 11.130745ms
Latencies     [mean, 50, 95, 99, max]  13.718213ms, 13.242097ms, 16.508382ms, 25.167756ms, 129.288757ms
Bytes In      [total, mean]            2541003, 8470.01
Bytes Out     [total, mean]            1636500, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

50 QPS for 5s

Before:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=50qps -rate=50 -duration=5s | tee results.bin | vegeta report
Requests      [total, rate]            250, 50.18
Duration      [total, attack, wait]    1m7.318174155s, 4.982542856s, 1m2.335631299s
Latencies     [mean, 50, 95, 99, max]  1m0.103715429s, 1m0.957090646s, 1m3.543040632s, 1m4.220043201s, 1m4.250048319s
Bytes In      [total, mean]            736409, 2945.64
Bytes Out     [total, mean]            1363750, 5455.00
Success       [ratio]                  33.60%
Status Codes  [code:count]             200:84  502:166  
Error Set:
502 Bad Gateway

After:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=50qps -rate=50 -duration=5s | tee results.bin | vegeta report
Requests      [total, rate]            250, 50.17
Duration      [total, attack, wait]    4.993111979s, 4.983242598s, 9.869381ms
Latencies     [mean, 50, 95, 99, max]  92.333272ms, 13.51449ms, 52.175179ms, 1.841178441s, 1.869576507s
Bytes In      [total, mean]            2117508, 8470.03
Bytes Out     [total, mean]            1363750, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:250  
Error Set:

@codecov
Copy link

codecov bot commented May 1, 2019

Codecov Report

Merging #3685 into master will increase coverage by 0.01%.
The diff coverage is 52.23%.

Impacted Files Coverage Δ
cmd/frontend/graphqlbackend/git_commit.go 17.47% <6.66%> (-2.53%) ⬇️
...tend/graphqlbackend/git_commit_resolution_cache.go 89.18% <89.18%> (ø)

Copy link
Contributor

@ijt ijt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a benchmark to show that it really helps?
Then we can also find out if the sync.Once logic really needs to be there.

@slimsag
Copy link
Member Author

slimsag commented May 1, 2019

Can you add a benchmark to show that it really helps?

What precisely would it benchmark?

Then we can also find out if the sync.Once logic really needs to be there.

This is not possible without a significant undertaking. In order to verify this, we would need to vet or add benchmarks for every downstream consumer of this method: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/frontend/graphqlbackend/git_commit.go#L86:29&tab=references

@slimsag
Copy link
Member Author

slimsag commented May 1, 2019

To clarify some things in my last comment:

Can you add a benchmark to show that it really helps?

What precisely would it benchmark?

I mean to say that any (currently writeable) Go benchmark would not be sufficient here. Most of the perf harm would be done doing the HTTP requests outlined in the issue description, and for a Go test we would have to mock those out, so it wouldn't be representative of the real world. What would prove it is proper search load testing, but I am not convinced we should block this PR on having that.

Then we can also find out if the sync.Once logic really needs to be there.

This is not possible without a significant undertaking. In order to verify this, we would need to vet or add benchmarks for every downstream consumer of this method: https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/frontend/graphqlbackend/git_commit.go#L86:29&tab=references

While we could vet and benchmark all of these downstream consumers, the time would be better spent refactoring code so that Zoekt hands us the indexed revision as part of search results (this is the more correct long-term fix for this issue).

I'll see if I can throw together a manual search benchmark that proves this change is a good one to alleviate your concern, but I'm not convinced we should block this PR on any of the above (if it was your intent to say that). Very open to hearing counter-points here! 😃

@slimsag
Copy link
Member Author

slimsag commented May 1, 2019

Wow, so thanks for pushing me to prove out the perf gain on this, I was 100% wrong about how straightforward this change was :) Turns out that rcache would just cause us to become redis-connection-bound with failed to execute redis command, cmd: GET, error: read tcp 127.0.0.1:54758->127.0.0.1:6379: read: connection reset by peer errors.

Working on a better cache solution now.

@ijt
Copy link
Contributor

ijt commented May 1, 2019

Your argument for this change makes sense to me, but at the same time it's notoriously hard to know in advance what effect a change will have on performance. We can see that by looking at what happened as a result of #2318.

I would really like to see some performance numbers before this goes in. Otherwise we're flying blind and we don't know which performance-oriented code is helping and which isn't.

I'm happy to help get those numbers. Ah, just saw your reply. Yep. That's how this stuff is.

@slimsag
Copy link
Member Author

slimsag commented May 1, 2019

@ijt PTAL:

  • Two new commits (& rebased against master to resolve conflict).
  • PR description updated with perf testing results for this new implementation. Quite staggering. Thank you for pushing me to perf test & keeping me honest, I was definitely wrong to not do so! 😸

@ijt
Copy link
Contributor

ijt commented May 1, 2019

Looking now.

Copy link
Contributor

@ijt ijt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cmd/frontend/graphqlbackend/git_commit.go Outdated Show resolved Hide resolved
enterprise/dev/Procfile Show resolved Hide resolved
@ijt
Copy link
Contributor

ijt commented May 1, 2019

Can you either flesh out the "Test plan:" bit in the description or remove it? It's a bit confusing.

@slimsag
Copy link
Member Author

slimsag commented May 1, 2019

Filed https://github.com/sourcegraph/sourcegraph/issues/3750 for the long-term fix / tech debt we should tackle here.

@slimsag
Copy link
Member Author

slimsag commented May 1, 2019

addressed all points, will merge once CI passes.

CHANGELOG.md Outdated Show resolved Hide resolved
@slimsag slimsag mentioned this pull request May 1, 2019
15 tasks
@slimsag slimsag merged commit 4fbdf0c into master May 1, 2019
@slimsag slimsag deleted the sg/oid-cache branch May 1, 2019 20:28
slimsag added a commit that referenced this pull request May 1, 2019
…t commit OID resolution) (#3685)

* refactor OID resolution logic (review with whitespace off)

* add oid resolution cache

* add observability

* CHANGELOG: update

* custom cache implementation

* enterprise/dev/Procfile: fix bug where zoekt is not listening

* remove redundant return stmt

* resolveCommitIODUncached

* faster test

* Update CHANGELOG.md
slimsag added a commit that referenced this pull request May 2, 2019
…t commit OID resolution) (#3685)

* refactor OID resolution logic (review with whitespace off)

* add oid resolution cache

* add observability

* CHANGELOG: update

* custom cache implementation

* enterprise/dev/Procfile: fix bug where zoekt is not listening

* remove redundant return stmt

* resolveCommitIODUncached

* faster test

* Update CHANGELOG.md
ijsnow pushed a commit that referenced this pull request May 7, 2019
…t commit OID resolution) (#3685)

* refactor OID resolution logic (review with whitespace off)

* add oid resolution cache

* add observability

* CHANGELOG: update

* custom cache implementation

* enterprise/dev/Procfile: fix bug where zoekt is not listening

* remove redundant return stmt

* resolveCommitIODUncached

* faster test

* Update CHANGELOG.md
slimsag added a commit that referenced this pull request May 15, 2019
…3.2.0

In #3685 I resolved a major indexed search performance regression that was
introduced in v3.2.0.

For a search returning 100 repositories and 100 results per repository:

1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout)

2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository).
    - This was shown in benchmarks and load tests on customer instances to be a substantial improvement.
    - On one customer instance, we've found this still hasn't been enough of an improvement.
    - On another customer instance, we've found we could be doing better.

3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information.

Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing.
Helps https://app.hubspot.com/contacts/2762526/company/407948923
slimsag added a commit that referenced this pull request May 16, 2019
…4048)

* search: completely resolve indexed search performance regression in v3.2.0

In #3685 I resolved a major indexed search performance regression that was
introduced in v3.2.0.

For a search returning 100 repositories and 100 results per repository:

1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout)

2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository).
    - This was shown in benchmarks and load tests on customer instances to be a substantial improvement.
    - On one customer instance, we've found this still hasn't been enough of an improvement.
    - On another customer instance, we've found we could be doing better.

3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information.

Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing.
Helps https://app.hubspot.com/contacts/2762526/company/407948923

* remove now unnecessary Git OID resolution & caching logic

* CHANGELOG

* fixup reversion

* fixup

* update tests
slimsag added a commit that referenced this pull request May 16, 2019
…4048)

* search: completely resolve indexed search performance regression in v3.2.0

In #3685 I resolved a major indexed search performance regression that was
introduced in v3.2.0.

For a search returning 100 repositories and 100 results per repository:

1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout)

2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository).
    - This was shown in benchmarks and load tests on customer instances to be a substantial improvement.
    - On one customer instance, we've found this still hasn't been enough of an improvement.
    - On another customer instance, we've found we could be doing better.

3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information.

Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing.
Helps https://app.hubspot.com/contacts/2762526/company/407948923

* remove now unnecessary Git OID resolution & caching logic

* CHANGELOG

* fixup reversion

* fixup

* update tests
slimsag added a commit that referenced this pull request May 20, 2019
…4048)

* search: completely resolve indexed search performance regression in v3.2.0

In #3685 I resolved a major indexed search performance regression that was
introduced in v3.2.0.

For a search returning 100 repositories and 100 results per repository:

1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout)

2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository).
    - This was shown in benchmarks and load tests on customer instances to be a substantial improvement.
    - On one customer instance, we've found this still hasn't been enough of an improvement.
    - On another customer instance, we've found we could be doing better.

3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information.

Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing.
Helps https://app.hubspot.com/contacts/2762526/company/407948923

* remove now unnecessary Git OID resolution & caching logic

* CHANGELOG

* fixup reversion

* fixup

* update tests
slimsag added a commit that referenced this pull request May 20, 2019
…4048)

* search: completely resolve indexed search performance regression in v3.2.0

In #3685 I resolved a major indexed search performance regression that was
introduced in v3.2.0.

For a search returning 100 repositories and 100 results per repository:

1. After the regression: we would do 100*100 == 10,000 Git OID resolutions (which would overload Gitserver and cause searches to timeout)

2. After my previous resolution, we would no longer do duplicate work and thus only did 100 Git OID resolutions (1 per repository).
    - This was shown in benchmarks and load tests on customer instances to be a substantial improvement.
    - On one customer instance, we've found this still hasn't been enough of an improvement.
    - On another customer instance, we've found we could be doing better.

3. After *this* resolution, we are 100% at the performance level we were at prior to v3.2.0 because we are no longer doing *any* additional Git OID resolutions. Zoekt already has this information and we already aquire this information prior to executing searches, we just weren't correctly reusing this information.

Should fully resolve an outstanding issue https://app.hubspot.com/contacts/2762526/company/464956351 is facing.
Helps https://app.hubspot.com/contacts/2762526/company/407948923

* remove now unnecessary Git OID resolution & caching logic

* CHANGELOG

* fixup reversion

* fixup

* update tests
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants