-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GetOrGenerateKeys locking leads to significant service degredation with high request saturation #3863
Comments
Nice find! SGTM |
Since Hydra can scale horizontally maybe it makes sense to use a distributed lock in this situation? It seems like optimistic locking might fit this scenario best, considering there's many readers but few writers. The algorithm I'm thinking for generating a keyset is:
To de-duplicate this operation within a single Hydra it may make sense to utilize something like singleflight https://pkg.go.dev/golang.org/x/sync/singleflight . Additionally, I could see this being utilized to ease implementing automatic time based key rolling. Nice find btw! |
I like the idea, but it’s not needed. The lock is just there to check the local state (are we initialized or not). Could probably be done with an atomic int |
That's fair, it seems like there's a small race condition for initialization with two or more Hydra instances. But I think everything would continue to work even if it was triggered. The only real impact is that there would be extra keys generated and potentially used for signing. |
Preflight checklist
Ory Network Project
No response
Describe the bug
When operating Hydra in production, it has been observed through tracing that the
GetOrGenerateKeys
is called routinely for the Authorization Endpoint, Token Endpoint, and WellKnown endpoints. That function leverages two RWMutexes to store the keysets in memory.As identified in #3662, there can be problems when the tail latency of the database query to retrieve JWKs is above a certain threshold under highly concurrent conditions.
Additionally, when considering the locking logic for
GetOrGenerateKeys
, there may be an opportunity to leverageRLock()
when reading from the database and then acquiring the write lock only when generating to enable more throughput for this endpoint.Reproducing the bug
To recreate the issue locally:
Strategy
hydra
with an artificially set latency for thepersister_jwk
'sGetKeySet
that mimics real-world conditions (>= 25ms)./.well-known/openid-configuration
endpoint.Steps
go install github.com/codesenberg/bombardier@latest
docker compose -f quickstart.yml -f quickstart-cockroach.yml up -d cockroachd
docker compose -f quickstart.yml -f quickstart-cockroach.yml run hydra-migrate
persister_jwk.go
:time.Sleep(25 * time.Millisecond)
DSN='cockroach://root@localhost:26257/defaultdb?sslmode=disable&max_conns=20&max_idle_conns=4' go run . serve all --dev -c contrib/quickstart/5-min/hydra.yml
270
-300
concurrent connectionsbombardier -d 30s -t 30s -a -c 270 -l http://localhost:4444/.well-known/openid-configuration
Expected Outcome
Request saturation would result in a moderate service degradation. Here's a test with only 100 concurrent connections:
Actual Results
If the logarithmic growth in request handler timing reaches 10s, request latency degrades significantly.
Potential Fix
If
GetOrGenerateKeys
were to leverage a read lock and only acquiring a write lock when it wants to generate, then this request saturation does not have the same effect, but this may not provide the desired synchronization guarantees intended by the authors.Relevant log output
No response
Relevant configuration
No response
Version
v2.2.0
On which operating system are you observing this issue?
Linux
In which environment are you deploying?
Kubernetes
Additional Context
This was observed in a production Kubernetes deployment where 4 replicas of a hydra pods were getting thousands of requests per second and it led to a significant increase in latency. Upon examination of the traces, the large gap between spans before
persistence.sql.GetKeySet
was found to be a signature for the issue.The text was updated successfully, but these errors were encountered: