-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: very frequent near identical lease acquisition log messages #28947
Comments
Something I want to point out is that you could hope to explain this via Raft reproposals, but those aren't Raft reproposals - the ProposedTimestamp increases. So someone (presumably n2, assuming there are no lease transfers involved) is requesting leases in a hot loop for a bit. |
I guess one way this could happen is:
|
This still seems weird and I wonder if there's a different way. I did guess that perhaps there was a contended mutex below that line that could've held up requests, but as far as I can tell we don't acquire any mutexes other than Another possibility is that for some period of time, all lease renewals failed to the client but actually applied. But this would need similar "cohorting" in that method because once the first such proposal applies, any future requests see it, and large portions of that code are under the replica mu. Hmm, I'm still scratching my head. |
This is really easy to reproduce. Just run TPCC and you'll get these. |
Looking at the logs more, it seems that this isn't an isolated incident. It's happening for lots and lots of ranges, and during server startup. (The log starts a minute before the incident).
and that's just a single busy second. The activity transcends much of this. The reproposals are spaced across 10-30ms intervals, and somehow it seems that it must be driven by something about replica ticks. My first instinct is that this has to do with our cancelling of lease proposals in this code: cockroach/pkg/storage/replica.go Lines 4746 to 4756 in 88379ae
However, you expect that to be called every RaftElectionTimeoutTicks ticks which defaults to 15, and a tick is (by default) every 200ms for a total of every 3s. |
@jordanlewis you expect a flurry of them at startup, but are you sure you get these duplicates? I assume you will as there's no way this is something special you're doing. What TPCC were you running? I know it's three nodes. |
|
@nstewart I'm not sure why you added S-3-erroneous-edge-case here. Can you elaborate? |
@nvanbenschoten could I put this on your plate? Basically just keep an eye out for the lease acquisition log spam when you run tpcc the next time (which I assume you do all the time). There's hopefully nothing to fix, but perhaps we want to cut down on the number of these messages. |
Sure thing, although I probably won't have time to investigate in the next week or so. I'm hoping there's an easy win here. |
Sure, this isn't the highest priority. My main concern here is improving the logging UX (well, and making sure there isn't something terrible going on with these leases). |
Resolved by #31644. |
Seen in the logs while debugging #28918.
There are 19 lease proposal messages for n2/r604 in a single second. The messages all have an identical lease, but different proposed timestamps. It's unclear to me why that should happen, so we likely have something to fix. And even in the unlikely case in which this is expected, the log is extremely spammy and needs to be throttled.
The text was updated successfully, but these errors were encountered: