Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: undo quota acquisition on failed proposals #17856

Merged
merged 1 commit into from
Aug 23, 2017

Conversation

irfansharif
Copy link
Contributor

This was an oversight introduced in #15802, the undoQuotaAcquisition
was intended to be used for all proposal errors but we seem to have
skipped the very first one. It was a relic of the earlier structure
where in the event of proposal errors the acquired quota was
released within Replica.propose itself, and not left to the caller.

Fixes #17826 (hopefully, still being tested) but should be corrected
anyway.

+cc @benesch

This was an oversight introduced in cockroachdb#15802, the `undoQuotaAcquisition`
was intended to be used for all proposal errors but we seem to have
skipped the very first one. It seems to have been a relic of the earlier
structure where in the event of proposal errors the acquired quota was
released within `Replica.propose` itself, and not left to the caller.

Fixes (hopefully) cockroachdb#17826.
@irfansharif irfansharif requested review from tbg and a team August 23, 2017 17:27
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@benesch
Copy link
Contributor

benesch commented Aug 23, 2017

If anyone wants to follow the 2tb restore's progress: http://35.196.229.18:8080/#/cluster/all/replication

@benesch
Copy link
Contributor

benesch commented Aug 23, 2017

And thanks for tracking this down, @irfansharif! 🎉

@irfansharif
Copy link
Contributor Author

irfansharif commented Aug 23, 2017

It now spews only:
W170823 17:36:37.227820 289758 storage/node_liveness.go:370 [replica consistency checker,n1,s1,r2169/3:/Table/51/1/114{34795…-40181…}] slow heartbeat took 60.0s, which seems unrelated to what it got wedged on last time. Tentatively claiming this resolves #17826.

@benesch: I did not track down why exactly is it that proposals (AddSSTables or otherwise) failed here, it may have been as frequent as regular operation but got wedged given the acquired quota sizes, but something to look into perhaps.

Additionally as discussed in person and briefly on #17826, the use of the quota pool for AddSSTables is another layer of rate limiting. This may or may not be something we want but assuming a configured COCKROACH_RAFT_LOG_MAX_SIZE of 64 MiB, the defaultProposalQuota will be 16 MiB. Assuming 2MiB SSTs this will automatically allow at most 8 in flight AddSSTable commands addressed to the same replica (assuming no other ongoing writes, will impair throughput further for both regular writes and AddSSTables if so). Again, this may or may not be something we want, but the initial quota pool size being maxRaftLogSize / 4 was not tuned with with workloads including AddSSTable commands.

@irfansharif irfansharif requested a review from bdarnell August 23, 2017 18:56
@benesch
Copy link
Contributor

benesch commented Aug 23, 2017

Thanks to stellar work by @tschottdorf, we've absolved the quota pool of any blame for the most recent 2tb failure. Running another 2tb restore now: http://35.185.63.138:8080/#/cluster/all/overview

@bdarnell
Copy link
Contributor

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@irfansharif
Copy link
Contributor Author

🤞

@irfansharif irfansharif merged commit f652468 into cockroachdb:master Aug 23, 2017
@irfansharif irfansharif deleted the qpool-err-release branch November 18, 2019 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants