Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: limit goroutine creation in intent resolution #5219

Merged
merged 9 commits into from
Mar 14, 2016

Conversation

bdarnell
Copy link
Contributor

The method formerly known as Replica.handleSkippedIntents will no longer try to push transactions for which a PushTxn call is already in flight. Additionally, there is now a hard limit on the number of async tasks that can be created in this area.

Move methods related to intent resolution from Store and Replica to a new intentResolver type. Various cleanups, renamings, and refactorings to support these changes.

Fixes #4925


This change is Review on Reviewable

Rename some methods, clarify comments, and other minor API
updates (remove unused arg and avoid mutating the passed-in error).
An explosion of these calls for the same transaction (or a small number
of transactions) was responsible for cockroachdb#4925

Fixes cockroachdb#4925
This method adds channel-based backpressure to RunAsyncTask, to avoid
spinning up an unbounded number of goroutines.
This is a last line of defense against issues like cockroachdb#4925.
@bdarnell bdarnell added this to the Beta milestone Mar 14, 2016
@tamird
Copy link
Contributor

tamird commented Mar 14, 2016

:lgtm:


Reviewed 3 of 3 files at r1, 5 of 5 files at r2, 3 of 3 files at r3, 2 of 2 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, 2 of 2 files at r7, 1 of 1 files at r8.
Review status: all files reviewed at latest revision, 1 unresolved discussion.


storage/intent_resolver.go, line 43 [r6] (raw file):
these should be dereferenced on the way in - they are pointers in Transaction so that they can be nil before the transaction is initialized.


Comments from the review on Reviewable.io

@tbg
Copy link
Member

tbg commented Mar 14, 2016

Can you add (relatively) prominent logging in the case that the rate limiting or the transaction cache are triggered, at least for now? Is it certified that #4925 is still the intent issue? I haven't had a chance to investigate it after I disarmed the timestamp cache (so in the case I had seen, there were a million synchronous pushes triggered by the GC queue).

:lgtm:


Reviewed 3 of 3 files at r1, 5 of 5 files at r2, 3 of 3 files at r3, 2 of 2 files at r4, 1 of 1 files at r5, 1 of 1 files at r6, 2 of 2 files at r7, 1 of 1 files at r8.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


storage/intent_resolver.go, line 59 [r3] (raw file):
nit: there's something weird about the look of the . (if ....).


storage/intent_resolver.go, line 162 [r3] (raw file):
s/own intents/own external intents/


storage/intent_resolver.go, line 192 [r3] (raw file):
why did this move below the call to resolveIntents?
I think wiErr == nil can be removed (as we decided to drop this case under the rug).


storage/intent_resolver.go, line 156 [r6] (raw file):
If the other goroutine is pushing for timestamp but we need it aborted, that doesn't help us.

Some more comments here on the expected outcomes would be good (i.e. we assume the other guy's push goes through in time for us retrying the read/write).


storage/intent_resolver.go, line 237 [r6] (raw file):
why true? I thought it would make most sense to have false here.


Comments from the review on Reviewable.io

@nvanbenschoten
Copy link
Member

LGTM


Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.


storage/intent_resolver.go, line 47 [r8] (raw file):
s/goroutines/goroutines./


storage/intent_resolver.go, line 155 [r8] (raw file):
If there's a lot of contention for this lock, we may want to consider preallocating outside of the lock. We also can populate the resolveIntents slice outside of the lock if we're willing to do a second pass. Again, only needed if this becomes a bottleneck.


Comments from the review on Reviewable.io

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


storage/intent_resolver.go, line 75 [r8] (raw file):
Holy mega-line-length!


Comments from the review on Reviewable.io

@bdarnell
Copy link
Contributor Author

@tschottdorf The immediate problem in #4925 was goroutines created by handleSkippedIntents. There was a root cause that I don't think we've identified, but might have been the timestamp cache. I don't think there's enough info in #4925 to identify any culprit besides this one.

Added logging at V(1) for a push that is skipped because it is in flight (it would be pretty noisy otherwise - it shows up 250 times in a run of make test PKG=./storage) and stdlib log.Printf for tasks blocked by the semaphore (we can't use leveled logging from the stop package).

All other comments addressed.


Review status: 3 of 7 files reviewed at latest revision, 9 unresolved discussions.


storage/intent_resolver.go, line 192 [r3] (raw file):
Previously, resolveWriteIntentError could return a WriteIntentError and intents to resolve, so I changed it so that any intents that get returned should be resolved whether the push succeeded or not. I'm not sure whether it matters any more, though: I've removed the WriteIntentError quirks and I don't think it's possible for intents to make it to this point with the status of their transaction known in advance (and even if they did, we weren't returning them along with any errors but the WriteIntentError; see TODOs above)


storage/intent_resolver.go, line 43 [r6] (raw file):
Good catch; I guess this wasn't actually doing anything before since the pointers would never be the same. I can't figure out a good way to test this part, though (other than manually testing with logging, which I've now done). It's hard to test for the absence of something that, if it happened, would be asynchronous.


storage/intent_resolver.go, line 156 [r6] (raw file):
Yes, that's why the only time we pass skipInFlight is for PUSH_TOUCH (that's the one case where we didn't have an effective means of backpressure before and goroutines could really get out of control). Pushes that are actually trying to modify the transaction are always allowed through. We could do a better job there of determining whether the push is fully covered by some in-flight operation, but then we'd also need to determine whether the intents are the same or whether it's a different intent from the same transaction, and if the latter to synchronize the resolution of that intent with the return from the push. I had originally envisioned reworking this file into an explicit hierarchy of dependencies between transactions to push, intents to resolve, and transactions to GC, but left that for another day.


storage/intent_resolver.go, line 237 [r6] (raw file):
This is the only place where skipInFlight is true, because PUSH_TOUCH are the only types of pushes that are safe to drop. This is for intents that are skipped by an inconsistent read; there are no retries to worry about.


storage/intent_resolver.go, line 75 [r8] (raw file):
Did a pass over the entire diff to wrap at 100.


storage/intent_resolver.go, line 155 [r8] (raw file):
I doubt it'll be a problem. Allocation by itself is cheap; the main cost of an allocation is that each allocation implies eventual work for the GC.


Comments from the review on Reviewable.io

@tamird
Copy link
Contributor

tamird commented Mar 14, 2016

Reviewed 4 of 4 files at r9.
Review status: all files reviewed at latest revision, 8 unresolved discussions.


Comments from the review on Reviewable.io

@tbg
Copy link
Member

tbg commented Mar 14, 2016

:lgtm:


Reviewed 4 of 4 files at r9.
Review status: all files reviewed at latest revision, 5 unresolved discussions.


storage/intent_resolver.go, line 156 [r6] (raw file):
Oh, I misread the code. skipInFlight suggested to me that the in-flight map lookup is skipped, but really it's simply not pushing transactions for which a previous push is in flight. If that's likely to happen to someone else, maybe rename, but with the comment on the method it ought to be fine.


Comments from the review on Reviewable.io

@bdarnell
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


storage/intent_resolver.go, line 156 [r6] (raw file):
Renamed to skipIfInFlight. I almost spelled it that way originally.


Comments from the review on Reviewable.io

Change map keys from pointers to values.
Add logging.
Wrap long lines and update comments.
@tbg
Copy link
Member

tbg commented Mar 14, 2016

Reviewed 1 of 1 files at r10.
Review status: all files reviewed at latest revision, 5 unresolved discussions.


Comments from the review on Reviewable.io

bdarnell added a commit that referenced this pull request Mar 14, 2016
storage: limit goroutine creation in intent resolution
@bdarnell bdarnell merged commit 5d08c85 into cockroachdb:master Mar 14, 2016
@bdarnell bdarnell deleted the intent-resolver branch March 14, 2016 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants