concurrency: produce tracing metadata #55583

tbg · 2020-10-15T08:52:57Z

Is your feature request related to a problem? Please describe.
In the context of always-on tracing, we want to emit structured metadata describing contention events. These will, among others, be consumed by #55243 for inclusion in the statements page. They will also figure prominently in SQL session tracing.

Describe the solution you'd like
Introduce a protobuf that is emitted whenever a transaction conflicts with another transaction during replica evaluation. The guiding principle here should be that given the information in the protobuf, the SQL layer can construct an explanation of the conflict instructive to the average app developer. This definitely needs to include

the key at which the conflict occurred
when the conflict was first detected
the conflicting transaction (after resolution)

NB: since the metadata is emitted right when the conflict ends, the duration of the conflict is implicitly known.

The action will take place in the lock table waiter, which manages a single conflict here:

cockroach/pkg/kv/kvserver/concurrency/lock_table_waiter.go

Lines 156 to 399 in 1682c40

    
           		select { 
        
           		case <-newStateC: 
        
           			timerC = nil 
        
           			state := guard.CurState() 
        
           			switch state.kind { 
        
           			case waitFor, waitForDistinguished: 
        
           				if req.WaitPolicy == lock.WaitPolicy_Error { 
        
           					// If the waiter has an Error wait policy, resolve the conflict 
        
           					// immediately without waiting. If the conflict is a lock then 
        
           					// push the lock holder's transaction using a PUSH_TOUCH to 
        
           					// determine whether the lock is abandoned or whether its holder 
        
           					// is still active. If the conflict is a reservation holder, 
        
           					// raise an error immediately, we know the reservation holder is 
        
           					// active. 
        
           					if state.held { 
        
           						err = w.pushLockTxn(ctx, req, state) 
        
           					} else { 
        
           						err = newWriteIntentErr(state) 
        
           					} 
        
           					if err != nil { 
        
           						return err 
        
           					} 
        
           					continue 
        
           				} 
        
           				// waitFor indicates that the request is waiting on another 
        
           				// transaction. This transaction may be the lock holder of a 
        
           				// conflicting lock or the head of a lock-wait queue that the 
        
           				// request is a part of. 
        
           				// 
        
           				// waitForDistinguished is like waitFor, except it instructs the 
        
           				// waiter to quickly push the conflicting transaction after a short 
        
           				// liveness push delay instead of waiting out the full deadlock 
        
           				// detection push delay. The lockTable guarantees that there is 
        
           				// always at least one request in the waitForDistinguished state for 
        
           				// each lock that has any waiters. 
        
           				// 
        
           				// The purpose of the waitForDistinguished state is to avoid waiting 
        
           				// out the longer deadlock detection delay before recognizing and 
        
           				// recovering from the failure of a transaction coordinator for 
        
           				// *each* of that transaction's previously written intents. 
        
           				livenessPush := state.kind == waitForDistinguished 
        
           				deadlockPush := true 
        
           				// If the conflict is a reservation holder and not a held lock then 
        
           				// there's no need to perform a liveness push - the request must be 
        
           				// alive or its context would have been canceled and it would have 
        
           				// exited its lock wait-queues. 
        
           				if !state.held { 
        
           					livenessPush = false 
        
           				} 
        
           				// For non-transactional requests, there's no need to perform 
        
           				// deadlock detection because a non-transactional request can 
        
           				// not be part of a dependency cycle. Non-transactional requests 
        
           				// cannot hold locks or reservations. 
        
           				if req.Txn == nil { 
        
           					deadlockPush = false 
        
           				} 
        
           				// If the request doesn't want to perform a push for either 
        
           				// reason, continue waiting. 
        
           				if !livenessPush && !deadlockPush { 
        
           					continue 
        
           				} 
        
           				// If we know that a lock holder is already finalized (COMMITTED 
        
           				// or ABORTED), there's no reason to push it again. Instead, we 
        
           				// can skip directly to intent resolution. 
        
           				// 
        
           				// As an optimization, we defer the intent resolution until the 
        
           				// we're done waiting on all conflicting locks in this function. 
        
           				// This allows us to accumulate a group of intents to resolve 
        
           				// and send them together as a batch. 
        
           				// 
        
           				// Remember that if the lock is held, there will be at least one 
        
           				// waiter with livenessPush = true (the distinguished waiter), 
        
           				// so at least one request will enter this branch and perform 
        
           				// the cleanup on behalf of all other waiters. 
        
           				if livenessPush { 
        
           					if pusheeTxn, ok := w.finalizedTxnCache.get(state.txn.ID); ok { 
        
           						resolve := roachpb.MakeLockUpdate(pusheeTxn, roachpb.Span{Key: state.key}) 
        
           						deferredResolution = append(deferredResolution, resolve) 
        
           						// Inform the LockManager that the lock has been updated with a 
        
           						// finalized status so that it gets removed from the lockTable 
        
           						// and we are allowed to proceed. 
        
           						// 
        
           						// For unreplicated locks, this is all that is needed - the 
        
           						// lockTable is the source of truth so, once removed, the 
        
           						// unreplicated lock is gone. It is perfectly valid for us to 
        
           						// instruct the lock to be released because we know that the 
        
           						// lock's owner is finalized. 
        
           						// 
        
           						// For replicated locks, this is a bit of a lie. The lock hasn't 
        
           						// actually been updated yet, but we will be conducting intent 
        
           						// resolution in the future (before we observe the corresponding 
        
           						// MVCC state). This is safe because we already handle cases 
        
           						// where locks exist only in the MVCC keyspace and not in the 
        
           						// lockTable. 
        
           						// 
        
           						// In the future, we'd like to make this more explicit. 
        
           						// Specifically, we'd like to augment the lockTable with an 
        
           						// understanding of finalized but not yet resolved locks. These 
        
           						// locks will allow conflicting transactions to proceed with 
        
           						// evaluation without the need to first remove all traces of 
        
           						// them via a round of replication. This is discussed in more 
        
           						// detail in #41720. Specifically, see mention of "contention 
        
           						// footprint" and COMMITTED_BUT_NOT_REMOVABLE. 
        
           						w.lm.OnLockUpdated(ctx, &deferredResolution[len(deferredResolution)-1]) 
        
           						continue 
        
           					} 
        
           				} 
        
           				// The request should push to detect abandoned locks due to 
        
           				// failed transaction coordinators, detect deadlocks between 
        
           				// transactions, or both, but only after delay. This delay 
        
           				// avoids unnecessary push traffic when the conflicting 
        
           				// transaction is continuing to make forward progress. 
        
           				delay := time.Duration(math.MaxInt64) 
        
           				if livenessPush { 
        
           					delay = minDuration(delay, LockTableLivenessPushDelay.Get(&w.st.SV)) 
        
           				} 
        
           				if deadlockPush { 
        
           					delay = minDuration(delay, LockTableDeadlockDetectionPushDelay.Get(&w.st.SV)) 
        
           				} 
        
           				// However, if the pushee has the minimum priority or if the 
        
           				// pusher has the maximum priority, push immediately. 
        
           				// TODO(nvanbenschoten): flesh these interactions out more and 
        
           				// add some testing. 
        
           				if hasMinPriority(state.txn) || hasMaxPriority(req.Txn) { 
        
           					delay = 0 
        
           				} 
        
           				if delay > 0 { 
        
           					if timer == nil { 
        
           						timer = timeutil.NewTimer() 
        
           						defer timer.Stop() 
        
           					} 
        
           					timer.Reset(delay) 
        
           					timerC = timer.C 
        
           				} else { 
        
           					// If we don't want to delay the push, don't use a real timer. 
        
           					// Doing so is both a waste of resources and, more importantly, 
        
           					// makes TestConcurrencyManagerBasic flaky because there's no 
        
           					// guarantee that the timer will fire before the goroutine enters 
        
           					// a "select" waiting state on the next iteration of this loop. 
        
           					timerC = closedTimerC 
        
           				} 
        
           				timerWaitingState = state 
        
           			case waitElsewhere: 
        
           				// The lockTable has hit a memory limit and is no longer maintaining 
        
           				// proper lock wait-queues. 
        
           				if !state.held { 
        
           					// If the lock is not held, exit immediately. Requests will 
        
           					// be ordered when acquiring latches. 
        
           					return nil 
        
           				} 
        
           				// The waiting request is still not safe to proceed with 
        
           				// evaluation because there is still a transaction holding the 
        
           				// lock. It should push the transaction it is blocked on 
        
           				// immediately to wait in that transaction's txnWaitQueue. Once 
        
           				// this completes, the request should stop waiting on this 
        
           				// lockTableGuard, as it will no longer observe lock-table state 
        
           				// transitions. 
        
           				return w.pushLockTxn(ctx, req, state) 
        
           			case waitSelf: 
        
           				// Another request from the same transaction is the reservation 
        
           				// holder of this lock wait-queue. This can only happen when the 
        
           				// request's transaction is sending multiple requests concurrently. 
        
           				// Proceed with waiting without pushing anyone. 
        
           			case doneWaiting: 
        
           				// The request has waited for all conflicting locks to be released 
        
           				// and is at the front of any lock wait-queues. It can now stop 
        
           				// waiting, re-acquire latches, and check the lockTable again for 
        
           				// any new conflicts. If it find none, it can proceed with 
        
           				// evaluation. 
        
           				return nil 
        
           			default: 
        
           				panic("unexpected waiting state") 
        
           			} 
        
           		case <-timerC: 
        
           			// If the request was in the waitFor or waitForDistinguished states 
        
           			// and did not observe any update to its state for the entire delay, 
        
           			// it should push. It may be the case that the transaction is part 
        
           			// of a dependency cycle or that the lock holder's coordinator node 
        
           			// has crashed. 
        
           			timerC = nil 
        
           			if timer != nil { 
        
           				timer.Read = true 
        
           			} 
        
           			// If the request is conflicting with a held lock then it pushes its 
        
           			// holder synchronously - there is no way it will be able to proceed 
        
           			// until the lock's transaction undergoes a state transition (either 
        
           			// completing or being pushed) and then updates the lock's state 
        
           			// through intent resolution. The request has a dependency on the 
        
           			// entire conflicting transaction. 
        
           			// 
        
           			// However, if the request is conflicting with another request (a 
        
           			// reservation holder) then it pushes the reservation holder 
        
           			// asynchronously while continuing to listen to state transition in 
        
           			// the lockTable. This allows the request to cancel its push if the 
        
           			// conflicting reservation exits the lock wait-queue without leaving 
        
           			// behind a lock. In this case, the request has a dependency on the 
        
           			// conflicting request but not necessarily the entire conflicting 
        
           			// transaction. 
        
           			if timerWaitingState.held { 
        
           				err = w.pushLockTxn(ctx, req, timerWaitingState) 
        
           			} else { 
        
           				// It would be more natural to launch an async task for the push 
        
           				// and continue listening on this goroutine for lockTable state 
        
           				// transitions, but doing so is harder to test against. Instead, 
        
           				// we launch an async task to listen to lockTable state and 
        
           				// synchronously push. If the watcher goroutine detects a 
        
           				// lockTable change, it cancels the context on the push. 
        
           				pushCtx, pushCancel := context.WithCancel(ctx) 
        
           				go w.watchForNotifications(pushCtx, pushCancel, newStateC) 
        
           				err = w.pushRequestTxn(pushCtx, req, timerWaitingState) 
        
           				if errors.Is(pushCtx.Err(), context.Canceled) { 
        
           					// Ignore the context canceled error. If this was for the 
        
           					// parent context then we'll notice on the next select. 
        
           					err = nil 
        
           				} 
        
           				pushCancel() 
        
           			} 
        
           			if err != nil { 
        
           				return err 
        
           			} 
        
           		case <-ctxDoneC: 
        
           			return roachpb.NewError(ctx.Err()) 
        
           		case <-shouldQuiesceC: 
        
           			return roachpb.NewError(&roachpb.NodeUnavailableError{}) 
        
           		} 
        
           	} 
        
           }

We'll want to create the protobuf at the top of the method, register it with the registry of inflight operations (also part of the always-on work) and emit it to the trace span (if any) when the conflict has been handled.

Note that the concurrency manager also has a second path that is taken for conflict resolution, via write intent errors. This also has a subcase that needs to emit an event, though it is not thought to be commonly hit (~only when the lock table grows too large).

cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go

Lines 279 to 287 in 4ab98a0

    
           if wait { 
        
           	for i := range t.Intents { 
        
           		intent := &t.Intents[i] 
        
           		if err := m.ltw.WaitOnLock(ctx, g.Req, intent); err != nil { 
        
           			m.FinishReq(g) 
        
           			return nil, err 
        
           		} 
        
           	} 
        
           }

The existing testing in the concurrency package should lend itself very well to verifying that the correct events are emitted. The datadriven tests should be updated to automatically print out any events that are emitted, which means that the existing tests will already give us good coverage.

Describe alternatives you've considered

Additional context

tbg · 2020-10-28T10:15:49Z

Additional context will follow, but as a starting point, consumers of this can roughly expect to get handed a slice of (protobuf)

type ContentionEvent struct {
  Key roachpb.Key
  Txn roachpb.Transaction // pushee at the end of conflict resolution
  Duration time.Duration // time spent contending on the key against that txn in that one instance
  Outcome enum // pushed?
}

jordanlewis · 2020-11-03T15:52:09Z

@tbg, is the Txn in this contention event the transaction that was contended with, or the transaction that is contending? In other words, if I'm a transaction and I get this event, is the Txn my same transaction or another transaction?

tbg · 2020-11-03T16:57:08Z

The other transaction. My expectation is that the "own transaction" is separately available, but if this turns out not to be the case for some unexpected reason (or the exact snapshot of the transaction at the moment of the conflict matters) it's not a problem to include it in the event as well.

@tbg

56906: sql: produce mock ContentionEvents and display contention time in EXPLAIN ANALYZE r=RaduBerinde,tbg a=asubiotto Please take a look at individual commits for details Release note: None Closes #56612 @tbg could you take a look at the first commit which defines a `ContentionEvent` protobuf? It's close to what's described in #55583 (minus some fields that can be added later). Co-authored-by: Alfonso Subiotto Marques <[email protected]>

This is a (very dirty) first stab at cockroachdb#55583 that I am posting to answer the following questions: 1. is this basic approach sane or should contention metadata be generated elsewhere (i.e. not in `WaitOn`)? `WaitOn` does not necessarily see all updates, though I would take the point of view that a contenting transaction that a request never ends up waiting on (presumably because it is busy handling other conflicts) is not one we need to observe. 2. how to best structure the code so that contention events are tracked. Completing the current approach would likely result in spaghetti code that is easy to get wrong. We seem to want to keep a bit of state around; more than we do with the current structure which intentionally minimizes state held across iterations of the wait loop. In particular, it's unclear to me whether `state.txn` and/or `state.key` can change within a single invocation of `WaitOn`. I think this all means that I need to understand this code (in particular that on the other side of newStateC) better. Release note: None

tbg · 2021-01-06T16:29:50Z

#58444 will address the bulk of the work here, sans the outcome. As far as I can tell (just talked this through with Nathan) the outcome is either

some unknown hard error occurred (not much to do with those)
WriteIntentError if WaitPolicy_ERROR was used
the contending transaction was taken care of, i.e. it no longer is in the way via an intent or in-mem lock (though its writes may lead to our txn requiring a restart later)

Leaving this issue open to add the outcome enum later.

This change attaches a protobuf payload to the current Span whenever a request conflicts with another transaction. The payload contains the contending txn (i.e. the pushee) at the time at which it was first encountered, the key on which the conflict occurred (note that this is not necessarily the key at which the pushee is anchored) and the time spent waiting on the conflict (excluding intent resolution). This enables cockroachdb#57114. Touches cockroachdb#55583. I am not closing that issue yet because we also want to potentially track the outcome of the conflict. Release note: None

58444: concurrency: emit structured contention information to trace r=irfansharif,nvanbenschoten a=tbg This change attaches a protobuf payload to the current Span whenever a request conflicts with another transaction. The payload contains the contending txn (i.e. the pushee) at the time at which it was first encountered, the key on which the conflict occurred (note that this is not necessarily the key at which the pushee is anchored) and the time spent waiting on the conflict (excluding intent resolution). This enables #57114. Touches #55583. I am not closing that issue yet because we also want to potentially track the outcome of the conflict. Release note: None Co-authored-by: Tobias Grieger <[email protected]>

This change attaches a protobuf payload to the current Span whenever a request conflicts with another transaction. The payload contains the contending txn (i.e. the pushee) at the time at which it was first encountered, the key on which the conflict occurred (note that this is not necessarily the key at which the pushee is anchored) and the time spent waiting on the conflict (excluding intent resolution). This enables cockroachdb#57114. Touches cockroachdb#55583. I am not closing that issue yet because we also want to potentially track the outcome of the conflict. Release note: None

tbg · 2021-02-24T09:37:32Z

We produce metadata and while I said earlier that I wanted to leave the issue open to track improvements, I no longer consider that valuable. We can file new issues as we ship the current contention metadata and determine steps to improve them.

tbg added A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Oct 15, 2020

This was referenced Nov 12, 2020

sql: surface KV contention statistics #55243

Closed

rowexec/colexec: add cumulative contention time to EXPLAIN ANALYZE #56612

Closed

asubiotto mentioned this issue Nov 19, 2020

sql: produce mock ContentionEvents and display contention time in EXPLAIN ANALYZE #56906

Merged

This was referenced Jan 4, 2021

tracing: add LogStructured API #58180

Merged

concurrency: emit structured contention information to trace #58444

Merged

tbg added the A-tracing Relating to tracing in CockroachDB. label Jan 20, 2021

tbg closed this as completed Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concurrency: produce tracing metadata #55583

concurrency: produce tracing metadata #55583

tbg commented Oct 15, 2020

tbg commented Oct 28, 2020

jordanlewis commented Nov 3, 2020

tbg commented Nov 3, 2020

tbg commented Jan 6, 2021

tbg commented Feb 24, 2021

concurrency: produce tracing metadata #55583

concurrency: produce tracing metadata #55583

Comments

tbg commented Oct 15, 2020

tbg commented Oct 28, 2020

jordanlewis commented Nov 3, 2020

tbg commented Nov 3, 2020

tbg commented Jan 6, 2021

tbg commented Feb 24, 2021