Skip to content

PDP 20 (TXN Timeouts)

Derek Moore edited this page Jul 9, 2021 · 1 revision

Revisiting transaction timeouts

Status: Under discussion

Related issues:

Motivation

Transactions enable writers to perform multiple writes to a stream atomically. The current API call to begin a transaction looks like this:

Transaction<Type> beginTxn(long transactionTimeout, 
                           long maxExecutionTime,
                           long scaleGracePeriod);

The three parameters of the call are configuring different timeouts for a txn:

  1. transactionTimeout is a lease timeout. If no client pings the controller for this transaction within the specified period, then the txn aborts.
  2. maxExecutionTime is the maximum amount of time a txn is allowed to remain open, independent of whether a client is pinging the controller for the given txn or not.
  3. scaleGracePeriod is the maximum amount of time a txn is allowed to block a stream scaling event without aborting. Scaling events currently need to wait until outstanding txns complete before it can proceed. Consequently, choosing a larger value for scaleGracePeriod means that stream scaling events can be blocked for a longer time waiting on a txn.

One problem with setting new txns this way is that it is difficult to choose the values as there are three different timeouts to reason about. The second problem and perhaps a more severe one is the choice of the scale grace period timeout as it has conflicting goals. A longer timeout is appropriate for correctness as the application does not want a txn timing out prematurely. A shorter time out benefits scaling as once we determine that it needs to happen, the application can have access to new segments sooner.

Current transaction timeouts

  1. Lease timeouts
  2. Scale grace period
  3. Maximum execution time

Lease timeouts

The main goal of txn leases is to enable the system to reclaim used resources quickly when a txn has been left open because, for example, the client crashed. This mechanism works by having the writer client pinging the txn periodically to keep the txn open.

The recommendation in this proposal is to keep the lease mechanism, but make the pings internal to the client. We recommend moving the configuration of the timeout to configuration rather than having it present on each API call.

Scale grace period

We propose to remove the scale grace period by means of implementing rolling transactions. Rolling transactions do not block scaling events due to the presence of outstanding txns. The current grace period for stream scaling has the inconvenience described in motivation.

Maximum execution time

We propose to either remove this time out or move it to configuration.

Open questions

Discarded approaches

Clone this wiki locally