Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
…db#70095 cockroachdb#70102 cockroachdb#70118

69300: jobs: retry non-cancelable running and all reverting jobs r=ajwerner a=sajjadrizvi

Previously, non-cancelable jobs were retried in running state
only if their errors were marked as retryable. Moreover,  only 
non-cancelable reverting jobs were retried by default. This 
commit makes non-cancelable jobs always retry in running 
state unless their error is marked as a permanent error. In
addition, this commit makes all reverting jobs to retry when
they fail. As a result, non-cancelable running jobs and all
reverting jobs do not fail due to transient errors.

Release justification: low-risk updates to new functionality.

Release note (general change): Non-cancelable jobs, such as
schema-change GC jobs, now do not fail unless they fail with
a permanent error. They retry with exponential-backoff if
they fail due to a transient error. Furthermore, Jobs that 
perform reverting tasks do not fail. Instead, they are retried 
with exponential-backoff if an error is encountered while 
reverting. As a result, transient errors do not impact the jobs 
that are reverting.

Fixes: cockroachdb#66685

69982: docs/tech-notes: admission control overview r=sumeerbhola a=sumeerbhola

Release justification: Non-production code change.
Release note: None

70094: tenantcostserver: fix erroneous panic in tests r=RaduBerinde a=RaduBerinde

The test-only code that checks the invariants of the `tenant_usage`
table inadvertently panics if the query hits an error (such as one
that would be expected if the server is shutting down). We now just
log the error instead.

Fixes cockroachdb#70089.

Release note: None

Release justification: non-production code change to fix test failure.

70095: tenantcostclient: restrict allowed configuration from the tenant side r=RaduBerinde a=RaduBerinde

This change restricts the configuration of tenant cost control from
the tenant side. In the future, we will want to have settings where
the values come from the host cluster but we don't have that
infrastructure today.

With tenants being able to set their own settings, they could easily
sabotage the cost control mechanisms. This change restricts the
allowed values for the target period and the CPU usage allowance, and
fixes the cost model configuration to the default.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

70102: sql: clean up large row errors and events r=knz,yuzefovich a=michae2

Addresses: cockroachdb#67400, cockroachdb#69477

Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done
for CommonTxnRowsLimitDetails in cockroachdb#69945.

Also remove the SafeDetails methods from CommonLargeRowDetails,
txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we
need them.

Release note: None (there was no release between the introduction of the
LargeRow and LargeRowInternal events and this commit).

70118: kv: lock mutexes for `TxnCoordSender.Epoch()` and `Txn.status()` r=ajwerner a=erikgrinaker

### kvcoord: lock mutex in `TxnCoordSender.Epoch()`

Methods that access `TxnCoordSender.mu` fields must lock the mutex
first. `Epoch()` didn't.

Resolves cockroachdb#70071.

Release note: None

### kv: fix mutex locking for `Txn.status`

`Txn.status()` fetches the transaction status from the mutex-protected
`Txn.mu.sender` field, but callers did not take out the mutex lock when
calling it.

This patch renames the method to `Txn.statusLocked()`, and updates all
callers to take out the lock before calling it.

Release note: None

Co-authored-by: Sajjad Rizvi <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
  • Loading branch information
6 people committed Sep 13, 2021
7 parents 71d0b36 + d7ab27e + e64d7fa + 50021ff + d3b4801 + 309c3cb + 0d7c407 commit e19274a
Show file tree
Hide file tree
Showing 25 changed files with 885 additions and 377 deletions.
8 changes: 3 additions & 5 deletions docs/generated/eventlog.md
Original file line number Diff line number Diff line change
Expand Up @@ -1882,7 +1882,7 @@ Events in this category are logged to the `SQL_PERF` channel.
### `large_row`

An event of type `large_row` is recorded when a statement tries to write a row larger than
cluster setting `sql.mutations.max_row_size.log` to the database. Multiple
cluster setting `sql.guardrails.max_row_size_log` to the database. Multiple
LargeRow events will be recorded for statements writing multiple large rows.
LargeRow events are recorded before the transaction commits, so in the case
of transaction abort there will not be a corresponding row in the database.
Expand All @@ -1900,7 +1900,6 @@ of transaction abort there will not be a corresponding row in the database.
| `TableID` | | no |
| `FamilyID` | | no |
| `PrimaryKey` | | yes |
| `ViolatesMaxRowSizeErr` | | no |

### `slow_query`

Expand Down Expand Up @@ -2008,8 +2007,8 @@ Events in this category are logged to the `SQL_INTERNAL_PERF` channel.
### `large_row_internal`

An event of type `large_row_internal` is recorded when an internal query tries to write a row
larger than cluster settings `sql.mutations.max_row_size.log` or
`sql.mutations.max_row_size.err` to the database.
larger than cluster settings `sql.guardrails.max_row_size_log` or
`sql.guardrails.max_row_size_err` to the database.



Expand All @@ -2024,7 +2023,6 @@ larger than cluster settings `sql.mutations.max_row_size.log` or
| `TableID` | | no |
| `FamilyID` | | no |
| `PrimaryKey` | | yes |
| `ViolatesMaxRowSizeErr` | | no |

### `slow_query_internal`

Expand Down
343 changes: 343 additions & 0 deletions docs/tech-notes/admission_control.md

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions pkg/ccl/multiregionccl/region_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ func TestRegionAddDropEnclosingRegionalByRowOps(t *testing.T) {
<-rbrOpFinished
if !regionAlterCmd.shouldSucceed {
// Trigger a roll-back.
return errors.New("boom")
return jobs.MarkAsPermanentJobError(errors.New("boom"))
}
// Trod on.
return nil
Expand Down Expand Up @@ -544,7 +544,7 @@ func TestDroppingPrimaryRegionAsyncJobFailure(t *testing.T) {
knobs := base.TestingKnobs{
SQLTypeSchemaChanger: &sql.TypeSchemaChangerTestingKnobs{
RunBeforeExec: func() error {
return errors.New("yikes")
return jobs.MarkAsPermanentJobError(errors.New("yikes"))
},
RunAfterOnFailOrCancel: func() error {
mu.Lock()
Expand Down Expand Up @@ -609,7 +609,7 @@ func TestRollbackDuringAddDropRegionAsyncJobFailure(t *testing.T) {
knobs := base.TestingKnobs{
SQLTypeSchemaChanger: &sql.TypeSchemaChangerTestingKnobs{
RunBeforeMultiRegionUpdates: func() error {
return errors.New("boom")
return jobs.MarkAsPermanentJobError(errors.New("boom"))
},
},
// Decrease the adopt loop interval so that retries happen quickly.
Expand Down Expand Up @@ -692,7 +692,7 @@ func TestRollbackDuringAddDropRegionPlacementRestricted(t *testing.T) {
knobs := base.TestingKnobs{
SQLTypeSchemaChanger: &sql.TypeSchemaChangerTestingKnobs{
RunBeforeMultiRegionUpdates: func() error {
return errors.New("boom")
return jobs.MarkAsPermanentJobError(errors.New("boom"))
},
},
// Decrease the adopt loop interval so that retries happen quickly.
Expand Down
23 changes: 19 additions & 4 deletions pkg/ccl/multitenantccl/tenantcostclient/tenant_side.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ var TargetPeriodSetting = settings.RegisterDurationSetting(
"tenant_cost_control_period",
"target duration between token bucket requests from tenants (requires restart)",
10*time.Second,
settings.PositiveDuration,
checkDurationInRange(5*time.Second, 120*time.Second),
)

// CPUUsageAllowance is exported for testing purposes.
Expand All @@ -42,9 +42,22 @@ var CPUUsageAllowance = settings.RegisterDurationSetting(
"doesn't contribute to consumption; for example, if it is set to 10ms, "+
"that corresponds to 1% of a CPU",
10*time.Millisecond,
settings.NonNegativeDuration,
checkDurationInRange(0, 10*time.Millisecond),
)

// checkDurationInRange returns a function used to validate duration cluster
// settings. Because these values are currently settable by the tenant, we need
// to restrict the allowed values to avoid possible sabotage of the cost control
// mechanisms.
func checkDurationInRange(min, max time.Duration) func(v time.Duration) error {
return func(v time.Duration) error {
if v < min || v > max {
return errors.Errorf("value %s out of range (%s, %s)", v, min, max)
}
return nil
}
}

// mainLoopUpdateInterval is the period at which we collect CPU usage and
// evaluate whether we need to send a new token request.
const mainLoopUpdateInterval = 1 * time.Second
Expand Down Expand Up @@ -93,8 +106,10 @@ func newTenantSideCostController(
}
c.limiter.Init(c.timeSource, c.lowRUNotifyChan)

// TODO(radu): support changing the tenant configuration at runtime.
c.costCfg = tenantcostmodel.ConfigFromSettings(&st.SV)
// TODO(radu): these settings can currently be changed by the tenant (see
// #47918), which would made it very easy to evade cost control. For now, use
// the hardcoded default values.
c.costCfg = tenantcostmodel.DefaultConfig()
return c, nil
}

Expand Down
1 change: 1 addition & 0 deletions pkg/ccl/multitenantccl/tenantcostserver/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ go_library(
"//pkg/sql/sem/tree",
"//pkg/sql/sessiondata",
"//pkg/util",
"//pkg/util/log",
"//pkg/util/metric",
"//pkg/util/metric/aggmetric",
"//pkg/util/syncutil",
Expand Down
14 changes: 10 additions & 4 deletions pkg/ccl/multitenantccl/tenantcostserver/system_table.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/sql/sem/tree"
"github.com/cockroachdb/cockroach/pkg/sql/sessiondata"
"github.com/cockroachdb/cockroach/pkg/util"
"github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/cockroachdb/errors"
)

Expand Down Expand Up @@ -392,16 +393,16 @@ func (h *sysTableHelper) accomodateNewInstance(tenant *tenantState, instance *in

// maybeCheckInvariants checks the invariants for the system table with a random
// probability and only if this is a test build.
func (h *sysTableHelper) maybeCheckInvariants() error {
func (h *sysTableHelper) maybeCheckInvariants(ctx context.Context) error {
if util.CrdbTestBuild && rand.Intn(10) == 0 {
return h.checkInvariants()
return h.checkInvariants(ctx)
}
return nil
}

// checkInvariants reads all rows in the system table for the given tenant and
// checks that the state is consistent.
func (h *sysTableHelper) checkInvariants() error {
func (h *sysTableHelper) checkInvariants(ctx context.Context) error {
// Read the two rows for the per-tenant state (instance_id = 0) and the
// per-instance state.
rows, err := h.ex.QueryBufferedEx(
Expand Down Expand Up @@ -430,7 +431,12 @@ func (h *sysTableHelper) checkInvariants() error {
h.tenantID.ToUint64(),
)
if err != nil {
return err
if h.ctx.Err() == nil {
log.Warningf(ctx, "checkInvariants query failed: %v", err)
}
// We don't want to cause a panic for a query error (which is expected
// during shutdown).
return nil
}
if len(rows) == 0 {
return nil
Expand Down
2 changes: 1 addition & 1 deletion pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ func (s *instance) TokenBucketRequest(
return err
}

if err := h.maybeCheckInvariants(); err != nil {
if err := h.maybeCheckInvariants(ctx); err != nil {
panic(err)
}
consumption = tenant.Consumption
Expand Down
Loading

0 comments on commit e19274a

Please sign in to comment.