Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
105256: sql: adjust DistSQL physical planning in multi-tenant mode r=yuzefovich a=yuzefovich

This commit removes the logic that might reassign some spans to the
gateway that can be applied in the multi-tenant deployment mode. This
logic was introduced in cockroachdb#80353 with the following [rationale](cockroachdb#80353 (review)):
```
This will probably go away once we support locality-aware distribution,
at least in this form. There's two reasons why I did this right now:
1. It avoids an extra hop (KV->assignee->gateway->client vs
KV->gateway->client), and this is an optimization we sometimes do in
the non-MT code path though at a later stage,
2. It makes the assignments and distribution type deterministic in
testing when we expect to assign to only a single pod.
```
Since then the locality-aware planning has been implemented (addressing
first half of point 1.). Also, the second half of point 1. applies both
to single-tenant and multi-tenant modes (`maybeMoveSingleFlowToGateway`).
Point 2. is a bit unfortunate and I'm not sure what to do about it yet
(we'll need to figure it out if we ever make separate-process
multi-tenant the default mode for running tests that check DistSQL
planning). For now only a single test needed an adjustment to make it
deterministic.

Additionally, running TPCH queries experimentally has shown that this
reassigning of single TableReader can make the query latency
significantly different (sometimes reduce it, sometimes increase it)
(see [here](cockroachdb#104379 (comment))).

All of these reasons suggest that we should just get rid off this logic
to unify the DistSQL physical planner more between single-tenant and
multi-tenant modes.

Addresses: cockroachdb#104379.
Epic: CRDB-26687

Release note: None

105569: roachtest: harden and extend cancel roachtest r=yuzefovich a=yuzefovich

This commit hardens `cancel` roachtest. In particular, this test
involves two goroutines: the runner that is executing longer running
TPCH query and the main goroutine that cancels that query. Previously,
in order to ensure that the main goroutine attempts to cancel the query
at the right moment, we slept for 250ms. Then, we would cancel all
running non-internal queries other than `SHOW CLUSTER QUERIES` itself.
This was problematic for a couple of reasons:
- the TPCH query might not have started yet (due some Go scheduling
delays)
- we could actually try to cancel one of the setup queries (the runner
does `USE tpch;` and `SET distsql = off;` before running the TPCH
query).

In order to address the first reason, this commit adjusts the runner to
notify the main goroutine only after the setup queries are done and
introduces the polling loop to wait until the TPCH query shows up. That
polling loop will now randomly sleep for a random duration up to 1000ms
(in order to improve the test coverage of both the optimizer and the
execution engine). Note that we only check that the cancellation
occurred within 3s (used to be 5s before this commit), so we don't
sufficiently exercise the optimizer cancellation (which isn't the
primary goal of this test anyway).

The second reason is addressed by blocking the main goroutine until the
setup queries are done.

Fixes: cockroachdb#105528.

Release note: None

105817: docs-issue-generation: add related PRs link to product change issue body r=nickvigilante a=nickvigilante

Fixes DEVINF-447

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Nick Vigilante <[email protected]>
  • Loading branch information
3 people committed Jun 29, 2023
4 parents c9f4e8d + bdda6a0 + f80b346 + 75a6f1f commit da4a10b
Show file tree
Hide file tree
Showing 6 changed files with 205 additions and 131 deletions.
9 changes: 9 additions & 0 deletions pkg/cmd/docs-issue-generation/docs_issue_generation.go
Original file line number Diff line number Diff line change
Expand Up @@ -867,6 +867,15 @@ func formatReleaseNotes(
)
rnBodySlice = append(rnBodySlice, strings.TrimSuffix(rnBody, "\n"))
}
if len(rnBodySlice) > 1 {
relatedProductChanges := "Related product changes: " +
"https://cockroachlabs.atlassian.net/issues/?jql=project%20%3D%20%22DOC%22%20and%20%22Doc%20Type%5BDropdown%5D" +
"%22%20%3D%20%22Product%20Change%22%20AND%20description%20~%20%22commit%2F" +
crdbSha + "%22%20ORDER%20BY%20created%20DESC\n\n---"
for i, rn := range rnBodySlice {
rnBodySlice[i] = strings.Replace(rn, "\n---", relatedProductChanges, -1)
}
}
return rnBodySlice
}

Expand Down
82 changes: 82 additions & 0 deletions pkg/cmd/docs-issue-generation/docs_issue_generation_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,88 @@ This increases troubleshootability.
Release note: None`,
rns: []string{},
},
{
prNum: "104265",
sha: "d756dec1b9d7245305ab706e68e2ec3de0e61ffc",
commitMessage: `Release note (cli change): The log output formats ` + "`crdb-v1`" + ` and
` + "`crdb-v2`" + ` now support the format option ` + "`timezone`" + `. When specified,
the corresponding time zone is used to produce the timestamp column.
For example:
` + "```" + `yaml
file-defaults:
format: crdb-v2
format-options: {timezone: america/new_york}
` + "```" + `
Example logging output:
` + "```" + `
I230606 12:43:01.553407-040000 1 1@cli/start.go:575 ⋮ [n?] 4 soft memory limit of Go runtime is set to 35 GiB
^^^^^^^ indicates GMT-4 was used.
` + "```" + `
The timezone offset is also always included in the format if it is not
zero (e.g. for non-UTC time zones). This is necessary to ensure that
the times can be read back precisely.
Release note (cli change): The command ` + "`cockroach debug merge-log`" + ` was
adapted to understand time zones in input files read with format
` + "`crdb-v1`" + ` or ` + "`crdb-v2`" + `.
Release note (backward-incompatible change): When a deployment is
configured to use a time zone (new feature) for log file output using
formats ` + "`crdb-v1`" + ` or ` + "`crdb-v2`" + `, it becomes impossible to process the
new output log files using the ` + "`cockroach debug merge-log`" + ` command
from a previous version. The newest ` + "`cockroach debug merge-log`" + ` code
must be used instead.`,
rns: []string{`Related PR: https://github.com/cockroachdb/cockroach/pull/104265
Commit: https://github.com/cockroachdb/cockroach/commit/d756dec1b9d7245305ab706e68e2ec3de0e61ffc
Related product changes: https://cockroachlabs.atlassian.net/issues/?jql=project%20%3D%20%22DOC%22%20and%20%22Doc%20Type%5BDropdown%5D%22%20%3D%20%22Product%20Change%22%20AND%20description%20~%20%22commit%2Fd756dec1b9d7245305ab706e68e2ec3de0e61ffc%22%20ORDER%20BY%20created%20DESC
---
Release note (cli change): The log output formats ` + "`crdb-v1`" + ` and
` + "`crdb-v2`" + ` now support the format option ` + "`timezone`" + `. When specified,
the corresponding time zone is used to produce the timestamp column.
For example:
` + "```" + `yaml
file-defaults:
format: crdb-v2
format-options: {timezone: america/new_york}
` + "```" + `
Example logging output:
` + "```" + `
I230606 12:43:01.553407-040000 1 1@cli/start.go:575 ⋮ [n?] 4 soft memory limit of Go runtime is set to 35 GiB
^^^^^^^ indicates GMT-4 was used.
` + "```" + `
The timezone offset is also always included in the format if it is not
zero (e.g. for non-UTC time zones). This is necessary to ensure that
the times can be read back precisely.`,
`Related PR: https://github.com/cockroachdb/cockroach/pull/104265
Commit: https://github.com/cockroachdb/cockroach/commit/d756dec1b9d7245305ab706e68e2ec3de0e61ffc
Related product changes: https://cockroachlabs.atlassian.net/issues/?jql=project%20%3D%20%22DOC%22%20and%20%22Doc%20Type%5BDropdown%5D%22%20%3D%20%22Product%20Change%22%20AND%20description%20~%20%22commit%2Fd756dec1b9d7245305ab706e68e2ec3de0e61ffc%22%20ORDER%20BY%20created%20DESC
---
Release note (cli change): The command ` + "`cockroach debug merge-log`" + ` was
adapted to understand time zones in input files read with format
` + "`crdb-v1`" + ` or ` + "`crdb-v2`" + `.`,
`Related PR: https://github.com/cockroachdb/cockroach/pull/104265
Commit: https://github.com/cockroachdb/cockroach/commit/d756dec1b9d7245305ab706e68e2ec3de0e61ffc
Related product changes: https://cockroachlabs.atlassian.net/issues/?jql=project%20%3D%20%22DOC%22%20and%20%22Doc%20Type%5BDropdown%5D%22%20%3D%20%22Product%20Change%22%20AND%20description%20~%20%22commit%2Fd756dec1b9d7245305ab706e68e2ec3de0e61ffc%22%20ORDER%20BY%20created%20DESC
---
Release note (backward-incompatible change): When a deployment is
configured to use a time zone (new feature) for log file output using
formats ` + "`crdb-v1`" + ` or ` + "`crdb-v2`" + `, it becomes impossible to process the
new output log files using the ` + "`cockroach debug merge-log`" + ` command
from a previous version. The newest ` + "`cockroach debug merge-log`" + ` code
must be used instead.`},
},
}
for _, tc := range testCases {
t.Run(tc.prNum, func(t *testing.T) {
Expand Down
144 changes: 93 additions & 51 deletions pkg/cmd/roachtest/tests/cancel.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/cockroach/pkg/util/cancelchecker"
"github.com/cockroachdb/cockroach/pkg/util/randutil"
"github.com/cockroachdb/cockroach/pkg/util/timeutil"
"github.com/cockroachdb/cockroach/pkg/workload/tpch"
"github.com/cockroachdb/errors"
Expand Down Expand Up @@ -58,63 +59,104 @@ func registerCancel(r registry.Registry) {
t.Fatal(err)
}

queryPrefix := "USE tpch; "
if !useDistsql {
queryPrefix += "SET distsql = off; "
}

t.Status("running queries to cancel")
rng, _ := randutil.NewTestRand()
for _, queryNum := range tpchQueriesToRun {
// sem is used to indicate that the query-runner goroutine has
// been spawned up.
sem := make(chan struct{})
// Any error regarding the cancellation (or of its absence) will
// be sent on errCh.
errCh := make(chan error, 1)
go func(queryNum int) {
defer close(errCh)
query := tpch.QueriesByNumber[queryNum]
t.L().Printf("executing q%d\n", queryNum)
sem <- struct{}{}
close(sem)
_, err := conn.Exec(queryPrefix + query)
if err == nil {
errCh <- errors.New("query completed before it could be canceled")
} else {
fmt.Printf("query failed with error: %s\n", err)
// Note that errors.Is() doesn't work here because
// lib/pq wraps the query canceled error.
if !strings.Contains(err.Error(), cancelchecker.QueryCanceledError.Error()) {
errCh <- errors.Wrap(err, "unexpected error")
// Run each query 5 times to increase the test coverage.
for run := 0; run < 5; run++ {
// sem is used to indicate that the query-runner goroutine
// has been spawned up and has done preliminary setup.
sem := make(chan struct{})
// An error will always be sent on errCh by the runner
// (either query execution error or an error indicating the
// absence of expected cancellation error).
errCh := make(chan error, 1)
go func(queryNum int) {
runnerConn := c.Conn(ctx, t.L(), 1)
defer runnerConn.Close()
setupQueries := []string{"USE tpch;"}
if !useDistsql {
setupQueries = append(setupQueries, "SET distsql = off;")
}
for _, setupQuery := range setupQueries {
t.L().Printf("executing setup query %q", setupQuery)
if _, err := runnerConn.Exec(setupQuery); err != nil {
errCh <- err
close(sem)
return
}
}
query := tpch.QueriesByNumber[queryNum]
t.L().Printf("executing q%d\n", queryNum)
close(sem)
_, err := runnerConn.Exec(query)
if err == nil {
err = errors.New("query completed before it could be canceled")
}
errCh <- err
}(queryNum)

// Wait for the query-runner goroutine to start as well as
// to execute setup queries.
<-sem

// Continuously poll until we get the queryID that we want
// to cancel. We expect it to show up within 10 seconds.
var queryID, query string
timeoutCh := time.After(10 * time.Second)
pollingStartTime := timeutil.Now()
for {
// Sleep for some random duration up to 1000ms. This
// allows us to sometimes find the query when it's in
// the planning stage while in most cases it's in the
// execution stage already.
toSleep := time.Duration(rng.Intn(1001)) * time.Millisecond
t.Status(fmt.Sprintf("sleeping for %s", toSleep))
time.Sleep(toSleep)
rows, err := conn.Query(`SELECT query_id, query FROM [SHOW CLUSTER QUERIES] WHERE query NOT LIKE '%SHOW CLUSTER QUERIES%'`)
if err != nil {
t.Fatal(err)
}
if rows.Next() {
if err = rows.Scan(&queryID, &query); err != nil {
t.Fatal(err)
}
break
}
if err = rows.Close(); err != nil {
t.Fatal(err)
}
select {
case err = <-errCh:
t.Fatalf("received an error from the runner goroutine before the query could be canceled: %v", err)
case <-timeoutCh:
t.Fatal(errors.New("didn't see the query to cancel within 10 seconds"))
default:
}
}
}(queryNum)

// Wait for the query-runner goroutine to start.
<-sem

// The cancel query races with the execution of the query it's trying to
// cancel, which may result in attempting to cancel the query before it
// has started. To be more confident that the query is executing, wait
// a bit before attempting to cancel it.
time.Sleep(250 * time.Millisecond)

const cancelQuery = `CANCEL QUERIES
SELECT query_id FROM [SHOW CLUSTER QUERIES] WHERE query not like '%SHOW CLUSTER QUERIES%'`
c.Run(ctx, c.Node(1), `./cockroach sql --insecure -e "`+cancelQuery+`"`)
cancelStartTime := timeutil.Now()

select {
case err, ok := <-errCh:
if ok {
t.Fatal(err)

t.Status(fmt.Sprintf("canceling the query after waiting for %s", timeutil.Since(pollingStartTime)))
_, err := conn.Exec(`CANCEL QUERY $1`, queryID)
if err != nil {
t.Status(fmt.Sprintf("%s: %q", queryID, query))
t.Fatalf("encountered an error when canceling %q with queryID=%s: %v", query, queryID, err)
}
// If errCh is closed, then the cancellation was successful.
timeToCancel := timeutil.Since(cancelStartTime)
fmt.Printf("canceling q%d took %s\n", queryNum, timeToCancel)
cancelStartTime := timeutil.Now()

case <-time.After(5 * time.Second):
t.Fatal("query took too long to respond to cancellation")
select {
case err := <-errCh:
t.Status(err)
if !strings.Contains(err.Error(), cancelchecker.QueryCanceledError.Error()) {
// Note that errors.Is() doesn't work here because
// lib/pq wraps the query canceled error.
t.Fatal(errors.Wrap(err, "unexpected error"))
}
timeToCancel := timeutil.Since(cancelStartTime)
t.Status(fmt.Sprintf("canceling q%d took %s\n", queryNum, timeToCancel))

case <-time.After(3 * time.Second):
t.Fatal("query took too long to respond to cancellation")
}
}
}

Expand Down
Loading

0 comments on commit da4a10b

Please sign in to comment.