Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
90451: Roachtest SSH Retries r=smg260 a=smg260

roachtest:  This change introduces a default retry when encountering an error of `255` (SSH). It is transparent to callers. If desired, it would be relatively simple to expose this via the `cluster` interface and allow callers to specify a retry handler/options on a per command basis, perhaps something like:

``` 
// retry if we get an exit code of 12
retryOpts { 
	opt: retry.Options { .. } , 
	shouldRetryFn: func(res RunResultDetails) bool { return res.RemoteExitStatus == 12 }
}
c.RunE(ctx, nodes, retryOpts, "my_cmd")
```
but this PR does not introduce that.


Today there appears the concept of a "roachprod" and a "command error", the former of which denotes an issue in roachprod handling code, the latter representing an error executing a command over SSH. This PR preserves this behaviour and introduces an updated function signature from `f(i int) ([]byte, error)` to `f(i int) RunResultDetails, error`. `RunResultDetails` has been expanded to capture information about the execution of a command, such as `stderr/our, exit code, error, attempt number`. Any roachprod error will result in an `error` being returned, and set in `RunResultDetails.Err`. Any command error would be only set in `RunResultDetails.Err` with the returned error being nil. This allows callers to differentiate between a roachprod and a command error - something which existing code depends on.

Retry handling code can use a function's returned `RunResultDetails` to determine whether a retry should take place. This PR has default retry handling on `RunResultDetails.RemoteExitStatus == 255`.

Notable changes
- Retry on `255` exit code for any function executed via `ParallelE`
- Consolidated `Run/RunWithDetails` (significant duplication of code)
- Classify exit error at the session.go level so that it is available for all commands and to all callers
- Remove `"echo $?"` option to print exit code at end of cmd. It is not clear why this was done. (Perhaps for visibility in logs?) Accompanying function to parse out the status has also been removed. The exit code is still accessible via the `RunResultDetails`
- Update roachtests to check for SSH marker error instead of checking for `!install.NonZeroExitCode`

Resolves: #73542
Release note: None
Epic: [CRDB-21386](https://cockroachlabs.atlassian.net/browse/CRDB-21386)

91765: utilccl: fix a comment r=andreimatei a=andreimatei

It was missing a key subject.

Release note: None
Epic: None

91884: changfeedccl: deflake TestAlterChangefeedTelemetry  r=stevendanna a=stevendanna

The job system clears the lease asyncronously after cancelation. This
lease clearing transaction can cause a restart in the alter changefeed
transaction, which will lead to different feature counter
counts. Thus, we want to wait for the lease clear. However, the lease
clear isn't guaranteed to happen, so we only wait a few seconds for
it.

Epic: None

Release note: None

91885: roachpb, server, ui: add fingerprint ID to details pages r=maryliag a=maryliag

**Commit 1**
ui: use format from cluster-ui
The format.js file was duplicated on db console,
making it hard to maintain.
This commit replaces all usages of the format
functions on db-console package, to the ones
from cluster-ui.

Epic: None

Release note: None

**Commit 2**
ui: add leading zeros to hex value with less than 16 chars
Previously, some fingerprint IDs in hex format would have
a leading 0, that was not being returning from the DB calls.
This was causing confusion when seeing the value on the UI
and trying to find the same fingerprint on our tables.

This commits checks the hex values used and add the missing
leading zeros back.

Part Of: #91763

Release note (bug fix): Add leading zeros to fingerprint IDs
with less than 16 characters.

**Commit 3**
roachpb, server, ui: add fingerprint ID to details pages
Previously, there was no easy way to get a statement or
fingerprint id from their respective details pages,
allowing then the value to be found on our internal tables
using cli.
This commit adds the fingerprint id on the response of
the statement details page.
It also adds the fingerprint ID to the Statement Details
page and Transaction Details page.

Fixes #91763

<img width="758" alt="Screen Shot 2022-11-14 at 6 53 40 PM" src="https://user-images.githubusercontent.com/1017486/201797428-a04fcbce-e36e-426b-aeb9-bb9250adc4d6.png">
<img width="1524" alt="Screen Shot 2022-11-14 at 6 53 58 PM" src="https://user-images.githubusercontent.com/1017486/201797451-8c7035da-d2bb-4261-ad79-88618f03e7c5.png">




Release note (ui change): Add fingerprint ID in hex format
to the Statement Details page and Transaction Details page.

Co-authored-by: Miral Gadani <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: maryliag <[email protected]>
  • Loading branch information
5 people committed Nov 15, 2022
5 parents 00245e3 + ae0ad5c + 2f0313e + a4ab9b6 + 66730b8 commit 5a24e11
Show file tree
Hide file tree
Showing 52 changed files with 745 additions and 743 deletions.
1 change: 0 additions & 1 deletion pkg/ccl/changefeedccl/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,6 @@ go_test(
"//pkg/sql/types",
"//pkg/storage",
"//pkg/testutils",
"//pkg/testutils/jobutils",
"//pkg/testutils/serverutils",
"//pkg/testutils/skip",
"//pkg/testutils/sqlutils",
Expand Down
28 changes: 25 additions & 3 deletions pkg/ccl/changefeedccl/alter_changefeed_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ import (
"github.com/cockroachdb/cockroach/pkg/sql/execinfra"
"github.com/cockroachdb/cockroach/pkg/sql/tests"
"github.com/cockroachdb/cockroach/pkg/testutils"
"github.com/cockroachdb/cockroach/pkg/testutils/jobutils"
"github.com/cockroachdb/cockroach/pkg/testutils/serverutils"
"github.com/cockroachdb/cockroach/pkg/testutils/skip"
"github.com/cockroachdb/cockroach/pkg/testutils/sqlutils"
Expand Down Expand Up @@ -423,11 +422,34 @@ func TestAlterChangefeedTelemetry(t *testing.T) {

testFeed := feed(t, f, `CREATE CHANGEFEED FOR foo, bar WITH diff`)
defer closeFeed(t, testFeed)

feed := testFeed.(cdctest.EnterpriseTestFeed)

require.NoError(t, feed.Pause())
jobutils.WaitForJobToHaveNoLease(t, sqlDB, feed.JobID())

// The job system clears the lease asyncronously after
// cancellation. This lease clearing transaction can
// cause a restart in the alter changefeed
// transaction, which will lead to different feature
// counter counts. Thus, we want to wait for the lease
// clear. However, the lease clear isn't guaranteed to
// happen, so we only wait a few seconds for it.
waitForNoLease := func() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
for {
if ctx.Err() != nil {
return
}
var sessionID []byte
sqlDB.QueryRow(t, `SELECT claim_session_id FROM system.jobs WHERE id = $1`, feed.JobID()).Scan(&sessionID)
if sessionID == nil {
return
}
time.Sleep(250 * time.Millisecond)
}
}

waitForNoLease()
sqlDB.Exec(t, fmt.Sprintf(`ALTER CHANGEFEED %d DROP bar, foo ADD baz UNSET diff SET resolved, format=json`, feed.JobID()))

counts := telemetry.GetFeatureCounts(telemetry.Raw, telemetry.ResetCounts)
Expand Down
6 changes: 4 additions & 2 deletions pkg/ccl/utilccl/license_check.go
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,8 @@ func ApplyTenantLicense() error {
// IsEnterpriseEnabled() instead.
//
// The ClusterID argument should be the tenant-specific logical
// cluster ID. is not used for the check itself; it is merely embedded
// cluster ID.
// `feature` is not used for the check itself; it is merely embedded
// in the URL displayed in the error message.
func CheckEnterpriseEnabled(st *cluster.Settings, cluster uuid.UUID, feature string) error {
return checkEnterpriseEnabledAt(st, timeutil.Now(), cluster, feature, true /* withDetails */)
Expand All @@ -125,7 +126,8 @@ func CheckEnterpriseEnabled(st *cluster.Settings, cluster uuid.UUID, feature str
// hot paths.
//
// The ClusterID argument should be the tenant-specific logical
// cluster ID. is not used for the check itself; it is merely embedded
// cluster ID.
// `feature` is not used for the check itself; it is merely embedded
// in the URL displayed in the error message.
func IsEnterpriseEnabled(st *cluster.Settings, cluster uuid.UUID, feature string) bool {
return checkEnterpriseEnabledAt(
Expand Down
1 change: 1 addition & 0 deletions pkg/cmd/roachtest/tests/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,7 @@ go_library(
"//pkg/roachpb",
"//pkg/roachprod",
"//pkg/roachprod/config",
"//pkg/roachprod/errors",
"//pkg/roachprod/install",
"//pkg/roachprod/logger",
"//pkg/roachprod/prometheus",
Expand Down
15 changes: 5 additions & 10 deletions pkg/cmd/roachtest/tests/activerecord.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/option"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
rperrors "github.com/cockroachdb/cockroach/pkg/roachprod/errors"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/errors"
)
Expand Down Expand Up @@ -181,16 +182,10 @@ func registerActiveRecord(r registry.Registry) {
`sudo RUBYOPT="-W0" TESTOPTS="-v" bundle exec rake test`,
)

// Expected to fail but we should still scan the error to check if
// there's an SSH/roachprod error.
if err != nil {
// install.NonZeroExitCode includes unrelated to SSH errors ("255")
// or roachprod errors, so we call t.Fatal if the error is not an
// install.NonZeroExitCode error
commandError := (*install.NonZeroExitCode)(nil)
if !errors.As(err, &commandError) {
t.Fatal(err)
}
// Fatal for a roachprod or SSH error. A roachprod error is when result.Err==nil.
// Proceed for any other (command) errors
if err != nil && (result.Err == nil || errors.Is(err, rperrors.ErrSSH255)) {
t.Fatal(err)
}

// Result error contains stdout, stderr, and any error content returned by exec package.
Expand Down
15 changes: 5 additions & 10 deletions pkg/cmd/roachtest/tests/django.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/spec"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
rperrors "github.com/cockroachdb/cockroach/pkg/roachprod/errors"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/errors"
)
Expand Down Expand Up @@ -186,16 +187,10 @@ func registerDjango(r registry.Registry) {
t.Status("Running django test app ", testName)
result, err := c.RunWithDetailsSingleNode(ctx, t.L(), node, fmt.Sprintf(djangoRunTestCmd, testName))

// Expected to fail but we should still scan the error to check if
// there's an SSH/roachprod error.
if err != nil {
// install.NonZeroExitCode includes unrelated to SSH errors ("255")
// or roachprod errors, so we call t.Fatal if the error is not an
// install.NonZeroExitCode error
commandError := (*install.NonZeroExitCode)(nil)
if !errors.As(err, &commandError) {
t.Fatal(err)
}
// Fatal for a roachprod or SSH error. A roachprod error is when result.Err==nil.
// Proceed for any other (command) errors
if err != nil && (result.Err == nil || errors.Is(err, rperrors.ErrSSH255)) {
t.Fatal(err)
}

rawResults := []byte(result.Stdout + result.Stderr)
Expand Down
29 changes: 9 additions & 20 deletions pkg/cmd/roachtest/tests/gopg.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/option"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
rperrors "github.com/cockroachdb/cockroach/pkg/roachprod/errors"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/errors"
)
Expand Down Expand Up @@ -115,16 +116,10 @@ func registerGopg(r registry.Registry) {
destPath, removeColorCodes, resultsFilePath),
)

// Expected to fail but we should still scan the error to check if
// there's an SSH/roachprod error.
if err != nil {
// install.NonZeroExitCode includes unrelated to SSH errors ("255")
// or roachprod errors, so we call t.Fatal if the error is not an
// install.NonZeroExitCode error
commandError := (*install.NonZeroExitCode)(nil)
if !errors.As(err, &commandError) {
t.Fatal(err)
}
// Fatal for a roachprod or SSH error. A roachprod error is when result.Err==nil.
// Proceed for any other (command) errors
if err != nil && (result.Err == nil || errors.Is(err, rperrors.ErrSSH255)) {
t.Fatal(err)
}

rawResults := []byte(result.Stdout + result.Stderr)
Expand Down Expand Up @@ -152,16 +147,10 @@ func registerGopg(r registry.Registry) {
destPath, goPath, resultsFilePath, goPath),
)

// Expected to fail but we should still scan the error to check if
// there's an SSH/roachprod error.
if err != nil {
// install.NonZeroExitCode includes unrelated to SSH errors ("255")
// or roachprod errors, so we call t.Fatal if the error is not an
// install.NonZeroExitCode error
commandError := (*install.NonZeroExitCode)(nil)
if !errors.As(err, &commandError) {
t.Fatal(err)
}
// Fatal for a roachprod or SSH error. A roachprod error is when result.Err==nil.
// Proceed for any other (command) errors
if err != nil && (result.Err == nil || errors.Is(err, rperrors.ErrSSH255)) {
t.Fatal(err)
}

xmlResults := []byte(result.Stdout + result.Stderr)
Expand Down
2 changes: 1 addition & 1 deletion pkg/cmd/roachtest/tests/jepsen.go
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ func initJepsen(ctx context.Context, t test.Test, c cluster.Cluster, j jepsenCon
ctx, t.L(), controller, "sh", "-c",
`"sudo DEBIAN_FRONTEND=noninteractive apt-get -qqy install openjdk-8-jre openjdk-8-jre-headless libjna-java gnuplot > /dev/null 2>&1"`,
); err != nil {
if result.RemoteExitStatus == "100" {
if result.RemoteExitStatus == 100 {
t.Skip("apt-get failure (#31944)", result.Stdout+result.Stderr)
}
t.Fatal(err)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,14 +84,14 @@ func validateCorpusFile(
// Detect validation failures in standard output first, and dump those out.
failureRegex := regexp.MustCompile(`failed to validate.*`)
if matches := failureRegex.FindAllString(details.Stdout, -1); len(matches) > 0 {
t.Fatalf("Validation of corpus has failed (exit status %s): \n%s",
t.Fatalf("Validation of corpus has failed (exit status %d): \n%s",
details.RemoteExitStatus,
strings.Join(matches, "\n"))
}

// If no error is logged dump out both stdout and std error.
if details.RemoteExitStatus != "0" {
t.Fatalf("Validation command failed with exist status %s, output:\n %s\n%s\n",
if details.RemoteExitStatus != 0 {
t.Fatalf("Validation command failed with exit status %d, output:\n %s\n%s\n",
details.RemoteExitStatus,
details.Stdout,
details.Stderr,
Expand Down
15 changes: 5 additions & 10 deletions pkg/cmd/roachtest/tests/nodejs_postgres.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/option"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
rperrors "github.com/cockroachdb/cockroach/pkg/roachprod/errors"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/errors"
"github.com/stretchr/testify/require"
Expand Down Expand Up @@ -141,16 +142,10 @@ PGSSLCERT=$HOME/certs/client.%s.crt PGSSLKEY=$HOME/certs/client.%s.key PGSSLROOT
),
)

// Expected to fail but we should still scan the error to check if
// there's an SSH/roachprod error.
if err != nil {
// install.NonZeroExitCode includes unrelated to SSH errors ("255")
// or roachprod errors, so we call t.Fatal if the error is not an
// install.NonZeroExitCode error
commandError := (*install.NonZeroExitCode)(nil)
if !errors.As(err, &commandError) {
t.Fatal(err)
}
// Fatal for a roachprod or SSH error. A roachprod error is when result.Err==nil.
// Proceed for any other (command) errors
if err != nil && (result.Err == nil || errors.Is(err, rperrors.ErrSSH255)) {
t.Fatal(err)
}

rawResultsStr := result.Stdout + result.Stderr
Expand Down
15 changes: 5 additions & 10 deletions pkg/cmd/roachtest/tests/pgx.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/option"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
rperrors "github.com/cockroachdb/cockroach/pkg/roachprod/errors"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/errors"
)
Expand Down Expand Up @@ -121,16 +122,10 @@ func registerPgx(r registry.Registry) {
"`go env GOPATH`/bin/go-junit-report",
)

// Expected to fail but we should still scan the error to check if
// there's an SSH/roachprod error.
if err != nil {
// install.NonZeroExitCode includes unrelated to SSH errors ("255")
// or roachprod errors, so we call t.Fatal if the error is not an
// install.NonZeroExitCode error
commandError := (*install.NonZeroExitCode)(nil)
if !errors.As(err, &commandError) {
t.Fatal(err)
}
// Fatal for a roachprod or SSH error. A roachprod error is when result.Err==nil.
// Proceed for any other (command) errors
if err != nil && (result.Err == nil || errors.Is(err, rperrors.ErrSSH255)) {
t.Fatal(err)
}

// Result error contains stdout, stderr, and any error content returned by exec package.
Expand Down
15 changes: 5 additions & 10 deletions pkg/cmd/roachtest/tests/psycopg.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/option"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
rperrors "github.com/cockroachdb/cockroach/pkg/roachprod/errors"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/errors"
)
Expand Down Expand Up @@ -124,16 +125,10 @@ func registerPsycopg(r registry.Registry) {
make check PYTHON_VERSION=3`,
)

// Expected to fail but we should still scan the error to check if
// there's an SSH/roachprod error.
if err != nil {
// install.NonZeroExitCode includes unrelated to SSH errors ("255")
// or roachprod errors, so we call t.Fatal if the error is not an
// install.NonZeroExitCode error
commandError := (*install.NonZeroExitCode)(nil)
if !errors.As(err, &commandError) {
t.Fatal(err)
}
// Fatal for a roachprod or SSH error. A roachprod error is when result.Err==nil.
// Proceed for any other (command) errors
if err != nil && (result.Err == nil || errors.Is(err, rperrors.ErrSSH255)) {
t.Fatal(err)
}

// Result error contains stdout, stderr, and any error content returned by exec package.
Expand Down
15 changes: 5 additions & 10 deletions pkg/cmd/roachtest/tests/ruby_pg.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/option"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
rperrors "github.com/cockroachdb/cockroach/pkg/roachprod/errors"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/errors"
"github.com/stretchr/testify/require"
Expand Down Expand Up @@ -152,16 +153,10 @@ func registerRubyPG(r registry.Registry) {
`cd /mnt/data1/ruby-pg/ && bundle exec rake compile test`,
)

// Expected to fail but we should still scan the error to check if
// there's an SSH/roachprod error.
if err != nil {
// install.NonZeroExitCode includes unrelated to SSH errors ("255")
// or roachprod errors, so we call t.Fatal if the error is not an
// install.NonZeroExitCode error
commandError := (*install.NonZeroExitCode)(nil)
if !errors.As(err, &commandError) {
t.Fatal(err)
}
// Fatal for a roachprod or SSH error. A roachprod error is when result.Err==nil.
// Proceed for any other (command) errors
if err != nil && (result.Err == nil || errors.Is(err, rperrors.ErrSSH255)) {
t.Fatal(err)
}

rawResults := []byte(result.Stdout + result.Stderr)
Expand Down
15 changes: 5 additions & 10 deletions pkg/cmd/roachtest/tests/sqlalchemy.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/option"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry"
"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test"
rperrors "github.com/cockroachdb/cockroach/pkg/roachprod/errors"
"github.com/cockroachdb/cockroach/pkg/roachprod/install"
"github.com/cockroachdb/errors"
)
Expand Down Expand Up @@ -171,16 +172,10 @@ func runSQLAlchemy(ctx context.Context, t test.Test, c cluster.Cluster) {
test/test_suite_sqlalchemy.py
`)

// Expected to fail but we should still scan the error to check if
// there's an SSH/roachprod error.
if err != nil {
// install.NonZeroExitCode includes unrelated to SSH errors ("255")
// or roachprod errors, so we call t.Fatal if the error is not an
// install.NonZeroExitCode error
commandError := (*install.NonZeroExitCode)(nil)
if !errors.As(err, &commandError) {
t.Fatal(err)
}
// Fatal for a roachprod or SSH error. A roachprod error is when result.Err==nil.
// Proceed for any other (command) errors
if err != nil && (result.Err == nil || errors.Is(err, rperrors.ErrSSH255)) {
t.Fatal(err)
}

rawResults := []byte(result.Stdout + result.Stderr)
Expand Down
1 change: 1 addition & 0 deletions pkg/roachpb/app_stats.proto
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,7 @@ message AggregatedStatementMetadata {
optional int64 full_scan_count = 10 [(gogoproto.nullable) = false];
optional int64 vec_count = 11 [(gogoproto.nullable) = false];
optional int64 total_count = 12 [(gogoproto.nullable) = false];
optional string fingerprint_id = 13 [(gogoproto.nullable) = false, (gogoproto.customname) = "FingerprintID"];
}

// CollectedStatementStatistics wraps collected timings and metadata for some
Expand Down
Loading

0 comments on commit 5a24e11

Please sign in to comment.