Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
76516: server: improve visibility of ranges that fail to move during decommissioning r=knz,aayushshah15 a=cameronnunez

Fixes #76249. Informs #74158.

This patch makes it so that when a decommission is slow or stalls, the 
descriptions of some "stuck" replicas are printed to the operator.

Release note (cli change): if decommissioning is slow or stalls, decommissioning
replicas are printed to the operator.

Release justification: low risk, high benefit changes to existing functionality

78433: ci: report failure to generate code as a test failure to teamcity r=rail a=rickystewart

Closes #78368.

Release note: None

78495: cluster: Revert "cluster: use WaitConditionNextExit" r=otan a=rickystewart

This reverts commit 6543749.
That commit was an (unsuccessful) attempt to fix #58955, and in the
presence of this change the `acceptance` tests are very likely to hang
forever under Ubuntu 20.04 due to a race condition where the container
exits before we begin waiting on it.

Release note: None

78525: dev: add `ui clean` subcommand, update `Makefile` to point to `dev` r=rail a=rickystewart

Add `ui clean` and `ui clean --all`; the former does approximately the
same as `make ui-clean`, the latter does approximately the same as
`make ui-maintainer-clean`.

Release note: None

78561: sqlsmith: fix deadlock during generation r=rharding6373 a=rharding6373

Fixes a deadlock when sqlsmith generates join expressions.

Fixes: #78555

Release note: None

78602: roachpb,kvserver: tweak ReplicaUnavailableError rendering r=erikgrinaker a=tbg

Release justification: minor UX tweakto an error message
Release note: None


78617: roachtest: wait for 3x replication in disk-full r=tbg a=nicktrav

Improve on #78456 by waiting fro 3x replication, rather than 2x.

Release note: None.

78629: ci: make sure we're streaming test output when `stress`ing via Bazel r=mari-crl a=rickystewart

Without this, you don't see the helpful "ran X iterations so far, Y
failures" messages from `stress`.

Release note: None

Co-authored-by: Cameron Nunez <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
  • Loading branch information
6 people committed Mar 28, 2022
9 parents 2edffa9 + 920d94c + 37d3c8c + c4b415f + d2aeb57 + 25dcdb6 + f1a8909 + 9e35a71 + b59b3bc commit f8e03e2
Show file tree
Hide file tree
Showing 22 changed files with 358 additions and 31 deletions.
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -1390,11 +1390,13 @@ pkg/ui/workspaces/db-console/dist/%.ccl.dll.js pkg/ui/workspaces/db-console/%.cc

.PHONY: ui-test
ui-test: $(UI_CCL_DLLS) $(UI_CCL_MANIFESTS)
$(info $(yellow)NOTE: consider using `./dev ui test` instead of `make ui-test`$(term-reset))
$(NODE_RUN) -C pkg/ui/workspaces/db-console $(KARMA) start
$(NODE_RUN) -C pkg/ui/workspaces/cluster-ui yarn ci

.PHONY: ui-test-watch
ui-test-watch: $(UI_CCL_DLLS) $(UI_CCL_MANIFESTS)
$(info $(yellow)NOTE: consider using `./dev ui test --watch` instead of `make ui-test-watch`$(term-reset))
$(NODE_RUN) -C pkg/ui/workspaces/db-console $(KARMA) start --no-single-run --auto-watch & \
$(NODE_RUN) -C pkg/ui/workspaces/cluster-ui yarn test

Expand Down Expand Up @@ -1430,11 +1432,13 @@ ui-watch ui-watch-secure: $(UI_CCL_DLLS) pkg/ui/yarn.opt.installed
#
# `node-run.sh` wrapper is removed because this command is supposed to be run in dev environment (not in docker of CI)
# so it is safe to run yarn commands directly to preserve formatting and colors for outputs
$(info $(yellow)NOTE: consider using `./dev ui watch [--secure]` instead of `make ui-watch[-secure]`$(term-reset))
yarn --cwd pkg/ui/workspaces/cluster-ui build:watch & \
yarn --cwd pkg/ui/workspaces/db-console webpack-dev-server --config webpack.app.js --env.dist=ccl --env.WEBPACK_SERVE --port $(PORT) --mode "development" $(WEBPACK_DEV_SERVER_FLAGS)

.PHONY: ui-clean
ui-clean: ## Remove build artifacts.
$(info $(yellow)NOTE: consider using `./dev ui clean` instead of `make ui-clean`$(term-reset))
find pkg/ui/distccl/assets pkg/ui/distoss/assets -mindepth 1 -not -name .gitkeep -delete
rm -rf pkg/ui/assets.ccl.installed pkg/ui/assets.oss.installed
rm -rf pkg/ui/dist_vendor/*
Expand All @@ -1445,6 +1449,7 @@ ui-clean: ## Remove build artifacts.
.PHONY: ui-maintainer-clean
ui-maintainer-clean: ## Like clean, but also remove installed dependencies
ui-maintainer-clean: ui-clean
$(info $(yellow)NOTE: consider using `./dev ui clean --all` instead of `make ui-maintainer-clean`$(term-reset))
rm -rf pkg/ui/node_modules pkg/ui/workspaces/db-console/node_modules pkg/ui/yarn.installed pkg/ui/workspaces/cluster-ui/node_modules

pkg/roachprod/vm/aws/embedded.go: bin/.bootstrap pkg/roachprod/vm/aws/config.json pkg/roachprod/vm/aws/old.json bin/terraformgen
Expand Down
14 changes: 9 additions & 5 deletions build/teamcity-check-genfiles.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ tc_prepare

tc_start_block "Ensure generated code is up-to-date"

begin_check_generated_code_tests

# NOTE(ricky): Please make sure any changes to the Bazel-related checks here are
# propagated to build/teamcity/cockroach/ci/tests/check_generated_code_impl.sh
# as well.
Expand All @@ -29,24 +31,26 @@ if grep TODO DEPS.bzl; then
echo "Missing TODO comment in DEPS.bzl. Did you run \`./dev generate bazel --mirror\`?"
exit 1
fi
check_workspace_clean "Run \`./dev generate bazel\` to automatically regenerate these."
check_workspace_clean 'dev_generate_bazel' "Run \`./dev generate bazel\` to automatically regenerate these."
run build/builder.sh make generate &> artifacts/generate.log || (cat artifacts/generate.log && false)
rm artifacts/generate.log
check_workspace_clean "Run \`make generate\` to automatically regenerate these."
check_workspace_clean 'make_generate' "Run \`make generate\` to automatically regenerate these."
run build/builder.sh make buildshort &> artifacts/buildshort.log || (cat artifacts/buildshort.log && false)
rm artifacts/buildshort.log
check_workspace_clean "Run \`make buildshort\` to automatically regenerate these."
check_workspace_clean 'make_buildshort' "Run \`make buildshort\` to automatically regenerate these."
tc_end_block "Ensure generated code is up-to-date"

# generated code can generate new dependencies; check dependencies after generated code.
tc_start_block "Ensure dependencies are up-to-date"
# Run go mod tidy and `make -k vendor_rebuild` and ensure nothing changes.
run build/builder.sh go mod tidy
check_workspace_clean "Run \`go mod tidy\` and \`make -k vendor_rebuild\` to automatically regenerate these."
check_workspace_clean 'go_mod_tidy' "Run \`go mod tidy\` and \`make -k vendor_rebuild\` to automatically regenerate these."
run build/builder.sh make -k vendor_rebuild
cd vendor
check_workspace_clean "Run \`make -k vendor_rebuild\` to automatically regenerate these."
check_workspace_clean 'vendor_rebuild' "Run \`make -k vendor_rebuild\` to automatically regenerate these."
cd ..

end_check_generated_code_tests
tc_end_block "Ensure dependencies are up-to-date"

tc_start_block "Test web UI"
Expand Down
18 changes: 15 additions & 3 deletions build/teamcity-support.sh
Original file line number Diff line number Diff line change
Expand Up @@ -307,17 +307,29 @@ generate_ssh_key() {
fi
}

# Call this function with one argument, the error message to print if the
# workspace is dirty.
begin_check_generated_code_tests() {
echo "##teamcity[testSuiteStarted name='CheckGeneratedCode']"
}

end_check_generated_code_tests() {
echo "##teamcity[testSuiteFinished name='CheckGeneratedCode']"
}

# Call this function with two arguments: the name of the "test" that will be
# reported to teamcity and the error message to print if the workspace is dirty.
check_workspace_clean() {
echo "##teamcity[testStarted name='CheckGeneratedCode/$1' captureStandardOutput='true']"
# The workspace is clean iff `git status --porcelain` produces no output. Any
# output is either an error message or a listing of an untracked/dirty file.
if [[ "$(git status --porcelain 2>&1)" != "" ]]; then
git status >&2 || true
git diff -a >&2 || true
echo "====================================================" >&2
echo "Some automatically generated code is not up to date." >&2
echo $1 >&2
echo $2 >&2
echo "##teamcity[testFailed name='CheckGeneratedCode/$1']"
echo "##teamcity[testFinished name='CheckGeneratedCode/$1']"
exit 1
fi
echo "##teamcity[testFinished name='CheckGeneratedCode/$1']"
}
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ source "$dir/teamcity-support.sh" # For $root, check_workspace_clean

mkdir -p artifacts

begin_check_generated_code_tests

# Buffer noisy output and only print it on failure.
if ! (./build/bazelutil/check.sh &> artifacts/buildshort.log || (cat artifacts/buildshort.log && false)); then
# The command will output instructions on how to fix the error.
Expand All @@ -24,11 +26,13 @@ if grep TODO DEPS.bzl; then
echo "Missing TODO comment in DEPS.bzl. Did you run \`./dev generate bazel --mirror\`?"
exit 1
fi
check_workspace_clean "Run \`./dev generate bazel\` to automatically regenerate these."
check_workspace_clean 'dev_generate_bazel' "Run \`./dev generate bazel\` to automatically regenerate these."

# Run go mod tidy and ensure nothing changes.
# NB: If files are missing from any packages then `go mod tidy` will
# fail. So we need to make sure that `.pb.go` sources are populated.
bazel run //pkg/gen:go_proto
bazel run @go_sdk//:bin/go --ui_event_filters=-DEBUG,-info,-stdout,-stderr --noshow_progress mod tidy
check_workspace_clean "Run \`go mod tidy\` to automatically regenerate these."
check_workspace_clean 'go_mod_tidy' "Run \`go mod tidy\` to automatically regenerate these."

end_check_generated_code_tests
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ $BAZEL_BIN/pkg/cmd/bazci/bazci_/bazci --config=ci test \
--run_under "@com_github_cockroachdb_stress//:stress -bazel -shardable-artifacts 'GO_TEST_JSON_OUTPUT_FILE=cat,XML_OUTPUT_FILE=$BAZEL_BIN/pkg/cmd/bazci/bazci_/bazci merge-test-xmls' -maxtime 3h -maxfails 1 -stderr -p 1" \
--test_arg -dir --test_arg $ARTIFACTS_DIR \
--test_arg -ops --test_arg "uniform:5000-10000" \
--test_output streamed \
|| exit_status=$?

BAZEL_SUPPORT_EXTRA_GITHUB_POST_ARGS=--formatter=pebble-metamorphic process_test_json \
Expand Down
1 change: 1 addition & 0 deletions build/teamcity/cockroach/nightlies/stress_impl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ do
--run_under "@com_github_cockroachdb_stress//:stress -bazel -shardable-artifacts 'GO_TEST_JSON_OUTPUT_FILE=cat,XML_OUTPUT_FILE=$BAZEL_BIN/pkg/cmd/bazci/bazci_/bazci merge-test-xmls' $STRESSFLAGS" \
--define "gotags=$TAGS" \
--nocache_test_results \
--test_output streamed \
${EXTRA_BAZEL_FLAGS} \
|| exit_status=$?
process_test_json \
Expand Down
2 changes: 1 addition & 1 deletion dev
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
set -euo pipefail

# Bump this counter to force rebuilding `dev` on all machines.
DEV_VERSION=25
DEV_VERSION=26

THIS_DIR=$(cd "$(dirname "$0")" && pwd)
BINARY_DIR=$THIS_DIR/bin/dev-versions
Expand Down
32 changes: 32 additions & 0 deletions docs/generated/http/full.md
Original file line number Diff line number Diff line change
Expand Up @@ -6194,6 +6194,7 @@ If no NodeIDs are given, it targets the recipient node.
| ----- | ---- | ----- | ----------- | -------------- |
| node_ids | [int32](#cockroach.server.serverpb.DecommissionRequest-int32) | repeated | | [reserved](#support-status) |
| target_membership | [cockroach.kv.kvserver.liveness.livenesspb.MembershipStatus](#cockroach.server.serverpb.DecommissionRequest-cockroach.kv.kvserver.liveness.livenesspb.MembershipStatus) | | | [reserved](#support-status) |
| num_replica_report | [int32](#cockroach.server.serverpb.DecommissionRequest-int32) | | The number of decommissioning replicas to be reported. | [reserved](#support-status) |



Expand Down Expand Up @@ -6230,6 +6231,21 @@ DecommissionStatusResponse lists decommissioning statuses for a number of NodeID
| replica_count | [int64](#cockroach.server.serverpb.DecommissionStatusResponse-int64) | | The number of replicas on the node, computed by scanning meta2 ranges. | [reserved](#support-status) |
| membership | [cockroach.kv.kvserver.liveness.livenesspb.MembershipStatus](#cockroach.server.serverpb.DecommissionStatusResponse-cockroach.kv.kvserver.liveness.livenesspb.MembershipStatus) | | The membership status of the given node. | [reserved](#support-status) |
| draining | [bool](#cockroach.server.serverpb.DecommissionStatusResponse-bool) | | | [reserved](#support-status) |
| reported_replicas | [DecommissionStatusResponse.Replica](#cockroach.server.serverpb.DecommissionStatusResponse-cockroach.server.serverpb.DecommissionStatusResponse.Replica) | repeated | Decommissioning replicas on the given node to be reported. How many replicas are reported is determined by what was specified in the request. | [reserved](#support-status) |





<a name="cockroach.server.serverpb.DecommissionStatusResponse-cockroach.server.serverpb.DecommissionStatusResponse.Replica"></a>
#### DecommissionStatusResponse.Replica



| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| replica_id | [int32](#cockroach.server.serverpb.DecommissionStatusResponse-int32) | | | [reserved](#support-status) |
| range_id | [int32](#cockroach.server.serverpb.DecommissionStatusResponse-int32) | | | [reserved](#support-status) |



Expand Down Expand Up @@ -6258,6 +6274,7 @@ specified or, if none are specified, all nodes.
| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| node_ids | [int32](#cockroach.server.serverpb.DecommissionStatusRequest-int32) | repeated | | [reserved](#support-status) |
| num_replica_report | [int32](#cockroach.server.serverpb.DecommissionStatusRequest-int32) | | The number of decommissioning replicas to be reported. | [reserved](#support-status) |



Expand Down Expand Up @@ -6294,6 +6311,21 @@ DecommissionStatusResponse lists decommissioning statuses for a number of NodeID
| replica_count | [int64](#cockroach.server.serverpb.DecommissionStatusResponse-int64) | | The number of replicas on the node, computed by scanning meta2 ranges. | [reserved](#support-status) |
| membership | [cockroach.kv.kvserver.liveness.livenesspb.MembershipStatus](#cockroach.server.serverpb.DecommissionStatusResponse-cockroach.kv.kvserver.liveness.livenesspb.MembershipStatus) | | The membership status of the given node. | [reserved](#support-status) |
| draining | [bool](#cockroach.server.serverpb.DecommissionStatusResponse-bool) | | | [reserved](#support-status) |
| reported_replicas | [DecommissionStatusResponse.Replica](#cockroach.server.serverpb.DecommissionStatusResponse-cockroach.server.serverpb.DecommissionStatusResponse.Replica) | repeated | Decommissioning replicas on the given node to be reported. How many replicas are reported is determined by what was specified in the request. | [reserved](#support-status) |





<a name="cockroach.server.serverpb.DecommissionStatusResponse-cockroach.server.serverpb.DecommissionStatusResponse.Replica"></a>
#### DecommissionStatusResponse.Replica



| Field | Type | Label | Description | Support status |
| ----- | ---- | ----- | ----------- | -------------- |
| replica_id | [int32](#cockroach.server.serverpb.DecommissionStatusResponse-int32) | | | [reserved](#support-status) |
| range_id | [int32](#cockroach.server.serverpb.DecommissionStatusResponse-int32) | | | [reserved](#support-status) |



Expand Down
4 changes: 2 additions & 2 deletions pkg/acceptance/cluster/dockercluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ func (l *DockerCluster) OneShot(
if err := l.oneshot.Start(ctx); err != nil {
return err
}
return l.oneshot.Wait(ctx, container.WaitConditionNextExit)
return l.oneshot.Wait(ctx, container.WaitConditionNotRunning)
}

// stopOnPanic is invoked as a deferred function in Start in order to attempt
Expand Down Expand Up @@ -372,7 +372,7 @@ func (l *DockerCluster) initCluster(ctx context.Context) {
// and it'll get in the way of future runs.
l.vols = c
maybePanic(c.Start(ctx))
maybePanic(c.Wait(ctx, container.WaitConditionNextExit))
maybePanic(c.Wait(ctx, container.WaitConditionNotRunning))
}

// cockroachEntryPoint returns the value to be used as
Expand Down
48 changes: 47 additions & 1 deletion pkg/cli/node.go
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,17 @@ func runDecommissionNodeImpl(
MaxBackoff: 20 * time.Second,
}

// Log verbosity is increased when there is possibly a decommission stall.
// If the number of decommissioning replicas does not decrease after some time
// (i.e. the decommission status has not changed after
// sameStatusThreshold iterations), verbosity is automatically set.
// Some decommissioning replicas will be reported to the operator.
const sameStatusThreshold = 15
var (
numReplicaReport = 0
sameStatusCount = 0
)

// Decommissioning a node is driven by a three-step process.
// 1) Mark each node as 'decommissioning'. In doing so, all replicas are
// slowly moved off of these nodes.
Expand All @@ -472,21 +483,41 @@ func runDecommissionNodeImpl(
req := &serverpb.DecommissionRequest{
NodeIDs: nodeIDs,
TargetMembership: livenesspb.MembershipStatus_DECOMMISSIONING,
NumReplicaReport: int32(numReplicaReport),
}
resp, err := c.Decommission(ctx, req)
if err != nil {
fmt.Fprintln(stderr)
return errors.Wrap(err, "while trying to mark as decommissioning")
}

if numReplicaReport > 0 {
printDecommissionReplicas(ctx, *resp)
}

if !reflect.DeepEqual(&prevResponse, resp) {
fmt.Fprintln(stderr)
if err := printDecommissionStatus(*resp); err != nil {
if err = printDecommissionStatus(*resp); err != nil {
return err
}
prevResponse = *resp

// The decommissioning status changed. Set `sameStatusCount` back to zero.
sameStatusCount = 0
numReplicaReport = 0
} else {
// Print a marker indicating that there has been no progress,
// instead of printing the same status.
fmt.Fprintf(stderr, ".")

// Report decommissioning replicas if there's been significant time of
// no progress.
if sameStatusCount >= sameStatusThreshold && numReplicaReport == 0 {
// Configure a number of replicas to report.
numReplicaReport = 5
} else {
sameStatusCount++
}
}

anyActive := false
Expand Down Expand Up @@ -591,6 +622,21 @@ func printDecommissionStatus(resp serverpb.DecommissionStatusResponse) error {
clisqlexec.NewRowSliceIter(decommissionResponseValueToRows(resp.Status), decommissionResponseAlignment()))
}

func printDecommissionReplicas(ctx context.Context, resp serverpb.DecommissionStatusResponse) {
fmt.Fprintln(stderr, "\npossible decommission stall detected; reporting decommissioning replicas")

for _, nodeStatus := range resp.Status {
for _, replica := range nodeStatus.ReportedReplicas {
fmt.Fprintf(stderr,
"n%d still has replica id %d for range r%d",
nodeStatus.NodeID,
replica.ReplicaID,
replica.RangeID,
)
}
}
}

func runRecommissionNode(cmd *cobra.Command, args []string) error {
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
Expand Down
24 changes: 24 additions & 0 deletions pkg/cmd/dev/testdata/datadriven/ui
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,27 @@ cp -r sandbox/pkg/ui/workspaces/cluster-ui/dist crdb-checkout/pkg/ui/workspaces/
bazel info workspace --color=no
bazel run @nodejs//:yarn -- --silent --cwd crdb-checkout/pkg/ui/workspaces/db-console karma:watch
bazel run @nodejs//:yarn -- --silent --cwd crdb-checkout/pkg/ui/workspaces/cluster-ui jest --watch

exec
dev ui clean
----
bazel info workspace --color=no
rm -rf crdb-checkout/pkg/ui/workspaces/db-console/src/js/protos.js
rm -rf crdb-checkout/pkg/ui/workspaces/db-console/src/js/protos.d.ts
rm -rf crdb-checkout/pkg/ui/workspaces/db-console/ccl/src/js/protos.js
rm -rf crdb-checkout/pkg/ui/workspaces/db-console/ccl/src/js/protos.d.ts
rm -rf crdb-checkout/pkg/ui/workspaces/cluster-ui/dist

exec
dev ui clean --all
----
bazel info workspace --color=no
bazel info workspace --color=no
rm -rf crdb-checkout/pkg/ui/workspaces/db-console/src/js/protos.js
rm -rf crdb-checkout/pkg/ui/workspaces/db-console/src/js/protos.d.ts
rm -rf crdb-checkout/pkg/ui/workspaces/db-console/ccl/src/js/protos.js
rm -rf crdb-checkout/pkg/ui/workspaces/db-console/ccl/src/js/protos.d.ts
rm -rf crdb-checkout/pkg/ui/workspaces/cluster-ui/dist
rm -rf crdb-checkout/pkg/ui/node_modules
rm -rf crdb-checkout/pkg/ui/workspaces/db-console/node_modules
rm -rf crdb-checkout/pkg/ui/workspaces/cluster-ui/node_modules
Loading

0 comments on commit f8e03e2

Please sign in to comment.