Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: add debug send-kv-batch command to run arbitrary BatchRequest #72732

Merged
merged 2 commits into from
Nov 25, 2021

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Nov 14, 2021

This patch adds a CLI command debug send-kv-batch that takes a JSON
BatchRequest as input, proxies it into KV via the connected node's
admin server, and outputs a JSON BatchResponse. An admin user is
required, typically the root user, and the request is logged to the
system event log for auditability.

This is a useful backdoor to access arbitrary KV functionality for
debugging and surgery, e.g. during support escalations.

Resolves #66829.
Resolves #67001.

Release note: None


@knz It does not seem like admin commands generally allow specifying a user (e.g. via --user), and implicitly require the root user, so I followed that convention here as well. Let me know if we need to extend this.

@erikgrinaker erikgrinaker requested review from dt, knz and a team November 14, 2021 15:40
@erikgrinaker erikgrinaker requested a review from a team as a code owner November 14, 2021 15:40
@erikgrinaker erikgrinaker self-assigned this Nov 14, 2021
@erikgrinaker erikgrinaker requested review from a team as code owners November 14, 2021 15:40
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker removed the request for review from a team November 14, 2021 15:40
@erikgrinaker erikgrinaker force-pushed the cli-send-rpc branch 3 times, most recently from 4078316 to f31b9fe Compare November 14, 2021 15:53
@erikgrinaker erikgrinaker changed the title cli: add debug send-kv-batch command to run BatchRequests cli: add debug send-kv-batch command to run arbitrary BatchRequest Nov 14, 2021
@erikgrinaker erikgrinaker force-pushed the cli-send-rpc branch 3 times, most recently from 7ca1320 to 66e0052 Compare November 15, 2021 11:12
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM and I'll defer to @knz about making sure this is documented and set up in the right way.

I did poke around a bit and ended up rewriting the test a bit, pushed this as a separate commit. Take it or leave it, fine either way.

Reviewed 15 of 15 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, @knz, and @tbg)


docs/generated/eventlog.md, line 168 at r1 (raw file):

| `Timestamp` | The timestamp of the event. Expressed as nanoseconds since the Unix epoch. | no |
| `EventType` | The type of the event. | no |
| `NodeID` | The node ID where the event was originated. | no |

where the event originated


pkg/cli/debug.go, line 1505 at r1 (raw file):

}

var debugSendKVBatchCmd = &cobra.Command{

mind putting this and the impl + tests in a new files debug_send_kv_batch{,_test}.go?


pkg/cli/debug.go, line 1511 at r1 (raw file):

	RunE:  runSendKVBatch,
	Long: `
	Submits a JSON-formatted roachpb.BatchRequest, given as a file or via stdin,

The spaces at the beginning of each line shouldn't be there IMO


pkg/cli/debug_test.go, line 522 at r1 (raw file):

	require.Contains(t, output, "ERROR: invalid JSON")

	// Unknown JSON field should error

.


pkg/server/admin.go, line 2627 at r1 (raw file):

	}
	nodeID := int32(s.server.NodeID())
	event := &eventpb.DebugSendKvBatch{

customname DebugSendKVBatch?


pkg/server/admin.go, line 2644 at r1 (raw file):

	})
	if err != nil {
		log.Warningf(ctx, "failed to log SendKVBatch to event log, emitting below: %v", err)

I think structured events are unlikely to hit the same log as a Warning, so the "below" may not make sense. Maybe "emitting as structured event"?


pkg/util/log/eventpb/debug_events.proto, line 32 at r1 (raw file):

// CommonDebugEventDetails contains the fields common to all debugging events.
message CommonDebugEventDetails {
  // The node ID where the event was originated.

-was

@tbg tbg self-requested a review November 15, 2021 13:15
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, your test hit what I expect will be a common mistake: Protobuf structs cannot be (un)marshalled using the standard Golang JSON marshaller. You must use the JSON marshaller that's shipped with the Protobuf library, e.g. https://pkg.go.dev/github.com/gogo/protobuf/jsonpb.

I'll update the test to use this, with a big fat warning to anyone following its example.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @knz, and @tbg)


pkg/cli/debug.go, line 1505 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

mind putting this and the impl + tests in a new files debug_send_kv_batch{,_test}.go?

Yup, done.


pkg/cli/debug.go, line 1511 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The spaces at the beginning of each line shouldn't be there IMO

Ah, my bad -- thought I'd seen other commands indent it.


pkg/server/admin.go, line 2627 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

customname DebugSendKVBatch?

I don't think customname can be applied to messages, only fields. Unfortunately, we need the Kv here to get the snake-cased event type generated correctly.


pkg/server/admin.go, line 2644 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I think structured events are unlikely to hit the same log as a Warning, so the "below" may not make sense. Maybe "emitting as structured event"?

Right, good catch.

@tbg tbg removed their request for review November 15, 2021 15:50
@erikgrinaker erikgrinaker force-pushed the cli-send-rpc branch 3 times, most recently from 98d0452 to 7769e60 Compare November 15, 2021 18:20
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 14 of 15 files at r1, 9 of 9 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, and @tbg)


pkg/server/admin.go, line 2637 at r1 (raw file):

		BatchRequest: string(baJSON),
	}
	err = contextutil.RunWithTimeout(ctx, "log-event", 10*time.Second, func(ctx context.Context) error {

Can you explain to me the wisdom of using system.eventlog here, while we're generally trying to get away from using this table (also, observing that if one finds themselves in need to reach out to this debug facility, chances are this write may be impossible).


pkg/server/admin.go, line 2645 at r1 (raw file):

	if err != nil {
		log.Warningf(ctx, "failed to log SendKVBatch to event log, emitting below: %v", err)
		log.StructuredEvent(ctx, event)

The structured event should be logged in any case.
I think you're relying on the idea that InsertEventRecord will also call log.StructuredEvent above? But then that needs to be clarified.
(I'd argue all this could be simplified by only calling log.StructuredEvent here.)

Also, the structured logging needs a test.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, @knz, and @tbg)


pkg/server/admin.go, line 2637 at r1 (raw file):

Previously, knz (kena) wrote…

Can you explain to me the wisdom of using system.eventlog here, while we're generally trying to get away from using this table (also, observing that if one finds themselves in need to reach out to this debug facility, chances are this write may be impossible).

I was under the impression that system.eventlog is part of our event/audit logging facilities. You seem to be saying that it's unnecessary or deprecated here, so I'll remove it.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @knz, and @tbg)


pkg/server/admin.go, line 2645 at r1 (raw file):

Previously, knz (kena) wrote…

The structured event should be logged in any case.
I think you're relying on the idea that InsertEventRecord will also call log.StructuredEvent above? But then that needs to be clarified.
(I'd argue all this could be simplified by only calling log.StructuredEvent here.)

Also, the structured logging needs a test.

Done.

@erikgrinaker
Copy link
Contributor Author

@knz Any further thoughts on this?

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the tardiness!

:lgtm_strong:

Reviewed 18 of 18 files at r4, 17 of 17 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dt)

`jsonpb.Marshal` is used for JSON-marshaling of Protobuf structs in
accordance with the Protobuf spec.

Release note: None
This patch adds a CLI command `debug send-kv-batch` that takes a JSON
`BatchRequest` as input, proxies it into KV via the connected node's
admin server, and outputs a JSON `BatchResponse`. An admin user is
required, typically the root user, and the request is logged to the
system event log for auditability.

This is a useful backdoor to access arbitrary KV functionality for
debugging and surgery, e.g. during support escalations.

Release note: None
@erikgrinaker
Copy link
Contributor Author

TFTRs!

bors r=knz,tbg

@tbg
Copy link
Member

tbg commented Nov 25, 2021 via email

@erikgrinaker
Copy link
Contributor Author

Btw, demo about this on next weekly? This is cool stuff.

Sure thing. It's pretty trivial tbh, but extremely useful.

@craig
Copy link
Contributor

craig bot commented Nov 25, 2021

Build succeeded:

@craig craig bot merged commit a213807 into cockroachdb:master Nov 25, 2021
@erikgrinaker erikgrinaker deleted the cli-send-rpc branch November 25, 2021 20:40
tbg added a commit to tbg/cockroach that referenced this pull request Nov 26, 2021
This commit introduces a new KV request type `ProbeRequest`. This
request performs a mutationless round-trip through the replication
layer. The proximate reason to add this request is cockroachdb#33007, where we want
to fail-fast requests on a Replica until a `ProbeRequest` succeeds.

The alternative to introducing a new request type is using a a `Put` on
`keys.RangeProbeKey` (a key introduced for [kvprober]). However, when
the Replica circuit breakers perform a probe, they have to make sure
that this probe itself doesn't get caught up in the circuit breakers. In
particular, the circuit breaker's request needs special casing with
respect to lease requests (which would otherwise also bounce on the
circuit breaker). But even with that added (not too large a change),
the lease would be required, so we wouldn't be able to apply the
circuit breakers anywhere but the leaseholder (i.e. we would have
to heal the breaker when the lease is "not obtainable". What we are
after is really a per-Replica concept that is disjoint from the lease,
where we check that the replica can manage to get a command into the
log and observe it coming out of the replication layer.

`ProbeRequest` bypasses lease checks and can thus be processed by any
Replica. Upon application, they are guaranteed to result in a forced
error, and we suppress this error above the replication layer (i.e. at
the waiting request). This means that receiving a nil error from the
probe implies that the probe successfully made it through latching
(where it declares a read at its timestamp on the probed key),
evaluation, and application.

While no concrete plans exist for adoption of `ProbeRequest` elsewhere,
this might be of interest for [kvprober] and as a general building block
for quickly assessing overall KV health; for example we may extend cockroachdb#72732
to facilitate sending a probe to the leaseholders of all Ranges in the
system.

`ProbeRequest` is a point request. A ranged version is possible but
since `DistSender` will not return partial responses, it is not
introduced at this point. This can be done when a concrete need
arises.

`ProbeRequest` by default will be routed to the leaseholder, but it
can also be used with the `NEAREST` routing policy.

Some details helpful for reviewers follow.

- to enable testing of this new request type, `TestingApplyFilter` was
  augmented with a `TestingApplyForcedErrFilter` which gets to inspect
  and mutate forced errors below raft. The semantics of
  `TestingApplyFitler` remain unchainged; the newly added field
  `ForcedErr` is always nil when handed to such a filter.
- `ProbeRequest` is the first request type that skips the lease check
  and is not itself a lease check. This prompted a small refactor to
  `GetPrevLeaseForLeaseRequest`, which can now return false when the
  underlying request is a `Probe` and not `{Transfer,Request}Lease`.

[kvprober]: https://github.com/cockroachdb/cockroach/tree/2ad2bee257e78970ce2c457ddd6996099ed6727a/pkg/kv/kvprober
[routing policy]: https://github.com/cockroachdb/cockroach/blob/e1c7d6a55ed147aeb20d6d5c564d3d9bf73d2624/pkg/roachpb/api.proto#L48-L60

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Dec 2, 2021
This commit introduces a new KV request type `ProbeRequest`. This
request performs a mutationless round-trip through the replication
layer. The proximate reason to add this request is cockroachdb#33007, where we want
to fail-fast requests on a Replica until a `ProbeRequest` succeeds.

The alternative to introducing a new request type is using a `Put` on
`keys.RangeProbeKey` (a key introduced for [kvprober]). However, when
the Replica circuit breakers perform a probe, they have to make sure
that this probe itself doesn't get caught up in the circuit breakers. In
particular, the circuit breaker's request needs special casing with
respect to lease requests (which would otherwise also bounce on the
circuit breaker). But even with that added (not too large a change),
the lease would be required, so we wouldn't be able to apply the
circuit breakers anywhere but the leaseholder (i.e. we would have
to heal the breaker when the lease is "not obtainable". What we are
after is really a per-Replica concept that is disjoint from the lease,
where we check that the replica can manage to get a command into the
log and observe it coming out of the replication layer.

`ProbeRequest` bypasses lease checks and can thus be processed by any
Replica. Upon application, they are guaranteed to result in a forced
error, and we suppress this error above the replication layer (i.e. at
the waiting request). This means that receiving a nil error from the
probe implies that the probe successfully made it through latching
(where it declares a read at its timestamp on the probed key),
evaluation, and application.

While no concrete plans exist for adoption of `ProbeRequest` elsewhere,
this might be of interest for [kvprober] and as a general building block
for quickly assessing overall KV health; for example we may extend cockroachdb#72732
to facilitate sending a probe to the leaseholders of all Ranges in the
system.

`ProbeRequest` is a point request. A ranged version is possible but
since `DistSender` will not return partial responses, it is not
introduced at this point. This can be done when a concrete need
arises.

`ProbeRequest` by default will be routed to the leaseholder, but it
can also be used with the `NEAREST` routing policy.

Some details helpful for reviewers follow.

- to enable testing of this new request type, `TestingApplyFilter` was
  augmented with a `TestingApplyForcedErrFilter` which gets to inspect
  and mutate forced errors below raft. The semantics of
  `TestingApplyFilter` remain unchainged; the newly added field
  `ForcedErr` is always nil when handed to such a filter.
- `ProbeRequest` is the first request type that skips the lease check
  and is not itself a lease check. This prompted a small refactor to
  `GetPrevLeaseForLeaseRequest`, which can now return false when the
  underlying request is a `Probe` and not `{Transfer,Request}Lease`.

[kvprober]: https://github.com/cockroachdb/cockroach/tree/2ad2bee257e78970ce2c457ddd6996099ed6727a/pkg/kv/kvprober
[routing policy]: https://github.com/cockroachdb/cockroach/blob/e1c7d6a55ed147aeb20d6d5c564d3d9bf73d2624/pkg/roachpb/api.proto#L48-L60

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Dec 9, 2021
This commit introduces a new KV request type `ProbeRequest`. This
request performs a mutationless round-trip through the replication
layer. The proximate reason to add this request is cockroachdb#33007, where we want
to fail-fast requests on a Replica until a `ProbeRequest` succeeds.

The alternative to introducing a new request type is using a `Put` on
`keys.RangeProbeKey` (a key introduced for [kvprober]). However, when
the Replica circuit breakers perform a probe, they have to make sure
that this probe itself doesn't get caught up in the circuit breakers. In
particular, the circuit breaker's request needs special casing with
respect to lease requests (which would otherwise also bounce on the
circuit breaker). But even with that added (not too large a change),
the lease would be required, so we wouldn't be able to apply the
circuit breakers anywhere but the leaseholder (i.e. we would have
to heal the breaker when the lease is "not obtainable". What we are
after is really a per-Replica concept that is disjoint from the lease,
where we check that the replica can manage to get a command into the
log and observe it coming out of the replication layer.

`ProbeRequest` bypasses lease checks and can thus be processed by any
Replica. Upon application, they are guaranteed to result in a forced
error, and we suppress this error above the replication layer (i.e. at
the waiting request). This means that receiving a nil error from the
probe implies that the probe successfully made it through latching
(where it doesn't declare any key, i.e. should not conflict with long-
evaluating requests), evaluation, and application.

While no concrete plans exist for adoption of `ProbeRequest` elsewhere,
this might be of interest for [kvprober] and as a general building block
for quickly assessing overall KV health; for example we may extend cockroachdb#72732
to facilitate sending a probe to the leaseholders of all Ranges in the
system.

`ProbeRequest` is a point request. A ranged version is possible but
since `DistSender` will not return partial responses, it is not
introduced at this point. This can be done when a concrete need
arises.

`ProbeRequest` by default will be routed to the leaseholder, but it
can also be used with the `NEAREST` routing policy.

Some details helpful for reviewers follow.

- to enable testing of this new request type, `TestingApplyFilter` was
  augmented with a `TestingApplyForcedErrFilter` which gets to inspect
  and mutate forced errors below raft. The semantics of
  `TestingApplyFilter` remain unchainged; the newly added field
  `ForcedErr` is always nil when handed to such a filter.
- `ProbeRequest` is the first request type that skips the lease check
  and is not itself a lease check. This prompted a small refactor to
  `GetPrevLeaseForLeaseRequest`, which can now return false when the
  underlying request is a `Probe` and not `{Transfer,Request}Lease`.

[kvprober]: https://github.com/cockroachdb/cockroach/tree/2ad2bee257e78970ce2c457ddd6996099ed6727a/pkg/kv/kvprober
[routing policy]: https://github.com/cockroachdb/cockroach/blob/e1c7d6a55ed147aeb20d6d5c564d3d9bf73d2624/pkg/roachpb/api.proto#L48-L60

Release note: None
craig bot pushed a commit that referenced this pull request Dec 9, 2021
72972: kvserver: introduce ProbeRequest r=erikgrinaker a=tbg

This commit introduces a new KV request type `ProbeRequest`. This
request performs a mutationless round-trip through the replication
layer. The proximate reason to add this request is #33007, where we want
to fail-fast requests on a Replica until a `ProbeRequest` succeeds.

The alternative to introducing a new request type is using a a `Put` on
`keys.RangeProbeKey` (a key introduced for [kvprober]). However, when
the Replica circuit breakers perform a probe, they have to make sure
that this probe doesn't get caught up in the circuit breakers
themselves. In particular, the circuit breaker's request needs special
casing with respect to lease requests (which would otherwise also bounce
on the circuit breaker), and in particular we would also like the probe
to be proposed (via the local raft instance) when the local replica is
not the leaseholder, as this allows the circuit breakers to cover more
ground and ascertain that the Replica has access to the replication
layer regardless of its status.

`ProbeRequest` bypasses lease checks and can thus be processed by any
Replica. Upon application, they are guaranteed to result in a forced
error, and we suppress this error above the replication layer (i.e. at
the waiting request). This means that receiving a nil error from the
probe implies that the probe successfully made it through latching,
evaluation, and application.

While no concrete plans exist for adoption of `ProbeRequest` elsewhere,
this might be of interest for [kvprober] and as a general building block
for quickly assessing overall KV health; for example we may extend #72732
to facilitate sending a probe to the leaseholders of all Ranges in the
system.

`ProbeRequest` is a point request. A ranged version is possible but
since `DistSender` will not return partial responses, it is not
introduced at this point. This can be done when a concrete need
arises.

`ProbeRequest` by default will be routed to the leaseholder, but it
can also be used with the `NEAREST` routing policy.

[kvprober]: https://github.com/cockroachdb/cockroach/tree/2ad2bee257e78970ce2c457ddd6996099ed6727a/pkg/kv/kvprober
[routing policy]: https://github.com/cockroachdb/cockroach/blob/e1c7d6a55ed147aeb20d6d5c564d3d9bf73d2624/pkg/roachpb/api.proto#L48-L60

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
tbg added a commit to tbg/cockroach that referenced this pull request Dec 9, 2021
This commit introduces a new KV request type `ProbeRequest`. This
request performs a mutationless round-trip through the replication
layer. The proximate reason to add this request is cockroachdb#33007, where we want
to fail-fast requests on a Replica until a `ProbeRequest` succeeds.

The alternative to introducing a new request type is using a `Put` on
`keys.RangeProbeKey` (a key introduced for [kvprober]). However, when
the Replica circuit breakers perform a probe, they have to make sure
that this probe itself doesn't get caught up in the circuit breakers. In
particular, the circuit breaker's request needs special casing with
respect to lease requests (which would otherwise also bounce on the
circuit breaker). But even with that added (not too large a change),
the lease would be required, so we wouldn't be able to apply the
circuit breakers anywhere but the leaseholder (i.e. we would have
to heal the breaker when the lease is "not obtainable". What we are
after is really a per-Replica concept that is disjoint from the lease,
where we check that the replica can manage to get a command into the
log and observe it coming out of the replication layer.

`ProbeRequest` bypasses lease checks and can thus be processed by any
Replica. Upon application, they are guaranteed to result in a forced
error, and we suppress this error above the replication layer (i.e. at
the waiting request). This means that receiving a nil error from the
probe implies that the probe successfully made it through latching
(where it doesn't declare any key, i.e. should not conflict with long-
evaluating requests), evaluation, and application.

While no concrete plans exist for adoption of `ProbeRequest` elsewhere,
this might be of interest for [kvprober] and as a general building block
for quickly assessing overall KV health; for example we may extend cockroachdb#72732
to facilitate sending a probe to the leaseholders of all Ranges in the
system.

`ProbeRequest` is a point request. A ranged version is possible but
since `DistSender` will not return partial responses, it is not
introduced at this point. This can be done when a concrete need
arises.

`ProbeRequest` by default will be routed to the leaseholder, but it
can also be used with the `NEAREST` routing policy.

Some details helpful for reviewers follow.

- to enable testing of this new request type, `TestingApplyFilter` was
  augmented with a `TestingApplyForcedErrFilter` which gets to inspect
  and mutate forced errors below raft. The semantics of
  `TestingApplyFilter` remain unchainged; the newly added field
  `ForcedErr` is always nil when handed to such a filter.
- `ProbeRequest` is the first request type that skips the lease check
  and is not itself a lease check. This prompted a small refactor to
  `GetPrevLeaseForLeaseRequest`, which can now return false when the
  underlying request is a `Probe` and not `{Transfer,Request}Lease`.

[kvprober]: https://github.com/cockroachdb/cockroach/tree/2ad2bee257e78970ce2c457ddd6996099ed6727a/pkg/kv/kvprober
[routing policy]: https://github.com/cockroachdb/cockroach/blob/e1c7d6a55ed147aeb20d6d5c564d3d9bf73d2624/pkg/roachpb/api.proto#L48-L60

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants