-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(cosmic-swingset): Add memory stats to new commit block slog event #5637
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice and tidy! Please rerequest a review when you're ready for it.
f080255
to
547d722
Compare
.map(([key, value]) => [`agoric.${key}`, cleanValue(value, key)]) | ||
.filter(([_key, value]) => value !== undefined && value !== null), | ||
), | ||
...serializeInto(attrs, 'agoric'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelfig this does cause some data to be serialized deeply, such as slots
being written as .slots.0=ko123
etc instead of .slots=["ko123"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's perfectly fine, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like it makes the number of attributes per (span?) become larger and variable.. if some crank uses a thousand slots, is that going to cause problems on the telemetry provider?
While it's nice to be able to recover the list of slots, I'm hard pressed to imagine a honeycomb-side search that could use it (slogfile-processing tools are probably more useful there). Unless it's possible to do a honeycomb search like "find me all spans WHERE any .slots.*
attributes have value ko123
", which would be kinda cool for causality tracing (currently I do this with a lot of jq
work and/or ad-hoc python tools).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, a base slots
which is a comma separated string of values may be more useful, and could be used with contains
in Honeycomb queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explicitly handled slots
below to avoid the splitting in separate fields.
547d722
to
384024a
Compare
message.msg.methargs?.slots ?? message.msg.args.slots; | ||
const slots = | ||
message.msg.methargs?.slots ?? message.msg.args.slots ?? []; | ||
attrs['message.msg.args.slots'] = slots.join(','); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@warner is there a risk that ,
would ever be used in krefs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not anytime soon, I have a bunch of analysis tools that also use ,
as a joiner. vrefs are more interesting, but krefs have no punctuation (purely k[opd]\d+
) and no plans to acquire any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the reason that this code is only ever seeing krefs is that it'd fed by crank-start
events, rather than type: 'deliver'
events? If it were the latter, something would need to extract the kernel-format delivery (.kd
), and I don't see that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan of re-thinking the crank vs delivery slog events.
if (spans.topName() === `intra-block`) { | ||
spans.pop(`intra-block`); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces a new span to capture the duration between "commit-block" and the next "begin-block", and layers that as a child of the "previous" block, which causes the block's duration to now measure the block to block time.
Before we were not capturing this gap anywhere, causing blocks to appear shorter than they really were.
384024a
to
86179af
Compare
@michaelfig @warner, PTAL (recommend reviewing commit-by-commit) I've rebased and simplified the commit-block slog events. (sorry for nullifying previous reviews) I've also included some preliminary cleanup to the slog-to-otel in advance of a more thorough rewrite of spans. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think that's good. I can't visualize the collection of spans and how they line up, so I'm not confident that I understand how this changes the spans, but I know you've got another change in the works which will include docs and a list/diagram of spans, so I'll re-examine this code once those docs are available as a reference.
@@ -230,23 +234,27 @@ export function makeSnapStore( | |||
toDelete.add(hash); | |||
} | |||
|
|||
function commitDeletes(ignoreErrors = false) { | |||
async function commitDeletes(ignoreErrors = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks sound, although I don't get why an async version is better, and why these changes are part of the PR (this doesn't change the swingstore API, right?, just some churn in the necessary IO endowments?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was needed in an earlier version to allow executing JS while the delete I/Os were taking place. This version is strictly superiors though since it parallelizes the deletes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you're correct, at this point it became a drive by change.
@@ -1011,7 +1011,12 @@ export default function buildKernel( | |||
async function processDeliveryMessage(message) { | |||
kdebug(`processQ ${JSON.stringify(message)}`); | |||
kdebug(legibilizeMessage(message)); | |||
kernelSlog.write({ type: 'crank-start', message }); | |||
kernelSlog.write({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of using delivery-crank-start
and routing-crank-start
instead of a separate field (likewise delivery-crank-finish
/routing-crank-finish
). I figured that my jq
tools would just grow a select(.type=="delivery-crank-start")
instead of select(.type=="crank-start" and .crankType=="delivery")
, slightly easier to type, and it's rare that I'd want to pay attention to both types at the same time.
Also, I'm kinda wavering on the idea that the routing step should be called a "crank" (might be better to find a new name, maybe), and having separate type:
values would make that easier to change.
OTOH if they share a crankNum
numberspace then having separate type
values is a tiny bit weird.
Ok leave it with crankType
for now, but let's think about it for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the shared numbering space was one motivation for the decision. The other being that some "delivery" cranks still do not deliver anything. I was also concerned about breaking existing slog tooling which targets crank-start
.
}); | ||
|
||
controller.writeSlogObject({ | ||
type: 'cosmic-swingset-commit-block-finish', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the type:
name, be aware that the commit()
this wraps is only the swingset/swing-store commit: the rest of cosmos-sdk does not start it's IAVL commit until after this returns. The name is probably ok (we can pretend that the cosmic-swingset-
prefix implies that we're only talking about the swingset host app's commit of swingset state), but if we ever find a way to grab the golang/IAVL commit timing and write it to the slog, we might need to find a more precise name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I expect we'll need to overall the layering of spans if/when we ever manage to report data from the go side.
let heapSnapshotTime; | ||
|
||
const t0 = performance.now(); | ||
engineGC(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so this is forcing a Node.js GC every block.. normally I'd be concerned about interfering with the carefully-designed V8 GC algorithm (we shouldn't presume to know better), but given that this is only happening every 5-10 seconds, and we're measuring how much overhead we're introducing, I'm ok with it.
But let's pay attention to the results, and be willing to remove it if it appears to be overhead that could be handled better through the normal incremental GC. I see that afterCommitCallback
implies that this might happen in parallel with golang-side work, but I never believe parallelism actually happens without proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, that's why I added the measurement compared to the previous iteration.
I couldn't 100% confirm that it happened in parallel with the go side, but from what I could tell, it did.
for await (const line of lines) { | ||
lineCount += 1; | ||
const obj = JSON.parse(line); | ||
const update = obj.time >= progress.lastSlogTime; | ||
update ||= obj.time >= progress.lastSlogTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, what prompted this? Did slogfiles have non-monotonically-increasing timestamps?
I see that there's no way for update
to ever become false
again (within a single run of this program), but that's ok, as the feature is meant to allow the program to be restarted after a crash and pick up where it left off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some weird case where some slog lines were skipped, and this was the only explanation I found. I unfortunately didn't save the slog file.
message.msg.methargs?.slots ?? message.msg.args.slots; | ||
const slots = | ||
message.msg.methargs?.slots ?? message.msg.args.slots ?? []; | ||
attrs['message.msg.args.slots'] = slots.join(','); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the reason that this code is only ever seeing krefs is that it'd fed by crank-start
events, rather than type: 'deliver'
events? If it were the latter, something would need to extract the kernel-format delivery (.kd
), and I don't see that here.
Support missing types
feat(telemetry): create spans for commit-block and intra-block Causes block duration to measure block to block time
86179af
to
42e43bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work, Mathieu.
refs: #5507
Description
cosmic-swingset-commit-block-*
slog events, with their own timingcommit-block-start
andcommit-block-finish
measure the time taken by cosmic-swingset to save/flush its data.after-commit-block
is an event containing post block cleanup stats related to node/v8 memory.NODE_HEAP_SNAPSHOTS
environment option to control the generation of node heap snapshots after a block commit-1
(or any negative number) to disable,0
to only trigger on large block interval (hardcoded at 30 seconds), any positive number to generate a snapshot at an intervalSecurity Considerations
This adds an option to dump the node heap, which shouldn't contain any sensitive data as it's all based on the consensus execution.
Documentation Considerations
TBD
Testing Considerations
Manual local testing with the loadgen deployment test, and ingest slot to otel script.