-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VAULT-31409: trace postUnseal function #28895
Conversation
CI Results: |
didn't bother to check the error from traceFile.Close()
helper/trace/debug_trace.go
Outdated
path := fmt.Sprintf("%s/%s_%s", os.TempDir(), filePrefix, time.Now().Format(time.RFC3339)) | ||
traceFile, err := os.Create(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still want to make this more similar to VAULT_STACKTRACE_WRITE_TO_FILE
for the sake of consistency
after talking to Kuba it sounds like it's a good idea to try to move stuff out of core, so even if there's no immediate need for a generic debug trace function it's still fair to add it
also some usability improvements from manual testing
helper/trace/debug_trace.go
Outdated
} | ||
} | ||
|
||
traceFile, err := filepath.Abs(fmt.Sprintf("%s/%s-%s.trace", dir, filePrefix, time.Now().Format(time.RFC3339))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we switch this to using filepath.Join
to create the path, rather than Sprintf
?
command/server/config.go
Outdated
@@ -115,6 +115,9 @@ type Config struct { | |||
License string `hcl:"-"` | |||
LicensePath string `hcl:"license_path"` | |||
DisableSSCTokens bool `hcl:"-"` | |||
|
|||
EnablePostUnsealTrace bool `hcl:"enable_post_unseal_trace"` | |||
PostUnsealTraceDir string `hcl:"post_unseal_trace_dir"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also add these fields to the (*Config).Merge
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh good catch, wouldn't have noticed this unless I'd tested with multiple files. I wonder why we didn't go for a more generic implementation here (maybe didn't want to use reflect?) but we could at least have some semgrep validation to check that all fields are merged (or explicitly ignored)
if stopTrace := c.tracePostUnsealIfEnabled(); stopTrace != nil { | ||
defer stopTrace() | ||
} | ||
|
||
defer metrics.MeasureSince([]string{"core", "post_unseal"}, time.Now()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what is the best order between this metric and the new trace, but I think it's fair to include the metric in the captured trace even tho that will make the time spend on the tracing setup invisible to the metric
command/server/config.go
Outdated
@@ -115,6 +115,9 @@ type Config struct { | |||
License string `hcl:"-"` | |||
LicensePath string `hcl:"license_path"` | |||
DisableSSCTokens bool `hcl:"-"` | |||
|
|||
EnablePostUnsealTrace bool `hcl:"enable_post_unseal_trace"` | |||
PostUnsealTraceDir string `hcl:"post_unseal_trace_dir"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh good catch, wouldn't have noticed this unless I'd tested with multiple files. I wonder why we didn't go for a more generic implementation here (maybe didn't want to use reflect?) but we could at least have some semgrep validation to check that all fields are merged (or explicitly ignored)
there were concerns about using the /tmp directory because of permissions, or having a default dir at all, so now it's required to set a dir in order to generate the traces.
sounds like it might be forbidden in Windows and possibly cause problems in some MacOS applications.
Build Results: |
return "", nil, fmt.Errorf("trace directory %q does not exist", dir) | ||
} | ||
|
||
if !os.IsNotExist(err) && !d.IsDir() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the !os.IsNotExist(err)
portion of this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think we need it. If os.IsNotExist(err)
evaluates to true then we don't want to evaluate d.IsDir()
since d
would be nil
.
The alternative would be to use a nested if-else structure that interleaves error returns with additional operations (creation of default dir), which I don't like as much:
d, err := os.Stat(dir)
if err != nil {
if os.IsNotExist(err) {
if dirMustExist {
return "", nil, fmt.Errorf("trace directory %q does not exist", dir)
} else {
if err := os.Mkdir(dir, 0o700); err != nil {
return "", nil, fmt.Errorf("failed to create trace directory %q: %s", dir, err)
}
}
} else {
return "", nil, fmt.Errorf("failed to stat trace directory %q: %s", dir, err)
}
} else if !d.IsDir() {
return "", nil, fmt.Errorf("trace directory %q is not a directory", dir)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, it's good as is!
CI was complaining about missing comments on the new test function. It feels a bit silly to require this of tests but whatever XD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, though a config test is still failing. once that's passing i'm good to approve it!
return "", nil, fmt.Errorf("trace directory %q does not exist", dir) | ||
} | ||
|
||
if !os.IsNotExist(err) && !d.IsDir() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, it's good as is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
* initial implementation of unseal trace * close file if we fail to start the trace didn't bother to check the error from traceFile.Close() * use reloadable config instead of env var * license * remove leftover * allow setting custom dir and remove new package * bring back StartDebugTrace after talking to Kuba it sounds like it's a good idea to try to move stuff out of core, so even if there's no immediate need for a generic debug trace function it's still fair to add it * track postUnseal instead of unsealInternal also some usability improvements from manual testing * address PR comments * address security review there were concerns about using the /tmp directory because of permissions, or having a default dir at all, so now it's required to set a dir in order to generate the traces. * add unit tests to StartDebugTrace * move back to default dir * document new parameters * add tiny integration test * avoid column in trace filename sounds like it might be forbidden in Windows and possibly cause problems in some MacOS applications. * address PR feedback * add go doc to test CI was complaining about missing comments on the new test function. It feels a bit silly to require this of tests but whatever XD * fix tests
Description
This PR adds 2 new settings to the top-level Vault server configuration:
enable_post_unseal_trace
that when enabled will cause Vault to generate a Go trace during the execution ofcore.postUnseal
and save it to disk under the directory configured inpost_unseal_trace_dir
, defaulting to/tmp/vault-traces/
instead if the field isn't set. This file can be inspected using thego tool trace
command and will help us debug cases where the leadership transfer operation takes too long.Most of the feature was split into a reusable
StartDebugTrace
function in a separate package. It could be used for similar debug purposes in the future but most importantly for now it avoids adding more code and dependencies to the already-bloatedcore
package.Manually tested the functionality to ensure that it was possible to generate the traces by updating the new fields in the config, then sending a
SIGHUP
signal to the running Vault process and triggering a leadership election viavault operator step-down
. Disabling the generation of the traces via the sameSIGHUP
process also works. Post-unseal only runs on the active node so restarting a follower won't generate the post-unseal traces.Jira: VAULT-31409
TODO only if you're a HashiCorp employee
to N, N-1, and N-2, using the
backport/ent/x.x.x+ent
labels. If this PR is in the CE repo, you should only backport to N, using thebackport/x.x.x
label, not the enterprise labels.of a public function, even if that change is in a CE file, double check that
applying the patch for this PR to the ENT repo and running tests doesn't
break any tests. Sometimes ENT only tests rely on public functions in CE
files.
in the PR description, commit message, or branch name.
description. Also, make sure the changelog is in this PR, not in your ENT PR.