-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for docker testclusters #20247
Conversation
…dev or -dev-three-node modes.
…e go test case, take 2
…it, so instead skip each step. Yuck.
…er-node. All that's missing is for the "current" vault binary to be used.
…he current vault binary.
…of cleanup needed but the concept is sound.
When using the podman service runner (creating a socket equivalent to Docker's), tests fail with a nil pointer exception since the bridge network is named "podman", not "bridge". Allow single networked containers and use whatever name the container runner assigns to it. Signed-off-by: Alexander Scheel <[email protected]>
} else if d.RunOptions.LogConsumer != nil { | ||
consumeLogs = true | ||
logStdout = &LogConsumerWriter{d.RunOptions.LogConsumer} | ||
logStderr = &LogConsumerWriter{d.RunOptions.LogConsumer} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging changes will be great to have for the PKI tests, to avoid having to use the tests.OnError(...)
manual logging :-)
func (dc *DockerCluster) addNode(ctx context.Context, opts *DockerClusterOptions) error { | ||
i := len(dc.ClusterNodes) | ||
nodeID := fmt.Sprintf("core-%d", i) | ||
node := &dockerClusterNode{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding the first node, and similar to the above comment about saving the network, I wonder if it'd be valuable to re-add the first node's network here?
In particular, we want to ensure that we always land nodes on the same network so they can communicate to each other; I think this only happens if someone is actively playing around with default networks though maybe not if we don't allow choosing the first container's network via DockerCluster
?
I don't know that it matters much any more, but the reason why we saved network
and re-added it manually to subsequent containers in the pkiext/nginx_test.go
suite was that CircleCI's remote docker setup would sometimes land us on different networks, which was the oddest thing ever given the other networks still seemed to exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by re-adding a network? I looked in nginx_test but couldn't figure out what you're referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry in RunNginxRootTest
we call ..., networkName, ... := buildNginxContainer
which creates a container on an arbitrary network and subsequent calls which create containers which are supposed to be on the same network take that in e.g., CheckWithClients(t, networkName, ...)
-- to ensure they're created on the same network as the nginx instance.
Here, since we're exposing this in sdk
as a generic test dependency, we can't know how the container would get created, so it'd be best to ensure that the subsequent nodes land on the same network (via forcing it to) as the first node.
"SKIP_SETCAP=true", | ||
"VAULT_LOG_FORMAT=json", | ||
}, | ||
Ports: []string{"8200/tcp", "8201/tcp"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be cool in a future PR to add the ability to use a non-TLS enabled port here.
OCSP and CRL fetching should (theoretically) use a non-TLS listener and I've occasionally had some issues with cert auth method talking to localhost over TLS.
I don't think I need this immediately, perhaps I could add this in the future when I do though. But curious if you have thoughts on how this would look, it doesn't look like we get that much influence over listeners but I might've missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet added any support for configuring listeners in HCL, but I don't recall any major obstacles to doing so, just didn't need it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't mean to request changes, meant to just submit review lol.
# Conflicts: # go.sum # sdk/go.mod # sdk/go.sum
break | ||
} | ||
} else { | ||
realIP = inspect.NetworkSettings.Networks[d.RunOptions.NetworkName].IPAddress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any chance of the network with that name not being found, and this panicking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently the answer is yes:
acme_test.go:95: [0] Cluster Node Test_ACMERSAPure-core-0 - podman / 10.88.0.2
acme_test.go:95: [1] Cluster Node Test_ACMERSAPure-core-1 - podman / 10.88.0.3
acme_test.go:95: [2] Cluster Node Test_ACMERSAPure-core-2 - podman / 10.88.0.4
acme_test.go:27: creating on network: podman
2023-04-24T11:49:41.586-0400 [DEBUG] Test_ACMERSAPure.core-2.core.cluster-listener: creating rpc dialer: address=10.88.0.2:8201 alpn=req_fw_sb-act_v1 host=fw-50bb4d28-5e69-6a80-b974-6d3562e805b0 timeout=19.999984975s
--- FAIL: Test_ACMERSAPure (36.18s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x68 pc=0xb7fcc8]
goroutine 14 [running]:
testing.tRunner.func1.2({0xd69d00, 0x156da90})
/usr/local/go/src/testing/testing.go:1526 +0x24e
testing.tRunner.func1()
/usr/local/go/src/testing/testing.go:1529 +0x39f
panic({0xd69d00, 0x156da90})
/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/hashicorp/vault/sdk/helper/docker.(*Runner).Start(0xc0222c8780, {0xfe59e8, 0xc00003a0d8}, 0x6?, 0x0)
/home/cipherboy/GitHub/cipherboy/vault/sdk/helper/docker/testhelpers.go:458 +0x10a8
github.com/hashicorp/vault/builtin/logical/pkiext.CheckCertBot(0xc000582ea0, {0xc000312770, 0x6}, {0xc02207b720, 0x1d})
/home/cipherboy/GitHub/cipherboy/vault/builtin/logical/pkiext/acme_test.go:39 +0x265
github.com/hashicorp/vault/builtin/logical/pkiext.RunACMERootTest(0xc000582ea0, {0xe7961b, 0x3}, 0x52f785?, 0x0, {0xe7961b, 0x3}, 0xc000083f60?, 0x0)
/home/cipherboy/GitHub/cipherboy/vault/builtin/logical/pkiext/acme_test.go:173 +0x19d3
github.com/hashicorp/vault/builtin/logical/pkiext.Test_ACMERSAPure(0x0?)
/home/cipherboy/GitHub/cipherboy/vault/builtin/logical/pkiext/acme_test.go:177 +0x38
testing.tRunner(0xc000582ea0, 0xef4488)
/usr/local/go/src/testing/testing.go:1576 +0x10b
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:1629 +0x3ea
FAIL github.com/hashicorp/vault/builtin/logical/pkiext 36.192s
FAIL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's on me. I passed in networkID
rather than NetworkName
... Hence why it was confused. 😆
After that, no, it does indeed have that network correctly specified.
} | ||
|
||
func (dc *DockerCluster) GetBarrierOrRecoveryKeys() [][]byte { | ||
return dc.GetBarrierKeys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment isn't necessarily related to your changes since I know this implementation is the same as the current TestCluster, but should this function ever return recovery keys instead of barrier keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only once we support non-shamir seals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ncabatoff This thread reminds me, I think it'd be really cool if we could support other base layers for the Vault container, happy to open a PR for that.
Not quite sure how it interacts with AddNodes tbh, but what I was thinking is it'd be cool to drop in a container file like:
FROM hashicorp/vault:latest
RUN apk install softhsm # ... plus more configuration
and then use PKCS#11 auto-unseal/managed keys/... testing in Vault Ent with SoftHSM.
I think we might need some hooks to let us specify custom callbacks at points in time (startup, cluster join, &c ...) so that we can inject data from other containers (initial node will setup auto-unseal keys which we'll need to replicate into other machines unless we start a PKCS#11 network proxy client on the Vault node and transit all requests to a single, common PKCS#11 network proxy server on another node).
This is definitely enough work to be a completely separate PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worked and was sufficient for my tests in #20320!
container, hostIPs, containerID, err := d.Start(context.Background(), addSuffix, forceLocalAddr) | ||
if d.RunOptions.PreDelete { | ||
name := d.RunOptions.ContainerName | ||
matches, err := d.DockerAPI.ContainerList(ctx, types.ContainerListOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct in understanding this is exclusive of t.Parallel()
? If we specify addSuffix
to allow tests to execute in parallel, here we'd kill all uuid-appended variants as well as the main one?
A docs follow-up should be sufficient if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so? It seems to me we're using exact match rather than prefix match in the filter.
The main use case for the pre-delete is for local testing where we're not using addSuffix, to prevent a prior failed run from interfering. We should probably do that labels TODO so that in local testing with addSuffix, old containers get cleaned up; they won't interfere, but they'll take up memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed the exact match. I think this is good. Yeah, I'm curious what you would put in the label though... perhaps the test name?
// behaviour for HUP is termination. So the PostStart that NewDockerCluster | ||
// passes in (which does all that PKI cert stuff) waits to see output from | ||
// Vault on stdout/stderr before it sends the signal, and we don't want to | ||
// run the PostStart until we've hooked into the docker logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's slick and really handy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha I'm glad you think it's slick. I agree it's handy but it feels like a messy kludge... still, it works.
No description provided.