-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in rpk:test-container
#2418
Comments
Again today: Looked at the code - I think this an rpk bug. In It would be more robust to generate a pool of available ports up-front before starting any redpanda nodes, and then allocate from that pool -- that way there wouldn't be a race between redpanda startup and GetFreePort. |
Would also be nice to have the rpk info/debug log output during this test (does it just need a -v flag?) |
Failure of this test in a different way here: https://buildkite.com/vectorized/redpanda/builds/3537#29a68ee8-bc64-488d-b4e4-4c0a9020b9c0 On that one, we see two nodes start successfully, and the third node silently fails to start ('rpk start' runs, status check output appears, but not output from redpanda binary appears). |
FWIW. my goal is to switch to using a static set of ports, but that'll likely conflict with these tests. |
Latest example on tip of dev |
This seems to be becoming more frequent for some reason https://buildkite.com/vectorized/redpanda/builds/3908#ad8d710f-2aea-4577-bdc4-bd256fd935bf |
Another one: @twmb how essential is the coverage provided by these tests? The failure is happening often enough that I'm inclined to switch them off until fixed. |
AFAICT the code the tests are testing is strictly isolated to |
Disabling here: https://github.com/vectorizedio/vtools/pull/314 |
|
As an update, this test is disabled and the code is frozen until aspects of it are rewritten, notably the port randomization should be fixed, and the test should tolerate that. AFAICT, this test can be completely rewritten at that point, and I'm not sure it's worth spending time to fix the current issue when in the end it'll be removed or rewritten. |
totally agree but if rpk:container is being used, then we should also keep in mind how long we think it might be until it is rewritten so that we do have testing for a in-use feature. do you have any guidance on either (1) where someone would start poking around to make the port randomization be more robust / correct and (2) how long that might take? |
The port probing can be pretty easily traced through gopls's jump to definition, |
Reopening due to https://buildkite.com/redpanda/redpanda/builds/9845#c84f2c25-401e-47af-8b21-379197d2b1cc
|
We tested with the tip of dev: 1k nodes / 5k ports and we allocate unique ports every time. Port colliding might be related to another service so we will keep an eye out if it happens again. |
@r-vasquez what changed between that failure and the tip of dev to make us think this won't happen again? I get that it might be interference from something else, but that interference needs to be hunted down, unless we've exhausted all avenues of investigation? |
Our next approach if this fails again is to just reallocate 5 new free ports and retry the container start -- we could either leave this issue open, or reopen the first time we see this same CI failure. We do intend to eventually switch away from the random port choices and instead make the ports static and driven by flags, but that's a bit longer term. |
If this had failed once a few weeks ago I could understand, but this failed Friday, the day after it was re-enabled, and it's only Monday now. This doesn't seem like a super rare failure -- if there's already a test change in mind to make this more robust, then let's make the change, instead of waiting for more failures. |
Failed overnight here: https://buildkite.com/redpanda/redpanda/builds/9991#7efb7950-1304-4027-8344-4e8644c8ab69
@r-vasquez please re-disable today if the fix isn't obvious right away. |
I just looked again at the patch -- I think the issue is that getFreePortPool is able to find the same port is free more than once, because after it checks the port, there is nothing still listening on it, so future calls to getFreePort can find the same port again. When getFreePort is called, it should reject any port that is already allocated in the pool |
Reopening to track re-enabling the test -- I didn't see a vtools PR since #655 disabled it? |
Thanks, I just created: https://github.com/redpanda-data/vtools/pull/667 |
Test re-enabling merged, I think we can close this. |
This test is run after all the ducktape tests & seems to be failing every 10-20 runs.
https://buildkite.com/vectorized/redpanda/builds/2528#e0afe087-83be-4b09-9899-3cea1e8c966c
Couple lines stand out
Error: Error restarting the cluster: Error response from daemon: driver failed programming external connectivity on endpoint rp-node-1 (77e690e2d23268aed897727feaa2caa7837f7c60d531150ed7f87fe35234668d): Bind for 0.0.0.0:41369 failed: port is already allocated
Error: flag needs an argument: --brokers
The text was updated successfully, but these errors were encountered: