Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the TestLeader_SecondaryCA_IntermediateRefresh test flakine… #6885

Merged
merged 1 commit into from
Dec 5, 2019

Conversation

mkeeler
Copy link
Member

@mkeeler mkeeler commented Dec 5, 2019

Before the fix: go test -count 10 ./agent/consul -run TestLeader_SecondaryCA_IntermediateRefresh would fail 9/10 times. After the fix, I ran it with -count 100 and it passed every time.

The real question here is whether needing to move these up is indicative of an actual issue.

cc: @banks

What I believe was happening was that the intermediate was being changed between when we got the updated intermediate from the secondary CA provider and when we actually signed the cert. Therefore, the leaf cert couldn't be verified because we were setting up the cert pool with the old intermediate.

I added some debug logging to the Consul CA provider so I could see when it was setting new intermediates and it did happen 3 times.

https://gist.github.com/mkeeler/c4a8a7f429f788641d1562a63bbe7251

@mkeeler mkeeler requested a review from a team December 5, 2019 00:22
@codecov
Copy link

codecov bot commented Dec 5, 2019

Codecov Report

Merging #6885 into master will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6885      +/-   ##
==========================================
+ Coverage   65.83%   65.85%   +0.01%     
==========================================
  Files         435      435              
  Lines       52488    52488              
==========================================
+ Hits        34558    34564       +6     
+ Misses      13796    13788       -8     
- Partials     4134     4136       +2
Impacted Files Coverage Δ
agent/consul/flood.go 90.9% <0%> (-6.07%) ⬇️
agent/consul/acl_replication_legacy.go 83.05% <0%> (-3.39%) ⬇️
agent/consul/session_ttl.go 85.48% <0%> (-3.23%) ⬇️
agent/cache/watch.go 78.75% <0%> (-2.5%) ⬇️
api/kv.go 83.45% <0%> (-1.44%) ⬇️
api/watch/funcs.go 74.61% <0%> (-1.04%) ⬇️
agent/consul/connect_ca_endpoint.go 56.52% <0%> (-0.67%) ⬇️
command/exec/exec.go 66.56% <0%> (-0.6%) ⬇️
command/debug/debug.go 65.97% <0%> (-0.6%) ⬇️
agent/consul/rpc.go 78.57% <0%> (-0.38%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6817676...71b1c9c. Read the comment docs.

@mkeeler
Copy link
Member Author

mkeeler commented Dec 5, 2019

I think I know what the problem is. The flow goes something like this:

  1. initializeCA is called in the secondary which creates a CSR and gets it signed by the primary.
  2. The test which was waiting for the CA roots in the secondary sees it and will get the intermediate pem from the CA provider.
  3. secondaryCARootWatch is started which begins with a non-blocking request to get the CA roots and will then call initializeCA. This does another round of CSR and setting the intermediate.
  4. The CA config is set in the primary which causes root rotation.
  5. The test code is in a retry loop waiting for a new intermediate pem to be set on the secondary provider. It ends up picking up the one created by the root watch routine.
  6. The secondary blocking routine fires now that the root was rotated in the primary and a new round of CSR and intermediate setting happens.
  7. The test code signs a new leaf cert. This is done with the ca providers current intermediate which happens to be the third one.
  8. We attempt to verify the new leaf with the second cert from the initialize when starting the CA root watch routine. As the intermediate doesn't match the leaf the test fails.

The fixes in this PR work because we wait to get the "new" intermediate until after waiting for the roots to be updated in both the primary and secondary CAs. While I think it would be good to get this fix in (just so we can make CI happy and because that is how to make the test more robust), it did turn up a minor issue with the roots watcher which I have opened an issue for: #6886

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable quick fix, the issue(s) still need to be fixed but this helps CI some until they are and isn't "wrong".

@mkeeler mkeeler changed the title Fix the TestLeader_SecondaryCA_IntermediateRefresh test flakiness Fix the TestLeader_SecondaryCA_IntermediateRefresh test flakine… Dec 5, 2019
@mkeeler mkeeler merged commit c05ab15 into master Dec 5, 2019
@mkeeler mkeeler deleted the flake/ca-intermediate-refresh branch December 5, 2019 14:35
@ghost
Copy link

ghost commented Jan 25, 2020

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one 👍.

@ghost ghost locked and limited conversation to collaborators Jan 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants