Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Root CA getting generated even if error is returned #18667

Open
nsimons opened this issue Jan 11, 2023 · 4 comments
Open

Root CA getting generated even if error is returned #18667

nsimons opened this issue Jan 11, 2023 · 4 comments

Comments

@nsimons
Copy link
Contributor

nsimons commented Jan 11, 2023

Describe the bug
We came across a scenario where generating root CA failed (error was returned) but the CA was actually generated successfully.

Upon inspecting the code, the culprit is here. The CA is generated and persisted but there are other actions that follow. If any of those actions fail, an error is returned to the caller, which suggests that the CA generation failed. But actually it succeeded. For example, in our case the function errors out when trying to build the CRL: b.crlBuilder.rebuild().

This can lead to some nasty side effects

  • client does not know if the CA was actually generated or not
  • client saw the error and tries to generate a new CA even though one already exists, accidentally creating a second one or getting a new error
  • client is not sure did all necessary "post-actions" after CA generation succeed

To Reproduce
Easiest way is to modify the code to always return an error from b.crlBuilder.rebuild().
Then simply try to generate a root CA by following the tutorial: Generate Root CA.
Even if an error is returned the CA is visible e.g. through the /issuers endpoint.

Expected behavior
An error should not be returned if CA generation is actually successful.
Alternatively, CA should not be persisted if an error is being returned.

Additional context
Happened on 1.11.3, but also reproducible on current main branch.

@cipherboy
Copy link
Contributor

@nsimons I think this is desired behavior, within the bounds of Vault today.

A warning is meant to indicate there is more to do, or something might not have been done quite as intended.

An error indicates something went wrong, and IMO, may not necessarily be fatal.

In our case, Vault today lacks storage transactions, making it nearly impossible to fully and cleanly revert a partial operation.

Here, an error occurred during CRL building. This isn't fatal -- the issuer was still generated and is discoverable under the APIs -- but indicates that more-up-to-date CRLs will be unavailable until this is addressed and rectified. This could lead to operational issues if not addressed, as if clients attempt to fetch more up-to-date CRLs, none may be available.

@nsimons
Copy link
Contributor Author

nsimons commented Jan 23, 2023

Right, yeah it's a classical problem.

Would it then make sense that the client always tries to delete the issuer and start over if it encounters an error? Because the client cannot really know whether an error is something that it can ignore (e.g. CRL will be rebuilt later) or something that it should act on (CA generated, but e.g. serial number was not stored (so revoke not possible?)). We can't really inspect error messages since those can change between Vault versions.

@cipherboy
Copy link
Contributor

cipherboy commented Jan 23, 2023

I don't think that's sufficient in all cases. To take your other issue (#18583 - sorry :-), if you were to chmod revocation entries, you could end up with a state wherein 1. all previous issuer creations would've succeeded, and 2. any subsequent issuer creation would fail since CRL building will now fail. Here, the client cleaning up and retrying wouldn't necessarily help: the issue isn't with the new issuer, but some other underlying bug with storage.

In general, as much as we can, we do attempt to keep key substrings of error messages stable, but I do agree in general you can't rely on it...

What about a usage-based testing approach? You can list issuers+keys before/after, and attempt to issue a new (short-lived) cert under it, attempt to revoke it, rotate the CRL (and optionally, run a tidy with a short safety buffer once the temporary leaf has expired).

If you don't see the issuer, you know it really is a fatal error somewhere (perhaps bad parameters?), if you're able to use it but just not able to revoke (and so could flag the mount to a human operator), or if it all works fine.

(The assumption here is that you need automated creation of Root CAs, which seems a little interesting as a use case but could be an interesting problem in general :-)

@nsimons
Copy link
Contributor Author

nsimons commented Jan 24, 2023

No worries, the background info are the same across the two issues ;), basically we have hundreds of PKI backends and sometimes the storage has a hiccup. So for us re-creating the issuer and trying again could help I believe (we only have one issuer per PKI mount).

Usage-based testing is a good idea and that's what we are now trying to do. We check the issuers endpoint in order to determine whether the CA was created or not. So far it works, but I feel it's not that future proof.

If CRL (re)building is not fatal, would it make sense to handle the error case differently?

	// Build a fresh CRL
	err = b.crlBuilder.rebuild(sc, true)
	if err != nil {
		return nil, err
	}

Issue a warning and continue (resp.AddWarning), return resp, err, or kick off the CRL rebuild as a go-routine and continue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants