Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Secure Storage] Update vault request timeout and error handling. #7193

Merged
merged 1 commit into from
Jan 12, 2021

Conversation

JoshLind
Copy link
Contributor

@JoshLind JoshLind commented Jan 11, 2021

Motivation

This PR offers 3 small improvements to the vault secure storage client:

  1. First, set the timeout() and timeout_connection() values explicitly. Right now timeout() is not being set, so vault could potentially hang on reads/writes and the client code could block indefinitely.
  2. Second, ensure we are calling "ureq::Response.into_string()" on all response handling paths. This is required to clear the response buffers. This relates to: Attempt to return dropped Body to pool algesten/ureq#94
  3. Third, use "SyntheticError" and "SerializationError" where appropriate inside the client error handling code instead of "InternalError". "InternalError" is a bit too generic.

Have you read the Contributing Guidelines on pull requests?

Yes.

Test Plan

All tests pass locally.

Related PRs

None, but these changes come as a result of the on-going investigation of: https://github.com/diem/partners/issues/727

@JoshLind JoshLind requested a review from davidiw January 11, 2021 19:26
@JoshLind JoshLind changed the title [Draft] Draft. [Secure Storage] Update vault request timeout and error handling. Jan 11, 2021
@JoshLind JoshLind marked this pull request as ready for review January 11, 2021 21:17
const TIMEOUT: u64 = 10_000;
/// Request timeouts for vault operations.
const CONNECT_TIMEOUT_MILLISECS: u64 = 3000;
const TIMEOUT_MILLISECS: u64 = 5000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgorven, @davidiw, you may have to come to some sort of compromise regarding these values 😄

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much work would it be to make these configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per component? e.g., using the config files, I believe a PR was previously started last year but never landed (#5229). Given the amount of code for this, do we think configurable timeouts would be that useful?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Vault backend config section (wherever that gets used). Yes, timeouts are almost always configurable and do get used :-) If they are configurable we can set fairly aggressive defaults, but if that causes issues we can tune them in deployments. If they're hardcoded they have to be much more conservative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair -- if you can see the benefit then I'm happy to do that. Will probably have to do it as a follow up though -- might be too far beyond the scope of this PR 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can attach it as another commit to this PR or you can keep the values at 10_000.
I'm not comfortable changing values without having knobs at this point.

Copy link
Contributor Author

@JoshLind JoshLind Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've changed them to 10_000 and will then follow up with a configurable timeout PR (I think this will be cleaner/easier to manage).

let status = resp.status();
let status_text = resp.status_text().to_string();
match resp.into_string() {
Ok(body) => Error::HttpError(status, status_text, body),
Err(e) => Error::InternalError(e.to_string()),
Err(error) => Error::SerializationError(error.to_string()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a serialization error? I think this is an internal error. Same with above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, although I preferred SerializationError as internally "resp.into_string()" performs serialization of the response (e.g., unchunking of the response bytes and using from_utf8_lossy), so if the call fails, it's a SerializationError. But granted, this is internal to the "resp.into_string()". Happy to revert to "InternalError" or perhaps something else, e.g., "ResponseConversionError"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say UTF8 from bytes is serialization in the traditional sense. I guess the question is when would this result in an error? Since it isn't well defined, I called it an internal error. A more accurate statement might be "ResponseParsingError" ? yuck :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm 100% certain that this should be an internal error, for example, this could be caused by a terminated connection. I'll accept with the assumption we'll revert these back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, cool, thanks @davidiw -- sounds good!

@JoshLind JoshLind force-pushed the vault_fix branch 2 times, most recently from 7affe2f to 3a2efd9 Compare January 12, 2021 18:42
@JoshLind
Copy link
Contributor Author

Thanks @davidiw 😄
/land

@github-actions
Copy link

Cluster Test Result

Compatibility test results for land_b02a891e ==> land_e65cc44e (PR)
1. All instances running land_b02a891e, generating some traffic on network
2. First validator land_b02a891e ==> land_e65cc44e, to validate storage
3. First batch validators (14) land_b02a891e ==> land_e65cc44e, to test consensus
4. Second batch validators (15) land_b02a891e ==> land_e65cc44e, to upgrade rest of the validators
5. All full nodes (30) land_b02a891e ==> land_e65cc44e, to finish the network upgrade
all up : 917 TPS, 4927 ms latency, 6200 ms p99 latency, no expired txns
Logs: http://kibana.ct-0-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-01-12T19:38:41Z',to:'2021-01-12T19:59:59Z'))
Dashboard: http://grafana.ct-0-k8s-testnet.aws.hlw3truzy4ls.com/d/performance/performance?from=1610480321000&to=1610481599000
Validator 1 logs: http://kibana.ct-0-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-01-12T19:38:41Z',to:'2021-01-12T19:59:59Z'))&_a=(columns:!(log),query:(language:kuery,query:'kubernetes.pod_name:"val-1"'),sort:!(!('@timestamp',desc)))

Repro cmd:

./scripts/cti --tag land_b02a891e --cluster-test-tag land_e65cc44e -E RUST_LOG=debug -E BATCH_SIZE=15 -E UPDATE_TO_TAG=land_e65cc44e --report report.json --suite land_blocking_compat

🎉 Land-blocking cluster test passed! 👌

@bors-libra
Copy link
Contributor

💔 Test Failed - commit-workflow

@JoshLind
Copy link
Contributor Author

/land

@github-actions
Copy link

Cluster Test Result

Compatibility test results for land_99a69959 ==> land_7e094ec2 (PR)
1. All instances running land_99a69959, generating some traffic on network
2. First validator land_99a69959 ==> land_7e094ec2, to validate storage
3. First batch validators (14) land_99a69959 ==> land_7e094ec2, to test consensus
4. Second batch validators (15) land_99a69959 ==> land_7e094ec2, to upgrade rest of the validators
5. All full nodes (30) land_99a69959 ==> land_7e094ec2, to finish the network upgrade
all up : 914 TPS, 4954 ms latency, 6250 ms p99 latency, no expired txns
Logs: http://kibana.ct-1-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-01-12T21:28:42Z',to:'2021-01-12T21:49:32Z'))
Dashboard: http://grafana.ct-1-k8s-testnet.aws.hlw3truzy4ls.com/d/performance/performance?from=1610486922000&to=1610488172000
Validator 1 logs: http://kibana.ct-1-k8s-testnet.aws.hlw3truzy4ls.com/app/kibana#/discover?_g=(time:(from:'2021-01-12T21:28:42Z',to:'2021-01-12T21:49:32Z'))&_a=(columns:!(log),query:(language:kuery,query:'kubernetes.pod_name:"val-1"'),sort:!(!('@timestamp',desc)))

Repro cmd:

./scripts/cti --tag land_99a69959 --cluster-test-tag land_7e094ec2 -E RUST_LOG=debug -E BATCH_SIZE=15 -E UPDATE_TO_TAG=land_7e094ec2 --report report.json --suite land_blocking_compat

🎉 Land-blocking cluster test passed! 👌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants