Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ClusterMap #785

Merged
merged 2 commits into from
Sep 20, 2023
Merged

Refactor ClusterMap #785

merged 2 commits into from
Sep 20, 2023

Conversation

XAMPPRocky
Copy link
Collaborator

@XAMPPRocky XAMPPRocky commented Sep 10, 2023

This PR refactors our internal storage of transmission of clusters over xDS to be more focused on endpoints rather than mimicking Envoys layout. We weren't using basically any of it, and having all of those extra fields was creating a lot of extra bytes and made the encoding logic more complicated.

This PR moves to using a simpler map of locality -> endpoints, as that was the only part we were using, and uses our own proto types for xDS discovery. We still use the core discovery request and response wrapper, it's just the types inside are now Quilkin types rather than envoy types.

We don't have a xDS benchmark, but I wrote a small micro benchmark of just the time and size of discovery responses, and there's a significant improvement to using this implementation, being roughly 30%-250% faster depending on the endpoint amounts (being faster for smaller amounts of endpoints where the message size overhead is more noticeable), and being 12% smaller.

Old

endpoints time (seconds) size (bytes)
0 0.001 53
100 0.005 5537
1000 0.028 55013
10000 0.117 550013
65535 0.721 3653593

New

endpoints time (seconds) size (bytes)
0 0.001 53
100 0.002 4910
1000 0.011 48984
10000 0.085 489984
65535 0.539 3260352

Diff

endpoints time (faster) size (smaller)
100 2.5x 12%
1000 2.5x 12%
10000 37% 12%
65535 33% 12%
#[test]
fn protobuf_test() {
    let measure = |count: u16| {
        let config = Config::default();
        if count != 0 {
            config.clusters.write().insert_default(crate::endpoint::LocalityEndpoints::from((0..count).map(|i| Endpoint::from((std::net::Ipv4Addr::UNSPECIFIED, i))).collect::<Vec<_>>()));
        }
        let response_creation = std::time::Instant::now();
        let response = config.discovery_request("", ResourceType::Cluster, &[]).unwrap();
        let time = response_creation.elapsed().as_secs_f64();
        let size = crate::prost::encode(&response).unwrap().len();
        println!("count: {count}, time: {time}, size: {size}");
    };

    (measure)(0);
    (measure)(100);
    (measure)(1000);
    (measure)(10000);
    (measure)(u16::MAX);
    panic!();
}

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@markmandel
Copy link
Member

Lemme know if you want me to take a look, but also reminding you that we have:

quilkin/agones/src/lib.rs

Lines 418 to 425 in d8e93b6

// Output the events and logs for each pod that matches this label selector.
// Useful for determining why something is failing in CI without having to run a cluster.
// Requires quilkin::test_utils::enable_log("agones=debug"); to enable debug logging within
// the test
pub async fn debug_pods(client: &Client, labels: String) {
debug!(labels, "🪓 Debug output for Selector");
let pods: Api<Pod> = client.namespaced_api();
let events: Api<Event> = client.namespaced_api();

In case you want to try debugging in here.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

1 similar comment
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@XAMPPRocky
Copy link
Collaborator Author

In case you want to try debugging in here.

I added it but it doesn't seem to be working? 🤔

@@ -57,6 +57,7 @@ mod tests {
/// for this test, we should only have single Agones integration test in this suite, since they
/// could easily collide with each other.
async fn agones_token_router() {
quilkin::test_utils::enable_log("agones=debug");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on what aspect you want to debug, you will nee to also include a call to debug_pods(client: &Client, labels: String) to actually get logs and events about those pods.

For example, debug_pods(client, "role=proxy") will output debug information about the proxy instances. You can see the labels they have here:

let labels = BTreeMap::from([("role".to_string(), "proxy".to_string())]);

(This would be good to add to the doc on debug_pods)

quilkin/agones/src/lib.rs

Lines 418 to 425 in d8e93b6

// Output the events and logs for each pod that matches this label selector.
// Useful for determining why something is failing in CI without having to run a cluster.
// Requires quilkin::test_utils::enable_log("agones=debug"); to enable debug logging within
// the test
pub async fn debug_pods(client: &Client, labels: String) {
debug!(labels, "🪓 Debug output for Selector");
let pods: Api<Pod> = client.namespaced_api();
let events: Api<Event> = client.namespaced_api();
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looking at the build log, it's not failing because something failed to come up (as far as I can tell anyway), it's failing here:

assert!(failed, "Packet should have failed");

So the existing debug_pod statements aren't going to help in this moment - they have to be conditionally put in.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@@ -181,7 +182,7 @@ filters:
.as_mut()
.map(|annotations| annotations.remove(token_key).unwrap());
gameservers.replace(name.as_str(), &pp, &gs).await.unwrap();

debug_pods(client, "role=proxy");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion might be to change line 201 from

assert!(failed, "Packet should have failed");

to something like:

if failed {
  // proxies
  debug_pods(client, "role=proxy");
  // xds server
  debug_pods(client, "role=xds");

  panic!("Packet should have failed");
}

Then the condition can always stay in the test in case of future debugging requirements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

1 similar comment
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@markmandel
Copy link
Member

Ah, parse_ipv6_endpoint has localities in the config still

quilkin/src/config.rs

Lines 412 to 424 in de49889

fn parse_ipv6_endpoint() {
let config: Config = serde_json::from_value(json!({
"version": "v1alpha1",
"clusters":{
"default":{
"localities": [{
"endpoints": [{
"address": "[2345:0425:2CA1:0000:0000:0567:5673:24b5]:25999"
}],
}]
}
}
}))

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

6 similar comments
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@XAMPPRocky XAMPPRocky enabled auto-merge (squash) September 20, 2023 14:49
@markmandel markmandel enabled auto-merge (squash) September 20, 2023 17:54
@github-actions
Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@markmandel markmandel merged commit 1d2bec6 into main Sep 20, 2023
5 checks passed
@quilkin-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: a2976a9a-4d56-4064-9187-c4149131653b

The following development images have been built, and will exist for the next 30 days:

To build this version:

git fetch [email protected]:googleforgames/quilkin.git pull/785/head:pr_785 && git checkout pr_785
cargo build

@markmandel markmandel deleted the ep/refactor-cluster-map branch September 20, 2023 18:28
@markmandel markmandel added the kind/cleanup Refactoring code, fixing up documentation, etc label Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Refactoring code, fixing up documentation, etc size/xl
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants