Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add arguments to quilkin run #574

Merged
merged 21 commits into from
Sep 22, 2022
Merged

Add arguments to quilkin run #574

merged 21 commits into from
Sep 22, 2022

Conversation

XAMPPRocky
Copy link
Collaborator

@XAMPPRocky XAMPPRocky commented Aug 25, 2022

Closes #555, closes #572

This adds the basic options to run quilkin as a reverse proxy, allowing you to set the listen port and the destination address. Additionally I added some admin related options for setting admin port and disabling admin entirely.

@markmandel
Copy link
Member

Can we get some docs, probably here https://googleforgames.github.io/quilkin/main/book/using.html please?

Re-running CI to see if that's a flaky test error or not.

@XAMPPRocky
Copy link
Collaborator Author

It's not, I have the fix locally. The change is that the admin spawning is pulled out of the proxy to let the admin logic be shared amongst all commands, and these are tests that are testing the admin server.

image/Dockerfile Outdated Show resolved Hide resolved
entrypoint: bash
args:
- '-c'
- 'timeout --signal=INT --preserve-status 5s docker run --rm ${_REPOSITORY}quilkin:$(make version)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than deleting, could make it (something like)

'timeout --signal=INT --preserve-status 5s docker run --rm ${_REPOSITORY}quilkin:$(make version)' run

To make it run in run mode.

Although one could argue the integration tests also test this (although this would provide a nicer error message if something basic was broken). I could go either way on this one.

@XAMPPRocky
Copy link
Collaborator Author

XAMPPRocky commented Sep 14, 2022

@markmandel Rather than duplicating documentation, I figured it was easiest and best to expose the CLI in the rust documentation and just link to that. That way we're maintaining a single source of truth as our documentation now matches what the CLI outputs in help.

I'm having trouble figuring out what these Agones tests want me to do, do you have any idea?

@markmandel
Copy link
Member

@markmandel Rather than duplicating documentation, I figured it was easiest and best to expose the CLI in the rust documentation and just link to that. That way we're maintaining a single source of truth as our documentation now matches what the CLI outputs in help.

Makes sense.

I'm having trouble figuring out what these Agones tests want me to do, do you have any idea?

I'm at an event at the moment, but will take a look as soon as I can 👍🏻

@markmandel
Copy link
Member

The other option, as arch issues can get tricky fast, is to force Docker to Linux amd64 for all things:

export DOCKER_DEFAULT_PLATFORM=linux/amd64

Might actually be the easiest option (we could also bake this into the Makefile in some capacity).

@XAMPPRocky
Copy link
Collaborator Author

TIL, I’ll try it with that tomorrow

@markmandel
Copy link
Member

Just tried to take it for a spin, but got stuck at the compilation issue I referenced above 😢

I was going to note as well, if you have a local kubectl config that has access to an Agones cluster (although if you have occasionally access issues, try running a kubectl command to refresh the token), you should be able to run the e2e test directly, of the IMAGE_TAG env var is set.

@XAMPPRocky
Copy link
Collaborator Author

Decided to just to revert removing the quilkin.yaml and changing the entrypoint for now, as I'd like to get these changes in and we can make those changes separately.

The other option, as arch issues can get tricky fast, is to force Docker to Linux amd64 for all things:

This doesn't work because the build-image is hardcoded to use x86_64 in multiple places, so the build-image needs to be reworked before it's usable on ARM machines.

@markmandel
Copy link
Member

Decided to just to revert removing the quilkin.yaml and changing the entrypoint for now, as I'd like to get these changes in and we can make those changes separately.

That makes sense 👍🏻

The other option, as arch issues can get tricky fast, is to force Docker to Linux amd64 for all things:

This doesn't work because the build-image is hardcoded to use x86_64 in multiple places, so the build-image needs to be reworked before it's usable on ARM machines.

Huh. That's strange, because if you set DOCKER_DEFAULT_PLATFORM is affects both build and run operations, so the image itself will be x86, so having it install x86 binaries is what should work in that environment. 🤔 Prune the local image cache maybe, if there are old versions lying around?

I almost wish I had a ARM machine so I could test this out. If you are keen, could also try a GVC sometime. It's entirely possible I'm missing something, since i'm shooting in the dark a bit.

In the meantime, I'll take on finding what has broken in the integration tests for this PR, and update you with my results (probably send a PR to your PR.)

Also I'll prioritise #593 such that you can use the same image from CI to locally test an integration test.

)]
quiet: bool,
/// The path to the configuration file for the Quilkin instance.
#[clap(short, long, env = "QUILKIN_CONFIG", default_value = "quilkin.yaml")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking of coming back around and fleshing out the help for the rest of the commands for a nice command line reference.

Curious if you didn't want a command line reference here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait - when I run the command, I see help everywhere,

Wait.... does clap take the help from the comment above it? If so, that's a super neat trick.

@markmandel
Copy link
Member

Digging into what's failing in the integration tests - not sure what's causing it, but looks like there is something trying to bind to 9091 more than once. I'm seeing this in a few of the tests.

This is in agones management mode:

─➤  kubectl logs -n 1663639419 quilkin-manage-agones-5fb9bf9df7-6rqm9
{"timestamp":"2022-09-20T02:09:39.317682Z","level":"INFO","fields":{"message":"Starting Quilkin","version":"0.4.0-dev","commit":"f45d971388ce356229cc922c36376f267d1e6d40"},"target":"quilkin::cli"}
{"timestamp":"2022-09-20T02:09:39.317882Z","level":"DEBUG","fields":{"message":"provided path not found","path":"quilkin.yaml"},"target":"quilkin::cli"}
{"timestamp":"2022-09-20T02:09:39.318230Z","level":"INFO","fields":{"message":"Starting admin endpoint","address":"0.0.0.0:9091"},"target":"quilkin::admin"}
{"timestamp":"2022-09-20T02:09:39.318298Z","level":"TRACE","fields":{"message":"registering event source with poller: token=Token(1), interests=READABLE | WRITABLE","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.4/src/poll.rs","log.line":521},"target":"mio::poll"}
{"timestamp":"2022-09-20T02:09:39.319482Z","level":"INFO","fields":{"message":"Starting admin endpoint","address":"0.0.0.0:9091"},"target":"quilkin::admin","span":{"name":"spawn"},"spans":[{"name":"spawn"}]}
thread 'main' panicked at 'error binding to 0.0.0.0:9091: error creating server listener: Address already in use (os error 98)', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/server/server.rs:77:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
{"timestamp":"2022-09-20T02:09:39.319629Z","level":"ERROR","fields":{"message":"Panic has occurred. Moving to Unhealthy","panic_info":"panicked at 'error binding to 0.0.0.0:9091: error creating server listener: Address already in use (os error 98)', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/server/server.rs:77:13"},"target":"quilkin::admin::health","span":{"name":"spawn"},"spans":[{"name":"spawn"}]}
{"timestamp":"2022-09-20T02:09:39.319646Z","level":"ERROR","fields":{"message":"Panic has occurred. Moving to Unhealthy","panic_info":"panicked at 'error binding to 0.0.0.0:9091: error creating server listener: Address already in use (os error 98)', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/server/server.rs:77:13"},"target":"quilkin::admin::health","span":{"name":"spawn"},"spans":[{"name":"spawn"}]}
{"timestamp":"2022-09-20T02:09:39.319634Z","level":"TRACE","fields":{"message":"no local config found, falling back to local in-cluster config","error":"failed to read kubeconfig from '\"/home/nonroot/.kube/config\"': No such file or directory (os error 2)"},"target":"kube_client::config"}
{"timestamp":"2022-09-20T02:09:39.320006Z","level":"TRACE","fields":{"message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.4/src/poll.rs","log.line":652},"target":"mio::poll"}
{"timestamp":"2022-09-20T02:09:39.320187Z","level":"DEBUG","fields":{"message":"buffer closing; waking pending tasks"},"target":"tower::buffer::worker"}
{"timestamp":"2022-09-20T02:09:39.320399Z","level":"TRACE","fields":{"message":"deregistering event source from poller","log.target":"mio::poll","log.module_path":"mio::poll","log.file":"/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/mio-0.8.4/src/poll.rs","log.line":652},"target":"mio::poll"}

(Side note - looks like the check in create_quilkin_pod shouldn't just check if it's running, because the pod is in crashloop backoff in the integration test).

Looks like all the tests are failing for this reason.

@markmandel
Copy link
Member

Aha! I can replicate. Running a local binary with a config file reproduces the error:

➜  agones-xonotic git:(pr/ep/run-args) ✗ ../../target/build-image/x86_64-unknown-linux-gnu/release/quilkin -c ./client-compress.yaml run
{"timestamp":"2022-09-20T02:26:50.508501Z","level":"INFO","fields":{"message":"Starting Quilkin","version":"0.4.0-dev","commit":"f45d971388ce356229cc922c36376f267d1e6d40"},"target":"quilkin::cli"}
{"timestamp":"2022-09-20T02:26:50.508851Z","level":"INFO","fields":{"message":"Starting admin endpoint","address":"0.0.0.0:9091"},"target":"quilkin::admin"}
{"timestamp":"2022-09-20T02:26:50.508956Z","level":"INFO","fields":{"message":"Starting","port":7000,"proxy_id":"markmandel"},"target":"quilkin::proxy"}
{"timestamp":"2022-09-20T02:26:50.508973Z","level":"INFO","fields":{"message":"Starting admin endpoint","address":"0.0.0.0:9091"},"target":"quilkin::admin"}
{"timestamp":"2022-09-20T02:26:50.509022Z","level":"ERROR","fields":{"message":"Panic has occurred. Moving to Unhealthy","panic_info":"panicked at 'error binding to 0.0.0.0:9091: error creating server listener: Address already in use (os error 98)', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/server/server.rs:77:13"},"target":"quilkin::admin::health"}
{"timestamp":"2022-09-20T02:26:50.509039Z","level":"ERROR","fields":{"message":"Panic has occurred. Moving to Unhealthy","panic_info":"panicked at 'error binding to 0.0.0.0:9091: error creating server listener: Address already in use (os error 98)', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/server/server.rs:77:13"},"target":"quilkin::admin::health"}
thread 'main' panicked at 'error binding to 0.0.0.0:9091: error creating server listener: Address already in use (os error 98)', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/server/server.rs:77:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtr

Looks like it's trying to recreate the admin server twice.

@@ -46,16 +46,6 @@ steps:
args:
- build
id: build
# Run the built images for 5 seconds to make sure that the entrypoint and default config works out of the box
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replicated the issue as well with running the container from this PR. I'd recommend leaving in this test, as it also will replicate the breaking issue with the Agones integration tests, but output the logs within cloud build, and make it much easier to debug if there is a problem with the base image.

We may want to consider running the container in a few different common configuration options might also be useful in a few tests like the above, just to prove it all still runs even without any traffic flowing through.

@markmandel
Copy link
Member

Re-running, so that the new CI failure messages can give us a build image to debug against.

src/cli.rs Outdated
.admin
.store(Arc::new(crate::config::Admin { address }));
}
Some(tokio::spawn(crate::admin::server(config.clone())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this is the offending line causing the double creation of the admin server, since both Proxy::run and ControlPlane::from_arc both create admin instances and run them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing some debugging (code), it looks like the Config that comes from this line:

markmandel@09de12a#diff-b2812f19576dd53d0c35b107a322f58a00fce6977f4b5976e1961853982af3ccR93

let config = Arc::new(Self::read_config(self.config)?);

Has a None admin field, but looking at the code I'm not sure why it's None when it should have the default admin port 🤔

You can see it in my debug logs:

{"timestamp":"2022-09-22T04:41:50.829061Z","level":"DEBUG","fields":{"message":"Main Config","no_admin":false,"config":"Config { admin: None, clusters: None, filters: None, management_servers: None, proxy: None, version: None, metrics: Metrics { active_clusters: GenericGauge { v: Value { desc: Desc { fq_name: \"cluster_active\", help: \"Number of currently active clusters.\", const_label_pairs: [], variable_labels: [], id: 6550223101314893, dim_hash: 9568000379080501117 }, val: AtomicI64 { inner: 0 }, val_type: Gauge, label_pairs: [] } }, active_endpoints: GenericGauge { v: Value { desc: Desc { fq_name: \"cluster_active_endpoints\", help: \"Number of currently active endpoints.\", const_label_pairs: [], variable_labels: [], id: 9175583387471564798, dim_hash: 12641526325508385022 }, val: AtomicI64 { inner: 0 }, val_type: Gauge, label_pairs: [] } } } }"},"target":"quilkin::cli"}

I'll keep digging into this tomorrow unless you get to it first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've found the issue. I'll send a PR to your PR today.

@markmandel markmandel mentioned this pull request Sep 22, 2022
@markmandel markmandel linked an issue Sep 22, 2022 that may be closed by this pull request
This includes several fixes, tweaks and small improvements:

* Fix bug in `Slot impl Default` that would cause the default Admin
  config to be `None`.
* Fix double creation of Admin server
* Return cloudbuild timeout tests for easier CI debugging, and add one
  for command line config.
* Fixes and small additions to the "using" documentation.
@quilkin-bot
Copy link
Collaborator

Build Succeeded 🥳

Build Id: daaf20f4-73ad-4137-98c2-cbc32438f7f8

The following development images have been built, and will exist for the next 30 days:

To build this version:

git fetch [email protected]:googleforgames/quilkin.git pull/574/head:pr_574 && git checkout pr_574
cargo build

@markmandel markmandel merged commit bc8f780 into main Sep 22, 2022
@markmandel markmandel deleted the ep/run-args branch September 22, 2022 21:59
XAMPPRocky added a commit that referenced this pull request Oct 10, 2022
* Add CLI arguments to run
* Fix bug in `Slot impl Default` that would cause the default Admin
  config to be `None`.
* Return cloudbuild timeout tests for easier CI debugging, and add one
  for command line config.
* Fixes and small additions to the "using" documentation.

Co-authored-by: Mark Mandel <[email protected]>
XAMPPRocky added a commit that referenced this pull request Oct 10, 2022
* Add CLI arguments to run
* Fix bug in `Slot impl Default` that would cause the default Admin
  config to be `None`.
* Return cloudbuild timeout tests for easier CI debugging, and add one
  for command line config.
* Fixes and small additions to the "using" documentation.

Co-authored-by: Mark Mandel <[email protected]>
@markmandel markmandel added kind/feature New feature or request area/user-experience Pertaining to developers trying to use Quilkin, e.g. cli interface, configuration, etc labels Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Quilkin, e.g. cli interface, configuration, etc kind/feature New feature or request size/m
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configure basic configuration values with command flags Be able to configure tracing/log levels
3 participants