Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Admission webhook "mutations.stable.agones.dev" errors with Invalid FleetAutoscaler #406

Closed
Oleksii-Terekhov opened this issue Nov 7, 2018 · 18 comments
Assignees
Labels
kind/bug These are bugs. kind/design Proposal discussing new features / fixes and how they should be implemented
Milestone

Comments

@Oleksii-Terekhov
Copy link

Oleksii-Terekhov commented Nov 7, 2018

Bare-metal k8s v1.10.4, helm v2.9.1, Agones (installed by helm) v0.5.0,
Linux 4.4.159-1.el7.elrepo.x86_64, Centos 7.4

Custom helm chart for my deployment with fleet and fleetAutoscaler

  policy:
    type: Buffer
    buffer:
      bufferSize: 2
      minReplicas: 0
      maxReplicas: 10

At helm install almost always successful, but sometimes(1:10):

Release "xxx" does not exist. Installing it now.
Error: release xxx failed: Internal error occurred: admission webhook 
  "mutations.stable.agones.dev" denied the request: Invalid FleetAutoscaler

Same error in kube-apiserver log:

net/http.HandlerFunc.ServeHTTP(0xc42a10c2e0, 0x7f4ba00d9160, 0xc4216cf320, 0xc44a31e200)
        /usr/local/go/src/net/http/server.go:1918 +0x44
k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1(0xc42a10c380, 0xb1e9820, 0xc4216cf320, 0xc44a31e200, 0xc43dd541e0)
        /workspace/anago-v1.10.4-beta.0.68+5ca598b4ba5abb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:93 +0x8d
created by k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP
        /workspace/anago-v1.10.4-beta.0.68+5ca598b4ba5abb/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:92 +0x1ab

logging error output: "{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"Internal error occurred: admission webhook \\\"mutations.stable.agones.dev\\\" denied the request: Invalid FleetAutoscaler\",\"reason\":\"InternalError\",\"details\":{\"causes\":[{\"message\":\"admission webhook \\\"mutations.stable.agones.dev\\\" denied the request: Invalid FleetAutoscaler\"}]},\"code\":500}\n"
 [[tiller/v0.0.0 (linux/amd64) kubernetes/$Format] 10.10.20.31:34754]

As background - 20-40 fleets with autoscalers in one namespace with frequent parallel redeploy (install/upgrade/delete via helm)

Which log can help resolve this critical issue?
Almost always sequential ./helm upgrade --force --install --wait --timeout 1200 ... resolve issue.

Maybe race condition between fleet and correspond autoscaler creation/update?

Update 1: I try apply autoscaler without fleet - got same error

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/agones/master/examples/simple-udp/fleetautoscaler.yaml

Error from server (InternalError): error when creating "https://raw.githubusercontent.com/GoogleCloudPlatform/agones/master/examples/simple-udp/fleetautoscaler.yaml": Internal error occurred: admission webhook "mutations.stable.agones.dev" denied the request: Invalid FleetAutoscaler

Update 2: helm/helm#2994 - common CRD issue...

@Oleksii-Terekhov
Copy link
Author

Kubernetes is a declarative system. Users provide manifests that define desired state and Kubernetes converges the current state accordingly. If the workloads in question don't support this model because they require specific ordering, that means the workloads do not support a declarative model. In this case, the focus should be on fixing the workloads to support orderless installation via logic inside the pods.
(c) https://github.com/gabrtv

@markmandel markmandel added the question I have a question! label Nov 7, 2018
@markmandel
Copy link
Member

On my phone atm, so can't go too deep, but check the Troubleshooting guide for a review of logs:

https://github.com/GoogleCloudPlatform/agones/blob/master/docs/troubleshooting.md

Also, what happens when you debug output your Helm chart - does it look correct?

@Oleksii-Terekhov
Copy link
Author

Helm chart work fine when create fleet before autoscaler 👍
When helm create autoscaler before fleet - install blow up 👎

@markmandel
Copy link
Member

This was by initial design, and intended behaviour. It means that Fleets and Fleet Autoscalers get garbage collected appropriately when the Fleet gets deleted.

@victor-prodan Anything you want to add here (since you wrote this)? Should we decouple them, and allow Fleet autoscalers to be orphaned? @Oleksii-Terekhov has some valid points about being able to define these declaratively. It puts more onus on the developer to cleanup these artifacts though.

The other issue here is that if we maintain the current functionality should have a validationHandler for the checking of the fleetName value, as that will return a much more readable validation error, with a reason - rather than "500", with no message (which is what happens with a mutation handler that rejects the call).

Also hunting down a workaround, at least for the moment.

@markmandel
Copy link
Member

I think you can use lifecycle hooks to solve this problem for the moment (would love to hear if it works for you) - but this does sound like an issue worth discussion - do we want to force developers down this path?

https://docs.helm.sh/developing_charts/#hooks

@Oleksii-Terekhov
Copy link
Author

Maybe autoscaler must be part of fleet definitions?

Also question about minReplicas - why you need this? I cannot realize when minReplicas <>0 in production environment. Value in bufferSize resolve all issues....

Helm hooks (https://github.com/helm/helm/blob/master/docs/charts_hooks.md)looks weird due:

Hook resources are not managed with corresponding releases

The resources that a hook creates are not tracked or managed as part of the release. Once Tiller verifies >that the hook has reached its ready state, it will leave the hook resource alone.

Practically speaking, this means that if you create resources in a hook, you cannot rely upon helm delete >to remove the resources. To destroy such resources, you need to either write code to perform this >operation in a pre-delete or post-delete hook or add "helm.sh/hook-delete-policy" annotation to the >hook template file.

@markmandel
Copy link
Member

Aah. that's a good point re: hooks not being managed. Okay, that's a deal breaker.

Original fleet autoscaling design ticket for context.

@markmandel
Copy link
Member

Quick thought on workaround: Do you have the fleet + fleetautoscaler definitions in the same file, or separate files in your helm chart?

Wondering if they are in the same template file, it will solve the issue for now.

@Oleksii-Terekhov
Copy link
Author

Oleksii-Terekhov commented Nov 8, 2018

In the separate files... I'll try join they now.
But original helm flow independent of file-level granularity, manifest kind+hooks may change order.

Update for @markmandel: "helm delete --purge" + "helm install" with joined template still fails with same error "Invalid FleetAutoscaler"

@victor-prodan
Copy link
Contributor

Also question about minReplicas - why you need this? I cannot realize when minReplicas <>0 in production environment. Value in bufferSize resolve all issues....

Here is an example: you need to restart the matchmaker servers so you know there will be a temporary down spike in demand, but the requests will come back as usual after a short while. In this case, you want to keep your dedicated server cluster warmed up to the current size and ready to handle the load.

To do this, just set minReplicas to your current level before restarting the matchmaker and remove it after the load has resumed.

@victor-prodan
Copy link
Contributor

@victor-prodan Anything you want to add here (since you wrote this)? Should we decouple them, and allow Fleet autoscalers to be orphaned?

I don't know, I can see it going both ways.
Maybe @EricFortin has a stronger opinion on this?

@Oleksii-Terekhov
Copy link
Author

In design ticket exist point:

  • autoscaling settings can be part of the Fleet specification, it would make sense; each fleet must have a different autoscaling settins

@markmandel
Copy link
Member

I'm digging into the HorizontalPodAutoscaler here - see what it does when a Deployment doesn't exist. I can't find any docs, and digging around for the code. I may just have to setup an example and try it! 😄

Figure following their guide here probably makes the most amount of sense.

@markmandel markmandel added kind/bug These are bugs. kind/design Proposal discussing new features / fixes and how they should be implemented labels Nov 9, 2018
@EricFortin
Copy link
Collaborator

I would definitely go with what the HPA does but my instinct says we should keep them locked together. Let's see if my instinct is right haha.

@markmandel markmandel self-assigned this Nov 13, 2018
@markmandel
Copy link
Member

Checking HPA - yes, you can create one without having a Deployment.

What happens is that you get a warning event on the HPA if it can't find the Deployment - so that the issue user visible.

I propose we should go down this path (allow a FleetAutoscaler be created without a Fleet) - it does allow for nicer decoupling, and still lets a user see why a FleetAutoscaler may not be working.

Small question - should this mean you can change which Fleet a FleetAutoscaler is pointed at? (I'm leaning towards yes, just because I can't think of a downside, and it may be a nice bit of flexibility).

Thoughts?

@Oleksii-Terekhov
Copy link
Author

Is there a chance for this in release 0.6.0?

@markmandel
Copy link
Member

Working on it right now! Probably have a PR today.

@markmandel markmandel removed the question I have a question! label Nov 14, 2018
markmandel added a commit to markmandel/agones that referenced this issue Nov 15, 2018
This removes some previous restrictions with the FleetAutoscaler,
as well as cleaning up a couple of items.

- FleetAutoscalers can now have a Fleet target that doesn't exit.
  - If the target doesn't exist, this will be recorded as a warning event
    of type `FailedGetFleet`, rather than `ScalingLimited` status item,
    as it allows a human readable message, as well as give us data on how
    often this occurs, and it is readable by stat/alerting packages.
- FleetAutoscalers can now have their Fleet target edited. This
  means that FleetAutoscalers can be transitioned from one Fleet
  to another. This may be useful in scenarios like red-green deployments, or
  if you wish to temporarily disable an Autoscaler, but not delete it.
- `FleetAutoscaler.Status.ScalingLimited` was also replaced by a `ScalingLimited`
  Event, as it allows a human readable message, as well as give us data on how
  often this occurs, and it is readable by stat/alerting packages.

Closes googleforgames#406
@markmandel
Copy link
Member

PR #416 in place -- as I said on the PR, I made some decisions that I think make sense, but also consider this a sacrificial draft, so definitely let me know if you see issues.

markmandel added a commit to markmandel/agones that referenced this issue Nov 15, 2018
This removes some previous restrictions with the FleetAutoscaler,
as well as cleaning up a couple of items.

- FleetAutoscalers can now have a Fleet target that doesn't exit.
  - If the target doesn't exist, this will be recorded as a warning event
    of type `FailedGetFleet`, rather than `ScalingLimited` status item,
    as it allows a human readable message, as well as give us data on how
    often this occurs, and it is readable by stat/alerting packages.
- FleetAutoscalers can now have their Fleet target edited. This
  means that FleetAutoscalers can be transitioned from one Fleet
  to another. This may be useful in scenarios like red-green deployments, or
  if you wish to temporarily disable an Autoscaler, but not delete it.
- `FleetAutoscaler.Status.ScalingLimited` was also replaced by a `ScalingLimited`
  Event, as it allows a human readable message, as well as give us data on how
  often this occurs, and it is readable by stat/alerting packages.

Closes googleforgames#406
markmandel added a commit to markmandel/agones that referenced this issue Nov 16, 2018
This removes some previous restrictions with the FleetAutoscaler,
as well as cleaning up a couple of items.

- FleetAutoscalers can now have a Fleet target that doesn't exit.
  - If the target doesn't exist, this will be recorded as a warning event
    of type `FailedGetFleet`, rather than `ScalingLimited` status item,
    as it allows a human readable message, as well as give us data on how
    often this occurs, and it is readable by stat/alerting packages.
- FleetAutoscalers can now have their Fleet target edited. This
  means that FleetAutoscalers can be transitioned from one Fleet
  to another. This may be useful in scenarios like red-green deployments, or
  if you wish to temporarily disable an Autoscaler, but not delete it.
- `FleetAutoscaler.Status.ScalingLimited` was also replaced by a `ScalingLimited`
  Event, as it allows a human readable message, as well as give us data on how
  often this occurs, and it is readable by stat/alerting packages.

Closes googleforgames#406
markmandel added a commit to markmandel/agones that referenced this issue Nov 17, 2018
This removes some previous restrictions with the FleetAutoscaler,
as well as cleaning up a couple of items.

- FleetAutoscalers can now have a Fleet target that doesn't exit.
  - If the target doesn't exist, this will be recorded as a warning event
    of type `FailedGetFleet`, as it allows a human readable message, as
    well as give us data on how often this occurs, and it is readable by
    stat/alerting packages.
- FleetAutoscalers can now have their Fleet target edited. This
  means that FleetAutoscalers can be transitioned from one Fleet
  to another. This may be useful in scenarios like red-green deployments, or
  if you wish to temporarily disable an Autoscaler, but not delete it.
- `FleetAutoscaler.Status.ScalingLimited` has an added
  Event, as it allows a human readable message, as well as give us data on how
  often this occurs, and it is readable by stat/alerting packages.

Closes googleforgames#406
markmandel added a commit to markmandel/agones that referenced this issue Nov 19, 2018
This removes some previous restrictions with the FleetAutoscaler,
as well as cleaning up a couple of items.

- FleetAutoscalers can now have a Fleet target that doesn't exit.
  - If the target doesn't exist, this will be recorded as a warning event
    of type `FailedGetFleet`, as it allows a human readable message, as
    well as give us data on how often this occurs, and it is readable by
    stat/alerting packages.
- FleetAutoscalers can now have their Fleet target edited. This
  means that FleetAutoscalers can be transitioned from one Fleet
  to another. This may be useful in scenarios like red-green deployments, or
  if you wish to temporarily disable an Autoscaler, but not delete it.
- `FleetAutoscaler.Status.ScalingLimited` has an added
  Event, as it allows a human readable message, as well as give us data on how
  often this occurs, and it is readable by stat/alerting packages.

Closes googleforgames#406
markmandel added a commit that referenced this issue Nov 19, 2018
This removes some previous restrictions with the FleetAutoscaler,
as well as cleaning up a couple of items.

- FleetAutoscalers can now have a Fleet target that doesn't exit.
  - If the target doesn't exist, this will be recorded as a warning event
    of type `FailedGetFleet`, as it allows a human readable message, as
    well as give us data on how often this occurs, and it is readable by
    stat/alerting packages.
- FleetAutoscalers can now have their Fleet target edited. This
  means that FleetAutoscalers can be transitioned from one Fleet
  to another. This may be useful in scenarios like red-green deployments, or
  if you wish to temporarily disable an Autoscaler, but not delete it.
- `FleetAutoscaler.Status.ScalingLimited` has an added
  Event, as it allows a human readable message, as well as give us data on how
  often this occurs, and it is readable by stat/alerting packages.

Closes #406
@markmandel markmandel added this to the 0.6.0 milestone Nov 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug These are bugs. kind/design Proposal discussing new features / fixes and how they should be implemented
Projects
None yet
Development

No branches or pull requests

4 participants