Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Kubernetes 1.12 #717

Closed
heartrobotninja opened this issue Apr 17, 2019 · 36 comments · Fixed by #967
Closed

Upgrade to Kubernetes 1.12 #717

heartrobotninja opened this issue Apr 17, 2019 · 36 comments · Fixed by #967
Assignees
Labels
kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones
Milestone

Comments

@heartrobotninja
Copy link
Contributor

The version of client-go we vendor is 8.0 and is deprecated per the k8s client-go compatability matrix. They use semver rules which means that we may have breaking changes in 9.0+ (but we also might not). We should investigate what it would take to bring us up to current, or as close to current as possible.

@heartrobotninja
Copy link
Contributor Author

heartrobotninja commented Apr 17, 2019

Another part of this is that if I vendor k8s.io/code-generator at its latest incarnation, I get a panic when running make gen-crd-client because the old client-go version uses glog, and code-generator uses klog (which is what should be used) and they use the same log_dir flag which results in this:

Generating deepcopy funcs
/go/bin/deepcopy-gen flag redefined: log_dir
panic: /go/bin/deepcopy-gen flag redefined: log_dir

goroutine 1 [running]:
flag.(*FlagSet).Var(0xc00006c180, 0x78cc20, 0x9c8770, 0x703a10, 0x7, 0x73b82b, 0x2f)
        /usr/local/go/src/flag/flag.go:850 +0x4af
flag.(*FlagSet).StringVar(...)
        /usr/local/go/src/flag/flag.go:753
k8s.io/klog.InitFlags(0x0)
        /go/src/k8s.io/klog/klog.go:411 +0xa4
main.main()
        /go/src/k8s.io/code-generator/cmd/deepcopy-gen/main.go:59 +0x3f
Makefile:364: recipe for target 'gen-crd-client' failed
make: *** [gen-crd-client] Error 2

@markmandel
Copy link
Member

So the unwritten rule I've been following (we should make some kind of written rule), is that we've been updating the client as the next version of K8s has cloud penetration (which I've taken at available on EKS, GKE and AKS).

Usually there is breaking changes between each client-go version, but they tend to be relatively trivial to fix.

@markmandel markmandel added the kind/cleanup Refactoring code, fixing up documentation, etc label Apr 18, 2019
@heartrobotninja
Copy link
Contributor Author

Can you comment on what backwards compatibility usually is like with the major version increases?

@markmandel
Copy link
Member

Usually we upgrade because we want a latest feature - so backward compatibility has been less of a concern for us.

Moving forward, it's a tough question, as we only have https://github.com/kubernetes/client-go#compatibility-matrix to go by, which is not the most reliable thing in the world.

Realistically, we should run our own tests.

1.12 is meant to have some very significant scheduling improvements, so it would be good to upgrade, but is a tricky question going forward.

There must be some prior art for this, for other controllers out there?

@heartrobotninja
Copy link
Contributor Author

It looks like client-go still has no way to enumerate exactly what changed or broken if I am reading kubernetes/client-go#234 correctly. I guess we need to answer the question: Does the e2e test coverage sufficiently exercise the client or do we need to write tests specifically for the go client?

@markmandel markmandel added the kind/design Proposal discussing new features / fixes and how they should be implemented label May 18, 2019
@markmandel
Copy link
Member

Coming back to this as i was thinking about it more, as it looks like 1.13 is GA on GKE although yet to be services on EKS or AKS

I would say - probably the safest bet is to assume that client-go is incompatible with previous versions of Kubernetes (although it usually isn't), and that it's possibly incompatible with future versions (but more likely to work) -- since we really have no guarantee. We have e2e tests, that should be reasonably definitive - but we also don't have a testing suite that runs on every version of Kubernetes (maybe we should?)

My initial thought here is that we have a single required version of Kubernetes that is guaranteed to work with Agones, because that is what we run for our e2e tests and development (and we can assume that the next version up also works, but we can be clear it hasn't been tested).

Now the question begets itself -- what version should that be?

I have two thoughts here:

  1. We want a really baked in version
  2. We don't want to fall behind in the rapid progress of Kubernetes.

So my initial thought is: We update to n-1's version of whatever has cloud penetration (i.e. is offered on GKE, AKS and EKS). So once 1.13 is offered on all 3 platforms, we can move to a requirement of 1.12.

We know it's a solidly baked version, but it doesn't let us fall too far behind on k8's progress.

How do people feel about that?

/cc @jkowalski @pooneh-m @ilkercelikyilmaz - I figure you might have some thoughts?
/cc @EricFortin any thoughts on your side?

Figure it's a pretty important topic, so pinging lots of people to weigh in.

@markmandel
Copy link
Member

markmandel commented May 18, 2019

🤦‍♂️ @roberthbailey -- you likely have a lot of insight here. WDYT?

(the facepalm is because I was silly for not thinking of you earlier)

@roberthbailey
Copy link
Member

Does our use of client-go use any esoteric k8s api features? I'm trying to read really carefully through the client-go versioning and it seems like for the majority of features it's compatible almost across the board due to the backwards compatibility guarantees of k8s. If we are mostly using well established APIs, then I think we can be less concerned about the specific version of k8s that client-go is aligned with and more concerned with the maintenance status of the client-go version we are vendoring. In particular, even if we go with a guaranteed k8s version that we support, we don't have to tightly tie the client-go version to that k8s version -- we can skew forwards (or backwards) a bit to stay within the client-go maintenance window.

I think the trend over time is that the cloud providers are picking up newer releases of k8s more slowly. So even if that has historically been a useful gauge of when to bump libraries it may not be as good of a bellwether going forward. In addition, there is a delay of 0-6 weeks from when we bump the vendored version and when it actually makes it into a release.

@markmandel
Copy link
Member

Excellent points @roberthbailey -- I think the most "esoteric" features we use are subresources, which are pretty fully baked at this point and time.

I see this as two layers though:

  1. Which version of client-go do we use?
  2. What version of Kubernetes do we do most of our development against?

Regarding No.1 it seems like it mostly doesn't matter, as long as we stick with stable API surfaces.

Regarding No. 2 my fear is, if we start developing against 1.12, but people are using 1.11, we don't actually know if things are working in 1.11 still.

Do we need multiple test clusters of different versions? Maybe we do?

@markmandel
Copy link
Member

Discussion from the community meeting today said that for near future:

  1. We maintain 1 version of Kubernetes that we develop against and provide support for.
    1. Other versions may well work, but aren't actively tested against. So you may run into issues.
  2. We work against n-1 of what is currently provided on AKS/EKS/GKE and that is our current baseline (so if each host supports 1.13, we run against 1.12)
    1. We may revise this in the future to be version n, depending on the pace of cloud providers adopting Kubernetes releases.

If we decide at a later date that we need to support multiple K8s versions, we can expand the operations needed to provide support for this, but its not something the community is asking for.

If I haven't captured the thoughts from the meeting, or if you disagree - please make a note here 👍

If we don't have an objections by the 30th of may, we can consider this approach locked in, and we can document it, and start moving infrastructure to 1.12.

@markmandel
Copy link
Member

Looks like EKS is going to start hosting 1.13 this month, which once live, will mean we can move to 1.12

markmandel added a commit to markmandel/agones that referenced this issue Jun 11, 2019
This rewrites the Overview page to give a concise summary of the project
and its capabilities.

For this reason, I moved the requirements section into the Installation
section, and updated the docs slightly to fulfill kubernetes version
support outlined and decided in googleforgames#717
markmandel added a commit that referenced this issue Jun 11, 2019
This rewrites the Overview page to give a concise summary of the project
and its capabilities.

For this reason, I moved the requirements section into the Installation
section, and updated the docs slightly to fulfill kubernetes version
support outlined and decided in #717
@markmandel
Copy link
Member

markmandel commented Jun 27, 2019

List of items to do for upgrading to 1.12

  • Update e2e cluster to run against 1.12
    • Update deployment manage scripts (new version, also match the current size of the cluster)
    • Recreate cluster with new scripts
  • Update the dev tooling to create 1.12 clusters
    • Update kubectl
    • Minikube
    • Kind
    • GKE
  • Update documentation for creating clusters to 1.12
    • Minikube
    • GKE (there are some default changes here to consider)
    • EKS
    • AKS
  • Update to client-go 9.0 (based on compatibility matrix)

Anything else I'm missing?

@markmandel markmandel added this to the 0.12.0 milestone Jun 27, 2019
@markmandel markmandel changed the title Investigate upgrading k8s.io/client-go to > 8.0 Upgrade to Kubernetes 1.12 Jun 27, 2019
@markmandel
Copy link
Member

Hope you don't mind, I renamed this to "Upgrade to Kubernetes 1.12", and added it to the milestone, as discussed on the community meeting today.

@heartrobotninja
Copy link
Contributor Author

Not at all. I think that is a better title at this point and gives me something more actionable to work on in the coming month.

@roberthbailey
Copy link
Member

I don't see it anywhere in the GKE release notes, but I just got an automated email saying:

1.11 masters will soon be deprecated and automatic upgrade to 1.12 is scheduled for the week of July 8, 2019

which means that we should start working on this ASAP. If we don't switch the default kubernetes version on GKE, CI will soon break because we won't be able to create 1.11 clusters.

@roberthbailey
Copy link
Member

More reason to do this soon: as per https://groups.google.com/d/msg/kubernetes-announce/AT5yB3FzDv8/gwfhI78RBgAJ 1.12.10 has just been released and it is

the FINAL release for v1.12 and further issues/PRs for the release-1.12 branch will not be accommodated

@markmandel
Copy link
Member

I'm starting on updating the build tooling to create a 1.12 GKE cluster first - mainly so that I can do some local testing to make sure everything still works on 1.12

@roberthbailey
Copy link
Member

I can also help with this, once I get a bit further along on #703.

markmandel added a commit to markmandel/agones that referenced this issue Jul 19, 2019
Also regenerated the CRD clients with the updated tooling.

Needed to build a udp-server:0.14

Work for googleforgames#717
markmandel added a commit to markmandel/agones that referenced this issue Jul 19, 2019
Also regenerated the CRD clients with the updated tooling.

Needed to build a udp-server:0.14

Work for googleforgames#717
markmandel added a commit to markmandel/agones that referenced this issue Jul 19, 2019
Also regenerated the CRD clients with the updated tooling.

Needed to build a udp-server:0.14

Work for googleforgames#717
markmandel added a commit that referenced this issue Jul 19, 2019
Also regenerated the CRD clients with the updated tooling.

Needed to build a udp-server:0.14

Work for #717
@roberthbailey
Copy link
Member

@aLekSer - do you think you'd be able to update the AKS instructions to create clusters using Kubernetes 1.12?

@aLekSer
Copy link
Collaborator

aLekSer commented Jul 23, 2019

Yes, I will upgrade Terraform configurations to a new version of Kubernetes.

@aLekSer
Copy link
Collaborator

aLekSer commented Jul 23, 2019

Updated AKS terraform config to 1.12.8, the latest 1.12 supported by AKS, in this PR

#899

@roberthbailey
Copy link
Member

From what I can tell, all that is missing to close this out is documentation updates to use 1.12 on EKS. Does anyone have an amazon account they can test with?

@markmandel
Copy link
Member

May be easier to check in #users on Slack? 🤔

@aLekSer
Copy link
Collaborator

aLekSer commented Jul 29, 2019

@roberthbailey I was able to create EKS cluster with version Kubernetes GitVersion:"v1.12.6-eks-d69f1b" but there is an issue with creating a simple fleet. I use 10 t2.micro instances for now.
Steps for installing I took from agones.dev - Installation

  Normal  SuccessfulDelete  11m (x3 over 11m)   gameserverset-controller  (combined from similar events): Deleted gameserver in state Unhealthy: simple-udp-jwpd2-ww554                                                                                                 
  Normal  SuccessfulCreate  55s (x93 over 12m)  gameserverset-controller  (combined from similar events): Created gameserver: simple-udp-jwpd2-nrmmd   

@roberthbailey
Copy link
Member

Thanks for looking @aLekSer!

Did you look at why the gameservers aren't becoming healthy?

@aLekSer
Copy link
Collaborator

aLekSer commented Jul 29, 2019

@roberthbailey gameservers continuosly got exited, so I need to find how to turn on some cloudwatch logging to see what happens prior to shutdown. I could predict that there is not enough space for new pods.

@markmandel
Copy link
Member

I would expect with micro instances there would be barely any room for the k8s resources 😄

@aLekSer
Copy link
Collaborator

aLekSer commented Jul 30, 2019

I tested with t3.medium instance. The same errors. Created GS with disabled healthcheck and see the logs and it has an issue with connecting to a sidecar Ready not sent and backwards true as servers become Unhealthy if run as simple-udp example fleet.

kubectl logs gds-example example-server
2019/07/30 13:04:02 Creating SDK instance
2019/07/30 13:04:02 Starting Health Ping
2019/07/30 13:04:02 Starting UDP server, listening on port 7654
2019/07/30 13:04:02 Marking this server as ready
2019/07/30 13:04:02 Could not send ready message

 kubectl logs gds-example agones-gameserver-sidecar                                                                                                                                                                                
{"ctlConf":{"Address":"localhost","IsLocal":false,"LocalFile":""},"grpcPort":59357,"httpPort":59358,"message":"Starting sdk sidecar","severity":"info","source":"main","time":"2019-07-30T13:04:01.480709261Z","version":"0.11.0"}                                      
{"gsKey":"default/gds-example","message":"created GameServer sidecar","severity":"info","source":"*sdkserver.SDKServer","time":"2019-07-30T13:04:01.548770273Z"}                                                                                                        
{"message":"Starting SDKServer grpc service...","severity":"info","source":"main","time":"2019-07-30T13:04:01.549160433Z"}
{"message":"Starting SDKServer grpc-gateway...","severity":"info","source":"main","time":"2019-07-30T13:04:01.556033789Z"}
{"gsKey":"default/gds-example","health":{"disabled":true,"periodSeconds":5,"failureThreshold":3,"initialDelaySeconds":5},"message":"setting health configuration","severity":"info","source":"*sdkserver.SDKServer","time":"2019-07-30T13:04:01.649257082Z"}            
{"gsKey":"default/gds-example","message":"Starting SDKServer http health check...","severity":"info","source":"*sdkserver.SDKServer","time":"2019-07-30T13:04:01.649414429Z"}                                                                                           
{"gsKey":"default/gds-example","message":"Starting workers...","queue":"stable.agones.dev.default.gds-example","severity":"info","source":"*sdkserver.SDKServer","time":"2019-07-30T13:04:01.649454734Z","workers":1}                                                   
{"gsKey":"default/gds-example","message":"Sending GameServer Event to connectedStreams","severity":"info","source":"*sdkserver.SDKServer","time":"2019-07-30T13:04:02.421443714Z"} 

@aLekSer
Copy link
Collaborator

aLekSer commented Jul 30, 2019

Also there is an issue in EKS when using helm, which is fixed by next command:

helm init --upgrade --service-account tiller

After this I was able to use recent version with:

helm install --name my-release --namespace agones-system agones --set agones.image.tag=0.12.0-e481e7f

Checking functionality now

@aLekSer
Copy link
Collaborator

aLekSer commented Jul 30, 2019

@roberthbailey Basic functionality on 1.12 EKS cluster (namely v1.12.6-eks-d69f1b) is fully operational: creating GS, fleet, fleetautoscaler is fine. Tested on m3.medium instances, used helm to install recent 0.12.0 version.
kubectl get gs

NAME                     STATE       ADDRESS         PORT   NODE                                          AGE                                                              
gds-example              Ready       34.220.203.18   7777   ip-192-168-88-61.us-west-2.compute.internal   4m                                                               
simple-udp-mdlx6-jfk5w   Allocated   34.220.203.18   7204   ip-192-168-88-61.us-west-2.compute.internal   3m                                                               
simple-udp-mdlx6-qdplw   Ready       34.220.203.18   7545   ip-192-168-88-61.us-west-2.compute.internal   17s                                                              
simple-udp-mdlx6-z5hwz   Ready       34.220.203.18   7025   ip-192-168-88-61.us-west-2.compute.internal   3m  

@aLekSer
Copy link
Collaborator

aLekSer commented Jul 30, 2019

I think we need to have a Terraform module for EKS.
Currently I made a cluster with such command:

eksctl create cluster --name prod5 --version 1.12 --nodegroup-name standard-workers --node-type t3.medium --nodes 3 --nodes-min 1 --nodes-max 6 --node-ami auto

@roberthbailey
Copy link
Member

@aLekSer - let's split the terraform for eks into a separate issue (feature request).

I think all that remains here is to change https://github.com/googleforgames/agones/blob/master/site/content/en/docs/Installation/_index.md to change the required version to 1.12 starting with the 0.12.0 release.

@aLekSer
Copy link
Collaborator

aLekSer commented Jul 30, 2019

Ok, I will add a ticket for Terraform now.

@markmandel
Copy link
Member

Everything is ticket off - I assume we can close this now?

@markmandel markmandel added kind/feature New features for Agones and removed kind/cleanup Refactoring code, fixing up documentation, etc labels Aug 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants