Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running windows game server #54

Open
cyriltovena opened this issue Jan 9, 2018 · 107 comments
Open

Running windows game server #54

cyriltovena opened this issue Jan 9, 2018 · 107 comments
Labels
awaiting-maintainer Block issues from being stale/obsolete/closed kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones

Comments

@cyriltovena
Copy link
Collaborator

cyriltovena commented Jan 9, 2018

Game production usually works on windows first then port to linux eventually at the end of the development. Having windows support would help the adoption rate.

What does it takes to run a windows game server ?

  • We should find a way to detect that the game server resource is windows ? May be a resource parameter ?
  • Does windows support sidecar concept ? Apparently from here at the bottom it says windows support only one container per pod, but should work on windows server 1709. We should try it.
  • If yes, create a sidecar using windows nano.
  • add a windows game server example.

Testing will be difficult as windows support is still in beta since k8s 1.5 but apparently greatly improve in 1.9.

Documentation :
https://kubernetes.io/docs/getting-started-guides/windows/
http://blog.kubernetes.io/2017/09/windows-networking-at-parity-with-linux.html
https://github.com/kubernetes/community/tree/master/sig-windows
https://docs.microsoft.com/en-us/windows-server/get-started/whats-new-in-windows-server-1709

@markmandel
Copy link
Member

  • Short term, to determine if it's windows, there is currently be a requirement to add "beta.kubernetes.io/os": "windows" to the nodeSelector on the Pod template. We could wrap that is some nicer syntactic sugar, but its a start.
  • In the documentation you linked, I don't see any reference to multiple containers per pod still being a restriction - but we could ask in sig-windows to confirm
    • Assuming multiple containers work, the sidecar is already cross compiled into a win binary (for local development) - so it would need to be turned into a Windows Container, and hosted somewhere however that works.

It looks like 1.9 support for Windows is much better than previous - I would suggest only trying it in a 1.9 install.

@cyriltovena
Copy link
Collaborator Author

http://blog.kubernetes.io/2017/09/windows-networking-at-parity-with-linux.html this was shared by @alexandrem and if you look at the bottom table you'll see that only 1709 support multiple container in a pod.

@markmandel
Copy link
Member

Sorry, I wasn't clear - my thought was that the blog post is pre-1.9, so it may be out of date.

Looking at the Known Limitations Section, which is the 1.9 docs, this is not listed as an issue - where I figure it should be. Hence I figured it makes sense to check, either because it's a bad documentation bug (and we could file a PR), or work has been done to resolve the issue on Windows Server 2016.

@cyriltovena
Copy link
Collaborator Author

You're right and we need to test but from my understanding it's independent from kubernetes development but more an issue of windows server networking and this has been fixed in 1709 version so for me as soon as you don't use an updated version it won't work.

What will be really a problem is testing, so far I don't see any other way than having a real cluster.
Because minikube support only linux container and the docker for windows team is apparently going down the minikube road.

@markmandel
Copy link
Member

Before testing - would be faster to just drop an email to https://groups.google.com/forum/#!forum/kubernetes-sig-windows 😄 and see if it's a documentation bug or not.

The other part I'm curious about is the state of hostPort support.

@markmandel
Copy link
Member

Some more content released today:
http://blog.kubernetes.io/2018/01/kubernetes-v19-beta-windows-support.html

@markmandel markmandel added kind/feature New features for Agones kind/design Proposal discussing new features / fixes and how they should be implemented labels Jan 9, 2018
@cyriltovena
Copy link
Collaborator Author

https://blog.docker.com/2017/11/docker-for-windows-17-11/

Because there’s only one Docker daemon, and because that daemon now runs on Windows, it will soon be possible to run Windows and Linux Docker containers side-by-side, in the same networking namespace. This will unlock a lot of exciting development and production scenarios for Docker users on Windows.

Possibly no need to change anything in the code as the sidecar would run on linux and the gameserver on windows, this would also be interesting to explore.

@markmandel
Copy link
Member

@cyriltovena
Copy link
Collaborator Author

Kind issue to look at: kubernetes-sigs/kind#410

@markmandel
Copy link
Member

Big news!!!

https://cloud.google.com/kubernetes-engine/docs/release-notes?hl=en

The ability to create clusters with node pools running Microsoft Windows Server is now in Beta. This feature is currently only available in the Rapid release channel.

@EricFortin
Copy link
Collaborator

EricFortin commented Jun 2, 2020

I have taken a stab at this in the past few days. I somewhat succeeded and thought I would recap my work in this issue so that we may be able to make it happen.

Disclaimer: I was mainly interested in seeing if this could even work. I hacked a few things and work around others.

Building sidecar and simple udp images for Windows

SDK

Here's an example of my final Dockerfile:

# Go runtime methods link to some Windows DLLs that are not in nanoserver. See https://github.com/StefanScherer/dockerfiles-windows/pull/197 for the workaround implemented here.
# Using multi stage build to copy it and still keep a relatively small image.
FROM mcr.microsoft.com/windows/servercore:1909 as core
WORKDIR /app

FROM mcr.microsoft.com/windows/nanoserver:1909

COPY --from=core /windows/system32/netapi32.dll /windows/system32/netapi32.dll

WORKDIR /app
COPY ./sdk-server.windows.amd64.exe /app/sdk-server.exe
ENTRYPOINT ["/app/sdk-server.exe"]

Windows containers images need to be build on the same OS that they are targetting. When trying to build on my machine(1903) with a FROM windows/nanoserver:1909 I had an error so I ultimately built my images on a VM.

simple-udp

I used simple-udp to test as this is simple and can be cross-compiled. I used this command from Agones clone:

docker run --rm -e "GOOS=windows" -e "GOARCH=amd64" -e "GO111MODULE=on" -w /go/src/agones.dev/agones -v /f/go/src/agones.dev/agones/build//.config/gcloud:/root/.config/gcloud -v ~/.kube/:/root/.kube -v ~/.helm:/root/.helm -v /f/go/src/agones.dev/agones:/go/src/agones.dev/agones -v /f/go/src/agones.dev/agones/build//.gomod:/go/pkg/mod -v /f/go/src/agones.dev/agones/build//.gocache:/root/.cache/go-build agones-build:c16b1f68c7 go build -mod=vendor \
        -o /go/src/agones.dev/agones/examples/simple-udp/simple-udp.windows.amd64.exe  -ldflags "-X agones.dev/agones/pkg.Version=1.6.0-7988111" agones.dev/agones/examples/simple-udp

I then wrote a similar Dockerfile as the previous one to build the image.

Cluster creation

I followed GKE's documentation to create a cluster to host Agones with Windows node. It seems we should be able to run in 1.15 but when I tried with this version, I couldn't add the Windows node pool so I reverted to use 1.16 which at the time of writing gave me a 1.16.8-gke15. I used those commands to create the cluster:

gcloud container clusters create special-cluster \
    --enable-ip-alias \
    --num-nodes=1 \
    --machine-type=n1-standard-4 \
    --cluster-version=1.16

gcloud container node-pools create win-node-pool \
    --cluster=special-cluster \
    --image-type=WINDOWS_SAC \
    --no-enable-autoupgrade \
    --metadata disable-legacy-endpoints=true \
    --machine-type=n1-standard-4 \
    --num-nodes=1

gcloud container node-pools create agones-system \
  --cluster=special-cluster \
  --no-enable-autoupgrade \
  --metadata disable-legacy-endpoints=true \
  --node-taints agones.dev/agones-system=true:NoExecute \
  --node-labels agones.dev/agones-system=true \
  --num-nodes=1 \
  --machine-type=n1-standard-4

Agones installation

We need to replace the SDK image that default install use since it is the one build as a linux based container. Helm has the settings to overwrite. Unfortunately, Agones use a single repo url and only the name and tags can be replaced. We can then either host all images in an external repo and replace the image repo settings or we can cheat a little(which I did but it'd be better not to)

Here's how I installed Agones: helm install --set agones.image.sdk.name=agones-sdk-win --set agones.image.sdk.tag=1909 --name my-release --namespace agones-system agones/agones

This means the controller will insert sidecar with image at gcr.io/agones-images/agones-sdk-win:1909. On my side, since I built the image from the actual cluster's Windows node, I simply had to tag it like so. It worked in my favor since Kubernetes' default PullPolicy is IfNotPresent. This isn't great(it only works on single node node pool) and next time I will go with hosting them all in our repo.

Deploying game servers

I used Agones' example and added a node selector to target Windows machine. Here's the content of my fleet.yaml:

apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
  name: simple-udp
spec:
  replicas: 2
  template:
    spec:
      ports:
      - name: default
        containerPort: 7654
      template:
        spec:
          nodeSelector:
            kubernetes.io/os: windows
          containers:
          - name: simple-udp
            image: <redacted>
            resources:
              requests:
                memory: "64Mi"
                cpu: "100m"
              limits:
                memory: "64Mi"
                cpu: "200m"

After that, I had 2 game servers running in my cluster. Allocation works too.

$k get gs
NAME                     STATE   ADDRESS         PORT   NODE                                              AGE
simple-udp-wk2rq-9r6b7   Ready   <redacted>      7081   gke-special-cluster-win-node-pool-568b60bb-dh14   13h
simple-udp-wk2rq-vqjvq   Ready   <redacted>      7145   gke-special-cluster-win-node-pool-568b60bb-dh14   21h

End of good news

I was not able to contact my game server. So I did some investigation to diagnose: Firewall rules is correctly setup as when running the sdk in local mode and simple-udp directly on Windows VM, I was able to contact the server.

To run in GKE with Windows node, we have to enable ip alias which means each pod have their own IPs. Kubernetes then need to setup routing when using host ports. So I tested from within the cluster if I was able to reach game servers. Here are some results:

  • From VM running pod, hitting externalIP:hostPort doesn't work (might be due to limitation of K8s Windows integration)
  • From VM running pod, hitting localhost:hostPort doesn't work (might be due to limitation of K8s Windows integration)
  • From VM running pod, hitting host internalIP:hostPort doesn't work
  • From VM running pod, hitting podIP:containerPort works
  • From pod in another VM(linux), hitting externalIP:hostPort doesn't work
  • From pod in another VM(linux), hitting internalIP:hostPort doesn't work
  • From pod in another VM(linux), hitting podIP:containerPort works

That means the game server hosting in Windows container is working but there are issues with the networking setup. After snooping around on the Windows VM, I found out that GKE uses L2bridge type network driver and specially win-bridge CNI plugin. It looks like this plugin supports hostPort mapping where it would ask the Host Network Service(HNS) to configure some routing rules. At this point, my Windows Networking knowledge is coming a bit short and I can't really debug CNI integration in GKE.

I tested a standard cluster(only Container optimized OS VMs) and it worked so it point to a shortcoming in the Windows CNI implementation.

To be continued ...

@cyriltovena
Copy link
Collaborator Author

cyriltovena commented Jun 2, 2020

Here's how I installed Agones: helm install --set agones.image.sdk.name=agones-sdk-win --set agones.image.sdk.tag=1909 --name my-release --namespace agones-system agones/agones

Would be great to be able to specify the SDK image via an annotation, this way you don't need to reinstall agones and you can run different sdk version.

Nice experiment.

@roberthbailey
Copy link
Member

roberthbailey commented Jun 3, 2020

We need to replace the SDK image that default install use since it is the one build as a linux based container. Helm has the settings to overwrite. Unfortunately, Agones use a single repo url and only the name and tags can be replaced. We can then either host all images in an external repo and replace the image repo settings or we can cheat a little(which I did but it'd be better not to)

On GKE (I think this is also true for other providers) you have to have at least one node pool running linux VMs for system containers even if your are running your workloads on Windows. So you could just run the linux agones controllers next to the other linux Kubernetes controllers.

edit: I just re-read this and realized you are talking about the sidecar.....

For that, I think we need to add some flexibility into Agones so that it can launch both windows and linux sidecars based on some sort of tag on the gameserver / fleet.

@roberthbailey
Copy link
Member

I had been wondering about hostPort support. I read through the GKE and Kubernetes docs and didn't see a clear answer as to whether (or how well) they would work. Maybe someone on the GKE windows or GKE networking team can help. We will reach out internally.

@WeetA34
Copy link
Contributor

WeetA34 commented Jun 5, 2020

Hello,
I've setup a local cluster on my windows workstation with 3 hyper-VM vm.
I deployed kub with rancher 2.4.4
1st try with flannel vxlan but i encounter a big which avoid windows container to reach kube api (10.43.0.1). It's a known issue which should have been fixed in kube 1.18.1.
So, i rebuilt the cluster with flannel l2bridge.
I'm stuck at the same point as Eric. The gameserver is running and is healthy but i'm not able to connect on the hostPort from another machine located on the same hyper-v network.
No issue with the simple-tcp linux.
I've checked firewall on the windows server core 1903 node. It seems ok.
I'm still investigating.
Regards
Stéphane

@WeetA34
Copy link
Contributor

WeetA34 commented Jun 5, 2020

I ran almost all commands present in https://github.com/microsoft/SDN/tree/master/Kubernetes/windows/debug/ scripts
I didn't find the expected port

@WeetA34
Copy link
Contributor

WeetA34 commented Jun 5, 2020

Same issue without agones, unable to connect to IIS container launched with the following command:
kubectl run iis --namespace default --restart='Never' --port 80 --hostport 8000 --image=mcr.microsoft.com/windows/servercore/iis:windowsservercore-1903
no issue with a simple docker run:
docker run -d --rm --name iis -p 8000:80 mcr.microsoft.com/windows/servercore/iis:windowsservercore-1903

@markmandel
Copy link
Member

Started a thread as well in #sig-windows on K8s slack:
https://kubernetes.slack.com/archives/C0SJ4AFB7/p1591381173181100

@markmandel
Copy link
Member

markmandel commented Jun 5, 2020

Based on the conversation in the #sig-windows thread, hostPort is not currently supported.

I've filed/updated two tickets (see links above). and been told to check in in a week, to see if they can find some resources to work on hostPort support:

daschott  14 minutes ago
Yes, that would be a good issue to upvote, and include the information. I would considering raising a Windows issue on CNI plugins as well. Let me try to see internally if anyone on our team has cycles to pick this up. Please follow up again in a week if you haven't heard back by then.

@markmandel
Copy link
Member

Possible good news!
https://github.com/containernetworking/plugins/releases/tag/v0.8.6 supports portMappings! So it may be a case of an updated CNI, and ensuring:

    "capabilities": {
        "dns": true,
        "portMappings":  true
    }

has been added to the CNI config.

@markmandel
Copy link
Member

Looks like we do have it enabled on GCE:
https://github.com/kubernetes/kubernetes/blob/release-1.15/cluster/gce/windows/k8s-node-setup.psm1#L807

But need the CNI version is not up to 0.8.6 (which came out 23 days ago, so no surprise)

@WeetA34
Copy link
Contributor

WeetA34 commented Jun 6, 2020

Hello Mark,
i already created an flannel image with 0.8.6 but i was not aware about portMappings.
i can now access iis when running iis windows container with the following command:
kubectl run iis --namespace default --restart='Never' --port 80 --hostport 8000 --image=mcr.microsoft.com/windows/servercore/iis:windowsservercore-1903
Now i'm facing an issue creating the gs resource due to kubernetes 1.18.x server side apply.

kubectl create -f gs-simple-tcp-win.yaml

The GameServer "simple-tcp-win-69qbp" is invalid: metadata.managedFields.fieldsType: Invalid value: "": must be `FieldsV1`

Thank you

@WeetA34
Copy link
Contributor

WeetA34 commented Jun 6, 2020

Hello again
i just rebuilt the entire cluster with kubernetes 1.17.6
It works fine now :)
flannel type: host-gw (l2bridge)
windows node & image build: windows server 1903

Thank you again Mark for pointing to the new CNi plugin portMappings attributes
It's documented here : https://docs.microsoft.com/en-us/virtualization/windowscontainers/kubernetes/common-problems#hostport-publishing-is-not-working

$ kubectl get nodes -o wide
NAME           STATUS   ROLES    AGE   VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                  KERNEL-VERSION       CONTAINER-RUNTIME
kub2lin01      Ready    <none>   34m   v1.17.6   172.31.0.59     <none>        Ubuntu 18.04.4 LTS        4.15.0-101-generic   docker://19.3.11
kub2master01   Ready    master   38m   v1.17.6   172.31.10.248   <none>        Ubuntu 18.04.4 LTS        4.15.0-101-generic   docker://19.3.11
kub2win01      Ready    <none>   12m   v1.17.6   10.244.2.2      <none>        Windows Server Standard   10.0.18362.836       docker://19.3.5

$ kubectl get gameservers
NAME                   STATE   ADDRESS      PORT   NODE        AGE
simple-tcp-win-mpgsc   Ready   10.244.2.2   7612   kub2win01   3m56s

$ kubectl get pods -o wide
NAME                   READY   STATUS    RESTARTS   AGE     IP           NODE        NOMINATED NODE   READINESS GATES
simple-tcp-win-mpgsc   2/2     Running   5          4m24s   10.244.2.5   kub2win01   <none>           <none>

Note: The windows node is registered with its flannel IP instead of its primary IP.
Kubernetes Windows Node:

C:\>hostname
KUB2WIN01

C:\>ipconfig

Windows IP Configuration


Ethernet adapter vEthernet (Ethernet) 2:

   Connection-specific DNS Suffix  . : mshome.net
   Link-local IPv6 Address . . . . . : fe80::8c2a:c459:b65b:5fa7%14
   IPv4 Address. . . . . . . . . . . : 172.31.10.83
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . : 172.31.0.1

Ethernet adapter vEthernet (cbr0_ep):

   Connection-specific DNS Suffix  . : mshome.net
   Link-local IPv6 Address . . . . . : fe80::9848:7de1:9f17:a440%15
   IPv4 Address. . . . . . . . . . . : 10.244.2.2
   Subnet Mask . . . . . . . . . . . : 255.255.255.0
   Default Gateway . . . . . . . . . : 10.244.2.1

Ethernet adapter vEthernet (nat):

   Connection-specific DNS Suffix  . :
   Link-local IPv6 Address . . . . . : fe80::181c:d022:213c:c96%9
   IPv4 Address. . . . . . . . . . . : 192.168.160.1
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . :

Ethernet adapter vEthernet (ae554c2617bbdc9):

   Connection-specific DNS Suffix  . :
   Link-local IPv6 Address . . . . . : fe80::d008:e1bc:7902:e1b7%16
   IPv4 Address. . . . . . . . . . . : 172.21.192.1
   Subnet Mask . . . . . . . . . . . : 255.255.240.0
   Default Gateway . . . . . . . . . :

Test from an Alpine VM on the same subnet:

$  hostname
alpine

$ ip addr show eth0 | grep 'inet '
    inet 172.31.8.45/20 scope global eth0

$ nc 172.31.10.83 7612
hello toto
ACK: hello toto

I built the following images on my windows node:

REPOSITORY                             TAG                      IMAGE ID            CREATED             SIZE
sigwindowstools/flannel                0.12.0-1903              af8d76c4c777        33 minutes ago      5.21GB
golang                                 windowsservercore-1903   25eabeca5478        35 minutes ago      5.54GB
sigwindowstools/kube-proxy             1.17.6-1903              874e362d6bd9        40 minutes ago      5.19GB
mcr.microsoft.com/k8s/core/pause       1.2.0-1903               995bb8fbeb6d        4 hours ago         261MB
simple-tcp-win                         0.0.2                    b46163501dbd        44 hours ago        269MB
gcr.io/agones-images/agones-sdk        1.6.0                    5c95ad75fdc2        2 days ago          296MB

@TBBle
Copy link
Contributor

TBBle commented Jan 7, 2021

If the Agones SDK sidecar container image is pushed to different tags for different platforms/architectures, then a Manifest List can be pushed that references them, and the container runtime will pull the right image for its platform.

Then agones.image.sdk.tag can reference that manifest list by default, and is platform-agnostic.

@jeremyje
Copy link
Contributor

jeremyje commented Jan 7, 2021

#1894 does the manifest magic already it just needs to be turned on in CI. At the time I cautioned against enabling it by default because it significantly increases the build time and I didn't want to complicate an Agones release. Since there's a release cut after the PR went in it's appropriate to turn on WITH_WINDOWS by default.

Beyond that like @josephbmanley said, all you pretty much need is some changes to Terraform to provision a Windows Node Pool via conditional and the Helm templates need to be updated to have a parameter that basically inserts the nodeSelector for Windows.

/cc @markmandel

There's 1 big caveat in that only Windows LTSC 2019 is supported (default for Kubernetes). Using it on 2004 and 20H2 will break in weird ways (this is likely going to be the version your dev box is on unless you run with Windows Server). Agones will need to update the CI image to install Docker Client 20.10 which comes with the os.version fix to docker manifest amend. This can be done today since the features accessed are client side and I've confirmed in a separate project it works. It's possible to get this behavior by sed grepping the manifest files locally but it's awful and to me more risky than just upgrading the client at this point.

@markmandel
Copy link
Member

@josephbmanley

Thanks for digging into this work!

Setup default node selectors in helm charts to deploy on Linux

I think we have this already? See this documentation. Will that suffice?

Either add windows option to helm or add to docs to set agones.image.sdk.tag to the windows tag

I don't think this is necessary (I'm fairly sure this is covered above), as with the way the manifest/registry operates, it can select the appropriate OS image as needed without having to be specific about it.

One thing to also add to the list - make sure the windows images are built and pushed as part of the release:
https://github.com/googleforgames/agones/blob/master/build/includes/release.mk#L74-L80

We should probably also add a section for documentation as well - likely a "Windows containers" page would be useful - maybe under Installation > Create Cluster Or maybe within each provider? Maybe also a "Guide" of some kind?

@jeremyje

At the time I cautioned against enabling it by default because it significantly increases the build time

How significantly are we talking here? minutes, hours, days? 😄

Agones will need to update the CI image to install Docker Client 20.10

We use whatever version comes with Cloud Build, also whatever we have on our work machines - sounds like we'll need to wait on supporting 2004 and 20H2 until the newer Docker version propagates out to those systems (if they haven't already).

Does that all make sense?

@TBBle
Copy link
Contributor

TBBle commented Jan 12, 2021

Setup default node selectors in helm charts to deploy on Linux

I think we have this already? See this documentation. Will that suffice?

I'm not sure what your link is referring to (the agones.dev/agones-system taints, I guess?), but I believe @josephbmanley meant hardcoding something like

kubernetes.io/os: linux
kubernetes.io/arch: amd64

in the nodeSelector for the controller to match the generated container images platform(s),, so they are not scheduled on Windows nodes (or ARM64 Linux nodes)

Users can also do this themselves using Helm values, but if the container os/arch is known statically, it seems nicer to make it part of the chart, than make it part of the config where it might be forgotten or overwritten.

Another option, the "legacy" option from the Kubernetes docs is to taint all your Windows nodes, and include tolerations on the Fleet's PodTemplates.

@markmandel
Copy link
Member

Yeah exactly, the section that reads:

gcloud container node-pools create agones-system \
  --cluster=[CLUSTER_NAME] \
  --no-enable-autoupgrade \
  --node-taints agones.dev/agones-system=true:NoExecute \ # < this bit
  --node-labels agones.dev/agones-system=true \ # < and this bit
  --num-nodes=1

But I didn't realise you could do:

kubernetes.io/os: linux
kubernetes.io/arch: amd64

Which makes total sense - and should be backward compatible 👍

Do we need to something similar for the GameServer windows images? I assume we do?

@TBBle
Copy link
Contributor

TBBle commented Jan 12, 2021

Ideally, nodeSelectors should be used in all Deployments to match the os and architectures of the manifest (or manifest list, where used). Since Windows containers are tied to the specific version of the kernel needed, that's also something that should be in the manifest lists based on the targets being built.

Sadly, k8s has been practically mono-os and mono-arch long enough most people and Helm charts aren't including appropriate nodeSelectors in their rollouts. Both Windows and ARM64 are starting to be serious k8s things, and there's other surprise platforms around that someone must care about. I doubt I'll ever see a GameServer on a s390x Kubernetes node.

For illustration, once WITH_WINDOWS is turned on, the current gameserver sidecar manifest list implies that the GameServer-owned Pod would need something like this to prevent being scheduled on a node that couldn't run sdk-server container.

spec:
  affinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      # Option 1: Linux && AMD64
      - matchExpressions:
        - key: kubernetes.io/os:
          operator: In
          values: [ linux ]
        - key: kubernetes.io/arch:
          operator: In
          values: [ amd64 ]
      # Option 2: Windows && AMD64 && [ ltsc2019 ]
      - matchExpressions:
        - key: kubernetes.io/os:
          operator: In
          values: [ windows ]
        - key: kubernetes.io/arch:
          operator: In
          values: [ amd64 ]
        - key: node.kubernetes.io/windows-build:
          operator: In
          values: [ "10.0.17763" ]

I expect that the above would never actually be used as-is. The main container in the GameServer's Pod is likely to have more-specific architecture support (I'd be surprised if any Agones user is deploying manifest lists for their own game server container image, as differences in performance profile seem very likely), and also more-specific user-custom nodeSelector or affinity terms, to select the node-group with, e.g., appropriately-sized nodes and public-routable IP addresses so hostPort works.

I think in the end it'd be easier on everyone to simply build a wider variety of sdk-server image arch/os/version targets, so that GameServers are always constrained by the user, not the sdk-server container availability. ^_^

@yeslayla
Copy link
Contributor

@markmandel

Either add windows option to helm or add to docs to set agones.image.sdk.tag to the windows tag

I don't think this is necessary (I'm fairly sure this is covered above), as with the way the manifest/registry operates, it can select the appropriate OS image as needed without having to be specific about it.

One thing to also add to the list - make sure the windows images are built and pushed as part of the release:
https://github.com/googleforgames/agones/blob/master/build/includes/release.mk#L74-L80

We should probably also add a section for documentation as well - likely a "Windows containers" page would be useful - maybe under Installation > Create Cluster Or maybe within each provider? Maybe also a "Guide" of some kind?

@jeremyje

If it's true that it already selects the proper image based on the manifest, the default tags should be documented, as I had to specify the tag manually when using my own registry.

Agones will need to update the CI image to install Docker Client 20.10

We use whatever version comes with Cloud Build, also whatever we have on our work machines - sounds like we'll need to wait on supporting 2004 and 20H2 until the newer Docker version propagates out to those systems (if they haven't already).

Azure also only supports the current version as well. It seems all cloud providers are a bit behind on this.

@markmandel
Copy link
Member

I think in the end it'd be easier on everyone to simply build a wider variety of sdk-server image arch/os/version targets, so that GameServers are always constrained by the user, not the sdk-server container availability. ^_^

Sounds like what i think we're saying here is - put os specific constraints on the linux bits, leave GameServers as a exercise for the user on how to restrict game servers to their appropriate OS (in this case Windows) basically so they can just do what they want (although this sounds like something we should definitely document), and we don't get in their way.

Makes sense to me if that's the general consensus.

This multi-arch stuff make my head hurt 🤕

@jeremyje
Copy link
Contributor

jeremyje commented Jan 13, 2021

The way the make WITH_WINDOWS=1 build-images is setup is that it'll create multi-arch images between Linux and Windows. Meaning at the end of the day none of the references change. There's definitely complexity in managing the different manifests but it's 80% implemented already. The remaining (hard) 20% is managing different versions of Windows which is much lower priority. Generally speaking ltsc2019 is good enough and when ltsc2021 comes out will be the time to work the multi-arch.

Docker Client 20.10 is compatible with Docker Server 19.03. I've verified I can build like this in a different project.

How significantly are we talking here? minutes, hours, days? 😄

+10-30 minutes based on machine type, hdd vs ssd, of the machine. If you parallelize you'll need to create a separate buildx context for each thread.

If there's only a sidecar I don't think there's any specific changes necessary. It'll run on Windows and Linux because the manifest makes that work. The pod container spec simply needs the nodeSelector to tell it to run on Windows and it should mostly just work. If there's containers that need to run on the Windows host outside of the sidecar then you'd need to basically add tolerations for Windows but I strongly suggest using the manifest based tagging for cross-platform workloads instead of using YAML to manage this.

Lastly, Windows requires at least 1 Linux node, https://kubernetes.io/docs/setup/production-environment/windows/intro-windows-in-kubernetes/#windows-containers-in-kubernetes.

@TBBle @josephbmanley What issues are you seeing in Windows? Beyond opting into running a Windows container on Windows you shouldn't need to use node selectors. Windows nodes have a taint applied to them so default pod specs shouldn't schedule on those nodes. Has anyone tried to deploy the Xonotic example running on Windows in a cluster? The Xonotic example requires a Windows Server 2019 machine to build. It's possible to remove this dependency but it's intentional to have it like so since that's a more likely path Windows customers would use.

@TBBle
Copy link
Contributor

TBBle commented Jan 14, 2021

The issues I'm referring to are (pods being assigned to Windows nodes that don't have Windows containers images available) are prevented by the example taint in the docs, but I consider tainting Windows nodes to be a short-term workaround until all the things running on the cluster have appropriate nodeSelectors to ensure their container image manifests match the node arch/os/buildver.

Otherwise we'll go through this all again as ARM CPUs become more popular in k8s deployments.

I do kind-of wish k8s supported looking at the containers in a Pod and extracting their OS requirements from the manifests/manifest lists, but I can't see how that would ever make sense given the k8s system architecture.

@markmandel
Copy link
Member

Quick thought I had on e2e testing:

Figured we could add some windows nodes to our e2e cluster, and then adjust some of our e2e tests to run both on Linux and Windows, by running the same test just with the windows node selector in the Pod template for the windows test.

@dzmitry-lahoda
Copy link
Contributor

one last request - publish simple-game-server-windows to gcr:)

@roberthbailey
Copy link
Member

If you pull gcr.io/agones-images/simple-game-server:0.3 it will pull the windows version if you are on a windows node. That image is a manifest that contains references to the linux and windows images to pull. The windows image (if you wanted to pull it directly) is gcr.io/agones-images/simple-game-server:0.3-windows_amd64-ltsc2019.

@dzmitry-lahoda
Copy link
Contributor

dzmitry-lahoda commented Jul 4, 2021

@roberthbailey i see that stuff works for simple-gs, but not for xonotic. i see xonotic has windows, but pulling it directly or with windows suffix fails on windows pod. i want xonotic as test tool for our gs programmers and me, so we know when issues in our gs or in agones setup i did.

@TBBle
Copy link
Contributor

TBBle commented Jul 5, 2021

Looking at https://console.cloud.google.com/gcr/images/agones-images/GLOBAL/xonotic-example?gcrImageListsize=30, it doesn't look like a new Xonitic manifest has been pushed since early 2020, and the Windows support was added in December 2020 in #1894.

So I don't think a Windows build of Xonotic has been pushed to the Agones repo, you probably have to build it yourself.

The xonotic Makefile used by the cloudbuild scripts doesn't build a Windows image, so even if it had been built since Windows support was added, it wouldn't have been pushed. You can contrast with with simple-game-server's Makefile to see how much is missing, both building Windows images, and pushing the manifest list that lets you use one repository name from multiple platforms.

#1894 (comment) suggests that the xonotic Windows image isn't cross-buildable from Linux, which might have been (or still be) a blocker for Agones's build pipeline. I'm not sure why else it didn't get the Windows support added to its Makefile when it was added to the Dockerfile.

So either an oversight, or a difficult problem to solve, I guess.

@TBBle
Copy link
Contributor

TBBle commented Jul 5, 2021

Since I was looking at the simple-game-server's Makefile, this TODO (and a couple of others in this file) are now completable, as Docker 20.10 has been available on Cloud Build since May 2021. I guess those TODOs might have been replicated in other Makefiles, but haven't looked.

@roberthbailey
Copy link
Member

@jeremyje - do you have any cycles to look into making our xonotic example run on windows servers?

@jeremyje
Copy link
Contributor

jeremyje commented Jul 7, 2021

@TBBle Yes GCB has supported docker 20.10 for a while I'd strongly recommend upgrading to it because 19.03 is going end of life later this month, https://endoflife.software/applications/virtualization/docker-daemon.

@roberthbailey I unfortunately do not have cycles to work on this. I looked at it briefly here's the state. The Dockerfile.windows will only run on a Windows host because it's executing RUN commands. It can be adjusted to run on Linux but the Makefile will need to do the coordination of downloading the Xonotic zip file and extracting the files locally before copying them into the image via COPY. After that you can execute it as a docker buildx build. There's another issue where the main.go stdout scrubber cannot execute Xonotic while in the container. I'm not sure why but the first step would be to execute it again locally to see why it's breaking. If it isn't the next step I'd do is attempt to use the full windows docker image, mcr.microsoft.com/windows:1809 to see if the repros. Xonotic is very sensitive to file structure and paths so it makes it difficult to make it run.

@github-actions
Copy link

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

@github-actions github-actions bot added the stale Pending closure unless there is a strong objection. label Apr 15, 2023
@markmandel
Copy link
Member

@zmerlynn , @gongmax 🤔 I think the only thing left here is to setup some automated testing of a (?) windows cluster, probably on at least a single version of supported GKE?

Looks like there is some cleanup to do here as well with the build system, and hopefully there isn't much else? But I also say that as someone who isn't familiar with the windows container ecosystem either.

Marking as awaiting-maintainer, since this work is ongoing.

@markmandel markmandel added awaiting-maintainer Block issues from being stale/obsolete/closed and removed stale Pending closure unless there is a strong objection. labels Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-maintainer Block issues from being stale/obsolete/closed kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones
Projects
None yet
Development

No branches or pull requests