Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TESTING][GCP][GKE] Chart on GCP #1

Open
taktakpeops opened this issue May 12, 2020 · 61 comments
Open

[TESTING][GCP][GKE] Chart on GCP #1

taktakpeops opened this issue May 12, 2020 · 61 comments

Comments

@taktakpeops
Copy link
Owner

Following a chat on the issue jitsi/docker-jitsi-meet#565 for Jitsi Meet in K8S using this chart, I am moving the discussion specifically related to GKE here.

@ChrisTomAlx - could you share your findings here?

I will also provide you some support.

@taktakpeops
Copy link
Owner Author

jitsi-meet-helm

@ChrisTomAlx
Copy link

Hey @taktakpeops
This looks fine to me except for the jibri part, because each jibri pod only supports one recording at a time and the cpu / ram optimal requirements for jibri are much more than other pods (4vcpu 8GB Ram is what I have read on the community forums).
Also jibri requires some node level changes. Refer to this pr.

I was thinking we should maybe continue this discussion on your jitsi issue because the jitsi team is super helpful and they could help if we run into problems.

I would love for them to see the architecture you have created here, looks impressive..

@taktakpeops
Copy link
Owner Author

OK

@taktakpeops
Copy link
Owner Author

One thing to add is that this architecture works only if you have multiple nodes in your cluster and each JVB instance has its own node because JVB will bind the public IP of the node. So some affinity rules are needed for the deployment of multiple JVB

@ChrisTomAlx
Copy link

Sorry I have been AWOL for a bit.
So to do node level changes for proper jibri functioning (atleast in GKE), I am planning to create a daemonset which will run on all nodes before they start up any other pods.

https://stackoverflow.com/questions/57005971/how-to-use-customise-ubuntu-image-for-node-when-gke-cluster-create

Also I am creating the jibri yaml file with container post-start and pre-stop scripts to correctly change the /home/jibri/.asoundrc file in each of the containers. I will probably use hostpath to read and write the same file within all the containers on a node so they all know which sound cards are available and which aren't.

As I mentioned earlier the Jibri in the architecture can possibly be a group like JVB is correct?

@taktakpeops
Copy link
Owner Author

Yes, indeed it can. Even if I think that it’s strongly tied to Jicofo + Prosody and I would see all of it in one pod.

About the multiple JVB, it used some pod anti-affinity rules to ensure the dispatch of the pods across nodes. Since each JVB deployment deploys every instance to its own pod with a port unique to the deployment it works with many Jitsi servers.

About Jibri, one issue remain: not all cluster can access the sound device of the VM.

I am starting to explore how to use a dummy sound driver that can be used as a loop back for recording sound from a pod without needing the physical device. I will most likely develop a K8S plugin device for that.

Let me know if you have more questions and feel free to push a PR with your scripts for GKE if you are using the chart :)

@ChrisTomAlx
Copy link

Sure I will start looking into this from tomorrow again. Will raise a PR if there are any changes required..
Regarding Jibri, a dummy sound device would be amazing. Jibri uses ALSA. If there was a way to add that to the container directly then most the jibri problems would be solved, but I am not entirely sure that is possible.

@ChrisTomAlx
Copy link

ChrisTomAlx commented May 21, 2020

Hey @taktakpeops I noticed that you are using

  - name: DOCKER_HOST_ADDRESS
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP

status.hostIP is only available in kubernetes version 1.17.0 and above, but it seems like GCP does not have a stable release for 1.17.0 yet, There is a rapid release available but google specifically mentions not to use it in production.. Does aws support 1.17.0 already ?

@taktakpeops
Copy link
Owner Author

Hi @ChrisTomAlx,

Looking into the documentation here, it's available since v1.7. Checking here, it seems that GCP is currently supporting 1.15.9, 1.16.5, 1.17 and higher for GKE.

In AWS, I am using the latest version supported by EKS which is v1.16.8 (you can find all supported versions of K8S for EKS here: https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html).

Have you tried deploying and noticed that the IP wasn't exported in the environment?

@ChrisTomAlx
Copy link

Hi @taktakpeops
Oh.. You are absolutely right. GKE does support 1.17 but it is not on a stable channel. But it does not really matter now that I know hostIP was actually implemented in 1.7.0+ and not 1.17.0+

Thanks for your quick response. You just saved me a whole lot of cluster reconfiguration

@ChrisTomAlx
Copy link

Hey @taktakpeops ..

I have a couple of questions. If time permits please do get back to me.

  1. The status.hostIP uses the nodes IP address correct? But this IP address must be accessible from outside through an external IP. So my question is did you have to manually set up the nodes external IP ?

  2. Each Prosody / Jicofo combo uses different xmpp domains right?

     - name: XMPP_DOMAIN
       value: jitsi.meet
     - name: XMPP_AUTH_DOMAIN
       value: auth.jitsi.meet
     - name: XMPP_INTERNAL_MUC_DOMAIN
       value: internal-muc.jitsi.meet
     - name: XMPP_MUC_DOMAIN
       value: muc.jitsi.meet
     - name: XMPP_GUEST_DOMAIN
       value: guest.jitsi.meet

    For example all the domains above should change based on the prosody / jicofo combo dynamically right? How did you manage to do that ?

  3. I see some config files in your repo. Are those just to show that these are the configs within the containers? Or is it somehow being used in the helm chart ?

@taktakpeops
Copy link
Owner Author

Hello,

For question 1: yes, you need to be able to remotely access your nodes which means that they need a public IP. In AWS, I did so by enabling the SSH access in my node group. The public IP are being detected using STUN servers.

Question 2, yes, you are right but for that I am writing the logic in helm which will generate the domains for each Jicofo / prosody couple (so work in progress).

Question 3, the config are there for testing. I want to start using octo and enable some modules in posody.

@ChrisTomAlx
Copy link

Awesome. Thanks for the quick response. There seems to be some issue on my end when using NodePort. The JVB is not accessible. Tried with LoadBalancer and it works.

kubectl describe pod {jvb pod} gives me :-
Name:           jitsi-meet-jvb-0-7c5cdc6f54-slfn5
Namespace:      default
Priority:       0
Node:           gke-neumeet-pool-1-4768d01b-xd50/10.128.0.26
Start Time:     Fri, 22 May 2020 17:15:34 +0530
Labels:         app.kubernetes.io/instance=neumeet
                app.kubernetes.io/name=jitsi-meet-jvb-0
                pod-template-hash=7c5cdc6f54
Annotations:    kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container jitsi-meet-jvb
Status:         Running
IP:             10.56.4.199
IPs:            <none>
Controlled By:  ReplicaSet/jitsi-meet-jvb-0-7c5cdc6f54
Containers:
  jitsi-meet-jvb:
    Container ID:   docker://7819c812939676fe72a614256ec57ce8f83a11ab0aa71c2121f9501878a90daf
    Image:          asia.gcr.io/webrtcvideo-274408/neutrinos/jvb:v20
    Image ID:       docker-pullable://asia.gcr.io/webrtcvideo-274408/neutrinos/jvb@sha256:a2e405eb74e33accc0e62ec1a798d918874b505fe58cd4f0e77bc0ac719fff43
    Port:           30300/UDP
    Host Port:      0/UDP
    State:          Running
      Started:      Fri, 22 May 2020 17:15:37 +0530
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:  100m
    Environment:
      XMPP_SERVER:                 jitsi-meet-prosody
      PROSODY_INSTANCE:            0
      JICOFO_AUTH_USER:            focus
      JICOFO_AUTH_PASSWORD:        <set to the key 'JICOFO_AUTH_PASSWORD' in secret 'jitsi-meet-jicofo-config'>  Optional: false
      JVB_AUTH_USER:               jvb
      JVB_AUTH_PASSWORD:           <set to the key 'JVB_AUTH_PASSWORD' in secret 'jitsi-meet-jvb-config'>           Optional: false
      JICOFO_COMPONENT_SECRET:     <set to the key 'JICOFO_COMPONENT_SECRET' in secret 'jitsi-meet-jicofo-config'>  Optional: false
      JVB_PORT:                    30300
      JVB_STUN_SERVERS:            stun.l.google.com:19302,stun1.l.google.com:19302,stun2.l.google.com:19302
      JVB_TCP_HARVESTER_DISABLED:  true
      DOCKER_HOST_ADDRESS:          (v1:status.hostIP)
      JVB_OPTS:                    --apis=xmpp,rest
      ENABLE_STATISTICS:           true
      XMPP_DOMAIN:                 testmeet.neutrinos.co
      XMPP_AUTH_DOMAIN:            auth.jitsi-meet-prosody.default.svc
      XMPP_INTERNAL_MUC_DOMAIN:    internal-muc.jitsi-meet-prosody.default.svc
      XMPP_MUC_DOMAIN:             muc.testmeet.neutrinos.co
      XMPP_GUEST_DOMAIN:           guest.jitsi-meet-prosody.default.svc
      JVB_BREWERY_MUC:             jvbbrewery
      TZ:                          Europe/Amsterdam
    Mounts:
      /defaults from config (rw)
      /var/run/docker.sock from dockersock (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from neumeet-jitsi-meet-token-tclkm (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  dockersock:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      jitsi-meet-jvb-config-cm
    Optional:  false
  neumeet-jitsi-meet-token-tclkm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  neumeet-jitsi-meet-token-tclkm
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From                                       Message
  ----    ------     ----  ----                                       -------
  Normal  Scheduled  30m   default-scheduler                          Successfully assigned default/jitsi-meet-jvb-0-7c5cdc6f54-slfn5 to gke-neumeet-pool-1-4768d01b-xd50
  Normal  Pulling    30m   kubelet, gke-neumeet-pool-1-4768d01b-xd50  Pulling image "asia.gcr.io/webrtcvideo-274408/neutrinos/jvb:v20"
  Normal  Pulled     30m   kubelet, gke-neumeet-pool-1-4768d01b-xd50  Successfully pulled image "asia.gcr.io/webrtcvideo-274408/neutrinos/jvb:v20"
  Normal  Created    30m   kubelet, gke-neumeet-pool-1-4768d01b-xd50  Created container jitsi-meet-jvb
  Normal  Started    30m   kubelet, gke-neumeet-pool-1-4768d01b-xd50  Started container jitsi-meet-jvb

Anyhow just dropped this here in case you find something off. I will look into stun servers I think that must be why I am having issues. Because I am able to ping the external ip of the node from an external network but not with UDP which is strange. So from a first look stun servers should solve this issue if I am not wrong.

@taktakpeops
Copy link
Owner Author

@ChrisTomAlx, for the stun + public IP detection, you should see that in the logs of JVB.

Did you make sure that the security group for your nodes accept traffic on port 30300?

@ChrisTomAlx
Copy link

@taktakpeops Yea will look through the logs for some hints
Its not a private cluster so I assumed nodes will be accepting traffic without problems.. let me recheck that. Thanks for the suggestions!!

@taktakpeops
Copy link
Owner Author

@ChrisTomAlx : it’s not sure that the nodes will have the port open. In EKS, for example, the public network for the nodes has its own security group which has to be modified accordingly after deploying Jitsi to allow UDP traffic on the node.

@ChrisTomAlx
Copy link

ChrisTomAlx commented May 24, 2020

Hey @taktakpeops you were right. There was a firewall blocking UDP traffic, trying to get around it.
In the mean time I see someone else is also trying to replace ALSA dependency from Jibri with pulseAudio. Look at this comment..
Here is his github repo. I am not sure if it works, but since you said you were looking at removing node level changes from jibri, I thought you would be interested..

@taktakpeops
Copy link
Owner Author

taktakpeops commented May 25, 2020

Hey @ChrisTomAlx - thanks for the suggestion. After playing around with Alsa (used currently in Jitsi) - I ended up realizing that it requires a modification of the node OS, therfore specialized nodes for running Jibri.

I saw quite a few other people also modifying the config of Jibri for using PulseAudio. After going through issues / pull-requests and the forum, I found out that PulseAudio was originally used in Jibri. The Jitsi team dropped it because of stability issues, while recording calls with a large amount of participants they had drops / packet losts. They replaced it by Alsa.

I want to keep the Helm chart with the official containers.

For now, I am looking into how the AMIs for EKS are configured and how to enable some specific kernel modules as the one missing currently: soundcore.

Once soundcore is enabled, we can run the node with an access to a Alsa dummy soundcard (ideally supporting X recordings at a time).

The custom AMI will be then used for all the nodes. If X recordings at a time aren't possible, we can still run one Jibri per nodes.

I thought also about another (weird and terrible) thing: running Jibri inside a VM running inside a container running Qemu 🤣

@ChrisTomAlx
Copy link

I thought also about another (weird and terrible) thing: running Jibri inside a VM running inside a container running Qemu 🤣

This is intriguing, but I am even scared to ask about it after reading up a bit on Qemu. Although it could work right? It will be one heavy pod but I think anything is better than making node level changes


I saw quite a few other people also modifying the config of Jibri for using PulseAudio. After going through issues / pull-requests and the forum, I found out that PulseAudio was originally used in Jibri. The Jitsi team dropped it because of stability issues, while recording calls with a large amount of participants they had drops / packet losts. They replaced it by Alsa.

Nice catch, would have spend a lot of time going down that rabbit hole otherwise.


I want to keep the Helm chart with the official containers.

Agree. I would prefer the official ones as well. Then we can simply change container images and stay up to date with new features..


For now, I am looking into how the AMIs for EKS are configured and how to enable some specific kernel modules as the one missing currently: soundcore.

For GCP I could change the node images to ubuntu then ssh into each of the VMs and change them manually as suggested by the Jitsi Team and this pr. Although this won't work in the case of node auto scaling. Hence my plan is to create a daemonset that run on all nodes and adjusts them to run jibri. Just a side note remember one jibri pod can only do one recording at a time. This video gave me a lot of insight into jibri. Leaving it here in case you want to watch it as well.

@ChrisTomAlx
Copy link

Also thanks a ton for the firewall tip.. Got JVB to run as expected.. Testing scaling next

@taktakpeops
Copy link
Owner Author

This is intriguing, but I am even scared to ask about it after reading up a bit on Qemu. Although it could work right? It will be one heavy pod but I think anything is better than making node level changes

Yes - it could work but it would also be a hell to debug - it's a high level of inception :D

Nice catch, would have spend a lot of time going down that rabbit hole otherwise.

Agreed - it's pretty tricky to gather info about jitsi as it's all over the place

For GCP I could change the node images to ubuntu then ssh into each of the VMs and change them manually as suggested by the Jitsi Team and this pr. Although this won't work in the case of node auto scaling. Hence my plan is to create a daemonset that run on all nodes and adjusts them to run jibri. Just a side note remember one jibri pod can only do one recording at a time. This video gave me a lot of insight into jibri. Leaving it here in case you want to watch it as wel

For GCP, we might have the same issue as with AWS which is that the VM don't have a sound device. I am looking further into dummy-snd but it remains a kernel module and not included in Ubuntu on AWS EC2.

Will keep you posted once I find a workaround for that

@ChrisTomAlx
Copy link

ChrisTomAlx commented May 27, 2020

I just followed the below steps as shown in the pr for each node on GCP and I was able to get jibri to run.

For each node in the cluster do the following:

Install generic kernel image

sudo apt-get update
sudo apt-get install linux-image-generic

Change grub to start node VM with generic kernel

# GKE Node
sudo vim /etc/default/grub.d/50-cloudimg-settings.cfg
# Other Setup
sudo vim /etc/default/grub

Replace #GRUB_DEFAULT=0 with

# Make the latest generic kernel the default
release=$(linux-version list | grep -e '-generic$' | sort -V | tail -n1)
GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux $release"
sudo update-grub

Reboot VM

sudo reboot

Setup virtual sound device in the node

install the module

sudo apt update && sudo apt install linux-image-extra-virtual

configure 5 capture/playback interfaces

sudo su
echo "options snd-aloop enable=1,1,1,1,1 index=0,1,2,3,4" > /etc/modprobe.d/alsa-loopback.conf

setup autoload the module

echo "snd-aloop" >> /etc/modules

Reboot

sudo reboot

check that the module is loaded

lsmod | grep snd_aloop

@taktakpeops
Copy link
Owner Author

Hey @ChrisTomAlx, thank you for sharing.

I don’t see in your steps anything related to setting up the snd-dummy component, is it present in your nodes?

@ChrisTomAlx
Copy link

The Setup virtual sound device in the node section handles that I assume.

You can also refer this. The jitsi official documentation mentions how to set it up on AWS. Although I guess they meant it for VM's. In GKE the the kubernetes nodes are also VM's. I am not sure if its the same in AWS

For GCP if node auto scaling and node auto repair are on these changes won't stick. So I had to turn these off as well.

@ChrisTomAlx
Copy link

Here is my jibri yaml file. You might want to change it according to your needs slightly..
This is a one time only download link.

@taktakpeops
Copy link
Owner Author

If you run « aplay -L » in your terminal, what does it print?

Also, master and nodes are VM. Inside of the node VM, kubelet must be running to subscribe to the master. Then, pod scheduling happens in this VM.

@ChrisTomAlx
Copy link

If you run « aplay -L » in your terminal, what does it print?

It just says -bash: aplay: command not found. But I do have a jibri pod running and working on this node. And all I did was follow the steps above and made sure the node image was ubuntu instead of Google's own node image


Also, master and nodes are VM. Inside of the node VM, kubelet must be running to subscribe to the master. Then, pod scheduling happens in this VM.

Yes, makes sense, similar to GKE.

@taktakpeops
Copy link
Owner Author

Ah oki, i installed alsa on the VM to test my OS setup.

Got it to work on AWS. Now, for Jibri, how do you mount the device for the pods?

@taktakpeops
Copy link
Owner Author

For the EC2 instance, I had to use a different GRUB_DEFAULT. Instead of the menu title, I used gnulinux-advanced-6156ec80-9446-4eb1-95e0-9ae6b7a46187>gnulinux-4.15.0-101-generic-advanced-6156ec80-9446-4eb1-95e0-9ae6b7a46187.

Could you share the logs from Jibri to see how alsa gets init and so on?

@taktakpeops
Copy link
Owner Author

Sorry - checked everything, looks good.

Regarding the logs, I was just wondering if alsa is being loaded correctly or not. Because there are no health checks for Jibri, it can be in crashed inside of the pod but still considered as healthy as the process keeps running (in all the Jitsi Docker containers, the software runs as a daemon).

I am going to script the setup for the EKS nodes.

One issue remain: currently, only 1 jibri per nodes can run. 2 instances on one node would create a conflict for accessing the device

@ChrisTomAlx
Copy link

ChrisTomAlx commented May 27, 2020

no worries 😄


One issue remain: currently, only 1 jibri per nodes can run. 2 instances on one node would create a conflict for accessing the device

If you look at the Set interface section as I had mentioned above (quoting the same below), you can actually have more than one jibri pods in one node. But this is because you set echo "options snd-aloop enable=1,1,1,1,1 index=0,1,2,3,4" > /etc/modprobe.d/alsa-loopback.conf in your node. The index as per my understanding is saying you can have 5 jibri recordings running simlutaneously in this node.

Look at my jibri yaml file. Also /home/jibri/.asoundrc contains the loopback device being used. Refer the Set interface in file /home/jibri/.asoundrc inside a docker container section here


I actually did run two jibri pods in the same node. with different interfaces. One as Loopback and the other as Loopback_1. I think the best way to do this, since it has to happen dynamically per pod is to have a file at hostpath which keeps a list of interfaces already being used and then adjust the file as per the postStart and preStop lifecycle hooks of the pods. So all the pods can access the same file and edit it as required. What do you think about this approach ?


Regarding the logs, I was just wondering if alsa is being loaded correctly or not. Because there are no health checks for Jibri, it can be in crashed inside of the pod but still considered as healthy as the process keeps running (in all the Jitsi Docker containers, the software runs as a daemon).

You are right, health checks are an issue as well I guess...

@taktakpeops
Copy link
Owner Author

I think the best approach remains having a device plugin for managing the sound card. Still work in progress on my side 😄

@VengefulAncient
Copy link

Hi guys! Awesome work on this so far, thank you both so much. I'm just coming here to point out that on GKE, it seems that you don't need all the trickery with GRUB and a non-GCP boot image. If you are running Ubuntu, you can simply apt-get install linux-modules-extra-gcp and reboot. You will then get an up-to-date GCP image with snd-aloop module.

@ChrisTomAlx
Copy link

ChrisTomAlx commented May 27, 2020

Hey @VengefulAncient
Thanks for chipping in. This is brilliant. This just reduced the work of my Daemonset by a lot. There might be something similar for AWS as well

Although I will still have to do the following I assume

configure 5 capture/playback interfaces

sudo su
echo "options snd-aloop enable=1,1,1,1,1 index=0,1,2,3,4" > /etc/modprobe.d/alsa-loopback.conf

setup autoload the module

echo "snd-aloop" >> /etc/modules

Reboot

sudo reboot

Maybe autoload and reboot isn't required but configuring playback must be correct.?

@taktakpeops
Copy link
Owner Author

@VengefulAncient : thank you for the tip ! I assume that using the following for AWS does the same: https://packages.ubuntu.com/bionic/linux-modules-extra-aws

Testing it now and will update the documentation accordingly.

@VengefulAncient
Copy link

@ChrisTomAlx Unfortunately, it looks like the other steps are still required. The module is only installed for the new GCP image, which is not in use until the node is rebooted. I haven't gotten around to the DaemonSet yet, but perhaps its script could check if the snd-aloop module is loaded, and if not, install the package, perform the configuration and reboot the node? I'm still fairly new to this so not entirely sure whether that's possible.

@ChrisTomAlx
Copy link

@VengefulAncient Agreed. I am a bit new too. Daemonsets should be able to handle this as per my understanding.

@taktakpeops
Copy link
Owner Author

@VengefulAncient
Copy link

@taktakpeops Sadly, that only seems to exit if snd-aloop is not loaded - for true automatic scalability, we'd want the configuration performed on the node automatically (by a privileged DaemonSet?) if the module is not loaded. Could start with the same snippet you linked, but instead of exit, run apt-get update && apt-get install linux-modules-extra-gcp && echo "options snd-aloop enable=1,1,1,1,1 index=0,1,2,3,4" > /etc/modprobe.d/alsa-loopback.conf && echo "snd-aloop" >> /etc/modules && reboot (obviously replace linux-modules-extra-gcp with linux-modules-extra-aws for AWS).

@taktakpeops
Copy link
Owner Author

I see - but I don't think it needs to be a modification of the node by kubernetes. Instead, wouldn't it be easier to customize the script starting the VM? At the end, we need the update of the VM + reboot before launching the kubelet

@ChrisTomAlx
Copy link

ChrisTomAlx commented May 27, 2020

Here is the Daemonset I am working with.
kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: daejibri
spec:
  selector:
      matchLabels:
        name: daejibri # Label selector that determines which Pods belong to the DaemonSet
  template:
    metadata:
      labels:
        name: daejibri # Pod template's label selector
    spec:
      hostPID: true 
      # nodeSelector:
      #   type: jibri
      containers:
      - name: daejibri
        image: asia.gcr.io/google-containers/startup-script:v1
        securityContext: 
          privileged: true 
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        env: 
        - name: STARTUP_SCRIPT 
          value: | 
            ! /bin/bash 
            mkdir -p /chris/hey/
    # apt-get install linux-modules-extra-gcp
    # sudo reboot
    # echo done 

The asia.gcr.io/google-containers/startup-script:v1 image already exists in GCP by default. I think something similar might be there for AWS.

I ran the the daemonset and noticed node level changes after the daemonset ran (meaning I did ls / on the node and I saw the chris folder which did not exist before). So the script I ran was mkdir -p /chris/hey/. I assume you can replace that with anything you want and we can have truly scaleable jibri.

One thing I would add is Node selectors for both jibri and daemonset pods. So these nodes will have only jibri running on them

@ChrisTomAlx
Copy link

Instead, wouldn't it be easier to customize the script starting the VM? At the end, we need the update of the VM + reboot before launching the kubelet

@taktakpeops I am not sure that is possible on GKE. But I recall reading that is possible on AWS. This is from memory so I could be wrong

@taktakpeops
Copy link
Owner Author

It seems possible using a gcloud config file (https://cloud.google.com/container-optimized-os/docs/how-to/create-configure-instance#using_cloud-init_with_the_cloud_config_format)

All the setup of the node is managed through system.d (https://cloud.google.com/kubernetes-engine/docs/concepts/node-images#system_initialization) - so the script checking the setup and exiting in case of issues would then be a system.d service.

For installing the dependencies and doing the reboot (when needed), the flag --metadata-from-file allows to upload a local file which will be executed on startup (a Bash script) (https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--metadata-from-file and https://cloud.google.com/compute/docs/startupscript#startupscriptlocalfile).

@taktakpeops
Copy link
Owner Author

It's to be tested, I don't have a GKE cluster available currently to confirm

@ChrisTomAlx
Copy link

I will look into it. But I am not sure this will work. Because GKE VM's and Compute engine VM's don't always follow the same rules. I ran into a similar issue with firewall, because I could not add firewall rules to VM's within kubernetes so I had to go around it by editing an already existing rule.

In https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--metadata, It specifically mentions the startup-script is reserved for use, which is understandable since GKE sets most of it up. The master worker setup and all the kubernetes components

Additionally, the following keys are reserved for use by Kubernetes Engine:

cluster-location
cluster-name
cluster-uid
configure-sh
enable-os-login
gci-update-strategy
gci-ensure-gke-docker
instance-template
kube-env
startup-script
user-data

Still thanks for the in-depth research and all the links. Very helpful. Let me check if its doable.

@taktakpeops
Copy link
Owner Author

Let me know how it goes.

On another hand, a custom AMI or base Image seems the best pattern to follow: kubernetes/kops#387

@ChrisTomAlx
Copy link

Yes I agree. Unforunately I don't think GKE VM's support custom base images while GCP's compute engine VM's do.

@VengefulAncient
Copy link

@ChrisTomAlx I don't think it's necessarily such a bad thing. Frankly, I'd prefer it if the configuration was done by the DaemonSet - this approach lends itself much better to IaC (Infrastructure as Code). Doing additional configuration fetching scripts that have to be stored somewhere that will differ between providers is less maintainable. Ideally, all I'd want to do is add a node pool with label/taint that will prevent other pods from being scheduled there, deploy the chart with a few customized values, and have everything configure itself. The privileged DaemonSet configuring the nodes for us plays into that nicely - if it can do that.

@ChrisTomAlx
Copy link

@VengefulAncient I am also leaning towards the daemonset currently, mostly because I know it can be done. Only two problems I see.
Firstly kubernetes frowns upon privileged pods.
Secondly the image that I am using in the container currently seems to be created by Google themselves, so this method might only work on GKE if we can't find the same image on dockerhub or some other public docker repo.

@ChrisTomAlx
Copy link

I have compiled all my work over the past month here. Hope this helps people who stumble on this. Some points to note.

  • I used this chart for the provision volume claims that allows readwritemany mode for the claims.
  • The daemonset currently is for GCP only. But if you can find a docker image that does the same as what GCP's default startup-script image does, then it should work on other environments.
  • Make the appropriate changes that are mentioned before the code lines in each section.
  • Make sure to have ingress setup correctly with a domain name so you can reach the getrecording api correctly
  • Make sure you create a nodepool with kubernetes label type : jibri with each node having around 2 - 4 vcpu and 1 - 4 RAM. Enable auto-scaling on this nodepool and change the horizontal pod scaler jibri yaml file below to match the scaling you want. Each node will only hold 1 jibri pod. I tried multiple other ways, unfortunately this is the most consistent and infinitely scale-able approach I could find. You need to set the cpu utilization based on when you want the next node to startup. Node startup along with running of startup scripts could be a time consuming task that takes roughly 7 -10 minutes so it is a good idea to have a couple of extra jibri pods ready.
My FInished deployment files. Please replace the two docker images with your own

Please change the deployment images ( Jibri and getRecording )as you require.

# DAEMONSET FOR STARTUP SCRIPTS
kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: daejibri
spec:
  selector:
      matchLabels:
        name: daejibri # Label selector that determines which Pods belong to the DaemonSet
  template:
    metadata:
      labels:
        name: daejibri # Pod template's label selector
    spec:
      hostPID: true 
      nodeSelector:
        type: jibri
      containers:
      - name: daejibri
        image: asia.gcr.io/google-containers/startup-script:v1
        securityContext: 
          privileged: true 
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        env: 
        - name: STARTUP_SCRIPT 
          value: | 
            ! /bin/bash 
            mkdir -p /yourcompany;
            if [ -z "$(lsmod | grep -om1 snd_aloop)" ]; 
            then
              sudo apt update --yes && sudo apt-get install linux-modules-extra-gcp --yes && sudo echo "options snd-aloop enable=1 index=0" > /etc/modprobe.d/alsa-loopback.conf && sudo echo "snd-aloop" >> /etc/modules && sudo reboot 
            fi

---

# DEPLOYMENT FOR JIBRI MAIN

apiVersion: apps/v1
kind: Deployment
metadata:
  name: yourcompany-jitsi-meet-jibri
spec:
  replicas: 1
  selector:
    matchLabels:
      app: yourcompany-jitsi-meet-jibri
  template:
    metadata:
      labels:
        app: yourcompany-jitsi-meet-jibri
    spec:
      nodeSelector:
        type: jibri
      serviceAccountName: yourcompany-jitsi-meet
      securityContext:
        fsGroup: 999
      volumes:
        - name: dev-snd
          hostPath:
            path: "/dev/snd"
            type: Directory
        - name: dev-shm
          hostPath:
            path: "/dev/shm"
            type: Directory
        - name: test-volume-claim
          persistentVolumeClaim:
            claimName: test-volume-claim
      containers:
        - name: yourcompany-jitsi-meet-jibri
          image: 11111<===replace with your jibri image or the latest on dockerhub===>11111
          imagePullPolicy: Always
          resources:
            requests:
              memory: ".8Gi"
              cpu: "2.0"
            limits:
              memory: "1.0Gi"
              cpu: "2.1"
          volumeMounts:
            - mountPath: "/dev/snd"
              name: dev-snd
            - mountPath: "/dev/shm"
              name: dev-shm
            - mountPath: "/data/recordings"
              name: test-volume-claim
          securityContext:
            privileged: true
            capabilities:
              add:
                - SYS_ADMIN
          lifecycle:
            postStart:
              exec:
                command:
                  - "sh"
                  - "-c"
                  - >
                    mkdir -p "/config/recordings";
                    mkdir -p "/data/recordings";
                    echo "mv -f /config/recordings/* /data/recordings" > /config/finalize.sh;
          envFrom:
          - configMapRef:
              name: yourcompany-jitsi-meet-jicofo
          - configMapRef:
              name: yourcompany-prosody-common
          - configMapRef:
              name: yourcompany-jitsi-meet-web
          env:
          - name: DISPLAY
            value: ':0'
          - name: JIBRI_FINALIZE_RECORDING_SCRIPT_PATH
            value: /config/finalize.sh
          - name: JIBRI_STRIP_DOMAIN_JID
            value: muc
          - name: JIBRI_LOGS_DIR
            value: /config/logs
          - name: JIBRI_RECORDING_DIR
            value: /config/recordings

---

# DEPLOYMENT FOR JIBRI GET-RECORDING API

apiVersion: apps/v1
kind: Deployment
metadata:
  name: yourcompany-jitsi-meet-getrecording
spec:
  replicas: 3
  selector:
    matchLabels:
      app: yourcompany-jitsi-meet-getrecording
  template:
    metadata:
      labels:
        app: yourcompany-jitsi-meet-getrecording
    spec:
      volumes:
        - name: test-volume-claim
          persistentVolumeClaim:
            claimName: test-volume-claim
      containers:
        - name: yourcompany-jitsi-meet-getrecording
          image: 11111<===Your get recording docker image===>11111
          imagePullPolicy: Always
          volumeMounts:
            - mountPath: "/data/recordings"
              name: test-volume-claim
          lifecycle:
            postStart:
              exec:
                command:
                  - "sh"
                  - "-c"
                  - >
                    mkdir -p "/data/recordings";

---

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: test-volume-claim
spec:
  storageClassName: "nfs"
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

---

# SERVICE TO GET RECORDING

apiVersion: v1
kind: Service
metadata:
  name: yourcompany-meet-get-recording
  labels:
    app: yourcompany-jitsi-meet-getrecording
spec:
  type: NodePort
  ports:
  - port: 80
    targetPort: 80
    protocol: TCP
    name: http
  selector:
    app: yourcompany-jitsi-meet-getrecording

---

# HORIZONTAL POD SCALER FOR JIBRI POD

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: yourcompany-jitsi-meet-jibri-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name:  yourcompany-jitsi-meet-jibri
  minReplicas: 3
  maxReplicas: 6
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 3

---

# HORIZONTAL POD SCALER FOR JVB POD

# apiVersion: autoscaling/v2beta1
# kind: HorizontalPodAutoscaler
# metadata:
#   name: yourcompany-jitsi-meet-jibri-hpa
# spec:
#   scaleTargetRef:
#     apiVersion: apps/v1
#     kind: Deployment
#     name:  yourcompany-jitsi-meet-jibri
#   minReplicas: 3
#   maxReplicas: 6
#   metrics:
#   - type: Resource
#     resource:
#       name: cpu
#       targetAverageUtilization: 3
My get recording source code - NodeJS

Its not perfect but it mostly works. Make sure to change yourJitsiMeetFullDomainName. Use this to dockerize it

const http = require('http');
const fs = require('fs');
const os = require('os');
var glob = require("glob");
const path = require("path");
// var contentDisposition = require('content-disposition')
const server = http.createServer((function (req, res) {
    if (req.url != '/favicon.ico') {
        try {
            let url = req.url;
            let filePattern = url.split("/")[url.split("/").length - 1].trim();
            filePattern = filePattern.toLowerCase();
            // let recordingPath= "D:" + path.sep + "Work" + path.sep + "yourcompany" + path.sep + "Experiments" + path.sep + "jitsi-helm" + path.sep + "data" + path.sep + "recordings";
            let recordingPath = path.sep + "data" + path.sep + "recordings";
            let pathSplitter = os.type() == "Windows_NT" ? '/' : path.sep;
            let files = glob.sync(recordingPath + "/**/*.mp4", {});
            console.log(files);
            let fileNames=[];
            let validFilePaths=[];
            
            // Get only valid file paths and file names for the corresponding conference 
            for (let index = 0; index < files.length; index++) {
                let fullFilePath = files[index];
                let fileName = fullFilePath.split(pathSplitter)[fullFilePath.split(pathSplitter).length - 1];
                if(fileName.includes(filePattern) && filePattern !== "") {
                    fileNames.push(fileName);
                    validFilePaths.push(fullFilePath);
                }
                console.log("fileName",fileName);
            }
            console.log("fileNames");
            console.log(fileNames);
            console.log("validFilePaths");
            console.log(validFilePaths);

            // Show no conference found message if not found else get the array of files and serve the latest 
            if(fileNames.length == 0) {
                res.writeHead(200, {'Content-Type': 'text/plain'});
                res.end('Sorry no such conference found! Please recheck the conference name and open yourJitsiMeetFullDomainName/api/getrecording/<conference_name> in your browser');
            } else {

                // Get the array of dates
                let dates = [];
                for (let index = 0; index < fileNames.length; index++) {
                    let fullFileName = fileNames[index];
                    let fileHalves = fullFileName.split("_");
                    let timeAndExtHalf = fileHalves[fileHalves.length - 1];
                    let timeHalf = timeAndExtHalf.split(".")[0];
                    let extHalf = timeAndExtHalf.split(".")[1];
                    let timeSplit = timeHalf.split("-");
                    dates.push(new Date(timeSplit[0]+'/'+timeSplit[1]+'/'+timeSplit[2]+' '+timeSplit[3]+':'+timeSplit[4]+':'+timeSplit[5]));
                    console.log(timeHalf);
                    console.log(extHalf);
                }
                console.log(dates);
                let max = dates.reduce(function (a, b) { return a > b ? a : b; });
                let maxDateString = max.toISOString().replace(/T/, '-').replace(/\..+/, '').replace(/:/g, '-');
                console.log(maxDateString);
                let finalfileName = fileNames.reduce(function (a, b) { return a.includes(maxDateString) ? a : b; });
                let finalFilePath = validFilePaths.reduce(function (a, b) { return a.includes(finalfileName) ? a : b; });

                // Download the file
                var stream = fs.createReadStream(finalFilePath);
                res.writeHead(200, { 'Content-disposition': 'attachment; filename='+finalfileName+'.mp4', 'Content-Type' : 'video/mp4' });
                stream.pipe(res); // also you can set content-type
            }
        } catch (error) {
            console.log(error);
        }
    }
}));
server.listen(80, () => {
    console.log('server started');
});

@taktakpeops
Copy link
Owner Author

Hi @ChrisTomAlx - sorry for the late reply I was a but busy on another project ! Thank you for your feedback, will look carefully into it today !

As it seems that the chart works, let's start preparing for pushing that to the central Helm repo? :D

@VengefulAncient
Copy link

VengefulAncient commented Jun 19, 2020

Incidentally, same situation as @taktakpeops for me, we're just getting back to Jitsi after some other stuff that took priority away from it. I have a few questions for both of you that I hope you would be able to help me with:

@ChrisTomAlx :

  1. Why are you using a deployment with an HPA instead of a daemonset if you only plan to run one jibri pod per node? The daemonset will also allow you to simply init container to bootstrap your nodes, which is especially handy if you dedicate a node pool only to Jibri pods - nodes containing other components will not need to be rebooted.
  2. I assume the PVC is only used for recordings? Our company isn't interested in them, only livestreams, so I'd prefer to skip that part if possible. (BTW, 👍 on the nfs-server-provisioner, we use it for persistent NGINX cache shared between pods in RWX mode and it mostly works great.)
  3. We're configuring five ALSA loopback interfaces for each node (echo "options snd-aloop enable=1,1,1,1,1 index=0,1,2,3,4" > /etc/modprobe.d/alsa-loopback.conf). But each Jibri pod can only do one recording/livestream at a time, correct? Does that mean we'd have to run five Jibri pods per node to actually make use of these extra interfaces? And if so, why do you prefer to keep only one Jibri pod per node - is that because of stability issues you mentioned?

@taktakpeops :

  1. Did I understand correctly that this chart doesn't do actually infinite scaling and only sets minimum and maximum amount of replicas?
  2. Are we actually clear on what needs to be scaled? I've been digging through a bunch of other Jitsi/Jibri threads and it seems like we only really need to scale Jibri and JVB, not prosody etc. I could be wrong though.

@ everyone:

Each JVB pod needs its own UDP port. This brings up two problems.

  1. Do we have an idea of how to scale them infinitely (or at least for a few hundred replicas) without hardcoding the amount into values? We'd somehow need to store, read and write the list of already claimed ports. I don't have enough Kubernetes experience to know whether this is easily possible.
  2. Opening these ports requires a firewall rule - is it possible to create one as a Kubernetes object? So far, I've been doing that manually on GKE using gcloud compute firewall-rules for just one or two hardcoded JVB ports, and while it's theoretically possible to do it from a daemonset init script by installing Google Cloud SDK (Google actually suggests doing just that in their example), it's not very maintainable and generally clumsy.

As always, thank you both for continuing to look into this, your efforts are highly appreciated.

@ChrisTomAlx
Copy link

@taktakpeops Sure.. Although I am not entirely certain helm is still accepting charts into their repo.

@VengefulAncient Here are my views on these questions :-

@ChrisTomAlx :

  1. Why are you using a deployment with an HPA instead of a daemonset if you only plan to run one jibri pod per node? The daemonset will also allow you to simply init container to bootstrap your nodes, which is especially handy if you dedicate a node pool only to Jibri pods - nodes containing other components will not need to be rebooted.
  2. I assume the PVC is only used for recordings? Our company isn't interested in them, only livestreams, so I'd prefer to skip that part if possible. (BTW, 👍 on the nfs-server-provisioner, we use it for persistent NGINX cache shared between pods in RWX mode and it mostly works great.)
  3. We're configuring five ALSA loopback interfaces for each node (echo "options snd-aloop enable=1,1,1,1,1 index=0,1,2,3,4" > /etc/modprobe.d/alsa-loopback.conf). But each Jibri pod can only do one recording/livestream at a time, correct? Does that mean we'd have to run five Jibri pods per node to actually make use of these extra interfaces? And if so, why do you prefer to keep only one Jibri pod per node - is that because of stability issues you mentioned?
  1. Daemonsets can't be scaled as far as I could tell. HPA's allow you to scale the deployment as per your wish based on the cpu and memory utilization of pods
  2. PVC is used only for recording.. Although I haven't tested the live stream interaction. If you see some problems there let me know, I can look into it.
  3. This is a bit more complicated. So initially I had 5 jibri deployments (and 5 jibri HPA) instead of one, with podantiaffinity.. That worked alright but recording were crashing after 5 minute intervals. It could be a cpu issue, but I could not find the cut off required. Also kubernetes docs mentions not to use podAntiAffinity in clusters that have several hundred nodes. So I decided this would be the only infintely scalebale option. You can alter the HPA based on your need. Provide a min and a max and set the cpu utilization at which you want to start up a new pod.

@ everyone:
Each JVB pod needs its own UDP port. This brings up two problems.

  1. Do we have an idea of how to scale them infinitely (or at least for a few hundred replicas) without hardcoding the amount into values? We'd somehow need to store, read and write the list of already claimed ports. I don't have enough Kubernetes experience to know whether this is easily possible.
  2. Opening these ports requires a firewall rule - is it possible to create one as a Kubernetes object? So far, I've been doing that manually on GKE using gcloud compute firewall-rules for just one or two hardcoded JVB ports, and while it's theoretically possible to do it from a daemonset init script by installing Google Cloud SDK (Google actually suggests doing just that in their example), it's not very maintainable and generally clumsy.
  1. HPA's should allow you to do that. You can alter the HPA anytime during the life of the deployment. Although, if you do a manual edit. You might also have to do a manual delete when you want to delete the chart (not sure though). Also autoscaling of nodepools makes sure that you are not spending money unnecessarily by having too many unused nodes present.
  2. You can create a firewall rule in GCP and then when creating a nodepool in GKE it will ask you for a network label. That should work, but I haven't tested it. I just edited an existing firewall rule that applies to all the compute engine VM's and made it allow the correct UDP port. This was purely for testing.. I would not encourage you to do this on prod. I would suggest going the network label way, it should work imho. Give this nodepool a kubernetes label that matches JVB's nodeselector and GIve it a network label and all your JVB pods will sit here with welcoming arms to any UDP requests.

Hope this helps!!

@taktakpeops
Copy link
Owner Author

@ChrisTomAlx : you are right. It’s not about pushing to their repo but making the chart available by following these guidelines: https://github.com/helm/hub/blob/master/Repositories.md

Regarding the HPA, it applies only to the web element and the JVB element which can be scaled. Jibri + Jicofo could be moved from a statefulset to a daemonset, would be more logical I think.

The PVC such as the broadcaster are optional, if you don’t want them you can disable it in the value file used to deploy your chart.

For the recording part, not sure Alsa is the best solution as we discussed earlier with @ChrisTomAlx - more investigations are on going on my side.

Regarding the HPA for JVB, I want to get it to work in sync with Octo.

About Jibri, an HPA doesn’t make much sense I think as you would require also vertical scaling (more sound devices).

@VengefulAncient
Copy link

@ChrisTomAlx

  1. Daemonsets automatically schedule one pod per each node that matches labels/tolerations for it. If you are only scheduling one Jibri pod per node anyway and then starting more nodes based on CPU usage, it should be more or less the same.
  2. Livestreaming without recording does work without PVCs, that's why I was wondering :) I got used to the idea of recordings being saved on the nodes and not persistent volumes, but your idea is better in case we decide to record, since I assume that nodes can be killed at any time (I am using pre-emptible nodes on staging to cut costs).
  3. Sorry, I'm getting confused. Do you mean that you had 5 different Jibri deployments with their own replicasets, each depositing one pod per node, resulting in 5 pods per node? That's an interesting idea, I haven't thought of that. I'm not interested in doing that though, since we're only interesting in having one participant per conference and livestreaming to YouTube, and for that purpose, 2 vCPUs and 2 GB RAM do the job - just make sure to set disableThirdPartyRequests: true in your Jitsi web component config (see this issue) to avoid Jibri eating all RAM. So for my purposes, I can just have one Jibri pod on each of such nodes. Trying to have more will overload the CPUs. (BTW I'm currently using custom N2D nodes on GKE to make sure I get AMD EPYC and not some random older Intel - somehow, with N1 nodes with the same amount of RAM and vCPUs the stream was doing much poorer)

@taktakpeops

  1. Do we actually need to scale Jicofo though? I'm not trying to argue against it BTW, I'm just really confused by all the components, I'm not sure what Jicofo actually does.
  2. ALSA might not be the best solution, but I already have a script bootstrapping Ubuntu nodes for it, and it all works. If that part can be thrown out, that would be great, but I prefer it over having to modify any of the Jitsi component images - it's frankly a huge pain because of the poor documentation, confusing errors and insane amount of environment variables. Completely unmaintainable, IMO. Not sure whether a Helm chart that makes you modify an image before it works would be too popular, either. (Though the init script also currently requires Ubuntu nodes which aren't standard on either GKE or AWS, so that might be a moot point)
  3. Could you please link me somewhere I can read about Octo? I swear I must be really stupid, but somehow I was not able to find any documentation on what it is, besides passing mentions by Jitsi maintainers. Is it a Kubernetes thing? A Jitsi thing?
  4. Why would we need more sound devices though? You need pretty powerful nodes to handle multiple Jibri pods. Makes more sense to just run one Jibri per node (with weaker nodes), using one audio device. The costs on any major cloud provider are per CPU/RAM, so having one node with let's say 8 vCPUs and 16 GB RAM that can handle 4 Jibri pods is the same as having 4 nodes with 2 vCPUs and 4 GB RAM each - sure, there's some Kubernetes overhead you get rid of by moving to fewer larger nodes (less control plane pods), but it's not worth the hassle IMO. Unless I misunderstand something?

@ everyone

  1. Of course I want HPAs and autoscaling, we definitely don't want to spend more money on idle nodes :) But my point was that every JVB needs a unique UDP port. Which then needs to be opened in the firewall, defined in the JVB service, and in the JVB_PORT variable for each JVB pod. I don't see how HPAs would allow me to achieve that. Unless I'm again misunderstanding something? (Seems to be a running trend with Jitsi)
  2. Do you mean opening all UDP ports just for the node pool that holds JVB pods? I guess that could work, though I'd probably go with a certain range instead.

Also, a new question came up: in my deployment, Jibri sometimes becomes ready before Prosody, so it (predictably) can't authenticate and just fails (which is where a k8s-native application would just kill the pod and restart, but alas, no such thing with Jibri). Normally I'd solve that by attaching an init container that would poll Prosody service, but since none of these components have proper healthcheck endpoints, this isn't going to work - Prosody only returns XMPP, not HTTP status codes I could use for Kubernetes health checks. Have either of you run into this issue? If so, how do you handle it? The official Jitsi docker-compose.yaml simply has depends_on directives for each component, which give us an understanding of which components should be ready before others, but Kubernetes sadly still does not support container ordering.

TIA and sorry for another wall of text 😢

@ChrisTomAlx
Copy link

ChrisTomAlx commented Jun 20, 2020

Daemonsets automatically schedule one pod per each node that matches labels/tolerations for it. If you are only scheduling one Jibri pod per node anyway and then starting more nodes based on CPU usage, it should be more or less the same.

But how do you scale it? You need an HPA right to check and scale based on cpu utilisation? and HPA's don't work on daemonsets as far as I know. Or atleast they are not designed for that purpose.


Sorry, I'm getting confused. Do you mean that you had 5 different Jibri deployments with their own replicasets, each depositing one pod per node, resulting in 5 pods per node?

Exactly but I had to use podAntiAffinity and that isn't infinitely scaleable so I went with the one pod per node concept. Also I was experiencing random crashes which could be solved by adding more cpu maybe. But I stopped going down that rabbit hole after the podAntiAffinity issue.


  1. Of course I want HPAs and autoscaling, we definitely don't want to spend more money on idle nodes :) But my point was that every JVB needs a unique UDP port. Which then needs to be opened in the firewall, defined in the JVB service, and in the JVB_PORT variable for each JVB pod. I don't see how HPAs would allow me to achieve that. Unless I'm again misunderstanding something? (Seems to be a running trend with Jitsi)
  2. Do you mean opening all UDP ports just for the node pool that holds JVB pods? I guess that could work, though I'd probably go with a certain range instead.

So what I was planning is to have a HPA for JVB. Again with one JVB per Node. firewall rule will only open one particular UDP port in all the nodes of the JVB nodepool. This is again vertical scaling so I would not suggest this setup for on prem deployments.


Also, a new question came up: in my deployment, Jibri sometimes becomes ready before Prosody

I had that issue as well.. For now I am deploying jibri only once prosody is up.. But there has to be a better way. If prosody restarts for whatever reason all jibri pods will go down with no restart. So that is an issue.. If you do find a workaround do post it here.

@marcoadasilvaa
Copy link

+1

@taktakpeops
Copy link
Owner Author

Hello @ChrisTomAlx,

Sorry (again) for the late reply.

But how do you scale it? You need an HPA right to check and scale based on cpu utilisation? and HPA's don't work on daemonsets as far as I know. Or atleast they are not designed for that purpose.

In this case, you aren't scaling horizontally but vertically. So if you need a new daemonset, you would spawn a new node.

However, now that I am running Jitsi at scale in a EC2 infrastructure, I realized that for JVB, Octo can work in K8s using the BRIDGE_SELECTION_STRATEGY set to IntraRegionBridgeSelectionStrategy. In this case, your JVB can be managed by a deployment and therefore, you can apply an HPA. You just need to ensure that while you are having traffic on one instance, K8S doesn't kill it.

Exactly but I had to use podAntiAffinity and that isn't infinitely scaleable so I went with the one pod per node concept. Also I was experiencing random crashes which could be solved by adding more cpu maybe. But I stopped going down that rabbit hole after the podAntiAffinity issue.

If we are auto scaling the nodes of the clusters for JVB, Jibri can benefit from the same logic (so 2 daemonset in this case).

So what I was planning is to have a HPA for JVB. Again with one JVB per Node. firewall rule will only open one particular UDP port in all the nodes of the JVB nodepool. This is again vertical scaling so I would not suggest this setup for on prem deployments.

Can do, but need custom image for Jicofo + JVB (sip properties) - look at first answer.

I had that issue as well.. For now I am deploying jibri only once prosody is up.. But there has to be a better way. If prosody restarts for whatever reason all jibri pods will go down with no restart. So that is an issue.. If you do find a workaround do post it here.

I still think that one pod containing Jicofo + Prosody + Jibri is the way to go. Basically some kind of main pod that you would scale vertically as explained for the daemonset.

I will be available tomorrow, if you are interested, we can plan a call online to answer most questions !

@taktakpeops
Copy link
Owner Author

@VengefulAncient : will reply to your questions today. As suggested in my previous answer, we can have a call to go through all questions once for all :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants