Skip to content

Commit

Permalink
[stable/airflow] improve celery worker graceful shutdown (helm#23504)
Browse files Browse the repository at this point in the history
Signed-off-by: Mathew Wicks <[email protected]>

[stable/prometheus] update prometheus to 2.20.1 and cm reloader to 0.4.0 (helm#23506)

* updated prometheus to 2.20.1 and cm reloader to 0.4.0

Signed-off-by: André Bauer <[email protected]>

* fix xpp version

Signed-off-by: André Bauer <[email protected]>

Update for CPI v1.2.0

update versions

Update

Signed-off-by: dvonthenen <[email protected]>

Update readme

Signed-off-by: dvonthenen <[email protected]>

Updates

Signed-off-by: dvonthenen <[email protected]>

Allow for global chart values from charts inheriting this one

Make globals default

Bump version

Fix global

Signed-off-by: dvonthenen <[email protected]>
  • Loading branch information
thesuperzapper authored and davidvonthenen committed Sep 3, 2020
1 parent 7ce0e6a commit dd75d11
Show file tree
Hide file tree
Showing 23 changed files with 156 additions and 99 deletions.
2 changes: 1 addition & 1 deletion stable/airflow/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: v1
description: Airflow is a platform to programmatically author, schedule and monitor workflows
name: airflow
version: 7.3.0
version: 7.4.0
appVersion: 1.10.10
icon: https://airflow.apache.org/_images/pin_large.png
home: https://airflow.apache.org/
Expand Down
10 changes: 6 additions & 4 deletions stable/airflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ kubectl exec \
> NOTE: for chart version numbers, see [Chart.yaml](Chart.yaml) or [helm hub](https://hub.helm.sh/charts/stable/airflow).
For steps you must take when upgrading this chart, please review:
* [v7.3.X → v7.4.0](UPGRADE.md#v73x--v740)
* [v7.2.X → v7.3.0](UPGRADE.md#v72x--v730)
* [v7.1.X → v7.2.0](UPGRADE.md#v71x--v720)
* [v7.0.X → v7.1.0](UPGRADE.md#v70x--v710)
Expand Down Expand Up @@ -236,11 +237,12 @@ workers:
celery:
instances: 10
# wait until all tasks are finished before SIGTERM of Pod
## wait at most 10min for running tasks to complete
gracefullTermination: true
gracefullTerminationPeriod: 600
# wait AT MOST 10min for tasks on a worker to finish before SIGKILL
terminationPeriod: 600
## how many seconds (after the 10min) to wait before SIGKILL
terminationPeriod: 60
dags:
git:
Expand Down Expand Up @@ -592,7 +594,7 @@ __Airflow Worker Values:__
| `workers.autoscaling.*` | configs for the HorizontalPodAutoscaler of the worker Pods | `<see values.yaml>` |
| `workers.initialStartupDelay` | the number of seconds to wait (in bash) before starting each worker container | `0` |
| `workers.celery.*` | configs for the celery worker Pods | `<see values.yaml>` |
| `workers.terminationPeriod` | how many seconds to wait for tasks on a worker to finish before SIGKILL | `60` |
| `workers.terminationPeriod` | how many seconds to wait after SIGTERM before SIGKILL of the celery worker | `60` |
| `workers.secretsDir` | directory in which to mount secrets on worker containers | `/var/airflow/secrets` |
| `workers.secrets` | secret names which will be mounted as a file at `{workers.secretsDir}/<secret_name>` | `[]` |
| `workers.secretsMap` | you can use secretsMap to specify a map and all the secrets will be stored within it secrets will be mounted as files at `{workers.secretsDir}/<secrets_in_map>`. If you use workers.secretsMap, then it overrides `workers.secrets`.| `""` |
Expand Down
18 changes: 18 additions & 0 deletions stable/airflow/UPGRADE.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,23 @@
# Upgrading Steps

## `v7.3.X``v7.4.0`

__The following IMPROVEMENTS have been made:__

* Reduced how likely it is for a celery worker to receive SIGKILL with graceful termination enabled.
* Celery worker graceful shutdown lifecycle:
1. prevent worker accepting new tasks
2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for tasks to finish
3. send `SIGTERM` to worker
4. wait AT MOST `workers.terminationPeriod` for kill to finish
5. send `SIGKILL` to worker
* NOTE:
* if you currently use a high value of `workers.terminationPeriod`, consider lowering it to `60` and setting a high value for `workers.celery.gracefullTerminationPeriod`

__The following values have been ADDED:__

* `workers.celery.gracefullTerminationPeriod`

## `v7.2.X``v7.3.0`

__The following IMPROVEMENTS have been made:__
Expand Down
10 changes: 7 additions & 3 deletions stable/airflow/examples/google-gke/custom-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -239,13 +239,17 @@ workers:
##
instances: 10

## if we should wait for tasks to finish on a celery worker before SIGTERM of Pod
## if we should wait for tasks to finish before SIGTERM of the celery worker
##
gracefullTermination: true

## how many seconds to wait for tasks on a worker to finish before SIGKILL
## how many seconds to wait for tasks to finish before SIGTERM of the celery worker
##
gracefullTerminationPeriod: 600

## how many seconds to wait after SIGTERM before SIGKILL of the celery worker
##
terminationPeriod: 600
terminationPeriod: 60

## directory in which to mount secrets on worker containers
##
Expand Down
2 changes: 1 addition & 1 deletion stable/airflow/templates/configmap-scripts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ data:
# loop until all active task are finished
echo "*** waiting for active tasks to finish"
while (( celery inspect --broker $AIRFLOW__CELERY__BROKER_URL --destination celery@$HOSTNAME --json active | python3 -c "import json; active_tasks = json.loads(input())['celery@$HOSTNAME']; print(len(active_tasks))" > 0 )); do
sleep 30
sleep 10
done
preinit-db.sh: |
#!/bin/bash
Expand Down
9 changes: 8 additions & 1 deletion stable/airflow/templates/statefulsets-workers.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,11 @@ spec:
- name: {{ .Values.airflow.image.pullSecret }}
{{- end }}
restartPolicy: Always
{{- if .Values.workers.celery.gracefullTermination }}
terminationGracePeriodSeconds: {{ add .Values.workers.terminationPeriod .Values.workers.celery.gracefullTerminationPeriod }}
{{- else }}
terminationGracePeriodSeconds: {{ .Values.workers.terminationPeriod }}
{{- end }}
serviceAccountName: {{ include "airflow.serviceAccountName" . }}
{{- if .Values.workers.nodeSelector }}
nodeSelector:
Expand Down Expand Up @@ -148,7 +152,10 @@ spec:
lifecycle:
preStop:
exec:
command: ["/home/airflow/scripts/graceful-stop-celery-worker.sh"]
command:
- "timeout"
- "{{ .Values.workers.celery.gracefullTerminationPeriod }}s"
- "/home/airflow/scripts/graceful-stop-celery-worker.sh"
{{- end}}
envFrom:
- configMapRef:
Expand Down
22 changes: 17 additions & 5 deletions stable/airflow/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -530,14 +530,26 @@ workers:
##
instances: 1

## if we should wait for tasks to finish on a celery worker before SIGTERM of Pod
##
## NOTE:
## - `workers.terminationPeriod` is still the overall timeout before worker Pods are killed using SIGKILL
## if we should wait for tasks to finish before SIGTERM of the celery worker
##
gracefullTermination: false

## how many seconds to wait for tasks on a worker to finish before SIGKILL
## how many seconds to wait for tasks to finish before SIGTERM of the celery worker
##
## graceful shutdown lifecycle:
## 1. prevent worker accepting new tasks
## 2. wait AT MOST `workers.celery.gracefullTerminationPeriod` for tasks to finish
## 3. send SIGTERM to worker
## 4. wait AT MOST `workers.terminationPeriod` for kill to finish
## 5. send SIGKILL to worker
##
gracefullTerminationPeriod: 600

## how many seconds to wait after SIGTERM before SIGKILL of the celery worker
##
## WARNING:
## - tasks that are still running during SIGKILL will be orphaned, this is important
## to understand with KubernetesPodOperator(), as Pods may continue running
##
terminationPeriod: 60

Expand Down
4 changes: 2 additions & 2 deletions stable/prometheus/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: v1
name: prometheus
version: 11.11.1
appVersion: 2.19.0
version: 11.12.0
appVersion: 2.20.1
description: Prometheus is a monitoring system and time series database.
home: https://prometheus.io/
icon: https://raw.githubusercontent.com/prometheus/prometheus.github.io/master/assets/prometheus_logo-cb55bb5c346.png
Expand Down
6 changes: 3 additions & 3 deletions stable/prometheus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ Parameter | Description | Default
`configmapReload.prometheus.enabled` | If false, the configmap-reload container for Prometheus will not be deployed | `true`
`configmapReload.prometheus.name` | configmap-reload container name | `configmap-reload`
`configmapReload.prometheus.image.repository` | configmap-reload container image repository | `jimmidyson/configmap-reload`
`configmapReload.prometheus.image.tag` | configmap-reload container image tag | `v0.3.0`
`configmapReload.prometheus.image.tag` | configmap-reload container image tag | `v0.4.0`
`configmapReload.prometheus.image.pullPolicy` | configmap-reload container image pull policy | `IfNotPresent`
`configmapReload.prometheus.extraArgs` | Additional configmap-reload container arguments | `{}`
`configmapReload.prometheus.extraVolumeDirs` | Additional configmap-reload volume directories | `{}`
Expand All @@ -194,7 +194,7 @@ Parameter | Description | Default
`configmapReload.alertmanager.enabled` | If false, the configmap-reload container for AlertManager will not be deployed | `true`
`configmapReload.alertmanager.name` | configmap-reload container name | `configmap-reload`
`configmapReload.alertmanager.image.repository` | configmap-reload container image repository | `jimmidyson/configmap-reload`
`configmapReload.alertmanager.image.tag` | configmap-reload container image tag | `v0.3.0`
`configmapReload.alertmanager.image.tag` | configmap-reload container image tag | `v0.4.0`
`configmapReload.alertmanager.image.pullPolicy` | configmap-reload container image pull policy | `IfNotPresent`
`configmapReload.alertmanager.extraArgs` | Additional configmap-reload container arguments | `{}`
`configmapReload.alertmanager.extraVolumeDirs` | Additional configmap-reload volume directories | `{}`
Expand Down Expand Up @@ -282,7 +282,7 @@ Parameter | Description | Default
`server.enabled` | If false, Prometheus server will not be created | `true`
`server.name` | Prometheus server container name | `server`
`server.image.repository` | Prometheus server container image repository | `prom/prometheus`
`server.image.tag` | Prometheus server container image tag | `v2.19.0`
`server.image.tag` | Prometheus server container image tag | `v2.20.1`
`server.image.pullPolicy` | Prometheus server container image pull policy | `IfNotPresent`
`server.configPath` | Path to a prometheus server config file on the container FS | `/etc/config/prometheus.yml`
`server.global.scrape_interval` | How frequently to scrape targets by default | `1m`
Expand Down
6 changes: 3 additions & 3 deletions stable/prometheus/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -332,7 +332,7 @@ configmapReload:
##
image:
repository: jimmidyson/configmap-reload
tag: v0.3.0
tag: v0.4.0
pullPolicy: IfNotPresent

## Additional configmap-reload container arguments
Expand Down Expand Up @@ -370,7 +370,7 @@ configmapReload:
##
image:
repository: jimmidyson/configmap-reload
tag: v0.3.0
tag: v0.4.0
pullPolicy: IfNotPresent

## Additional configmap-reload container arguments
Expand Down Expand Up @@ -568,7 +568,7 @@ server:
##
image:
repository: prom/prometheus
tag: v2.19.2
tag: v2.20.1
pullPolicy: IfNotPresent

## prometheus server priorityClassName
Expand Down
4 changes: 2 additions & 2 deletions stable/vsphere-cpi/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
apiVersion: v1
appVersion: 1.1.0
appVersion: 1.2.1
description: A Helm chart for vSphere Cloud Provider Interface Manager (CPI)
name: vsphere-cpi
version: 0.1.3
version: 0.2.0
keywords:
- vsphere
- vmware
Expand Down
6 changes: 3 additions & 3 deletions stable/vsphere-cpi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,12 @@ The following table lists the configurable parameters of the vSphere CPI chart a

| Parameter | Description | Default |
|------------------------------------------|-------------------------------------|----------------------------------------|
| `podSecurityPolicy.enabled` | Enable pod sec policy (k8s > 1.17) | false |
| `podSecurityPolicy.enabled` | Enable pod sec policy (k8s > 1.17) | true |
| `podSecurityPolicy.annotations` | Annotations for pd sec policy | nil |
| `securityContext.enabled` | Enable sec context for container | false |
| `securityContext.runAsUser` | RunAsUser. Default is `nobody` in | 65534 |
| `securityContext.runAsUser` | RunAsUser. Default is `nobody` in | 1001 |
| | distroless image | |
| `securityContext.fsGroup` | FsGroup. Default is `nobody` in | 65534 |
| `securityContext.fsGroup` | FsGroup. Default is `nobody` in | 1001 |
| | distroless image | |
| `config.enabled` | Create a simple single VC config | false |
| `config.vcenter` | FQDN or IP of vCenter | vcenter.local |
Expand Down
2 changes: 1 addition & 1 deletion stable/vsphere-cpi/templates/common.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ metadata:
app: {{ template "cpi.name" . }}
vsphere-cpi-infra: common-configmap
chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
component: cloud-controller
component: cloud-controller-manager
heritage: {{ .Release.Service }}
release: {{ .Release.Name }}
namespace: {{ .Release.Namespace }}
Expand Down
26 changes: 15 additions & 11 deletions stable/vsphere-cpi/templates/configmap.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{{- if .Values.config.enabled -}}
{{- if .Values.config.enabled | default .Values.global.config.enabled -}}
apiVersion: v1
kind: ConfigMap
metadata:
Expand All @@ -13,15 +13,19 @@ metadata:
namespace: {{ .Release.Namespace }}
data:
vsphere.conf: |
[Global]
# properties in this section will be used for all specified vCenters unless overriden in VirtualCenter section.
port = "443" #Optional
insecure-flag = "1" #set to 1 if the vCenter uses a self-signed cert
# settings for using k8s secret
secret-name = "vsphere-cpi"
secret-namespace = "{{ .Release.Namespace }}"
# Global properties in this section will be used for all specified vCenters unless overriden in VirtualCenter section.
global:
port: 443
# set insecure-flag to true if the vCenter uses a self-signed cert
insecureFlag: true
# settings for using k8s secret
secretName: vsphere-cpi
secretNamespace: {{ .Release.Namespace }}
[VirtualCenter "{{ .Values.config.vcenter }}"]
datacenters = "{{ .Values.config.datacenter }}"
# port, insecure-flag will be used from Global section.
# VirtualCenter section
vcenter:
{{ .Release.Name }}:
server: {{ .Values.config.vcenter | default .Values.global.config.vcenter }}
datacenters:
- {{ .Values.config.datacenter | default .Values.global.config.datacenter }}
{{- end -}}
50 changes: 26 additions & 24 deletions stable/vsphere-cpi/templates/daemonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,16 @@ metadata:
app: {{ template "cpi.name" . }}
vsphere-cpi-infra: daemonset
chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
component: cloud-controller
component: cloud-controller-manager
tier: control-plane
heritage: {{ .Release.Service }}
release: {{ .Release.Name }}
namespace: {{ .Release.Namespace }}
{{- if .Values.daemonset.annotations }}
annotations:
{{ toYaml .Values.daemonset.annotations | indent 4 }}
{{- end }}
scheduler.alpha.kubernetes.io/critical-pod: ""
{{- if .Values.daemonset.annotations }}
{{- toYaml .Values.daemonset.annotations | indent 4 }}
{{- end }}
spec:
selector:
matchLabels:
Expand All @@ -22,46 +24,44 @@ spec:
type: RollingUpdate
template:
metadata:
{{- if .Values.daemonset.podAnnotations }}
{{- if .Values.daemonset.podAnnotations }}
annotations:
{{ toYaml .Values.daemonset.podAnnotations | indent 8 }}
{{- end }}
{{- toYaml .Values.daemonset.podAnnotations | indent 8 }}
{{- end }}
labels:
app: {{ template "cpi.name" . }}
component: cloud-controller
component: cloud-controller-manager
tier: control-plane
release: {{ .Release.Name }}
vsphere-cpi-infra: daemonset
{{- if .Values.daemonset.podLabels }}
{{ toYaml .Values.daemonset.podLabels | indent 8 }}
{{- end }}
{{- if .Values.daemonset.podLabels }}
{{- toYaml .Values.daemonset.podLabels | indent 8 }}
{{- end }}
spec:
nodeSelector:
{{- if .Values.daemonset.nodeSelector }}
{{ toYaml .Values.daemonset.nodeSelector | indent 8 }}
{{- else }}
node-role.kubernetes.io/master: ""
{{- end }}
securityContext:
runAsUser: 0
{{- if .Values.daemonset.nodeSelector }}
{{- toYaml .Values.daemonset.nodeSelector | indent 8 }}
{{- end }}
tolerations:
{{- if .Values.daemonset.tolerations }}
{{ toYaml .Values.daemonset.tolerations | indent 8 }}
{{- else }}
- key: node.cloudprovider.kubernetes.io/uninitialized
value: "true"
effect: NoSchedule
- key: node-role.kubernetes.io/master
effect: NoSchedule
{{- end }}
- key: node.kubernetes.io/not-ready
effect: NoSchedule
operator: Exists
{{- if .Values.daemonset.tolerations }}
{{- toYaml .Values.daemonset.tolerations | indent 8 }}
{{- end }}
{{- if .Values.securityContext.enabled }}
securityContext:
fsGroup: {{ .Values.securityContext.fsGroup }}
runAsUser: {{ .Values.securityContext.runAsUser }}
{{- end }}
serviceAccountName: {{ .Values.serviceAccount.name }}
{{- if .Values.daemonset.useHostNetwork }}
hostNetwork: true
{{- end }}
dnsPolicy: {{ .Values.daemonset.dnsPolicy }}
containers:
- name: {{ template "cpi.name" . }}
Expand Down Expand Up @@ -91,8 +91,10 @@ spec:
- mountPath: {{ .Values.daemonset.cmdline.cloudConfig.dir }}
name: vsphere-config-volume
readOnly: true
{{- if .Values.daemonset.resources }}
resources:
{{ toYaml .Values.daemonset.resources | indent 10 }}
{{- toYaml .Values.daemonset.resources | indent 10 }}
{{- end }}
volumes:
- name: vsphere-config-volume
configMap:
Expand Down
Loading

0 comments on commit dd75d11

Please sign in to comment.