Learn how to configure and manage the Postgres Operator in your Kubernetes (K8s) environment.
Minor version upgrades for PostgreSQL are handled via updating the Spilo Docker image. The operator will carry out a rolling update of Pods which includes a switchover (planned failover) of the master to the Pod with new minor version. The switch should usually take less than 5 seconds, still clients have to reconnect.
Major version upgrades are supported via cloning.
The new cluster manifest must have a higher version
string than the source
cluster and will be created from a basebackup. Depending of the cluster size,
downtime in this case can be significant as writes to the database should be
stopped and all WAL files should be archived first before cloning is started.
Note, that simply changing the version string in the postgresql
manifest does
not work at present and leads to errors. Neither Patroni nor Postgres Operator
can do in place pg_upgrade
. Still, it can be executed manually in the Postgres
container, which is tricky (i.e. systems need to be stopped, replicas have to be
synced) but of course faster than cloning.
CustomResourceDefinitions
will be registered with schema validation by default when the operator is
deployed. The OperatorConfiguration
CRD will only get created if the
POSTGRES_OPERATOR_CONFIGURATION_OBJECT
environment variable
in the deployment yaml is set and not empty.
When submitting manifests of postgresql
or
OperatorConfiguration
custom
resources with kubectl, validation can be bypassed with --validate=false
. The
operator can also be configured to not register CRDs with validation on ADD
or
UPDATE
events. Running instances are not affected when enabling the validation
afterwards unless the manifests is not changed then. Note, that the provided CRD
manifests contain the validation for users to understand what schema is
enforced.
Once the validation is enabled it can only be disabled manually by editing or patching the CRD manifest:
kubectl patch crd postgresqls.acid.zalan.do -p '{"spec":{"validation": null}}'
If your cluster uses a DNS domain other than the default cluster.local
, this
needs to be set in the operator configuration (cluster_domain
variable). This
is used by the operator to connect to the clusters after creation.
The operator can run in a namespace other than default
. For example, to use
the test
namespace, run the following before deploying the operator's
manifests:
kubectl create namespace test
kubectl config set-context $(kubectl config current-context) --namespace=test
All subsequent kubectl
commands will work with the test
namespace. The
operator will run in this namespace and look up needed resources - such as its
ConfigMap - there. Please note that the namespace for service accounts and
cluster role bindings in operator RBAC rules
needs to be adjusted to the non-default value.
Watching a namespace for an operator means tracking requests to change Postgres clusters in the namespace such as "increase the number of Postgres replicas to 5" and reacting to the requests, in this example by actually scaling up.
By default, the operator watches the namespace it is deployed to. You can
change this by setting the WATCHED_NAMESPACE
var in the env
section of the
operator deployment manifest or by
altering the watched_namespace
field in the operator
configuration.
In the case both are set, the env var takes the precedence. To make the
operator listen to all namespaces, explicitly set the field/env var to "*
".
Note that for an operator to manage pods in the watched namespace, the
operator's service account (as specified in the operator deployment manifest)
has to have appropriate privileges to access the watched namespace. The
operator may not be able to function in the case it watches all namespaces but
lacks access rights to any of them (except K8s system namespaces like
kube-system
). The reason is that for multiple namespaces operations such as
'list pods' execute at the cluster scope and fail at the first violation of
access rights.
By default, multiple operators can only run together in one K8s cluster when isolated into their own namespaces. But, it is also possible to define ownership between operator instances and Postgres clusters running all in the same namespace or K8s cluster without interfering.
First, define the CONTROLLER_ID
environment variable in the operator deployment manifest. Then specify the ID
in every Postgres cluster manifest you want this operator to watch using the
"acid.zalan.do/controller"
annotation:
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
name: demo-cluster
annotations:
"acid.zalan.do/controller": "second-operator"
spec:
...
Every other Postgres cluster which lacks the annotation will be ignored by this
operator. Conversely, operators without a defined CONTROLLER_ID
will ignore
clusters with defined ownership of another operator.
To avoid accidental deletes of Postgres clusters the operator can check the manifest for two existing annotations containing the cluster name and/or the current date (in YYYY-MM-DD format). The name of the annotation keys can be defined in the configuration. By default, they are not set which disables the delete protection. Thus, one could choose to only go with one annotation.
postgres-operator ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-operator
data:
delete_annotation_date_key: "delete-date"
delete_annotation_name_key: "delete-clustername"
OperatorConfiguration
apiVersion: "acid.zalan.do/v1"
kind: OperatorConfiguration
metadata:
name: postgresql-operator-configuration
configuration:
kubernetes:
delete_annotation_date_key: "delete-date"
delete_annotation_name_key: "delete-clustername"
Now, every cluster manifest must contain the configured annotation keys to
trigger the delete process when running kubectl delete pg
. Note, that the
Postgresql
resource would still get deleted as K8s' API server does not
block it. Only the operator logs will tell, that the delete criteria wasn't
met.
cluster manifest
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
name: demo-cluster
annotations:
delete-date: "2020-08-31"
delete-clustername: "demo-cluster"
spec:
...
In case, the resource has been deleted accidentally or the annotations were
simply forgotten, it's safe to recreate the cluster with kubectl create
.
Existing Postgres cluster are not replaced by the operator. But, as the
original cluster still exists the status will show CreateFailed
at first.
On the next sync event it should change to Running
. However, as it is in
fact a new resource for K8s, the UID will differ which can trigger a rolling
update of the pods because the UID is used as part of backup path to S3.
The manifest operator-service-account-rbac.yaml
defines the service account, cluster roles and bindings needed for the operator
to function under access control restrictions. The file also includes a cluster
role postgres-pod
with privileges for Patroni to watch and manage pods and
endpoints. To deploy the operator with this RBAC policies use:
kubectl create -f manifests/configmap.yaml
kubectl create -f manifests/operator-service-account-rbac.yaml
kubectl create -f manifests/postgres-operator.yaml
kubectl create -f manifests/minimal-postgres-manifest.yaml
For each namespace the operator watches it creates (or reads) a service account
and role binding to be used by the Postgres Pods. The service account is bound
to the postgres-pod
cluster role. The name and definitions of these resources
can be configured.
Note, that the operator performs no further syncing of namespaced service
accounts and role bindings.
By default postgresql
custom resources can only be listed and changed by
cluster admins. To allow read and/or write access to other human users apply
the user-facing-clusterrole
manifest:
kubectl create -f manifests/user-facing-clusterroles.yaml
It creates zalando-postgres-operator:user:view, :edit and :admin clusterroles that are aggregated into the K8s default roles.
To ensure Postgres pods are running on nodes without any other application pods, you can use taints and tolerations and configure the required toleration in the operator configuration.
As an example you can set following node taint:
kubectl taint nodes <nodeName> postgres=:NoSchedule
And configure the toleration for the Postgres pods by adding following line to the ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-operator
data:
toleration: "key:postgres,operator:Exists,effect:NoSchedule"
For an OperatorConfiguration resource the toleration should be defined like this:
apiVersion: "acid.zalan.do/v1"
kind: OperatorConfiguration
metadata:
name: postgresql-configuration
configuration:
kubernetes:
toleration:
postgres: "key:postgres,operator:Exists,effect:NoSchedule"
Note that the K8s version 1.13 brings taint-based eviction
to the beta stage and enables it by default. Postgres pods by default receive
tolerations for unreachable
and noExecute
taints with the timeout of 5m
.
Depending on your setup, you may want to adjust these parameters to prevent
master pods from being evicted by the K8s runtime. To prevent eviction
completely, specify the toleration by leaving out the tolerationSeconds
value
(similar to how Kubernetes' own DaemonSets are configured)
To ensure Postgres pods are running on different topologies, you can use pod anti affinity and configure the required topology in the operator configuration.
Enable pod anti affinity by adding following line to the operator ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-operator
data:
enable_pod_antiaffinity: "true"
Likewise, when using an OperatorConfiguration resource add:
apiVersion: "acid.zalan.do/v1"
kind: OperatorConfiguration
metadata:
name: postgresql-configuration
configuration:
kubernetes:
enable_pod_antiaffinity: true
By default the topology key for the pod anti affinity is set to
kubernetes.io/hostname
, you can set another topology key e.g.
failure-domain.beta.kubernetes.io/zone
. See built-in node labels for available topology keys.
By default the operator uses a PodDisruptionBudget (PDB) to protect the cluster
from voluntarily disruptions and hence unwanted DB downtime. The MinAvailable
parameter of the PDB is set to 1
which prevents killing masters in single-node
clusters and/or the last remaining running instance in a multi-node cluster.
The PDB is only relaxed in two scenarios:
- If a cluster is scaled down to
0
instances (e.g. for draining nodes) - If the PDB is disabled in the configuration (
enable_pod_disruption_budget
)
The PDB is still in place having MinAvailable
set to 0
. If enabled it will
be automatically set to 1
on scale up. Disabling PDBs helps avoiding blocking
Kubernetes upgrades in managed K8s environments at the cost of prolonged DB
downtime. See PR #384
for the use case.
In some cases, you might want to add labels
that are specific to a given
Postgres cluster, in order to identify its child objects. The typical use case
is to add labels that identifies the Pods
created by the operator, in order
to implement fine-controlled NetworkPolicies
.
postgres-operator ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-operator
data:
inherited_labels: application,environment
OperatorConfiguration
apiVersion: "acid.zalan.do/v1"
kind: OperatorConfiguration
metadata:
name: postgresql-operator-configuration
configuration:
kubernetes:
inherited_labels:
- application
- environment
cluster manifest
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
name: demo-cluster
labels:
application: my-app
environment: demo
spec:
...
network policy
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
name: netpol-example
spec:
podSelector:
matchLabels:
application: my-app
environment: demo
It is possible to configure a ConfigMap as well as a Secret which are used by the Postgres pods as an additional provider for environment variables. One use case is to customize the Spilo image and configure it with environment variables. Another case could be to provide custom cloud provider or backup settings.
In general the Operator will give preference to the globally configured variables, to not have the custom ones interfere with core functionality. Variables with the 'WAL_' and 'LOG_' prefix can be overwritten though, to allow backup and logshipping to be specified differently.
The ConfigMap with the additional settings is referenced in the operator's main configuration. A namespace can be specified along with the name. If left out, the configured default namespace of your K8s client will be used and if the ConfigMap is not found there, the Postgres cluster's namespace is taken when different:
postgres-operator ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-operator
data:
# referencing config map with custom settings
pod_environment_configmap: default/postgres-pod-config
OperatorConfiguration
apiVersion: "acid.zalan.do/v1"
kind: OperatorConfiguration
metadata:
name: postgresql-operator-configuration
configuration:
kubernetes:
# referencing config map with custom settings
pod_environment_configmap: default/postgres-pod-config
referenced ConfigMap postgres-pod-config
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-pod-config
namespace: default
data:
MY_CUSTOM_VAR: value
The key-value pairs of the ConfigMap are then added as environment variables to the Postgres StatefulSet/pods.
The Secret with the additional variables is referenced in the operator's main configuration. To protect the values of the secret from being exposed in the pod spec they are each referenced as SecretKeyRef. This does not allow for the secret to be in a different namespace as the pods though
postgres-operator ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-operator
data:
# referencing secret with custom environment variables
pod_environment_secret: postgres-pod-secrets
OperatorConfiguration
apiVersion: "acid.zalan.do/v1"
kind: OperatorConfiguration
metadata:
name: postgresql-operator-configuration
configuration:
kubernetes:
# referencing secret with custom environment variables
pod_environment_secret: postgres-pod-secrets
referenced Secret postgres-pod-secrets
apiVersion: v1
kind: Secret
metadata:
name: postgres-pod-secrets
namespace: default
data:
MY_CUSTOM_VAR: dmFsdWU=
The key-value pairs of the Secret are all accessible as environment variables to the Postgres StatefulSet/pods.
As a preventive measure, one can restrict the minimum and the maximum number of
instances permitted by each Postgres cluster managed by the operator. If either
min_instances
or max_instances
is set to a non-zero value, the operator may
adjust the number of instances specified in the cluster manifest to match
either the min or the max boundary. For instance, of a cluster manifest has 1
instance and the min_instances
is set to 3, the cluster will be created with 3
instances. By default, both parameters are set to -1
.
For any Postgres/Spilo cluster, the operator creates two separate K8s
services: one for the master pod and one for replica pods. To expose these
services to an outer network, one can attach load balancers to them by setting
enableMasterLoadBalancer
and/or enableReplicaLoadBalancer
to true
in the
cluster manifest. In the case any of these variables are omitted from the
manifest, the operator configuration settings enable_master_load_balancer
and
enable_replica_load_balancer
apply. Note that the operator settings affect
all Postgresql services running in all namespaces watched by the operator.
If load balancing is enabled two default annotations will be applied to its
services:
external-dns.alpha.kubernetes.io/hostname
with the value defined by the operator configsmaster_dns_name_format
andreplica_dns_name_format
. This value can't be overwritten. If any changing in its value is needed, it MUST be done changing the DNS format operator config parameters; andservice.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout
with a default value of "3600". This value can be overwritten with the operator config parametercustom_service_annotations
or the cluster parameterserviceAnnotations
.
To limit the range of IP addresses that can reach a load balancer, specify the
desired ranges in the allowedSourceRanges
field (applies to both master and
replica load balancers). To prevent exposing load balancers to the entire
Internet, this field is set at cluster creation time to 127.0.0.1/32
unless
overwritten explicitly. If you want to revoke all IP ranges from an existing
cluster, please set the allowedSourceRanges
field to 127.0.0.1/32
or to an
empty sequence []
. Setting the field to null
or omitting it entirely may
lead to K8s removing this field from the manifest due to its
handling of null fields.
Then the resultant manifest will not contain the necessary change, and the
operator will respectively do nothing with the existing source ranges.
The Postgres Operator periodically scans all K8s objects belonging to each cluster and repairs all discrepancies between them and the definitions generated from the current cluster manifest. There are two types of scans:
-
sync scan
, running everyresync_period
seconds for every cluster -
repair scan
, coming everyrepair_period
only for those clusters that didn't report success as a result of the last operation applied to them.
The operator is capable of maintaining roles of multiple kinds within a Postgres database cluster:
-
System roles are roles necessary for the proper work of Postgres itself such as a replication role or the initial superuser role. The operator delegates creating such roles to Patroni and only establishes relevant secrets.
-
Infrastructure roles are roles for processes originating from external systems, e.g. monitoring robots. The operator creates such roles in all Postgres clusters it manages, assuming that K8s secrets with the relevant credentials exist beforehand.
-
Per-cluster robot users are also roles for processes originating from external systems but defined for an individual Postgres cluster in its manifest. A typical example is a role for connections from an application that uses the database.
-
Human users originate from the Teams API that returns a list of the team members given a team id. The operator differentiates between (a) product teams that own a particular Postgres cluster and are granted admin rights to maintain it, and (b) Postgres superuser teams that get the superuser access to all Postgres databases running in a K8s cluster for the purposes of maintaining and troubleshooting.
The operator logs reasons for a rolling update with the info
level and a diff
between the old and new StatefulSet specs with the debug
level. To benefit
from numerous escape characters in the latter log entry, view it in CLI with
echo -e
. Note that the resultant message will contain some noise because the
PodTemplate
used by the operator is yet to be updated with the default values
used internally in K8s.
The operator also support lazy updates of the Spilo image. That means the pod
template of a PG cluster's stateful set is updated immediately with the new
image, but no rolling update follows. This feature saves you a switchover - and
hence downtime - when you know pods are re-started later anyway, for instance
due to the node rotation. To force a rolling update, disable this mode by
setting the enable_lazy_spilo_upgrade
to false
in the operator configuration
and restart the operator pod. With the standard eager rolling updates the
operator checks during Sync all pods run images specified in their respective
statefulsets. The operator triggers a rolling upgrade for PG clusters that
violate this condition.
The operator can manage K8s cron jobs to run logical backups of Postgres
clusters. The cron job periodically spawns a batch job that runs a single pod.
The backup script within this pod's container can connect to a DB for a logical
backup. The operator updates cron jobs during Sync if the job schedule changes;
the job name acts as the job identifier. These jobs are to be enabled for each
individual Postgres cluster by setting enableLogicalBackup: true
in its
manifest. Notes:
-
The example image implements the backup via
pg_dumpall
and upload of compressed and encrypted results to an S3 bucket; the default imageregistry.opensource.zalan.do/acid/logical-backup
is the same image built with the Zalando-internal CI pipeline.pg_dumpall
requires asuperuser
access to a DB and runs on the replica when possible. -
Due to the limitation of K8s cron jobs it is highly advisable to set up additional monitoring for this feature; such monitoring is outside of the scope of operator responsibilities.
-
The operator does not remove old backups.
-
You may use your own image by overwriting the relevant field in the operator configuration. Any such image must ensure the logical backup is able to finish in presence of pod restarts and simultaneous invocations of the backup cron job.
-
For that feature to work, your RBAC policy must enable operations on the
cronjobs
resource from thebatch
API group for the operator service account. See example RBAC
To access cloud resources like S3 from a cluster on bare metal you can use
additional_secret_mount
and additional_secret_mount_path
configuration
parameters. The cloud credentials will be provisioned in the Postgres containers
by mounting an additional volume from the given secret to database pods. They
can then be accessed over the configured mount path. Via
Custom Pod Environment Variables you can
point different cloud SDK's (AWS, GCP etc.) to this mounted secret, e.g. to
access cloud resources for uploading logs etc.
A secret can be pre-provisioned in different ways:
- Generic secret created via
kubectl create secret generic some-cloud-creds --from-file=some-cloud-credentials-file.json
- Automatically provisioned via a custom K8s controller like kube-aws-iam-controller
To configure the operator on GCP there are some prerequisites that are needed:
- A service account with the proper IAM setup to access the GCS bucket for the WAL-E logs
- The credentials file for the service account.
The configuration paramaters that we will be using are:
additional_secret_mount
additional_secret_mount_path
gcp_credentials
wal_gs_bucket
Generate the K8s secret resource that will contain your service account's credentials. It's highly recommended to use a service account and limit its scope to just the WAL-E bucket.
apiVersion: v1
kind: Secret
metadata:
name: psql-wale-creds
namespace: default
type: Opaque
stringData:
key.json: |-
<GCP .json credentials>
With the psql-wale-creds
resource applied to your cluster, ensure that
the operator's configuration is set up like the following:
...
aws_or_gcp:
additional_secret_mount: "pgsql-wale-creds"
additional_secret_mount_path: "/var/secrets/google" # or where ever you want to mount the file
# aws_region: eu-central-1
# kube_iam_role: ""
# log_s3_bucket: ""
# wal_s3_bucket: ""
wal_gs_bucket: "postgres-backups-bucket-28302F2" # name of bucket on where to save the WAL-E logs
gcp_credentials: "/var/secrets/google/key.json" # combination of the mount path & key in the K8s resource. (i.e. key.json)
...
A list of sidecars is added to each cluster created by the operator. The default is empty.
kind: OperatorConfiguration
configuration:
sidecars:
- image: image:123
name: global-sidecar
ports:
- containerPort: 80
volumeMounts:
- mountPath: /custom-pgdata-mountpoint
name: pgdata
- ...
In addition to any environment variables you specify, the following environment variables are always passed to sidecars:
POD_NAME
- field reference tometadata.name
POD_NAMESPACE
- field reference tometadata.namespace
POSTGRES_USER
- the superuser that can be used to connect to the databasePOSTGRES_PASSWORD
- the password for the superuser
Since the v1.2 release the Postgres Operator is shipped with a browser-based configuration user interface (UI) that simplifies managing Postgres clusters with the operator.
The UI runs with Node.js and comes with it's own Docker image. However, installing Node.js to build the operator UI is not required. It is handled via Docker containers when running:
make docker
The UI talks to the K8s API server as well as the Postgres Operator REST API.
K8s API server URLs are loaded from the machine's kubeconfig environment by
default. Alternatively, a list can also be passed when starting the Python
application with the --cluster
option.
The Operator API endpoint can be configured via the OPERATOR_API_URL
environment variables in the deployment manifest.
You can also expose the operator API through a service.
Some displayed options can be disabled from UI using simple flags under the
OPERATOR_UI_CONFIG
field in the deployment.
Now, apply all manifests from the ui/manifests
folder to deploy the Postgres
Operator UI on K8s. Replace the image tag in the deployment manifest if you
want to test the image you've built with make docker
. Make sure the pods for
the operator and the UI are both running.
sed -e "s/\(image\:.*\:\).*$/\1$TAG/" manifests/deployment.yaml | kubectl apply -f manifests/
kubectl get all -l application=postgres-operator-ui
For local testing you need to apply K8s proxying and operator pod port
forwarding so that the UI can talk to the K8s and Postgres Operator REST API.
The Ingress resource is not needed. You can use the provided run_local.sh
script for this. Make sure that:
- Python dependencies are installed on your machine
- the K8s API server URL is set for kubectl commands, e.g. for minikube it would usually be
https://192.168.99.100:8443
. - the pod label selectors for port forwarding are correct
When testing with minikube you have to build the image in its docker environment
(running make docker
doesn't do it for you). From the ui
directory execute:
# compile and build operator UI
make docker
# build in image in minikube docker env
eval $(minikube docker-env)
docker build -t registry.opensource.zalan.do/acid/postgres-operator-ui:v1.3.0 .
# apply UI manifests next to a running Postgres Operator
kubectl apply -f manifests/
# install python dependencies to run UI locally
pip3 install -r requirements
./run_local.sh