Skip to content

Latest commit

 

History

History
730 lines (608 loc) · 22.7 KB

deploy-aidbox-in-kubernetes.md

File metadata and controls

730 lines (608 loc) · 22.7 KB

Deploy Production-ready Aidbox to Kubernetes

Production-ready infrastructure

Key infrastructure elements:

  • Cluster configuration — Node pool and tooling
  • Database — Cloud or self-managed database
  • Aidbox — Aidbox installation
  • Logging — Сollect application and cluster logs
  • Monitoring — Сollect, alert, and visualize cluster and application metrics
  • Security — Vulnerability scanning and policy management

Cluster configuration and tooling

Recommended Kubernetes cluster configuration:

  • Small and medium workloads — 3 nodes X 4 VCPU 16 GB RAM
  • Huge workloads — 3 nodes X 8 VCPU X 64 GB RAM

Toolkit required for development and deployment:

  • AWS, GCP, AZURE - Cloud provider CLI and SDK. Depends on your cloud provider:
  • Kubectl - connection and cluster management
  • Helm - Kubernetes package manager
  • Lens - Kubernetes IDE

Optional - Development and Delivery tooling:

  • Terraform - Infrastructure automation tool
  • Grafana tanka - configuration utility for your Kubernetes
  • Argo CD - GitOps delivery and management
  • Flux - set of continuous and progressive delivery solutions for Kubernetes

Database

Managed solution

Aidbox supports all popular managed Postgresql databases. Supported versions - 13 and higher. See more details in this article — Run Aidbox on managed PostgreSQL.

Self-managed solution

For a self-managed solution, we recommend using the AidboxDB image. This image contains all required extensions, backup tools, and pre-build replication support. Read more information in the documentation — AidboxDB.

{% hint style="info" %} To streamline the deployment process, our DevOps engineers have prepared Helm charts that you may find helpful. {% endhint %}

First step — create volume

{% code title="Persistent Volume" lineNumbers="true" %}

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: db-master-data
  namespace: prod
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 300Gi
  # depend on your cloud provider. Use SSD volumes
  storageClassName: managed-premium

{% endcode %}

Next - create all required configs, like postgresql.conf, required container parameters and credentials.

{% code title="postgresql.conf" lineNumbers="true" %}

apiVersion: v1
kind: ConfigMap
metadata:
  name: db-pg-config
  namespace: prod
data:
  postgres.conf: |-
    listen_addresses = '*'
    shared_buffers = '2GB'
    max_wal_size = '4GB'
    pg_stat_statements.max = 500
    pg_stat_statements.save = false
    pg_stat_statements.track = top
    pg_stat_statements.track_utility = true
    shared_preload_libraries = 'pg_stat_statements'
    track_io_timing = on
    wal_level = logical
    wal_log_hints = on
    archive_command = 'wal-g wal-push %p'
    restore_command = 'wal-g wal-fetch %f %p'

{% endcode %}

{% code title="db-config Configmap" lineNumbers="true" %}

apiVersion: v1
kind: ConfigMap
metadata:
  name: db-config
  namespace: prod
data:
  PGDATA: /data/pg
  POSTGRES_DB: postgres

{% endcode %}

{% code title="db-secret Secret" lineNumbers="true" %}

apiVersion: v1
kind: Secret
metadata:
  name: db-secret
  namespace: prod
type: Opaque
data:
  POSTGRES_PASSWORD: cG9zdGdyZXM=
  POSTGRES_USER: cG9zdGdyZXM=

{% endcode %}

Now we can create a database StatefulSet

{% code title="Db Master StatefulSet" lineNumbers="true" %}

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prod-db-master
  namespace: prod
spec:
  replicas: 1
  serviceName: db
  selector:
    matchLabels:
      service: db
  template:
    metadata:
      labels:
        service: db
    spec:
      volumes:
        - name: db-pg-config
          configMap:
            name: db-pg-config
            defaultMode: 420
        - name: db-dshm
          emptyDir:
            medium: Memory
        - name: db-data
          persistentVolumeClaim:
            claimName: db-master-data
      containers:
        - name: main
          image: healthsamurai/aidboxdb:14.2
          ports:
            - containerPort: 5432
              protocol: TCP
          envFrom:
            - configMapRef:
                name: db-config
            - secretRef:
                name: db-secret
          volumeMounts:
            - name: db-pg-config
              mountPath: /etc/configs
            - name: db-dshm
              mountPath: /dev/shm
            - name: db-data
              mountPath: /data
              subPath: pg

{% endcode %}

Create a master database service

{% code title="Database Service" lineNumbers="true" %}

apiVersion: v1
kind: Service
metadata:
  name: db
  namespace: prod
spec:
  ports:
    - protocol: TCP
      port: 5432
      targetPort: 5432
  selector:
    service: db

{% endcode %}

Replica installation contains all the same steps but requires additional configuration

{% code title="Replica DB config" lineNumbers="true" %}

apiVersion: v1
kind: ConfigMap
metadata:
  name: db-replica
  namespace: prod
data:
  PG_ROLE: replica
  PG_MASTER_HOST: db-master
  PG_REPLICA: streaming_replica_streaming
  PGDATA: /data/pg
  POSTGRES_DB: postgres

{% endcode %}

For backups and WAL archiving we recommend a cloud-native solution WAL-G. Full information about its configuration and usage is on this documentation page.

  • Configure storage access — WAL-G can store backups in S3, Google Cloud Storage, Azure, or a local file system.
  • Recommended backup policy — Full backup every week, incremental backup every day.

Alternative solutions

A set of tools to perform HA PostgreSQL with fail and switchover, automated backups.

  • Patroni — A Template for PostgreSQL HA with ZooKeeper, ETCD or Consul.
  • Postgres operator — The Postgres Operator delivers an easy-to-run HA PostgreSQL clusters on Kubernetes.

Aidbox

First, you must get an Aidbox license on the Aidbox user portal.

{% hint style="info" %} You might want to use the Helm charts prepared by our DevOps engineers to make the deployment experience smoother. {% endhint %}

Create ConfigMap with all required config and database connection

{% hint style="info" %} This ConfigMap example uses our default Aidbox Configuration Project Template. It's recommended to clone this template and bind your Aidbox installation with it. {% endhint %}

{% code title="Aidbox ConfigMap" lineNumbers="true" %}

apiVersion: v1
kind: ConfigMap
metadata:
  name: aidbox
  namespace: prod
data:
  AIDBOX_BASE_URL: https://my.box.url
  AIDBOX_BOX_ID: aidbox
  AIDBOX_FHIR_VERSION: 4.0.1
  AIDBOX_PORT: '8080'
  AIDBOX_STDOUT_PRETTY: all
  BOX_INSTANCE_NAME: aidbox
  BOX_METRICS_PORT: '8765'
  PGDATABASE: aidbox
  PGHOST: db.prod.svc.cluster.local   # database address
  PGPORT: '5432'                      # database port
  BOX_PROJECT_GIT_URL: "https://github.com/Aidbox/aidbox-project-template.git"
  BOX_PROJECT_GIT_PROTOCOL: "https"
  BOX_PROJECT_GIT_TARGET__PATH: "/tmp/aidbox-project"
  BOX_PROJECT_GIT_CHECKOUT: "main"
  AIDBOX_ZEN_ENTRYPOINT: main/box
  AIDBOX_DEV_MODE: "false"
  AIDBOX_ZEN_DEV_MODE: "false"

{% endcode %}

{% code title="Aidbox Secret" lineNumbers="true" %}

apiVersion: v1
kind: Secret
metadata:
  name: aidbox
  namespace: prod
data:
  AIDBOX_ADMIN_PASSWORD: <admin_password>
  AIDBOX_CLIENT_SECRET: <root_client_password>
  AIDBOX_LICENSE: <JWT-LICENSE>    # JWT license from the Aidbox user portal
  PGUSER: <db_user>                # database username
  PGPASSWORD: <db_password>        # database password
  
  BOX_AUTH_KEYS_SECRET: <random_string_auth_secret>
  BOX_AUTH_KEYS_PRIVATE: <rsa_private_key> 
  BOX_AUTH_KEYS_PUBLIC: <rsa_public_key> 
  
  # or just use our samples for non-production installation
  # BOX_AUTH_KEYS_SECRET: "auth-key-secret"
  # BOX_AUTH_KEYS_PRIVATE: "-----BEGIN RSA PRIVATE KEY-----\nMIICXAIBAAKBgQCRLKv0n9HPsajw3wcDH1k5DUSPPdKjxqp8h4OZKiG3wGEFYXi9\nfxBbpkQXjxGEmORi8UR4aM41kX8dd4SdMRGS1VX2AMgLEAFq354MpGBPIeJyv00y\nqV6wW0HT58+Nh+xdridDFSHkkplJFjDuQbYjfQzbSNECA31ME/GI9rGomQIDAQAB\nAoGAEYGytFecCnjtC6wHiVK71JeTIZd12fJsj4MbhWpJYeJxCMAz+l0S7MxweGtU\nNFpoKz7XUBJqcJcMvlHSBA89ZDobp3HS0R8ZDcdxossNRio3Ix1bRG7Pxnhs3R/T\nsOxlrQSgnSbg1k6M5iVSZt1ptCwch+ZLG37tD3ZvdAN0LCECQQC0IFiPJJEPauUi\neKmW4oUgBvOUVA93EqnBiv9lzk7UxrPgusFqnY02qJouDNvXXso6+FM8u9DNxSvw\nHPIuqJvhAkEAzlNYaJzoInkCS5PYTGg2f1GqRih9WHj8NUukfgbO61xT9QscM6An\n+RF8dfshU2zuaQFLTBPWrS0Nk0ZOxLFjuQJAZ4gz/sqwyiDR5RdfuscmZ3s3ZClQ\n3ksO4ZzoIXcMnoY7e888PvCh6ynLvO5NKiRkrrJu/XiikrNjBtdMaH8nYQJADkCF\nl9xW0KLJPM0+oLCGKy9J8sSzO9xHl6rc9vOjcXCUQBX/YbWLbVH+5ett9uRMZ6Z2\nPBAWwSmeiXDO2hliyQJBAI/7Gtzf1Z2O5pDgNMLkKcyX4BqsHFKFSD5Btb/zReEq\nTsr6vTvzucjJcS8843vgyhIUDtW2cu7G9BGxSfsZNCw=\n-----END RSA PRIVATE KEY-----\n"
  # BOX_AUTH_KEYS_PUBLIC: "-----BEGIN PUBLIC KEY-----\nMIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCRLKv0n9HPsajw3wcDH1k5DUSP\nPdKjxqp8h4OZKiG3wGEFYXi9fxBbpkQXjxGEmORi8UR4aM41kX8dd4SdMRGS1VX2\nAMgLEAFq354MpGBPIeJyv00yqV6wW0HT58+Nh+xdridDFSHkkplJFjDuQbYjfQzb\nSNECA31ME/GI9rGomQIDAQAB\n-----END PUBLIC KEY-----\n"

{% endcode %}

Aidbox Deployment

{% code title="Aidbox Deployment" lineNumbers="true" %}

apiVersion: apps/v1
kind: Deployment
metadata:
  name: aidbox
  namespace: prod
spec:
  replicas: 2
  selector:
    matchLabels:
      service: aidbox
  template:
    metadata:
      labels:
        service: aidbox
    spec:
      containers:
        - name: main
          image: healthsamurai/aidboxone:latest
          ports:
            - containerPort: 8080
              protocol: TCP
            - containerPort: 8765
              protocol: TCP
          envFrom:
            - configMapRef:
                name: aidbox
            - secretRef:
                name: aidbox
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 20
            timeoutSeconds: 10
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 12
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 20
            timeoutSeconds: 10
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 6
          startupProbe:
            httpGet:
              path: /health
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 20
            timeoutSeconds: 5
            periodSeconds:  5
            successThreshold: 1
            failureThreshold: 4

{% endcode %}

When Aidbox starts for the first time, resolving all the dependencies takes longer. If you encounter startupProbe failure, you might want to consider increasing the initialDelaySeconds and failureThreshold under the startupProbe spec in the config above.

All additional information about HA Aidbox configuration can be found in this article — HA Aidbox.

To verify that Aidbox started correctly you can check the logs:

kubectl logs -f <aidbox-pod-name>

Create the Aidbox k8s service

{% code title="Aidbox service" lineNumbers="true" %}

apiVersion: v1
kind: Service
metadata:
  name: aidbox
  namespace: prod
spec:
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  selector:
    service: aidbox

{% endcode %}

Ingress

A Cluster must have an ingress controller Installed.

Our recommendation is to use the Kubernetes Ingress NGINX Controller. As an alternative, you can use Traefic.

More additional information about Ingress in k8s can be found in this documentation — Kubernetes Service Networking

Ingress NGINX controller

Ingress-nginx — is an Ingress controller for Kubernetes using NGINX as a reverse proxy and load balancer.

{% code title="Install Ingress NGINX" %}

helm upgrade \
  --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx --create-namespace

{% endcode %}

CertManager

To provide a secure HTTPS connection you can use paid SSL certificates, issued for your domain, or use LetsEncrypt-issued certificates. In the case of using LetsEcrypt, we recommend installing and configuring Cert Manager Operator

{% code title="Install Cert Manager" %}

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install \
  cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.10.0 \       # Or latest available version
  --set installCRDs=true

{% endcode %}

Configure Cluster Issuer:

{% code title="" lineNumbers="true" %}

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt
spec:
  acme:
    email: [email protected]
    preferredChain: ''
    privateKeySecretRef:
      name: issuer-key
    server: https://acme-v02.api.letsencrypt.org/directory
    solvers:
      - http01:
          ingress:
            class: nginx  # Ingress class name

{% endcode %}

{% hint style="info" %} If you use Multibox image and want to use cert manger — you should configure DNS01 authorization to provide wildcard certificates

https://letsencrypt.org/docs/challenge-types/#dns-01-challenge {% endhint %}

Ingress resource

Now you can create k8s Ingress for Aidbox deployment

{% code title="Ingress" lineNumbers="true" %}

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: aidbox
  namespace: prod
  annotations:
    acme.cert-manager.io/http01-ingress-class: nginx
    cert-manager.io/cluster-issuer: letsencrypt
    kubernetes.io/ingress.class: nginx
spec:
  tls:
    - hosts:
        - my.box.url
      secretName: aidbox-tls
  rules:
    - host: my.box.url
      http:
        paths:
          - path: /
            pathType: ImplementationSpecific
            backend:
              service:
                name: aidbox
                port:
                  number: 80

{% endcode %}

Now you can test ingress

curl https://my.box.url

Logging

General logging & audit information can be found in this article — Logging & Audit

Aidbox supports integration with the following systems:

ElasticSearch integration

You can install ECK using the official guide.

Configure Aidbox and ES integration

apiVersion: v1
kind: Secret
metadata:
  name: aidbox
  namespace: prod
data:
  ...
  AIDBOX_ES_URL = http://es-service.es-ns.svc.cluster.local
  AIDBOX_ES_AUTH = <user>:<password>
  ...

DataDog integration

apiVersion: v1
kind: Secret
metadata:
  name: aidbox
  namespace: prod
data:
  ...
  AIDBOX_DD_API_KEY: <Datadog API Key>
  ...

Monitoring

For monitoring our recommendation is to use the Kube Prometheus stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack

Create Aidbox metrics service

{% code title="" lineNumbers="true" %}

apiVersion: v1
kind: Service
metadata:
  name: aidbox-metrics
  namespace: prod
  labels:
    operated: prometheus
spec:
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8765
  selector:
    service: aidbox

{% endcode %}

Create ServiceMonitor config for scrapping metrics data

{% code title="ServiceMonitor" lineNumbers="true" %}

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/component: metrics
    release: kube-prometheus
    serviceMonitorSelector: aidbox
  name: aidbox
  namespace: kube-prometheus
spec:
  endpoints:
    - honorLabels: true
      interval: 10s
      path: /metrics
      targetPort: 8765
    - honorLabels: true
      interval: 60s
      path: /metrics/minutes
      targetPort: 8765
    - honorLabels: true
      interval: 10m
      path: /metrics/hours
      targetPort: 8765
  namespaceSelector:
    any: true
  selector:
    matchLabels:
      operated: prometheus

{% endcode %}

Or you can directly specify the Prometheus scrapers configuration

global:
  external_labels:
    monitor: 'aidbox'
scrape_configs:
  - job_name: aidbox
    scrape_interval: 5s
    metrics_path: /metrics
    static_configs:
      - targets: [ 'aidbox-metrics.prod.svc.cluster.local:8765' ]

  - job_name: aidbox-minutes
    scrape_interval: 30s
    metrics_path: /metrics/minutes
    static_configs:
      - targets: [ 'aidbox-metrics.prod.svc.cluster.local:8765' ]

  - job_name: aidbox-hours
    scrape_interval: 1m
    scrape_timeout: 30s                     
    metrics_path: /metrics/hours
    static_configs:
      - targets: [ 'aidbox-metrics.prod.svc.cluster.local:8765' ]

Alternative solutions

  • VictoriaMetrics — High-Performance Open Source Time Series Database.
  • Thanos — highly available Prometheus setup with long-term storage capabilities.
  • Grafana Mimir — highly available, multi-tenant, long-term storage for Prometheus.

Export the Aidbox Grafana dashboard

Aidbox metrics has integration with Grafana, which can generate dashboards and upload them to Grafana — Grafana Integration

Additional monitoring

System monitoring:

  • node exporter — Prometheus exporter for hardware and OS metrics exposed by *NIX kernels
  • kube state metrics — is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects
  • cadvisor — container usage metrics

PostgreSQL monitoring:

  • pg_exporter — Prometheus exporter for PostgreSQL server metrics

Alerting

Alerting rules allow you to define alert conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service.

Alert rules

Alert for long-running HTTP queries with P99 > 5s in 5m interval

{% code lineNumbers="true" %}

alert: SlowRequests
for: 5m
expr: histogram_quantile(0.99, sum (rate(aidbox_http_request_duration_seconds_bucket[5m])) by (le, route, instance)) > 5
labels: {severity: ticket}
annotations:
  title: Long HTTP query execution
  metric: '{{ $labels.route }}'
  value: '{{ $value | printf "%.2f" }}'

{% endcode %}

Alert delivery

Alert manager template for Telegram

{% code lineNumbers="true" %}

global:
  resolve_timeout: 5m
  telegram_api_url: 'https://api.telegram.org/'
route:
  group_by: [alertname instance]
  # Default receiver
  receiver: <my-ops-chat>
  routes:
  # Mute watchdog alert
  - receiver: empty
    match: {alertname: Watchdog}
receivers:
- name: empty
- name: <my-ops-chat>
  telegram_configs:
  - chat_id: <chat-id>
    api_url: https://api.telegram.org
    parse_mode: HTML
    message: |-
      <b>[{{ .CommonLabels.instance }}] {{ .CommonLabels.alertname }}</b>
      {{ .CommonAnnotations.title }}
      {{ range .Alerts }}{{ .Annotations.metric }}: {{ .Annotations.value }}
      {{ end }}
    bot_token: <bot-token>

{% endcode %}

All other integrations you can find on the AlertManager documentation page.

Additional tools

  • Embedded Grafana alerts
  • Grafana OnCall

Security

Vulnerability and security scanners:

Kubernetes Policy Management:

Advanced:

  • Datree — k8s resources linter