[elasticsearch] Readiness probe is failing again with 8.0.0-SNAPSHOT and default config #1443

jmlrt · 2021-11-05T16:33:24Z

Chart version: 8.0.0-SNAPSHOT

Kubernetes version: all

Kubernetes provider: all

Helm Version: all

helm get release output:

Output of helm get release

NAME: es
LAST DEPLOYED: Fri Nov  5 17:27:14 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
USER-SUPPLIED VALUES:
null

COMPUTED VALUES:
antiAffinity: hard
antiAffinityTopologyKey: kubernetes.io/hostname
clusterHealthCheckParams: wait_for_status=green&timeout=1s
clusterName: elasticsearch
enableServiceLinks: true
envFrom: []
esConfig: {}
esJavaOpts: ""
esMajorVersion: ""
extraContainers: []
extraEnvs: []
extraInitContainers: []
extraVolumeMounts: []
extraVolumes: []
fsGroup: ""
fullnameOverride: ""
healthNameOverride: ""
hostAliases: []
httpPort: 9200
image: docker.elastic.co/elasticsearch/elasticsearch
imagePullPolicy: IfNotPresent
imagePullSecrets: []
imageTag: 8.0.0-SNAPSHOT
ingress:
  annotations: {}
  className: nginx
  enabled: false
  hosts:
  - host: chart-example.local
    paths:
    - path: /
  pathtype: ImplementationSpecific
  tls: []
initResources: {}
keystore: []
labels: {}
lifecycle: {}
masterService: ""
maxUnavailable: 1
minimumMasterNodes: 2
nameOverride: ""
networkHost: 0.0.0.0
networkPolicy:
  http:
    enabled: false
  transport:
    enabled: false
nodeAffinity: {}
nodeGroup: master
nodeSelector: {}
persistence:
  annotations: {}
  enabled: true
  labels:
    enabled: false
podAnnotations: {}
podManagementPolicy: Parallel
podSecurityContext:
  fsGroup: 1000
  runAsUser: 1000
podSecurityPolicy:
  create: false
  name: ""
  spec:
    fsGroup:
      rule: RunAsAny
    privileged: true
    runAsUser:
      rule: RunAsAny
    seLinux:
      rule: RunAsAny
    supplementalGroups:
      rule: RunAsAny
    volumes:
    - secret
    - configMap
    - persistentVolumeClaim
    - emptyDir
priorityClassName: ""
protocol: http
rbac:
  automountToken: true
  create: false
  serviceAccountAnnotations: {}
  serviceAccountName: ""
readinessProbe:
  failureThreshold: 3
  initialDelaySeconds: 10
  periodSeconds: 10
  successThreshold: 3
  timeoutSeconds: 5
replicas: 3
resources:
  limits:
    cpu: 1000m
    memory: 2Gi
  requests:
    cpu: 1000m
    memory: 2Gi
roles:
- master
- data
- data_content
- data_hot
- data_warm
- data_cold
- ingest
- ml
- remote_cluster_client
- transform
schedulerName: ""
secret:
  enabled: true
  password: ""
secretMounts: []
securityContext:
  capabilities:
    drop:
    - ALL
  runAsNonRoot: true
  runAsUser: 1000
service:
  annotations: {}
  enabled: true
  externalTrafficPolicy: ""
  httpPortName: http
  labels: {}
  labelsHeadless: {}
  loadBalancerIP: ""
  loadBalancerSourceRanges: []
  nodePort: ""
  transportPortName: transport
  type: ClusterIP
sysctlInitContainer:
  enabled: true
sysctlVmMaxMapCount: 262144
terminationGracePeriod: 120
tests:
  enabled: true
tolerations: []
transportPort: 9300
updateStrategy: RollingUpdate
volumeClaimTemplate:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 30Gi

HOOKS:
---
# Source: elasticsearch/templates/test/test-elasticsearch-health.yaml
apiVersion: v1
kind: Pod
metadata:
  name: "es-qzanl-test"
  annotations:
    "helm.sh/hook": test
    "helm.sh/hook-delete-policy": hook-succeeded
spec:
  securityContext:
    fsGroup: 1000
    runAsUser: 1000
  containers:
  - name: "es-novrx-test"
    image: "docker.elastic.co/elasticsearch/elasticsearch:8.0.0-SNAPSHOT"
    imagePullPolicy: "IfNotPresent"
    command:
      - "sh"
      - "-c"
      - |
        #!/usr/bin/env bash -e
        curl -XGET --fail 'elasticsearch-master:9200/_cluster/health?wait_for_status=green&timeout=1s'
  restartPolicy: Never
MANIFEST:
---
# Source: elasticsearch/templates/poddisruptionbudget.yaml
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: "elasticsearch-master-pdb"
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: "elasticsearch-master"
---
# Source: elasticsearch/templates/secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: elasticsearch-master-credentials
  labels:
    heritage: "Helm"
    release: "es"
    chart: "elasticsearch"
    app: "elasticsearch-master"
type: Opaque
data:
  username: ZWxhc3RpYw==
  password: "dDFIa1VCTkxIRTE0VkdyRQ=="
---
# Source: elasticsearch/templates/service.yaml
kind: Service
apiVersion: v1
metadata:
  name: elasticsearch-master
  labels:
    heritage: "Helm"
    release: "es"
    chart: "elasticsearch"
    app: "elasticsearch-master"
  annotations:
    {}
spec:
  type: ClusterIP
  selector:
    release: "es"
    chart: "elasticsearch"
    app: "elasticsearch-master"
  ports:
  - name: http
    protocol: TCP
    port: 9200
  - name: transport
    protocol: TCP
    port: 9300
---
# Source: elasticsearch/templates/service.yaml
kind: Service
apiVersion: v1
metadata:
  name: elasticsearch-master-headless
  labels:
    heritage: "Helm"
    release: "es"
    chart: "elasticsearch"
    app: "elasticsearch-master"
  annotations:
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
spec:
  clusterIP: None # This is needed for statefulset hostnames like elasticsearch-0 to resolve
  # Create endpoints also if the related pod isn't ready
  publishNotReadyAddresses: true
  selector:
    app: "elasticsearch-master"
  ports:
  - name: http
    port: 9200
  - name: transport
    port: 9300
---
# Source: elasticsearch/templates/statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch-master
  labels:
    heritage: "Helm"
    release: "es"
    chart: "elasticsearch"
    app: "elasticsearch-master"
  annotations:
    esMajorVersion: "8"
spec:
  serviceName: elasticsearch-master-headless
  selector:
    matchLabels:
      app: "elasticsearch-master"
  replicas: 3
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      name: elasticsearch-master
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 30Gi
  template:
    metadata:
      name: "elasticsearch-master"
      labels:
        release: "es"
        chart: "elasticsearch"
        app: "elasticsearch-master"
      annotations:
        
    spec:
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
      automountServiceAccountToken: true
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - "elasticsearch-master"
            topologyKey: kubernetes.io/hostname
      terminationGracePeriodSeconds: 120
      volumes:
      enableServiceLinks: true
      initContainers:
      - name: configure-sysctl
        securityContext:
          runAsUser: 0
          privileged: true
        image: "docker.elastic.co/elasticsearch/elasticsearch:8.0.0-SNAPSHOT"
        imagePullPolicy: "IfNotPresent"
        command: ["sysctl", "-w", "vm.max_map_count=262144"]
        resources:
          {}

      containers:
      - name: "elasticsearch"
        securityContext:
          capabilities:
            drop:
            - ALL
          runAsNonRoot: true
          runAsUser: 1000
        image: "docker.elastic.co/elasticsearch/elasticsearch:8.0.0-SNAPSHOT"
        imagePullPolicy: "IfNotPresent"
        readinessProbe:
          exec:
            command:
              - sh
              - -c
              - |
                #!/usr/bin/env bash -e

                # Exit if ELASTIC_PASSWORD in unset
                if [ -z "${ELASTIC_PASSWORD}" ]; then
                  echo "ELASTIC_PASSWORD variable is missing, exiting"
                  exit 1
                fi

                # If the node is starting up wait for the cluster to be ready (request params: "wait_for_status=green&timeout=1s" )
                # Once it has started only check that the node itself is responding
                START_FILE=/tmp/.es_start_file

                # Disable nss cache to avoid filling dentry cache when calling curl
                # This is required with Elasticsearch Docker using nss < 3.52
                export NSS_SDB_USE_CACHE=no

                http () {
                  local path="${1}"
                  local args="${2}"
                  set -- -XGET -s

                  if [ "$args" != "" ]; then
                    set -- "$@" $args
                  fi

                  set -- "$@" -u "elastic:${ELASTIC_PASSWORD}"

                  curl --output /dev/null -k "$@" "http://127.0.0.1:9200${path}"
                }

                if [ -f "${START_FILE}" ]; then
                  echo 'Elasticsearch is already running, lets check the node is healthy'
                  HTTP_CODE=$(http "/" "-w %{http_code}")
                  RC=$?
                  if [[ ${RC} -ne 0 ]]; then
                    echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with RC ${RC}"
                    exit ${RC}
                  fi
                  # ready if HTTP code 200, 503 is tolerable if ES version is 6.x
                  if [[ ${HTTP_CODE} == "200" ]]; then
                    exit 0
                  elif [[ ${HTTP_CODE} == "503" && "8" == "6" ]]; then
                    exit 0
                  else
                    echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} http://127.0.0.1:9200/ failed with HTTP code ${HTTP_CODE}"
                    exit 1
                  fi

                else
                  echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )'
                  if http "/_cluster/health?wait_for_status=green&timeout=1s" "--fail" ; then
                    touch ${START_FILE}
                    exit 0
                  else
                    echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )'
                    exit 1
                  fi
                fi
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 3
          timeoutSeconds: 5
        ports:
        - name: http
          containerPort: 9200
        - name: transport
          containerPort: 9300
        resources:
          limits:
            cpu: 1000m
            memory: 2Gi
          requests:
            cpu: 1000m
            memory: 2Gi
        env:
          - name: node.name
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: cluster.initial_master_nodes
            value: "elasticsearch-master-0,elasticsearch-master-1,elasticsearch-master-2,"
          - name: node.roles
            value: "master,data,data_content,data_hot,data_warm,data_cold,ingest,ml,remote_cluster_client,transform,"
          - name: discovery.seed_hosts
            value: "elasticsearch-master-headless"
          - name: cluster.name
            value: "elasticsearch"
          - name: network.host
            value: "0.0.0.0"
          - name: ELASTIC_PASSWORD
            valueFrom:
              secretKeyRef:
                name: elasticsearch-master-credentials
                key: password
        volumeMounts:
          - name: "elasticsearch-master"
            mountPath: /usr/share/elasticsearch/data

NOTES:
1. Watch all cluster members come up.
  $ kubectl get pods --namespace=default -l app=elasticsearch-master -w
2. Retrieve elastic user's password.
  $ kubectl get secrets --namespace=default elasticsearch-master-credentials -ojsonpath='{.data.password}' | base64 -d
3. Test cluster health using Helm test.
  $ helm --namespace=default test es

Describe the bug:

When using 8.0.0-SNAPSHOT and default values (default elasticsearch config, security not enforced), Elasticsearch chart fails to deploy, with pods never reaching ready state due to Readiness probe failing:

$ kubectl describe pod elasticsearch-master-0
...
  Normal   Started                 116s (x3 over 3m37s)  kubelet                  Started container elasticsearch
  Warning  Unhealthy               76s (x11 over 3m26s)  kubelet                  Readiness probe failed: Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )
Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )
  Warning  BackOff  72s (x2 over 2m8s)  kubelet  Back-off restarting failed container

This seems to be related to the new behavior where Elasticsearch generates TLS certificates by default: elastic/elasticsearch#77231
cc @elastic/es-delivery @jkakavas @mgreau @nkammah

Steps to reproduce:

Try deploy Elasticsearch chart with default values from master branch

$ cd helm-charts
$ git checkout master
$ helm install es ./elasticsearch

That's all

Expected behavior: Readiness probe should success.

Provide logs and/or server output (if relevant):

Elasticsearch logs

ERROR: [1] bootstrap checks failed. You must address the points described in the following [1] lines before starting Elasticsearch.
bootstrap check failure [1] of [1]: Transport SSL must be enabled if security is enabled. Please set [xpack.security.transport.ssl.enabled] to [true] or disable security by setting [xpack.security.enabled] to [false]
ERROR: Elasticsearch did not exit normally - check the logs at /usr/share/elasticsearch/logs/elasticsearch.log

Any additional context:

The text was updated successfully, but these errors were encountered:

jkakavas · 2021-11-08T15:12:55Z

elastic/elasticsearch#77231 was merged weeks ago, it's strange if we just started seeing these failures.

From the logs I can see that elasticsearch fails to start because security is enabled and TLS is not configured for the transport layer.

Transport SSL must be enabled if security is enabled. 
Please set [xpack.security.transport.ssl.enabled] to [true] or disable 
security by setting [xpack.security.enabled] to [false]

We only made this change elastic/elasticsearch#79602 recently, but AFAIU the nodes deployed with the default example here start with a basic license by default so this change shouldn't affect them. If anything, this should have been failing since we had changed the default value of xpack.security.enabled to true back in August.

In short, we can't have multi-node clusters with security enabled and no TLS configuration for the transport layer when running in production mode ( es bound to 0.0.0.0 ).

Is a valid option making the security example the default one ?

Happy to work with you folks to determine why this started breaking just now if that's valuable

mark-vieira · 2021-11-08T21:50:42Z

Is a valid option making the security example the default one ?

+1 to using a secure configuration by default. If folks explicitly disable security, fine, but since TLS example uses autogenerated certs it's not like it's more "effort" for users. In a production scenario folks will obviously want this stuff enabled and will provide their own certs.

framsouza · 2021-11-22T20:39:32Z

👋 do we have any update on this? @jkakavas shall we turn the security example the default one?

jkakavas · 2021-11-23T22:38:05Z

We had a discussion with @jmlrt @nkammah @framsouza and I spent some time with @framsouza kind help figuring out what happens.

In summary:

The cluster fails to start because security is enabled (by default) and transport TLS is not enabled/configured.
TLS is not auto-configured because of the environment this runs in. I didn't get to the bottom of which of our heuristics is triggered but TLS auto-configuration and the newly introduced enrollment mode is not meant to cater for use cases where there is already orchestration in place. The TLS configuration fits better within the orchestration.

Path forward:

I think we need to do something similar to what we do for docker compose where we generate the TLS keys/certificates before starting the node. Something very similar to what the security example already does, but figure out a way to incorporate the make secrets part. @framsouza is looking into this.

gohmc · 2021-12-11T11:01:28Z

In my case, I found errors from readinessProbe that said: "[[ unexpected operator". Further investigate indicated that despite probe to Elasticsearch cluster health (enabled with TLS and basic authentication) has returned successful, instead the script reported failed. I generated a copy of the manifest and changed from this:

...
if [[ ${RC} -ne 0 ]]; then
  echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} https://127.0.0.1:9200/ failed with RC ${RC}"
  exit ${RC}
fi
...
# ready if HTTP code 200, 503 is tolerable if ES version is 6.x
if [[ ${HTTP_CODE} == "200" ]]; then
  exit 0
elif [[ ${HTTP_CODE} == "503" && "{{ include "elasticsearch.esMajorVersion" . }}" == "6" ]]; then
  exit 0
else
  echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} {{ .Values.protocol }}://127.0.0.1:{{ .Values.httpPort }}/ failed with HTTP code ${HTTP_CODE}"
  exit 1
fi

To this... which I suspect it was related to the compatibility of bash in the base image:

...
if [ ${RC} -ne 0 ]; then
  echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} https://127.0.0.1:9200/ failed with RC ${RC}"
  exit ${RC}
fi
...
if [ ${HTTP_CODE} -eq 200 ]; then
  exit 0
elif [ ${HTTP_CODE} -eq 503 && "7" = "6" ]; then
  exit 0
else
  echo "curl --output /dev/null -k -XGET -s -w '%{http_code}' \${BASIC_AUTH} https://127.0.0.1:9200/ failed with HTTP code ${HTTP_CODE}"
  exit 1
fi

Here onwards the readinessProbe works as expected. Please note I am using image version 7.16.0. Hopefully this help the issue here.

mloeschner · 2021-12-13T13:41:13Z

It's simply the default shell of the container which is not capable to use double square-brackets. Change it to bash for example and it'll work.

readinessProbe:
          exec:
            command:
            - bash

jmlrt · 2021-12-13T15:29:51Z

Hi @gohmc @mloeschner, this is a different issue. We are working toward fixing it and should publish a new 7.16.1 release soon.

q-leobrack · 2021-12-13T17:38:56Z

Hi @gohmc @mloeschner, this is a different issue. We are working toward fixing it and should publish a new 7.16.1 release soon.

@jmlrt is there an issue link for this one? Thanks

EDIT: Couldn't spot one. Opened #1473

ngruson · 2021-12-15T06:33:21Z

I got the same error as @jmlrt when deploying to Docker Desktop on my laptop.
After I changed values.yaml as described in https://github.com/elastic/helm-charts/blob/main/elasticsearch/examples/docker-for-mac/values.yaml, the deployment got some what further.
The deployment is giving me another error now (java.nio.file.FileSystemException: /usr/share/elasticsearch/data/nodes/0: Not a directory) but that seems unrelated to this particular issue.

jmlrt added bug Something isn't working elasticsearch labels Nov 5, 2021

jkakavas self-assigned this Nov 8, 2021

jmlrt mentioned this issue Dec 9, 2021

[elasticsearch] use bash for readiness script #1458

Merged

jmlrt added the v8.5.1 label Dec 13, 2021

This was referenced Dec 16, 2021

[elasticsearch] #1495 Configure JVM options files #1496

Merged

6.8.22 release changelog #1512

Merged

7.16.2 release changelog #1510

Merged

framsouza mentioned this issue Dec 22, 2021

[elasticsearch] SSL by default #1519

Merged

framsouza self-assigned this Dec 28, 2021

jmlrt closed this as completed in #1519 Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[elasticsearch] Readiness probe is failing again with 8.0.0-SNAPSHOT and default config #1443

[elasticsearch] Readiness probe is failing again with 8.0.0-SNAPSHOT and default config #1443

jmlrt commented Nov 5, 2021 •

edited

Loading

jkakavas commented Nov 8, 2021

mark-vieira commented Nov 8, 2021

framsouza commented Nov 22, 2021

jkakavas commented Nov 23, 2021

gohmc commented Dec 11, 2021 •

edited

Loading

mloeschner commented Dec 13, 2021

jmlrt commented Dec 13, 2021 •

edited

Loading

q-leobrack commented Dec 13, 2021 •

edited

Loading

ngruson commented Dec 15, 2021

[elasticsearch] Readiness probe is failing again with 8.0.0-SNAPSHOT and default config #1443

[elasticsearch] Readiness probe is failing again with 8.0.0-SNAPSHOT and default config #1443

Comments

jmlrt commented Nov 5, 2021 • edited Loading

jkakavas commented Nov 8, 2021

mark-vieira commented Nov 8, 2021

framsouza commented Nov 22, 2021

jkakavas commented Nov 23, 2021

gohmc commented Dec 11, 2021 • edited Loading

mloeschner commented Dec 13, 2021

jmlrt commented Dec 13, 2021 • edited Loading

q-leobrack commented Dec 13, 2021 • edited Loading

ngruson commented Dec 15, 2021

jmlrt commented Nov 5, 2021 •

edited

Loading

gohmc commented Dec 11, 2021 •

edited

Loading

jmlrt commented Dec 13, 2021 •

edited

Loading

q-leobrack commented Dec 13, 2021 •

edited

Loading