Skip to content
This repository has been archived by the owner on May 16, 2023. It is now read-only.

7.5.0 release #391

Merged
merged 9 commits into from
Dec 6, 2019
Merged

7.5.0 release #391

merged 9 commits into from
Dec 6, 2019

Conversation

jmlrt
Copy link
Member

@jmlrt jmlrt commented Dec 2, 2019

  • Chart version not bumped (the versions are all bumped and released at the same time)
  • README.md updated with any new values or changes
  • Updated template tests in ${CHART}/tests/*.py
  • Updated integration tests in ${CHART}/examples/*/test/goss.yaml

@jmlrt jmlrt added the enhancement New feature or request label Dec 2, 2019
@jmlrt jmlrt requested a review from a team December 2, 2019 16:07
@jmlrt jmlrt self-assigned this Dec 2, 2019
@jmlrt jmlrt marked this pull request as ready for review December 2, 2019 16:38
Conky5
Conky5 previously approved these changes Dec 2, 2019
Copy link
Contributor

@Conky5 Conky5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending ci, noticed one super tiny formatting tweak

CHANGELOG.md Outdated Show resolved Hide resolved
Co-Authored-By: Chris Koehnke <[email protected]>
@jmlrt
Copy link
Member Author

jmlrt commented Dec 2, 2019

jenkins test this please

@jmlrt
Copy link
Member Author

jmlrt commented Dec 3, 2019

jenkins test this please

1 similar comment
@jmlrt
Copy link
Member Author

jmlrt commented Dec 4, 2019

jenkins test this please

@jmlrt
Copy link
Member Author

jmlrt commented Dec 4, 2019

Filebeat tests are failing with the following error:

2019-12-04T06:36:06.571Z  INFO  instance/beat.go:610  Home path: [/usr/share/filebeat] Config path: [/usr/share/filebeat] Data path: [/usr/share/filebeat/data] Logs path: [/usr/share/filebeat/logs]
2019-12-04T06:36:06.571Z  INFO  instance/beat.go:618  Beat ID: cc882fa1-fa20-471d-8d09-b2bf87dc9622
2019-12-04T06:36:06.572Z  INFO  instance/beat.go:390  filebeat stopped.
2019-12-04T06:36:06.572Z  ERROR instance/beat.go:916  Exiting: could not start the HTTP server for the API: listen tcp 127.0.0.1:5066: bind: address already in use
Exiting: could not start the HTTP server for the API: listen tcp 127.0.0.1:5066: bind: address already in use

@jmlrt
Copy link
Member Author

jmlrt commented Dec 4, 2019

OK, the issue is related Filebeat HTTP endpoint which is activated by default in our Helm templates.

Strangely, when running default and oss tests on CI, Filebeat pods fail because the HTTP endpoint port (5066) is already used inside the pod.

I could confirm that this port is already used even when removing -E http.enabled=true from Filebeat container args inside daemonset template and deploying default example:

  • Filebeat pods are starting and while HTTP Endpoint is disabled, by connecting to a pod, I can see that port 5066 is still used
$ sed -i '91,101d' templates/daemonset.yaml

$ kubectl config current-context
gke_elastic-ci-prod_us-central1-a_helm-112-cbd672a435

$ cd examples/default && make install
...

$ kubectl get daemonset helm-filebeat-default-filebeat
NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
helm-filebeat-default-filebeat   4         4         4       4            4           <none>          25m

$ kubectl get pod | grep helm-filebeat-default-filebeat
helm-filebeat-default-filebeat-7mjph                           1/1     Running            0          26m
helm-filebeat-default-filebeat-cgslm                           1/1     Running            0          26m
helm-filebeat-default-filebeat-w9dcv                           1/1     Running            0          26m
helm-filebeat-default-filebeat-z587h                           1/1     Running            0          26m

$ kubectl exec -it helm-filebeat-default-filebeat-7mjph bash

[root@gke-helm-112-cbd672a435-default-pool-fcb20f71-x04c filebeat]# ps -ef|grep filebeat
root           1       0  0 12:27 ?        00:00:06 filebeat -e
root        3427    3370  0 12:54 ?        00:00:00 grep --color=auto filebeat

# HTTP Endpoint isn't enabled by default
[root@gke-helm-112-cbd672a435-default-pool-fcb20f71-x04c filebeat]# filebeat export config
filebeat:
  inputs:
  - containers:
      ids:
      - '*'
    processors:
    - add_kubernetes_metadata:
        in_cluster: true
    type: docker
output:
  elasticsearch:
    host: ${NODE_NAME}
    hosts: ${ELASTICSEARCH_HOSTS}
path:
  config: /usr/share/filebeat
  data: /usr/share/filebeat/data
  home: /usr/share/filebeat
  logs: /usr/share/filebeat/logs

[root@gke-helm-112-cbd672a435-default-pool-fcb20f71-x04c filebeat]# curl localhost:5066
{"beat":"filebeat","hostname":"gke-helm-112-cbd672a435-default-pool-fcb20f71-x04c","name":"gke-helm-112-cbd672a435-default-pool-fcb20f71-x04c","uuid":"bdb133a2-1453-4ff3-827a-bba73ff5848f","version":"7.5.0"}

[root@gke-helm-112-cbd672a435-default-pool-fcb20f71-x04c filebeat]# yum -yq install iproute
...

[root@gke-helm-112-cbd672a435-default-pool-fcb20f71-x04c filebeat]# ss -plant | grep 5066
LISTEN     0      128    127.0.0.1:5066                     *:*
TIME-WAIT  0      0      127.0.0.1:60908              127.0.0.1:5066

@jmlrt
Copy link
Member Author

jmlrt commented Dec 4, 2019

A little state of art of our Helm CI tests:

  • When we run CI tests, we are provisioning 3 differents GKE cluster (1.12, 1.13 and 1.14)
  • On each GKE cluster we create a helm-charts-testing namespace
  • Inside this namespace we deploy all charts scenarios (see matrix.yml), that means we are deploying 7 Elasticsearch statefulsets (one for each scenario), 4 Filebeat daemonsets, ... in the same namespace.

While we had no issues with Filebeat < 7.5.0, each Filebeat pods could coexists in the same node and same namespace with HTTP endpoint enabled and no port conflict because of pods port isolation (http port is not exposed outside of the pod in our configuration), it seems that this is no more working with Filebeat 7.5.0 as one Filebeat pod is able to see port 5066 opened by another Filebeat pod in the same node / same namespace and so it will fail with "port already used" error.

I could confirm that when removing the helm-filebeat-security chart release which was working on the GKE cluster. As soon as the pods from this chart were destroyed, pods from helm-filebeat-oss chart release which were failing with "port already used" error were able to start.

@jmlrt
Copy link
Member Author

jmlrt commented Dec 4, 2019

Same issue can be reproduced with different namespaces:

  • Install Filebeat chart with 7.5.0 image and HTTP endpoint enabled on a first namespace => OK
  • Install Filebeat chart with 7.5.0 image and HTTP endpoint enabled on a second namespace => pods are failing with listen tcp 127.0.0.1:5066: bind: address already in use error
  • Remove first chart on the first namespace => pods on the second namespace are running now

@jmlrt
Copy link
Member Author

jmlrt commented Dec 5, 2019

After digging with @odacremolbap's help we found that's the issue is related to #321 which set hostNetwork=true for Filebeat chart.

By enabling host networking, Filebeat pods can directly see the network interfaces of the host machine where the pod was started. Filebeat pod is also accessible on all network interfaces of the host machine. The side effect is that 2 Filebeat pods requiring the same port cannot run on the same node and can lead to port conflicts. On top of that, creating a pod with hostNetwork: true on OpenShift is a privileged operation which can lead to issues for OpenShift users.

Why did out Filebeat tests worked with hostNetwork=true for Filebeat 7.4.2 but not with Filebeat 7.5.0?

Quoting @odacremolbap message on Slack

https://www.elastic.co/guide/en/beats/libbeat/7.5/release-notes-7.5.0.html#_added
last 2 points there are HTTP related. It looks like they refactored the server to add unix sockets/windows pipes, and in the way they also changed func signature and behaviour.

7.4.x does not return an error but logs: https://github.com/elastic/beats/blob/7.4/libbeat/api/server.go#L33-L53

7.5.x returns an error upstream, which will probably be used at the caller to fail instead of logging: https://github.com/elastic/beats/blob/master/libbeat/api/server.go#L41-L58

Conclusion

Regarding this I guess we should disable Host Networking by default and only allow to enable it by overriding Helm values.

In addition, the point to enable hostNetwork on Filebeat (and Metricbeat) was to capture proper hostname of the files/metrics being captured (#355). I'll ping the Beats team to know what is their recommendation about that point.

@jmlrt jmlrt requested review from Conky5 and a team December 5, 2019 21:28
filebeat/README.md Outdated Show resolved Hide resolved
Conky5
Conky5 previously approved these changes Dec 5, 2019
Copy link
Contributor

@Conky5 Conky5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Just one comment on the HostNetwork section phrasing.

Conky5
Conky5 previously approved these changes Dec 5, 2019
Copy link
Contributor

@Conky5 Conky5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jmlrt
Copy link
Member Author

jmlrt commented Dec 5, 2019

Need re-approval because I commited your suggestion 😃 (always done 🚀 )
Thank you @Conky5 for always being meticulous on comments and docs.
In addition of your great technical reviews and your always good suggestions on shell scripts best practices this is really nice being able to have good docs with your help ❤️

@jmlrt
Copy link
Member Author

jmlrt commented Dec 6, 2019

jenkins test this please

logstash can take longer than 60s to fully start and we can sometime reach the point where Logstash full start happen a few second after liveness probe sending the kill signal.
Crazybus
Crazybus previously approved these changes Dec 6, 2019
Copy link
Contributor

@Crazybus Crazybus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -12,6 +12,7 @@ This helm chart is a lightweight way to configure and run our official [Filebeat
## Usage notes and getting started
* The default Filebeat configuration file for this chart is configured to use an Elasticsearch endpoint. Without any additional changes, Filebeat will send documents to the service URL that the Elasticsearch helm chart sets up by default. You may either set the `ELASTICSEARCH_HOSTS` environment variable in `extraEnvs` to override this endpoint or modify the default `filebeatConfig` to change this behavior.
* The default Filebeat configuration file is also configured to capture container logs and enrich them with Kubernetes metadata by default. This will capture all container logs in the cluster.
* This chart disables the [HostNetwork](https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces) setting by default for compatibility reasons with the majority of kubernetes providers and scenarios. Some kubernetes providers may not allow enabling `hostNetwork` and deploying multiple Filebeat pods on the same node isn't possible with `hostNetwork`. However Filebeat does recommend activating it. If your kubernetes provider is compatible with `hostNetwork` and you don't need to run multiple Filebeat daemonsets, you can activate it [here](./values.yaml#L36).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/values.yaml#L36

This line number won't be correct anymore if any values are added above it. I think it would be better to just leave the link out to avoid confusion.

@@ -150,7 +150,7 @@ livenessProbe:
httpGet:
path: /
port: http
initialDelaySeconds: 60
initialDelaySeconds: 90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still cutting it pretty close and could make it hard for Logstash processes that are a bit slower to startup. This should be high enough that we can be very sure that it isn't going to prevent a container from starting up. Maybe around 5 minutes would be a nicer value.

Copy link
Contributor

@Crazybus Crazybus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants