Envoy does not release healthcheck log fd on SIGUSR1 #8249

perlun · 2019-09-16T09:14:07Z

Title: http_health_check log file descriptor is not being released on SIGUSR1

Description:
After rotating the logs and sending a SIGUSR1 to the envoy process (technically, to the hot restarter script), we are still seeing healthcheck logs being written to the old (rotated) file. This is validated by looking at the list of open files by the envoy process:

hibox@some-server:/var/log/envoy$ ps -fe | grep envoy
root      8679     1  0 Aug28 ?        00:00:01 python /usr/local/sbin/envoy-hot-restarter.py /usr/local/sbin/envoy-hot-restarter.sh
root      8743  8679  0 Aug28 ?        03:37:30 /usr/bin/envoy -c /etc/envoy/envoy.yaml --restart-epoch 1 --service-cluster hibox --service-node some-server --service-zone unknown
hibox    25054 27885  0 11:55 pts/0    00:00:00 grep --color=auto envoy
hibox@some-server:/var/log/envoy$ sudo ls -l /proc/8743/fd | grep envoy
lrwx------ 1 root root 64 Sep 16 11:10 12 -> /var/log/envoy/admin_access.log
lrwx------ 1 root root 64 Sep 16 11:10 7 -> /dev/shm/envoy_shared_memory_0
lrwx------ 1 root root 64 Sep 16 11:10 78 -> /var/log/envoy/healthchecks.log.1
lrwx------ 1 root root 64 Sep 16 11:10 80 -> /var/log/envoy/http_access.log
lrwx------ 1 root root 64 Sep 16 11:10 82 -> /var/log/envoy/metrics_access.log

Repro steps:
envoy --version is e349fb6/1.11.1/clean-getenvoy-930d4a5/RELEASE/BoringSSL

The relevant part of the configuration looks roughly like this:

  clusters:
    - name: cluster_tomcat
      connect_timeout: 0.25s
      type: STATIC
      dns_lookup_family: V4_ONLY
      lb_policy: ROUND_ROBIN
      common_lb_config: {
        healthy_panic_threshold: {
          value: 10
        }
      }

      health_checks: [
        {
          timeout: 0.5s,
          interval: 5s,
          unhealthy_threshold: 1,
          healthy_threshold: 2,

          http_health_check: {
            path: /some-path/api
          },

          event_log_path: /var/log/envoy/healthchecks.log
        }
      ]

My logrotate configuration file looks like this:

/var/log/envoy/*.log {
	rotate 15
	weekly
	compress
        delaycompress # Workaround for files being incorrectly removed
	missingok
	notifempty
	postrotate
		pkill -USR1 -f envoy-hot-restarter.py
	endscript
        sharedscripts
}

An interesting thing to note: I do get an error from logrotate when performing the log rotation:

error: error running shared postrotate script for '/var/log/envoy/*.log '

The syslog indicates that the SIGUSR1 was correctly received by Envoy, though:

Sep 16 11:43:40 some-server envoy-hot-restarter.py[8679]: [2019-09-16 11:43:40.005][8743][warning][main] [external/envoy/source/server/server.cc:473] caught SIGUSR1

The timestamps here also correlate well with the time when the other *.log.1 files were created:

hibox@some-server:/var/log/envoy$ ls -l
total 124452
-rw-r--r-- 1 root root     38986 Sep 16 12:08 admin_access.log
-rw-r--r-- 1 root root     12352 Sep 16 11:43 admin_access.log.1
-rw-r--r-- 1 root root         0 Sep 16 11:27 healthchecks.log
-rw-r--r-- 1 root root     77279 Sep 14 21:00 healthchecks.log.1
-rw-r--r-- 1 root root    620140 Sep 16 12:08 http_access.log
-rw-r--r-- 1 root root    198509 Sep 16 11:43 http_access.log.1
-rw-r--r-- 1 root root         0 Sep 16 11:27 https_access.log
-rw-r--r-- 1 root root     60766 Mar 29 08:48 https_access.log.1
-rw-r--r-- 1 root root     38784 Sep 16 12:08 metrics_access.log
-rw-r--r-- 1 root root     12288 Sep 16 11:43 metrics_access.log.1

Any ideas? It feels superficially similar to #4060 but I don't think we ever concluded that there was a problem with Envoy itself back then.

The text was updated successfully, but these errors were encountered:

zuercher · 2019-09-16T16:38:14Z

That looks like a bug.

zuercher · 2019-09-16T21:12:43Z

So the only thing I see is that we loop on the flush timer event. We'll break out of the loop if there's data to be written or if an exit was requested. Seems like it should also check for reopen-requested.

Checks for the reopen flag when the log flush timer fires and issues the reopen even if no data is pending. This prevents Envoy from holding a file descriptor on rotated but seldom written log files until the next write. Risk Level: low Testing: add unit test Docs Changes: n/a Release Notes: n/a Fixes: envoyproxy#8249 Signed-off-by: Stephan Zuercher <[email protected]>

Checks for the reopen flag when the log flush timer fires and issues the reopen even if no data is pending. This prevents Envoy from holding a file descriptor on rotated but seldom written log files until the next write. Risk Level: low Testing: add unit test Docs Changes: n/a Release Notes: n/a Fixes: #8249 Signed-off-by: Stephan Zuercher <[email protected]>

Checks for the reopen flag when the log flush timer fires and issues the reopen even if no data is pending. This prevents Envoy from holding a file descriptor on rotated but seldom written log files until the next write. Risk Level: low Testing: add unit test Docs Changes: n/a Release Notes: n/a Fixes: envoyproxy#8249 Signed-off-by: Stephan Zuercher <[email protected]>

zuercher added bug help wanted Needs help! labels Sep 16, 2019

zuercher mentioned this issue Sep 16, 2019

access_log: check for reopen flag on flush interval #8261

Merged

zuercher self-assigned this Sep 16, 2019

zuercher removed the help wanted Needs help! label Sep 16, 2019

mattklein123 closed this as completed in #8261 Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envoy does not release healthcheck log fd on SIGUSR1 #8249

Envoy does not release healthcheck log fd on SIGUSR1 #8249

perlun commented Sep 16, 2019

zuercher commented Sep 16, 2019

zuercher commented Sep 16, 2019

Envoy does not release healthcheck log fd on SIGUSR1 #8249

Envoy does not release healthcheck log fd on SIGUSR1 #8249

Comments

perlun commented Sep 16, 2019

zuercher commented Sep 16, 2019

zuercher commented Sep 16, 2019