Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy does not release healthcheck log fd on SIGUSR1 #8249

Closed
perlun opened this issue Sep 16, 2019 · 2 comments · Fixed by #8261
Closed

Envoy does not release healthcheck log fd on SIGUSR1 #8249

perlun opened this issue Sep 16, 2019 · 2 comments · Fixed by #8261
Assignees
Labels

Comments

@perlun
Copy link
Contributor

perlun commented Sep 16, 2019

Title: http_health_check log file descriptor is not being released on SIGUSR1

Description:
After rotating the logs and sending a SIGUSR1 to the envoy process (technically, to the hot restarter script), we are still seeing healthcheck logs being written to the old (rotated) file. This is validated by looking at the list of open files by the envoy process:

hibox@some-server:/var/log/envoy$ ps -fe | grep envoy
root      8679     1  0 Aug28 ?        00:00:01 python /usr/local/sbin/envoy-hot-restarter.py /usr/local/sbin/envoy-hot-restarter.sh
root      8743  8679  0 Aug28 ?        03:37:30 /usr/bin/envoy -c /etc/envoy/envoy.yaml --restart-epoch 1 --service-cluster hibox --service-node some-server --service-zone unknown
hibox    25054 27885  0 11:55 pts/0    00:00:00 grep --color=auto envoy
hibox@some-server:/var/log/envoy$ sudo ls -l /proc/8743/fd | grep envoy
lrwx------ 1 root root 64 Sep 16 11:10 12 -> /var/log/envoy/admin_access.log
lrwx------ 1 root root 64 Sep 16 11:10 7 -> /dev/shm/envoy_shared_memory_0
lrwx------ 1 root root 64 Sep 16 11:10 78 -> /var/log/envoy/healthchecks.log.1
lrwx------ 1 root root 64 Sep 16 11:10 80 -> /var/log/envoy/http_access.log
lrwx------ 1 root root 64 Sep 16 11:10 82 -> /var/log/envoy/metrics_access.log 

Repro steps:
envoy --version is e349fb6/1.11.1/clean-getenvoy-930d4a5/RELEASE/BoringSSL

The relevant part of the configuration looks roughly like this:

  clusters:
    - name: cluster_tomcat
      connect_timeout: 0.25s
      type: STATIC
      dns_lookup_family: V4_ONLY
      lb_policy: ROUND_ROBIN
      common_lb_config: {
        healthy_panic_threshold: {
          value: 10
        }
      }

      health_checks: [
        {
          timeout: 0.5s,
          interval: 5s,
          unhealthy_threshold: 1,
          healthy_threshold: 2,

          http_health_check: {
            path: /some-path/api
          },

          event_log_path: /var/log/envoy/healthchecks.log
        }
      ]

My logrotate configuration file looks like this:

/var/log/envoy/*.log {
	rotate 15
	weekly
	compress
        delaycompress # Workaround for files being incorrectly removed
	missingok
	notifempty
	postrotate
		pkill -USR1 -f envoy-hot-restarter.py
	endscript
        sharedscripts
}

An interesting thing to note: I do get an error from logrotate when performing the log rotation:

error: error running shared postrotate script for '/var/log/envoy/*.log '

The syslog indicates that the SIGUSR1 was correctly received by Envoy, though:

Sep 16 11:43:40 some-server envoy-hot-restarter.py[8679]: [2019-09-16 11:43:40.005][8743][warning][main] [external/envoy/source/server/server.cc:473] caught SIGUSR1

The timestamps here also correlate well with the time when the other *.log.1 files were created:

hibox@some-server:/var/log/envoy$ ls -l
total 124452
-rw-r--r-- 1 root root     38986 Sep 16 12:08 admin_access.log
-rw-r--r-- 1 root root     12352 Sep 16 11:43 admin_access.log.1
-rw-r--r-- 1 root root         0 Sep 16 11:27 healthchecks.log
-rw-r--r-- 1 root root     77279 Sep 14 21:00 healthchecks.log.1
-rw-r--r-- 1 root root    620140 Sep 16 12:08 http_access.log
-rw-r--r-- 1 root root    198509 Sep 16 11:43 http_access.log.1
-rw-r--r-- 1 root root         0 Sep 16 11:27 https_access.log
-rw-r--r-- 1 root root     60766 Mar 29 08:48 https_access.log.1
-rw-r--r-- 1 root root     38784 Sep 16 12:08 metrics_access.log
-rw-r--r-- 1 root root     12288 Sep 16 11:43 metrics_access.log.1

Any ideas? It feels superficially similar to #4060 but I don't think we ever concluded that there was a problem with Envoy itself back then.

@zuercher zuercher added bug help wanted Needs help! labels Sep 16, 2019
@zuercher
Copy link
Member

That looks like a bug.

@zuercher
Copy link
Member

So the only thing I see is that we loop on the flush timer event. We'll break out of the loop if there's data to be written or if an exit was requested. Seems like it should also check for reopen-requested.

zuercher added a commit to zuercher/envoy that referenced this issue Sep 16, 2019
Checks for the reopen flag when the log flush timer fires
and issues the reopen even if no data is pending. This
prevents Envoy from holding a file descriptor on rotated
but seldom written log files until the next write.

Risk Level: low
Testing: add unit test
Docs Changes: n/a
Release Notes: n/a
Fixes: envoyproxy#8249

Signed-off-by: Stephan Zuercher <[email protected]>
@zuercher zuercher self-assigned this Sep 16, 2019
@zuercher zuercher removed the help wanted Needs help! label Sep 16, 2019
mattklein123 pushed a commit that referenced this issue Sep 19, 2019
Checks for the reopen flag when the log flush timer fires
and issues the reopen even if no data is pending. This
prevents Envoy from holding a file descriptor on rotated
but seldom written log files until the next write.

Risk Level: low
Testing: add unit test
Docs Changes: n/a
Release Notes: n/a
Fixes: #8249

Signed-off-by: Stephan Zuercher <[email protected]>
danzh2010 pushed a commit to danzh2010/envoy that referenced this issue Sep 24, 2019
Checks for the reopen flag when the log flush timer fires
and issues the reopen even if no data is pending. This
prevents Envoy from holding a file descriptor on rotated
but seldom written log files until the next write.

Risk Level: low
Testing: add unit test
Docs Changes: n/a
Release Notes: n/a
Fixes: envoyproxy#8249

Signed-off-by: Stephan Zuercher <[email protected]>
danzh2010 pushed a commit to danzh2010/envoy that referenced this issue Oct 4, 2019
Checks for the reopen flag when the log flush timer fires
and issues the reopen even if no data is pending. This
prevents Envoy from holding a file descriptor on rotated
but seldom written log files until the next write.

Risk Level: low
Testing: add unit test
Docs Changes: n/a
Release Notes: n/a
Fixes: envoyproxy#8249

Signed-off-by: Stephan Zuercher <[email protected]>
danzh2010 pushed a commit to danzh2010/envoy that referenced this issue Oct 4, 2019
Checks for the reopen flag when the log flush timer fires
and issues the reopen even if no data is pending. This
prevents Envoy from holding a file descriptor on rotated
but seldom written log files until the next write.

Risk Level: low
Testing: add unit test
Docs Changes: n/a
Release Notes: n/a
Fixes: envoyproxy#8249

Signed-off-by: Stephan Zuercher <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants