Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init.d can create duplicate processes if pidfile lookup fails #5730

Closed
fooshards opened this issue Apr 16, 2019 · 0 comments · Fixed by #5731
Closed

init.d can create duplicate processes if pidfile lookup fails #5730

fooshards opened this issue Apr 16, 2019 · 0 comments · Fixed by #5731
Assignees
Labels
area/packaging bug unexpected problem or unintended behavior
Milestone

Comments

@fooshards
Copy link

fooshards commented Apr 16, 2019

System info:

rhel6.10
init.d
telegraf-1.8.1-1.x86_64

Steps to reproduce:

Get telegraf into working state, exercise init.d scripts

root@testhost ~]# service telegraf restart
telegraf process was stopped [ OK ]
Starting the process telegraf [ OK ]
telegraf process was started [ OK ]
[root@testhost ~]# service telegraf start
telegraf process is running [ FAILED ]
[root@testhost ~]# ps -ef | grep teleg
telegraf  7765     1  0 10:42 ?        00:00:00 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf
root      7821 29851  0 10:42 pts/1    00:00:00 grep teleg

Alter the pidfile to put it in a bad state (actual observed scenario was likely due to /var mount running out of space, getting the pid file unwritable, but in any case, the pidfile was out of sync)

[root@testhost ~]# vi /var/run/telegraf/telegraf.pid

Restart - note that the original process never got stopped, but a new process is started

[root@testhost ~]# service telegraf restart
Starting the process telegraf [ OK ]
telegraf process was started [ OK ]

At this time two pids are alive

[root@testhost ~]# ps -ef | grep teleg
telegraf  7765     1  0 10:42 ?        00:00:00 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf
telegraf  7886     1  0 10:42 ?        00:00:00 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf
root      7926 29851  0 10:42 pts/1    00:00:00 grep teleg

Note that 'start' actions are properly guarded

[root@testhost ~]# service telegraf start
telegraf process is running [ FAILED ]

[root@testhost ~]# ps -ef | grep telegraf
telegraf  7765     1  0 10:42 ?        00:00:00 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf
telegraf  7886     1  0 10:42 ?        00:00:00 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf
root      8065 29851  0 10:43 pts/1    00:00:00 grep telegraf

Expected behavior:

restart should guard against creating a new duplicate process OR kill the existing process

Actual behavior:

two identical processes are started, metrics begin to be duplicated, and/or as these processes pile up resources begin to become unavailable

Additional info:

This happened mostly because another agent was restarting telegraf a lot, and this issue was teased out downstream from it

@danielnelson danielnelson self-assigned this Apr 16, 2019
@danielnelson danielnelson added this to the 1.11.0 milestone Apr 16, 2019
@danielnelson danielnelson added area/packaging bug unexpected problem or unintended behavior labels Apr 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/packaging bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants