Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie process runc when use exec driver #5836

Closed
tantra35 opened this issue Jun 14, 2019 · 3 comments · Fixed by #5851
Closed

Zombie process runc when use exec driver #5836

tantra35 opened this issue Jun 14, 2019 · 3 comments · Fixed by #5851

Comments

@tantra35
Copy link
Contributor

Nomad version

Nomad v0.9.3 (c5e8b66)

Issue

On test stand we investigate upgrade process from 0.8.6 to 0.9.3 and found that when exec driver used zombie process appears

root      3640  0.9  5.3 739948 54356 ?        Ssl  18:55   0:06 /opt/nomad/nomad agent -config=/etc/nomad
root      3826  0.0  3.4 296408 34892 ?        Sl   18:59   0:00  \_ /opt/nomad/nomad_0.9.3/nomad logmon
root      3827  0.0  3.2 378336 33480 ?        Sl   18:59   0:00  \_ /opt/nomad/nomad_0.9.3/nomad logmon
root      3847  0.1  3.7 452068 37604 ?        Ssl  18:59   0:00  \_ /opt/nomad/nomad_0.9.3/nomad executor {"LogFile":"/var/lib/nomad/alloc/5ddd8286-99b0-f16e-fd3e-cb3f62481dc6/diamondbcapacitycollector/executor.out"
root      3872  0.0  0.0      0     0 ?        Z    18:59   0:00  |   \_ [runc:[1:CHILD]] <defunct>
nobody    3873  0.0  0.0   6008   808 ?        Ss   18:59   0:00  |   \_ /bin/sleep 600
root      3901  0.1  3.7 525800 38044 ?        Ssl  18:59   0:00  \_ /opt/nomad/nomad_0.9.3/nomad executor {"LogFile":"/var/lib/nomad/alloc/5ddd8286-99b0-f16e-fd3e-cb3f62481dc6/fluend/executor.out","LogLevel":"debug"
root      3913  0.0  0.0      0     0 ?        Z    18:59   0:00      \_ [runc:[1:CHILD]] <defunct>
nobody    3914  0.0  0.6  29308  6104 ?        Ssl  18:59   0:00      \_ /bin/fluent-bit -c /local/td-agent-bit.conf

In output below [runc:[1:CHILD]] is a zombie process

Job file (if appropriate)

job test_fleunt
{
	datacenters = ["test"]
	type = "batch"

	constraint
	{
		attribute = "${attr.kernel.name}"
		value = "linux"
	}

	group test_fleunt
	{
		task diamondbcapacitycollector
		{
			leader = true
			driver = "exec"

			config
			{
				command = "sleep"
				args = ["600"]
			}

			logs
			{
				max_files = 3
				max_file_size = 10
			}

			resources
			{
				cpu = 100
				memory = 300
			}
		}

		task fluend
		{
			driver = "exec"

			artifact
			{
				source = "http://docker.service.consul/fluent-bit.0.14.9.tar.gz"
				destination = "/bin"
			}

			config
			{
				command = "/bin/fluent-bit"
				args = ["-c", "${NOMAD_TASK_DIR}/td-agent-bit.conf"]
			}

			template
			{
				data = <<EOH
[SERVICE]
    Flush        5
    Daemon       Off
    Log_Level    info
    Parsers_File parsers.conf

[INPUT]
    Name tail
    Tag  app.smtprelay

    Path ${NOMAD_ALLOC_DIR}/logs/mail.log
    DB ${NOMAD_ALLOC_DIR}/logs/mail-log.pos

    Parser postfix

[FILTER]
    Name   grep
    Match  *
    Exclude message \blost\b\s\bconnection\b\s\bafter\b\s\bCONNECT\b\s\bfrom\b\s[^<]+?\[\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\W
    Exclude message \bconnect\b\s\bfrom\b\s[^<]+?\[\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\W
    Exclude message \bdisconnect\b\s\bfrom\b\s[^<]+?\[\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\W

[OUTPUT]
    Name forward
    Match *
    Time_as_Integer True
    Host td-agent-local.service.consul
EOH

			destination = "local/td-agent-bit.conf"
		}

		template
		{
			data = <<EOH
[PARSER]
    Name postfix
    Format regex
    Regex (?<time>[\w]+\s+[\d]+\s[\d:]+) (?<host>[^ ]+) (?<process>[^:]+): (?<message>((?<key>[^ :]+)[ :])? ?((to|from)=<(?<address>[^>]+)>)?.*)
    Time_Key time
    Time_Format %b %-d %H:%M:%S

EOH

				destination = "local/parsers.conf"
			}

			logs
			{
				max_files = 3
				max_file_size = 10
			}

			resources
			{
				memory = 50
				cpu = 150
			}
		}
	}
}
@tantra35
Copy link
Contributor Author

we found that this is regression in libcontainerd

opencontainers/runc#2022

so vendoring need to be updated

@preetapan
Copy link
Contributor

@tantra35 Thanks for digging into this and finding the libcontainerd ticket. We'll fix this in the next planned point release Nomad 0.9.4

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants