Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS lookup occasionally fails immediately after container create #150

Open
sesmith177 opened this issue Dec 4, 2017 · 13 comments
Open

DNS lookup occasionally fails immediately after container create #150

sesmith177 opened this issue Dec 4, 2017 · 13 comments
Assignees
Labels

Comments

@sesmith177
Copy link
Contributor

We have found that DNS lookups occasionally fail when a process is executed in a newly created container. Here is a short go program which can be used to reproduce the issue:

$env:ROOTFS_PATH=(docker inspect microsoft/windowsservercore:1709 | ConvertFrom-Json).GraphDriver.Data.Dir
$env:NETWORK_NAME="nat"

for ($i=0; $i -lt 20; $i++) {./main.exe }

The DNS lookup from inside the container will fail 0-6 times out of 20.

If we use a container image that has the DNSCache service turned off, the DNS lookup always succeeds. The Dockerfile we use for this is:

FROM microsoft/windowsservercore:1709

RUN powershell.exe -command "Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\dnscache' -Name Start -Value 4"
@darstahl
Copy link
Contributor

darstahl commented Dec 4, 2017

@msabansal @JMesser81

@darstahl
Copy link
Contributor

darstahl commented Dec 4, 2017

@sesmith177 can you post the output from docker info? That will tell us what the OS build this is.

@sesmith177
Copy link
Contributor Author

PS C:\users\admin\winc-release\src\code.cloudfoundry.org\winc> docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 2
Server Version: master-dockerproject-2017-12-03
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: ics l2bridge l2tunnel nat null overlay transparent
 Log: awslogs etwlogs fluentd gelf json-file logentries splunk syslog
Swarm: inactive
Default Isolation: process
Kernel Version: 10.0 16299 (16299.15.amd64fre.rs3_release.170928-1534)
Operating System: Windows Server Datacenter
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 26GiB
Name: vm-065f0650-70a1-4283-4764-315dea9cd366
ID: 4UFH:MVXK:BXPD:V43K:I7B7:MG43:AMMZ:MLEJ:L2F7:RBFP:JG5W:KMHA
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false�

@sunjayBhatia
Copy link
Contributor

@darrenstahlmsft @msabansal @JMesser81 any updates on this?

@msabansal
Copy link
Contributor

@sunjayBhatia. I can repro this aswell. This feels like an issue in DNS cache where it doesn't seem to work immidiatley after starting up. Investigating this.

@sunjayBhatia
Copy link
Contributor

@msabansal should we expect the recent 1803 release to have a fix for this?

cc @natalieparellano @mhoran

@msabansal
Copy link
Contributor

@JMesser81 PTAL.

@daschott
Copy link

@sunjayBhatia thank you for bringing this to our attention. If these don't work, we do have a possible platform bug we're investigating.

@daschott
Copy link

@sesmith177 How long does the issue persist for you? It looks like it goes away almost immediately when I use the provided Go program.

I can also confirm that this issue does not repro in recent Windows Insider Build 17692.

@sesmith177
Copy link
Contributor Author

@daschott we are not sure how long the issue lasts -- it causes problems for us because our containers often perform a DNS lookup immediately on startup.

@daschott
Copy link

@sesmith177 can you add a delay of a few seconds? The issue seems to go away immediately, which sadly makes it difficult to collect traces. The internal bug has been closed as it does not repro on recent builds. On older builds that are affected, it seems like created containers using "docker run" are not affected, even when using scripts to diagnose.

@sesmith177
Copy link
Contributor Author

@daschott we have a workaround for now. Does recent builds where it doesn't reproduce include the released 1803? Or does it include the most recent 2019 insider release (17709)?

@daschott
Copy link

Sorry for the ambiguity; any 2019 insider release (onwards from build 17692) should do.

ajgokhale added a commit to cloudfoundry/garden-windows-ci that referenced this issue Aug 3, 2018
1803 still has flaky outbound TCP when the container starts. This should
be fixed for 2019.

microsoft/hcsshim#150

Signed-off-by: Amin Jamali <[email protected]>
ankeesler pushed a commit to cloudfoundry/windows2016fs that referenced this issue Aug 24, 2018
We removed the dnscache service because apparently this is fixed in
2019. See issue: microsoft/hcsshim#150.

Signed-off-by: Andrew Keesler <[email protected]>
dcantah pushed a commit to dcantah/hcsshim that referenced this issue Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants