[solo] Use unbound to proxy to skydns; fixes shit #900

davidxia · 2016-04-06T23:33:59Z

[solo] Use unbound to proxy to skydns; fixes shit

TL;DR When two DNS servers don't work, add one more!

When running some integration tests with HeliosSoloDeployment on Docker
hosts that use a local unbound instance as its DNS resolver (i.e.
specified in /etc/resolv.conf on the Docker host),
we saw tests failures due to failed SRV queries to skydns. Skydns is
running in the solo container and forwards DNS queries it doesn't know
about to nameservers specified in /etc/resolv.conf via logic in start.sh.

The skydns error output from the helios solo container spawned by
HeliosSoloDeployment looked like:

skydns: failure to forward request "dns: failed to unpack truncated
message"

Our guess is that large UDP responses from the upstream unbound
have the "Message Truncated" DNS flag set. When this type of response
reaches skydns, skydns blows up and doesn't tell the client about the
error. The client times out without retrying in TCP mode. The client
would've retried if it had received an error message from skydns.

Running dig against skydns works. We think this is because dig adds
an OPT record to its query that sets "udp payload size: 4096".

Here are outstanding issues in skydns that seem related:

Solution:

We start an unbound instance in the solo container and have it forward
DNS queries via UDP to the upstream skydns in the same container.
Unbound will add the OPT section that makes everything work.
Things are fixed. :)

We admit this is super funky...And this only might work for UDP packets
up to 4096 bytes, the default set by unbound in OPT.

Much thanks to @gimaker for helping and suggesting unbound inside the
container.

Skydns now needs to be compiled with go 1.5+.

codecov-io · 2016-04-06T23:34:36Z

Current coverage is `48.92%`

Merging #900 into master will increase coverage by +0.01% as of 886f1dc

@@            master    #900   diff @@
======================================
  Files          271     271       
  Stmts        12706   12706       
  Branches      1634    1634       
  Methods          0       0       
======================================
+ Hit           6215    6216     +1
+ Partial        470     468     -2
- Missed        6021    6022     +1

Review entire Coverage Diff as of 886f1dc

Powered by Codecov. Updated on successful CI builds.

davidxia · 2016-04-07T00:37:22Z

@negz @mattnworb @rohansingh @gimaker

mattnworb · 2016-04-07T11:47:54Z

solo/docker/start.sh

@@ -23,7 +23,8 @@ done
 curl -XPUT http://127.0.0.1:4001/v2/keys/skydns/${SKYDNS_PATH} \
    -d value="{\"host\":\"$HOST_ADDRESS\"}"

-skydns $SKYDNS_OPTS &
+skydns $SKYDNS_OPTS -verbose &


I'm not sure if -verbose should be put in the image. Seems like you could run the docker image with -e SKYDNS_OPTS=-verbose at runtime if you really wanted verbose logging.

Yea ill remove.

mattnworb · 2016-04-07T13:24:33Z

👍 overall

@gimaker

TL;DR When two DNS servers don't work, add one more! When running some integration tests with HeliosSoloDeployment on Docker hosts that use a local unbound instance as its DNS resolver (i.e. specified in `/etc/resolv.conf` on the Docker host), we saw tests failures due to failed SRV queries to skydns. Skydns is running in the solo container and forwards DNS queries it doesn't know about to nameservers specified in `/etc/resolv.conf` via logic in `start.sh`. The skydns error output from the helios solo container spawned by HeliosSoloDeployment looked like: ``` skydns: failure to forward request "dns: failed to unpack truncated message" ``` Our guess is that large UDP responses from the upstream unbound have the "Message Truncated" DNS flag set. When this type of response reaches skydns, skydns blows up and doesn't tell the client about the error. The client times out without retrying in TCP mode. The client would've retried if it had received an error message from skydns. Running `dig` against skydns works. We think this is because `dig` adds an OPT record to its query that sets "udp payload size: 4096". Here are outstanding issues in skydns that seem related: * skynetservices/skydns#242 * skynetservices/skydns#45 Solution: We start an unbound instance in the solo container and have it forward DNS queries via UDP to the upstream skydns in the same container. Unbound will add the OPT section that makes everything work. Things are fixed. :) We admit this is super funky...And this only might work for UDP packets up to 4096 bytes, the default set by unbound in OPT. Much thanks to @gimaker for helping and suggesting unbound inside the container.

davidxia · 2016-04-07T13:45:18Z

@gimaker If it looks good to you, I'll merge and try to get a release out.

gimaker · 2016-04-07T14:20:57Z

@davidxia what happened to the nice commit message? :(

gimaker · 2016-04-07T14:21:20Z

Oh nevermind. I see that it's still there.

gimaker · 2016-04-07T14:24:13Z

👍

TL;DR; skydns does not handle TCP well. We already worked around this in #900. See thar PR for more context. However, that change only fixed the problem to an extent as we still have the same issue once the responses are >4096 bytes. This change extends that workaround to allow us to survive responses up to 32768 bytes in size. This change does not fix the issue, but should make it more rare.

and use ServicesResourceTransformer to relocate class names in META-INF/services/. fixes #900

[solo] Fix solo base image build; bump its version

1c8b124

Skydns now needs to be compiled with go 1.5+.

davidxia mentioned this pull request Apr 6, 2016

[solo] Fix skydns - two wrongs make a right #899

Closed

davidxia force-pushed the dxia/fix-solo-skydns2 branch from 88248a8 to c3b4743 Compare April 6, 2016 23:40

davidxia changed the title ~~[solo] Fix skydns - two wrongs make a right~~ [solo] Fix skydns - unbound inside solo Apr 7, 2016

davidxia force-pushed the dxia/fix-solo-skydns2 branch from c3b4743 to 0d1483b Compare April 7, 2016 00:42

davidxia changed the title ~~[solo] Fix skydns - unbound inside solo~~ [solo] Use unbound to proxy to skydns; fixes shit Apr 7, 2016

davidxia added the bug label Apr 7, 2016

mattnworb reviewed Apr 7, 2016
View reviewed changes

davidxia force-pushed the dxia/fix-solo-skydns2 branch from 0d1483b to 8989006 Compare April 7, 2016 13:25

davidxia force-pushed the dxia/fix-solo-skydns2 branch from 8989006 to 886f1dc Compare April 7, 2016 13:31

davidxia merged commit 5440a84 into master Apr 7, 2016

davidxia deleted the dxia/fix-solo-skydns2 branch April 7, 2016 14:24

gimaker mentioned this pull request Jan 25, 2017

helios-solo: skydns workaround to allow DNS responses up to 32768 bytes #1081

Merged

vbhavsar pushed a commit that referenced this pull request Aug 28, 2018

Upgrade maven-shade-plugin from 2.4.1 to 3.1.0

bfc400c

and use ServicesResourceTransformer to relocate class names in META-INF/services/. fixes #900

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[solo] Use unbound to proxy to skydns; fixes shit #900

[solo] Use unbound to proxy to skydns; fixes shit #900

davidxia commented Apr 6, 2016

codecov-io commented Apr 6, 2016

davidxia commented Apr 7, 2016

mattnworb Apr 7, 2016

davidxia Apr 7, 2016 •

edited

Loading

mattnworb commented Apr 7, 2016

davidxia commented Apr 7, 2016

gimaker commented Apr 7, 2016

gimaker commented Apr 7, 2016

gimaker commented Apr 7, 2016

[solo] Use unbound to proxy to skydns; fixes shit #900

[solo] Use unbound to proxy to skydns; fixes shit #900

Conversation

davidxia commented Apr 6, 2016

codecov-io commented Apr 6, 2016

Current coverage is 48.92%

davidxia commented Apr 7, 2016

mattnworb Apr 7, 2016

Choose a reason for hiding this comment

davidxia Apr 7, 2016 • edited Loading

Choose a reason for hiding this comment

mattnworb commented Apr 7, 2016

davidxia commented Apr 7, 2016

gimaker commented Apr 7, 2016

gimaker commented Apr 7, 2016

gimaker commented Apr 7, 2016

Current coverage is `48.92%`

davidxia Apr 7, 2016 •

edited

Loading