Skip to content
This repository has been archived by the owner on Sep 12, 2024. It is now read-only.

[solo] Use unbound to proxy to skydns; fixes shit #900

Merged
merged 2 commits into from
Apr 7, 2016

Conversation

davidxia
Copy link
Contributor

@davidxia davidxia commented Apr 6, 2016

[solo] Use unbound to proxy to skydns; fixes shit

TL;DR When two DNS servers don't work, add one more!

When running some integration tests with HeliosSoloDeployment on Docker
hosts that use a local unbound instance as its DNS resolver (i.e.
specified in /etc/resolv.conf on the Docker host),
we saw tests failures due to failed SRV queries to skydns. Skydns is
running in the solo container and forwards DNS queries it doesn't know
about to nameservers specified in /etc/resolv.conf via logic in start.sh.

The skydns error output from the helios solo container spawned by
HeliosSoloDeployment looked like:

skydns: failure to forward request "dns: failed to unpack truncated
message"

Our guess is that large UDP responses from the upstream unbound
have the "Message Truncated" DNS flag set. When this type of response
reaches skydns, skydns blows up and doesn't tell the client about the
error. The client times out without retrying in TCP mode. The client
would've retried if it had received an error message from skydns.

Running dig against skydns works. We think this is because dig adds
an OPT record to its query that sets "udp payload size: 4096".

Here are outstanding issues in skydns that seem related:

Solution:

We start an unbound instance in the solo container and have it forward
DNS queries via UDP to the upstream skydns in the same container.
Unbound will add the OPT section that makes everything work.
Things are fixed. :)

We admit this is super funky...And this only might work for UDP packets
up to 4096 bytes, the default set by unbound in OPT.

Much thanks to @gimaker for helping and suggesting unbound inside the
container.

Skydns now needs to be compiled with go 1.5+.
@codecov-io
Copy link

Current coverage is 48.92%

Merging #900 into master will increase coverage by +0.01% as of 886f1dc

@@            master    #900   diff @@
======================================
  Files          271     271       
  Stmts        12706   12706       
  Branches      1634    1634       
  Methods          0       0       
======================================
+ Hit           6215    6216     +1
+ Partial        470     468     -2
- Missed        6021    6022     +1

Review entire Coverage Diff as of 886f1dc

Powered by Codecov. Updated on successful CI builds.

@davidxia
Copy link
Contributor Author

davidxia commented Apr 7, 2016

@davidxia davidxia changed the title [solo] Fix skydns - two wrongs make a right [solo] Fix skydns - unbound inside solo Apr 7, 2016
@davidxia davidxia force-pushed the dxia/fix-solo-skydns2 branch from c3b4743 to 0d1483b Compare April 7, 2016 00:42
@davidxia davidxia changed the title [solo] Fix skydns - unbound inside solo [solo] Use unbound to proxy to skydns; fixes shit Apr 7, 2016
@davidxia davidxia added the bug label Apr 7, 2016
@@ -23,7 +23,8 @@ done
curl -XPUT http://127.0.0.1:4001/v2/keys/skydns/${SKYDNS_PATH} \
-d value="{\"host\":\"$HOST_ADDRESS\"}"

skydns $SKYDNS_OPTS &
skydns $SKYDNS_OPTS -verbose &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if -verbose should be put in the image. Seems like you could run the docker image with -e SKYDNS_OPTS=-verbose at runtime if you really wanted verbose logging.

Copy link
Contributor Author

@davidxia davidxia Apr 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea ill remove.

@mattnworb
Copy link
Member

👍 overall

@davidxia davidxia force-pushed the dxia/fix-solo-skydns2 branch from 0d1483b to 8989006 Compare April 7, 2016 13:25
TL;DR When two DNS servers don't work, add one more!

When running some integration tests with HeliosSoloDeployment on Docker
hosts that use a local unbound instance as its DNS resolver (i.e.
specified in `/etc/resolv.conf` on the Docker host),
we saw tests failures due to failed SRV queries to skydns. Skydns is
running in the solo container and forwards DNS queries it doesn't know
about to nameservers specified in `/etc/resolv.conf` via logic in `start.sh`.

The skydns error output from the helios solo container spawned by
HeliosSoloDeployment looked like:

```
skydns: failure to forward request "dns: failed to unpack truncated
message"
```

Our guess is that large UDP responses from the upstream unbound
have the "Message Truncated" DNS flag set. When this type of response
reaches skydns, skydns blows up and doesn't tell the client about the
error. The client times out without retrying in TCP mode. The client
would've retried if it had received an error message from skydns.

Running `dig` against skydns works. We think this is because `dig` adds
an OPT record to its query that sets "udp payload size: 4096".

Here are outstanding issues in skydns that seem related:

* skynetservices/skydns#242
* skynetservices/skydns#45

Solution:

We start an unbound instance in the solo container and have it forward
DNS queries via UDP to the upstream skydns in the same container.
Unbound will add the OPT section that makes everything work.
Things are fixed. :)

We admit this is super funky...And this only might work for UDP packets
up to 4096 bytes, the default set by unbound in OPT.

Much thanks to @gimaker for helping and suggesting unbound inside the
container.
@davidxia davidxia force-pushed the dxia/fix-solo-skydns2 branch from 8989006 to 886f1dc Compare April 7, 2016 13:31
@davidxia
Copy link
Contributor Author

davidxia commented Apr 7, 2016

@gimaker If it looks good to you, I'll merge and try to get a release out.

@gimaker
Copy link
Contributor

gimaker commented Apr 7, 2016

@davidxia what happened to the nice commit message? :(

@gimaker
Copy link
Contributor

gimaker commented Apr 7, 2016

Oh nevermind. I see that it's still there.

@gimaker
Copy link
Contributor

gimaker commented Apr 7, 2016

👍

@davidxia davidxia merged commit 5440a84 into master Apr 7, 2016
@davidxia davidxia deleted the dxia/fix-solo-skydns2 branch April 7, 2016 14:24
gimaker pushed a commit that referenced this pull request Jan 24, 2017
TL;DR; skydns does not handle TCP well. We already worked around this
in #900. See thar PR for more context.
However, that change only fixed the problem to an extent as we still have
the same issue once the responses are >4096 bytes.

This change extends that workaround to allow us to survive responses up to
32768 bytes in size. This change does not fix the issue, but should make it
more rare.
vbhavsar pushed a commit that referenced this pull request Aug 28, 2018
and use ServicesResourceTransformer to relocate class names in
META-INF/services/.

fixes #900
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants