Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns servers respond with SRV records when queried for AAAA ones #4051

Closed
davepacheco opened this issue Sep 7, 2023 · 5 comments
Closed

dns servers respond with SRV records when queried for AAAA ones #4051

davepacheco opened this issue Sep 7, 2023 · 5 comments
Assignees
Labels
good first issue Issues that are good for learning the codebase

Comments

@davepacheco
Copy link
Collaborator

I am not sure this is wrong and I think it's not very important, but I want to at least have a record of this.

@jordanhendricks reported this output from dogfood today:

root@oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a:~# dig @fd00:1122:3344:1::1 -p 53 -t AAAA _cockroach._tcp.control-plane.oxide.internal.                                                                                                                                                                           

; <<>> DiG 9.18.14 <<>> @fd00:1122:3344:1::1 -p 53 -t AAAA _cockroach._tcp.control-plane.oxide.internal.
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61672
;; flags: qr rd; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;_cockroach._tcp.control-plane.oxide.internal. IN AAAA

;; ANSWER SECTION:
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 3237a532-acaa-4ebe-bf11-dde794fea739.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 8bbea076-ff60-4330-8302-383e18140ef3.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 a3628a56-6f85-43b5-be50-71d8f0e04877.host.control-plane.oxide.internal.
_cockroach._tcp.control-plane.oxide.internal. 0 IN SRV 0 0 32221 e86845b5-eabd-49f5-9a10-6dfef9066209.host.control-plane.oxide.internal.

;; ADDITIONAL SECTION:
3237a532-acaa-4ebe-bf11-dde794fea739.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:109::3
4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:105::3
8bbea076-ff60-4330-8302-383e18140ef3.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:10b::3
a3628a56-6f85-43b5-be50-71d8f0e04877.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:107::3
e86845b5-eabd-49f5-9a10-6dfef9066209.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:108::3

;; Query time: 0 msec
;; SERVER: fd00:1122:3344:1::1#53(fd00:1122:3344:1::1) (UDP)
;; WHEN: Thu Sep 07 19:49:30 UTC 2023
;; MSG SIZE  rcvd: 652

Here, dig is making a query for AAAA records. The server is responding with SRV records (and AAAA records as additionals). I expected the server to report no records since there are no AAAA records with that name.

Interestingly, with +short, you get this:

root@oxz_nexus_20b100d0-84c3-4119-aa9b-0c632b0b6a3a:~# dig +short @fd00:1122:3344:1::1 -p 53 -t AAAA _cockroach._tcp.control-plane.oxide.internal.
0 0 32221 3237a532-acaa-4ebe-bf11-dde794fea739.host.control-plane.oxide.internal.
0 0 32221 4c3ef132-ec83-4b1b-9574-7c7d3035f9e9.host.control-plane.oxide.internal.
0 0 32221 8bbea076-ff60-4330-8302-383e18140ef3.host.control-plane.oxide.internal.
0 0 32221 a3628a56-6f85-43b5-be50-71d8f0e04877.host.control-plane.oxide.internal.
0 0 32221 e86845b5-eabd-49f5-9a10-6dfef9066209.host.control-plane.oxide.internal.

As a dig user I find that particularly surprising because I asked for AAAA records and got records on output with no IPv6 addresses. I suspect dig is just dumping whatever it finds in the answer section.

@iliana
Copy link
Contributor

iliana commented Oct 12, 2023

I think this behavior is correct and we should write a test to make sure it is retained. Here is the response for a SRV record from Knot DNS, whose target is an A record known to the resolver:

$ dig @ns1.7x6.net. -t SRV _minecraft._tcp.mc.7x6.net.

; <<>> DiG 9.10.6 <<>> @ns1.7x6.net. -t SRV _minecraft._tcp.mc.7x6.net.
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32640
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;_minecraft._tcp.mc.7x6.net.    IN      SRV

;; ANSWER SECTION:
_minecraft._tcp.mc.7x6.net. 900 IN      SRV     0 0 54321 mc.7x6.net.

;; ADDITIONAL SECTION:
mc.7x6.net.             900     IN      A       192.0.2.123

;; Query time: 49 msec
;; SERVER: 209.251.245.212#53(209.251.245.212)
;; WHEN: Thu Oct 12 09:45:23 PDT 2023
;; MSG SIZE  rcvd: 101

Consider also the similar implementation for glue records:

$ dig -t A @a.gtld-servers.net. ns1.7x6.net.

; <<>> DiG 9.10.6 <<>> -t A @a.gtld-servers.net. ns1.7x6.net.
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34811
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 3, ADDITIONAL: 7
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;ns1.7x6.net.                   IN      A

;; AUTHORITY SECTION:
7x6.net.                172800  IN      NS      ns1.7x6.net.
7x6.net.                172800  IN      NS      ns2.7x6.net.
7x6.net.                172800  IN      NS      ns3.7x6.net.

;; ADDITIONAL SECTION:
ns1.7x6.net.            172800  IN      A       209.251.245.212
ns1.7x6.net.            172800  IN      AAAA    2620:fc:c000::212
ns2.7x6.net.            172800  IN      AAAA    2a01:4ff:1f0:c2de::1
ns2.7x6.net.            172800  IN      A       5.78.69.207
ns3.7x6.net.            172800  IN      A       128.140.94.83
ns3.7x6.net.            172800  IN      AAAA    2a01:4f8:c17:f980::1

;; Query time: 47 msec
;; SERVER: 192.5.6.30#53(192.5.6.30)
;; WHEN: Thu Oct 12 09:36:05 PDT 2023
;; MSG SIZE  rcvd: 222

@iliana
Copy link
Contributor

iliana commented Oct 12, 2023

Wait I ran a different thing entirely. Here's Knot again:

$ dig @ns1.7x6.net. -t A _minecraft._tcp.mc.7x6.net.

; <<>> DiG 9.10.6 <<>> @ns1.7x6.net. -t A _minecraft._tcp.mc.7x6.net.
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54325
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;_minecraft._tcp.mc.7x6.net.    IN      A

;; AUTHORITY SECTION:
7x6.net.                900     IN      SOA     ns1.7x6.net. admin.7x6.net. 1312 10000 2400 604800 900

;; Query time: 54 msec
;; SERVER: 209.251.245.212#53(209.251.245.212)
;; WHEN: Thu Oct 12 09:50:47 PDT 2023
;; MSG SIZE  rcvd: 101

So, yes, our current behavior is likely incorrect.

@davepacheco
Copy link
Collaborator Author

We noticed this again under #6259. A quick look at the DNS server shows that it seems to ignore the requested record type. Here's where we handle the incoming message:

/// Handle a well-formed, decoded DNS query
async fn handle_dns_message(
request: &Request,
mr: &MessageRequest,
) -> Result<(), RequestError> {
let log = &request.log;
let store = &request.store;
debug!(&log, "message_request"; "mr" => #?mr);
let header = Header::response_from_request(mr.header());
let query = mr.query();
let name = query.original().name().clone();
let records = store.query(mr)?;

We grab the records from the store at L266 there. That function is documented to return all records associated with the name and says nothing about filtering by record type:

/// Returns a non-empty list of DNS records associated with the name in the
/// given DNS request.
///
/// If the returned set would have been empty, returns `QueryError::NoName`.
pub(crate) fn query(
&self,
mr: &trust_dns_server::authority::MessageRequest,
) -> Result<Vec<DnsRecord>, QueryError> {
let name = mr.query().name();
let orig_name = mr.query().original().name();
self.query_raw(name, orig_name)
}

I think store.query() should probably filter on record type, too.

Given this, you can see any kind of record when querying for any other kind of record. It's not just limited to getting SRV records when querying for AAAA -- the reverse can also happen.

@iliana is also right that we should continue to provide additional records when it makes sense. I think the fix I suggested won't change that because it's a separate hunk of code in handle_dns_message() that goes and resolves any SRV targets and includes them in the response.

@davepacheco davepacheco added the good first issue Issues that are good for learning the codebase label Aug 8, 2024
@iximeow iximeow self-assigned this Aug 9, 2024
iximeow added a commit that referenced this issue Aug 13, 2024
As noted in #4051,
queries to the internal DNS server would get any records for a name in
response, rather than only records matching the query incoming query type.
That behavior is confusing, but worse, wrong.

Subtly, this is not actually a misbehavior I believe we can observe
through `trust_dns_resolver`: the resolver from that crate includes its
own `CachingClient`. As a side effect of upstream answers going
through that caching client, incorrect Answers sections are cached and
only correct answers actually make it out to us as consumers of
`trust_dns_resolver`.

But as is plenty clear in #4051, `dig` and other DNS clients can get
incoherent answers!

Simple enough to fix: only return answers that are answers to the
question we were asked.
iximeow added a commit that referenced this issue Aug 13, 2024
`lookup_ipv6` is both buggy and easy to misuse:
* it sends an AAAA query for a domain which should have a SRV record
  - this works only because #4051
    means the SRV record is incorrectly returned, along with the desired
    AAAA in Additionals
* it looks up an IPv6 address from a SRV record *but ignores the port*.
  in places `lookup_ipv6` was used, it was paired consistently with the
  hardcoded port NEXUS_INTERNAL_PORT and matched what should be in the
  resolved SRV record. if we for example wanted to move Nexus' port (or
  start a test Nexus on an atypical port), the authoritative port number
  in the SRV response would be ignored for the hardcoded port.

lets just use the port that we told DNS we're at!

we may still want a bare IPv6 address for a service if we're testing
network reachability, for example, but we're not doing that today. and
if we need a service's IPv6 address to use with an alternate port to
access a different API, we *probably* should have a second SRV
record for that API to use instead?

(`lookup_ipv6` would be simple enough to fix and retain, but after
looking at its uses it seems it might be more trouble than it's worth
right now)
iximeow added a commit that referenced this issue Aug 13, 2024
`lookup_ipv6` is both buggy and easy to misuse:
* it sends an AAAA query for a domain which should have a SRV record
  - this works only because #4051
    means the SRV record is incorrectly returned, along with the desired
    AAAA in Additionals
* it looks up an IPv6 address from a SRV record *but ignores the port*.
  in places `lookup_ipv6` was used, it was paired consistently with the
  hardcoded port NEXUS_INTERNAL_PORT and matched what should be in the
  resolved SRV record. if we for example wanted to move Nexus' port (or
  start a test Nexus on an atypical port), the authoritative port number
  in the SRV response would be ignored for the hardcoded port.

lets just use the port that we told DNS we're at!

we may still want a bare IPv6 address for a service if we're testing
network reachability, for example, but we're not doing that today. and
if we need a service's IPv6 address to use with an alternate port to
access a different API, we *probably* should have a second SRV
record for that API to use instead?

(`lookup_ipv6` would be simple enough to fix and retain, but after
looking at its uses it seems it might be more trouble than it's worth
right now)
iximeow added a commit that referenced this issue Aug 16, 2024
`lookup_ipv6` is both buggy and easy to misuse:
* it sends an AAAA query for a domain which should have a SRV record
- this works only because
#4051 means the SRV
record is incorrectly returned, along with the actually-desired AAAA for
the SRV's target in Additionals
* it looks up an IPv6 address from a SRV record *but ignores the port*.
in places `lookup_ipv6` was used, it was paired consistently with the
hardcoded port NEXUS_INTERNAL_PORT and matched what should be in the
resolved SRV record. if we for example wanted to move Nexus' port (or
start a test Nexus on an atypical port), the authoritative port number
in the SRV response would be ignored for the hardcoded port.

lets just use the port that we told DNS we're at!

we may still want a bare IPv6 address for a service if we're going to
test network reachability, for example, but we're not doing that with
this function today. this all is distinct from helpers like
`lookup_all_ipv6`.

if we need a service's IPv6 address to use with an alternate port to
access a different API, we *probably* should have a distinct SRV record
for that lookup to use instead? i've found three instances of this:
* wicket assumes the techport proxy is on the same IP as Nexus' API, but that isn't necessarily true
* we assume the CRDB admin service listens on the same IP as CRDB itself, but that doesn't have to be true
* we look up addresses for MGS via `ServiceName::Dendrite`, but there's a
`ServiceName::ManagementGatewayService`, so either that's a typo or can be made to have its own SRV records

there are some uses of `lookup_all_ipv6` that make a lot of sense still,
where we're discovering the rack's network and _really_ do not care
about the port that Dendrite happens to be on.
@iximeow
Copy link
Member

iximeow commented Aug 16, 2024

FWIW, with #6320 landed i believe we no longer depend on this misbehavior..

iximeow added a commit that referenced this issue Oct 29, 2024
As noted in #4051,
queries to the internal DNS server would get any records for a name in
response, rather than only records matching the query incoming query type.
That behavior is confusing, but worse, wrong.

Subtly, this is not actually a misbehavior I believe we can observe
through `trust_dns_resolver`: the resolver from that crate includes its
own `CachingClient`. As a side effect of upstream answers going
through that caching client, incorrect Answers sections are cached and
only correct answers actually make it out to us as consumers of
`trust_dns_resolver`.

But as is plenty clear in #4051, `dig` and other DNS clients can get
incoherent answers!

Simple enough to fix: only return answers that are answers to the
question we were asked.
iximeow added a commit that referenced this issue Oct 29, 2024
As noted in #4051,
queries to the internal DNS server would get any records we have for a
name in response, rather than only records matching the query incoming
query type. That behavior is confusing, but worse, wrong.

Subtly, this is not actually a misbehavior I believe we can observe
through `trust_dns_resolver`: the resolver from that crate includes its
own `CachingClient`. As a side effect of upstream answers going through
that caching client, incorrect Answers records are cached and only
correct answers actually make it out to us as consumers of
`trust_dns_resolver`.

But as is plenty clear in #4051, `dig` and other DNS clients can get
incoherent answers!

Simple enough to fix: only return answers that are answers to the
question we were asked.
@iximeow
Copy link
Member

iximeow commented Oct 29, 2024

and with #6308 we no longer produce this misbehavior 👋

@iximeow iximeow closed this as completed Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Issues that are good for learning the codebase
Projects
None yet
Development

No branches or pull requests

3 participants