Lighthouse DNS unable to handle TCP retries after truncation #3218

t0lya · 2024-09-09T19:47:02Z

What happened:

We have >200 pods behind a headless service in our fleet of clusters.
We are making DNS requests for the headless service (<service>.<namespace>.svc.clusterset.local) which is supposed to return >200 A records (one for every pod).
dig <service>.<namespace>.svc.clusterset.local @<pod IP of lighthouse DNS> works as expected.
dig <service>.<namespace>.svc.clusterset.local @<clusterIP of lighthouse DNS> does not work, returns SERVFAIL.

What you expected to happen:

Per RFC1035 and RFC2181, DNS messages over UDP are limited to 512 bytes. Messages above limit are truncated. Server sents back response with TC bit set, after which client is supposed to retry with TCP.

But lighthouse DNS clusterIP service only exposes UDP port and not TCP port:

submariner-operator/controllers/servicediscovery/servicediscovery_controller.go

Lines 388 to 396 in ee76b52

    
           Ports: []corev1.ServicePort{{ 
        
           	Name:     "udp", 
        
           	Protocol: "UDP", 
        
           	Port:     53, 
        
           	TargetPort: intstr.IntOrString{ 
        
           		Type:   intstr.Int, 
        
           		IntVal: 53, 
        
           	}, 
        
           }},

We manually created a copy of clusterIP service with TCP port added (updating existing service was not possible because controller would overide our changes). After which dig <service>.<namespace>.svc.clusterset.local @<copied clusterIP of lighthouse DNS> began working.

Can we make change in submariner-operator to include TCP port in submariner-lighthouse-coredns clusterIP service?

How to reproduce it (as minimally and precisely as possible):
Create many pods (>200) behind a headless service.
dig <service>.<namespace>.svc.clusterset.local @<pod IP of lighthouse DNS> works as expected.
dig <service>.<namespace>.svc.clusterset.local @<clusterIP of lighthouse DNS> does not work, returns SERVFAIL.

Anything else we need to know?:

Environment:

Diagnose information (use subctl diagnose all):
Gather information (use subctl gather)
Cloud provider or hardware configuration:
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

tpantelis · 2024-09-10T00:22:04Z

Can we make change in submariner-operator to include TCP port in submariner-lighthouse-coredns clusterIP service?

We welcome contributions. Please submit a PR.

t0lya added the bug Something isn't working label Sep 9, 2024

t0lya changed the title ~~Lighthouse DNS unable to handle TCP retries during truncation~~ Lighthouse DNS unable to handle TCP retries after truncation Sep 9, 2024

dfarrell07 added this to Backlog Sep 10, 2024

dfarrell07 moved this to Backlog in Backlog Sep 10, 2024

dfarrell07 assigned t0lya Sep 10, 2024

dfarrell07 added the priority:medium label Sep 10, 2024

t0lya mentioned this issue Sep 11, 2024

Add TCP port to Lighthouse CoreDNS ClusterIP service #3220

Merged

tpantelis closed this as completed in #3220 Sep 16, 2024

tpantelis removed this from Backlog Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lighthouse DNS unable to handle TCP retries after truncation #3218

Lighthouse DNS unable to handle TCP retries after truncation #3218

t0lya commented Sep 9, 2024 •

edited

Loading

tpantelis commented Sep 10, 2024

Lighthouse DNS unable to handle TCP retries after truncation #3218

Lighthouse DNS unable to handle TCP retries after truncation #3218

Comments

t0lya commented Sep 9, 2024 • edited Loading

tpantelis commented Sep 10, 2024

t0lya commented Sep 9, 2024 •

edited

Loading