Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NetworkPolicy allowing DNS egress causes cilium agent crash in ACNS-enabled AKS #4525

Closed
felfa01 opened this issue Sep 5, 2024 · 9 comments
Assignees
Labels

Comments

@felfa01
Copy link

felfa01 commented Sep 5, 2024

Describe the bug
When running an AKS cluster with Advanced Container Networking Services (ACNS) and deploying a NetworkPolicy configured to allow DNS egress, cilium agent pods are going into a crashing state.

To Reproduce

  1. Create an AKS cluster configured with the following:
    networkProfile: {
      advancedNetworking: {
        observability: {
          enabled: true
        }
        security: {
          fqdnPolicy: {
            enabled: true
          }
        }
      }
      networkPlugin: 'azure'
      networkPluginMode: 'overlay'
      networkDataplane: 'cilium'
      networkPolicy: 'cilium'
  1. Deploy a NetworkPolicy configured to allow egress to port 53 with protocol UDP.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: bad-netpol
spec:
  egress:
  - to:
    - podSelector: {}
  - ports:
    - port: 53
      protocol: UDP
  policyTypes:
  - Egress
  1. run kubectl get pods -n kube-system and see that cilium pods are in a crashing state:
cilium-2zchs                                           1/1     Running            0             47h
cilium-bzj85                                           0/1     CrashLoopBackOff   21 (3s ago)   47h
cilium-kp5qt                                           0/1     CrashLoopBackOff   23 (3s ago)   47h
cilium-operator-5db9c9657b-k6j64                       1/1     Running            0             47h
cilium-operator-5db9c9657b-sl2ss                       1/1     Running            0             47h
cilium-xrdfm                                           1/1     Running            0             47h

Environment (please complete the following information):

  • Kubernetes version 1.29.7

Additional context
Error log:

time="2024-09-05T12:29:14Z" level=info msg="NetworkPolicy successfully added" k8sApiVersion= k8sNetworkPolicyName=k6-enable-connection subsys=k8s-watcher
time="2024-09-05T12:29:14Z" level=info msg="Policy imported via API, recalculating..." policyAddRequest=c6eba7b9-6b84-486f-86a8-7ca94cc99486 policyRevision=23 subsys=daemon
time="2024-09-05T12:29:14Z" level=info msg="Sending Policy updates to sdp: endpoint_id:1500  port:53  rules:{selector_string:\"&LabelSelector{MatchLabels:map[string]string{any.k8s-app: kube-dns,k8s.io.kubernetes.pod.namespace: kube-system,},MatchExpressions:[]LabelSelectorRequirement{},}\"  port_rules:{match_pattern:\"*\"}  selections:39072}" subsys=fqdn/server
time="2024-09-05T12:29:14Z" level=info msg="Sending update to stream: &{0xc0029d81e0}" subsys=fqdn/server
time="2024-09-05T12:29:14Z" level=info msg="Updating the DNS rules for endpoint 1500" subsys=proxy
panic: runtime error: invalid memory address or nil pointer dereference
	panic: Trying to configure zero proxy port
[signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0x2b3e3ec]

goroutine 1301 [running]:
github.com/cilium/cilium/pkg/proxy.(*Proxy).CreateOrUpdateRedirect.func1()
	/go/src/github.com/cilium/cilium/pkg/proxy/proxy.go:469 +0x6d
panic({0x2ff3920?, 0x5bfe340?})
	/usr/local/go/src/runtime/panic.go:920 +0x270
github.com/cilium/cilium/pkg/fqdn/service.(*FQDNDataServer).UpdateSDPAllowed(0xc001bdbf80, 0x5dc, 0x1110035, 0xc003a65770)
	/go/src/github.com/cilium/cilium/pkg/fqdn/service/service.go:61 +0x1ec
github.com/cilium/cilium/pkg/proxy.(*dnsRedirect).setRules(0xc001d11580, 0xc004a3f378?, 0xc003a65770)
	/go/src/github.com/cilium/cilium/pkg/proxy/dns.go:61 +0x217
github.com/cilium/cilium/pkg/proxy.(*dnsRedirect).UpdateRules(0xc001d11580, 0x3c933d0?)
	/go/src/github.com/cilium/cilium/pkg/proxy/dns.go:77 +0x2c
github.com/cilium/cilium/pkg/proxy.(*Proxy).CreateOrUpdateRedirect(0xc0005c7500, {0x3c896f0?, 0xc00084e6e0}, {0x3c933d0, 0xc00116c580}, {0xc003bf3d88, 0x12}, {0x3ca27a0, 0xc000b0aa80}, 0xc003fcb400)
	/go/src/github.com/cilium/cilium/pkg/proxy/proxy.go:503 +0x4d8
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).addNewRedirectsFromDesiredPolicy.func1(0xc004165380)
	/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:243 +0x166
github.com/cilium/cilium/pkg/policy.L4DirectionPolicy.updateRedirects({0xc003858de0?, 0x60?}, 0xc0042ca0c0, 0xc004a402f8, {0xc003a64cc0?, 0x0?, 0xc003a64cf0?})
	/go/src/github.com/cilium/cilium/pkg/policy/resolve.go:214 +0x196
github.com/cilium/cilium/pkg/policy.(*EndpointPolicy).UpdateRedirects(0x10?, 0xc0?, 0x4108c5?, {0xc003a64cc0?, 0x0?, 0xc003a64cf0?})
	/go/src/github.com/cilium/cilium/pkg/policy/resolve.go:199 +0x4d
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).addNewRedirectsFromDesiredPolicy(0xc000b0aa80, 0x0?, 0xc003a64a80, 0xc003fcb400)
	/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:217 +0x16d
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).addNewRedirects(0xc000b0aa80, 0xc003a64a50?)
	/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:419 +0x230
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).runPreCompilationSteps(0xc000b0aa80, 0xc00234a800, 0xc0038586c0)
	/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:840 +0x6d6
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerateBPF(0xc000b0aa80, 0xc00234a800)
	/go/src/github.com/cilium/cilium/pkg/endpoint/bpf.go:544 +0x190
github.com/cilium/cilium/pkg/endpoint.(*Endpoint).regenerate(0xc000b0aa80, 0xc00234a800)
	/go/src/github.com/cilium/cilium/pkg/endpoint/policy.go:472 +0x7b1
github.com/cilium/cilium/pkg/endpoint.(*EndpointRegenerationEvent).Handle(0xc000a4e080, 0xc000cb51a0?)
	/go/src/github.com/cilium/cilium/pkg/endpoint/events.go:57 +0x1de
github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run.func1()
	/go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:245 +0x133
sync.(*Once).doSlow(0xc001104fd0?, 0x44591c?)
	/usr/local/go/src/sync/once.go:74 +0xbf
sync.(*Once).Do(...)
	/usr/local/go/src/sync/once.go:65
github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).run(0xc001104f38?)
	/go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:233 +0x3c
created by github.com/cilium/cilium/pkg/eventqueue.(*EventQueue).Run in goroutine 1253
	/go/src/github.com/cilium/cilium/pkg/eventqueue/eventqueue.go:229 +0x69
@felfa01 felfa01 added the bug label Sep 5, 2024
@felfa01
Copy link
Author

felfa01 commented Sep 5, 2024

@chasewilson FYI, I have noticed this with the ACNS feature.

@tamilmani1989
Copy link
Member

@felfa01 Thanks for reporting. we are looking into this.

@vipul-21
Copy link

Thanks @felfa01. We were able to reproduce the issue on our end and are working on a fix for it.
The issue is that there are two networkpolicies applied to the same endpoint. And one of those policy does not contain any dns rules. When we apply these 2 policies(in this case k6-enable-connection with DNS rules and bad-netpol without dns rules), cilium agent creates the DNS redirection for k6-enable-connection and tries to resuse the same redirection for bad-netpol policy. During policy recalculation since ACNS feature currently only supports DNS based policies, it starts failing the cilium agent because of dns policy being nil for the bad-netpol.

@avo-sepp
Copy link

We came across this problem naturally running in one of our clusters. I can confirm this bug exists and the short-term fix is to remove the NetworkPolicy that has a DNS egress on it.

@vipul-21
Copy link

Confirming that the short term fix is to remove the NetworkPolicy that has a DNS egress specified.

@pettersolberg88
Copy link

pettersolberg88 commented Sep 19, 2024

We also observe the same issue and we hope a fix will be available soon

@ondrejmo
Copy link

ondrejmo commented Oct 8, 2024

Sorry to create more spam, but is there any eta when we can expect this to be fixed? Unfortunately, this bug makes the entire ACNS basically unusable.

@vipul-21
Copy link

vipul-21 commented Oct 8, 2024

Hey @ondrejmo, the fix has rolled out to every region. Are you still seeing the issue ?

@ondrejmo
Copy link

ondrejmo commented Oct 9, 2024

Hey @ondrejmo, the fix has rolled out to every region. Are you still seeing the issue ?

No, the issue seems to be fixed, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants