[BUG] cannot access Kubernetes dashboard after upgrading to 1.17.7 #1394

przemyslavic · 2020-06-29T13:54:39Z

Describe the bug
From time to time on some cluster you can't get to the dashboard by running kubectl proxy.
The issue was noticed after Kubernetes upgrade to v1.17.7 on AWS RedHat environment.

To Reproduce
Steps to reproduce the behavior:

Deploy AWS RHEL cluster
Run kubectl proxy on master node
Try to run curl -I http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/

Expected behavior
The dashboard UI is available.
HTTP status code is 200

HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: no-cache, private
Cache-Control: no-store
Content-Type: text/html; charset=utf-8
Date: Mon, 29 Jun 2020 13:18:29 GMT
Last-Modified: Fri, 06 Dec 2019 15:14:02 GMT

OS (please complete the following information):

OS: [RHEL 7.8]

Cloud Environment (please complete the following information):

Cloud Provider [AWS]

Additional context
Curl command output:

HTTP/1.1 503 Service Unavailable
Cache-Control: no-cache, private
Content-Length: 71
Content-Type: text/plain; charset=utf-8
Date: Mon, 29 Jun 2020 13:47:29 GMT
X-Content-Type-Options: nosniff

curl "http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/"
Error trying to reach service: 'dial tcp 10.244.3.15:8443: i/o timeout'

[ec2-user@ec2 ~]$ kubectl logs -n=kubernetes-dashboard kubernetes-dashboard-5d996f7d46-6tthp
2020/06/29 10:00:20 Starting overwatch
2020/06/29 10:00:20 Using namespace: kubernetes-dashboard
2020/06/29 10:00:20 Using in-cluster config to connect to apiserver
2020/06/29 10:00:20 Using secret token for csrf signing
2020/06/29 10:00:20 Initializing csrf token from kubernetes-dashboard-csrf secret
2020/06/29 10:00:20 Successful initial request to the apiserver, version: v1.17.7
2020/06/29 10:00:20 Generating JWE encryption key
2020/06/29 10:00:20 New synchronizer has been registered: kubernetes-dashboard-key-holder-kubernetes-dashboard. Starting
2020/06/29 10:00:20 Starting secret synchronizer for kubernetes-dashboard-key-holder in namespace kubernetes-dashboard
2020/06/29 10:00:20 Initializing JWE encryption key from synchronized object
2020/06/29 10:00:20 Creating in-cluster Sidecar client
2020/06/29 10:00:20 Auto-generating certificates
2020/06/29 10:00:20 Successfully created certificates
2020/06/29 10:00:20 Serving securely on HTTPS port: 8443
2020/06/29 10:00:50 Metric client health check failed: the server is currently unable to handle the request (get services dashboard-metrics-scraper). Retrying in 30 seconds.

The text was updated successfully, but these errors were encountered:

rafzei · 2020-07-02T13:29:24Z

Looks like it related to < v0.6.0 version of epicli. I cannot reproduce the issue by upgrading v0.6.0 to v.0.7.0.
ENV:

RHEL 7.8
AWS

toszo · 2020-07-03T08:05:12Z

Looks like a non-deterministic error. It appears usually on clusters built by pipeline.

rafzei · 2020-07-09T08:24:00Z

Additional log from kubernetes-metrics-scraper:

I0707 14:37:30.561304    4967 round_trippers.go:420] GET https://10.1.2.197:6443/api/v1/namespaces/kubernetes-dashboard/pods/kubernetes-metrics-scraper-756dd959c8-xz2vt/log
I0707 14:37:30.561316    4967 round_trippers.go:427] Request Headers:
I0707 14:37:30.561321    4967 round_trippers.go:431]     User-Agent: kubectl/v1.17.7 (linux/amd64) kubernetes/b445510
I0707 14:37:30.561325    4967 round_trippers.go:431]     Accept: application/json, */*
I0707 14:37:30.564051    4967 round_trippers.go:446] Response Status: 200 OK in 2 milliseconds
{"level":"info","msg":"Kubernetes host: https://10.96.0.1:443","time":"2020-07-07T14:35:56Z"}
10.244.1.1 - - [07/Jul/2020:14:36:28 +0000] "GET / HTTP/1.1" 200 6 "" "kube-probe/1.17"
10.244.1.1 - - [07/Jul/2020:14:36:38 +0000] "GET / HTTP/1.1" 200 6 "" "kube-probe/1.17"
10.244.1.1 - - [07/Jul/2020:14:36:48 +0000] "GET / HTTP/1.1" 200 6 "" "kube-probe/1.17"
{"level":"error","msg":"Error scraping node metrics: the server could not find the requested resource (get nodes.metrics.k8s.io)","time":"2020-07-07T14:36:56Z"}

What I observe is that issue was gone after VM restart.

rafzei · 2020-07-29T11:51:45Z

After upgrade K8s to 1.18.6 issue is still present on Flannel & Canal

rafzei · 2020-07-31T11:45:18Z

I've removed dependency due to new findings on this topic. The Dashboard working well within the Pod. Looks like it's related to manage flannel.1 interface by NetworkManager. The issue seems to be reproducible right now. I'm going to test a fix for that.

)

przemyslavic · 2020-08-06T07:45:22Z

The fix seems to have resolved the issue as I am no longer able to reproduce it.

* Initialized test status table * Added next sections of test status Refactored status table a bit, added next lines, added next section with descriptions. * Upgrade cluster section filled * All sections filled * Add missing tests * Move CNS proposition design doc to GH. * fixed formatting * Etcd encryption feature refactor for deployment and upgrades (#1427) * kubernetes_master: etcd encryption simplification and refactor * upgrade: refactor of upgrade-kubeadm-config.yml (proper yaml parsing) * upgrade: adding etcd encryption patching procedure * upgrade-master.yml: small coding style improvement (highlight fix) * upgrade: enabling patching of the kubeadm config * fact naming improvements Co-authored-by: to-bar <[email protected]> * patch-kubeadm-config.yml: skipping unnecessary kubectl apply Co-authored-by: to-bar <[email protected]> * Bumping AzureCLI to fix SP secrets with special characters. * Added Changelog entry. * Change move to copy build dir during an upgrade (#1429) * Change move to copy build dir during an upgrade * Got rid of unused backup_temp_dir * Update to logging - log piping for stderr. - custom colors for different log levels - mapping some cases of log warnings and errors from Terraform and Ansible * helm documentation #896 * Progress: - simplified piping * Fix K8s upgrade: 'kubeadm upgrade apply' hangs (#1431) * Clean up and optimize K8s upgrades * Patch only kubeadm-config ConfigMap * Downgrade CoreDNS to K8s built-in version before 'kubeadm upgrade apply' * Deploy customized CoreDNS after K8s is upgraded to the latest version * Update changelog * Wait for API resources to propagate * Rename vendor in VSCode recommendations (#1438) Vendor moved owner of mauve.terraform repository to HashiCorp (https://marketplace.visualstudio.com/items?itemName=HashiCorp.terraform) * Fix issue with Vault and Kubernetes Calico/Canal communication (#1434) * Add vault namespace and fixes related to connection issue * Add default policy for default namespace * Remove service endpoint, execute certificate part if enabled, setting protocol correctly in Vault Helm chart * Add possibility to configure manually Vault endpoint * Added changelog. * add howto links for helm doc * Update Changelog for #1438 (#1460) * Update Changelog * Update Changelog - add PR number * bump rabbitmq version from 3.7.10 to 3.8.3 #1395 * Changes in documentation after creating fix for calico and canal (#1459) * Changes after creating fix for calico and canal * Update changelog * Got rid of pipe and grep (#1472) * Assert that current version is upgradeable #1474 (#1476) * Assert that upgrade from current version is supported #1474 * Update core/src/epicli/data/common/ansible/playbooks/roles/upgrade/tasks/kubernetes.yml Co-authored-by: to-bar <[email protected]> * Add docker_version variable support (#1477) * add docker_version variable support * Docker installation - 2 tasks merged into 1 to speed up the deployment * Remove two useless packages from docker installation Co-authored-by: Grzegorz Dajuk <[email protected]> * Kubernetes HA upgrades (#1456) * epicli/upgrade: reusing existing shared-config + cleanups * upgrade: k8s HA upgrades minimal implementation * upgrade: kubernetes cleanup and refactor * Apply suggestions from code review Co-authored-by: to-bar <[email protected]> * upgrade: removing unneeded kubeconfig from k8s nodes (security fix) * upgrade: statefulset patching refactor * upgrade: cleanups and refactor for logs * Make deployment manifest tasks more generic * Improve detecting CNI plugin * AnsibleVarsGenerator.py: fixing regression issue introducted during upgrade refactor * Apply suggestions from code review Co-authored-by: to-bar <[email protected]> * upgrade: statefulset patching refactor - patching all containers (fix) - patching init containers also (fix) - removing include_tasks statements (speedup) * Ensure settings for backward compatibility * Revert "Ensure settings for backward compatibility" This reverts commit 5c9cdb6. * AnsibleInventoryUpgrade.py: merging shared-config with defaults * Adding changelog entry * Revert "AnsibleVarsGenerator.py: fixing regression issue introducted during upgrade refactor" This reverts commit c38eb9d. * Revert "epicli/upgrade: reusing existing shared-config + cleanups" This reverts commit e5957c5. * AnsibleVarsGenerator.py: adding nicer way to handle shared config Co-authored-by: to-bar <[email protected]> * Fix upgrade of flannel to v0.12.0 (#1484) * Readme and changelog update (#1493) Readme and changelog update * Fixing broken offline CentOS 7.8 installation (#1498) * repository: adding the missing centos-logos package * updating 0.7.1 changelog * repository/centos-7: restoring alphabetical order * Add modularization-approaches.md design document * Kibana config always points its elasticsearch.hosts to a "logging" VM (#1347) (#1483) * Bump elliptic from 6.5.0 to 6.5.3 in /examples/keycloak/implicit/react Bumps [elliptic](https://github.com/indutny/elliptic) from 6.5.0 to 6.5.3. - [Release notes](https://github.com/indutny/elliptic/releases) - [Commits](indutny/elliptic@v6.5.0...v6.5.3) Signed-off-by: dependabot[bot] <[email protected]> * Bump elliptic in /examples/keycloak/authorization/react Bumps [elliptic](https://github.com/indutny/elliptic) from 6.5.0 to 6.5.3. - [Release notes](https://github.com/indutny/elliptic/releases) - [Commits](indutny/elliptic@v6.5.0...v6.5.3) Signed-off-by: dependabot[bot] <[email protected]> * Always setting hostname on all nodes of the cluster (on-prem fix) (#1509) * common: always setting hostname on all nodes of the cluster (on-prem fix) * updating 0.7.1 changelog * Workarund restart rabbitmq pods during patching #1395 * add missing changelog entry * Upgrade Kubernetes to v1.18.6 (#1501) * Upgrade k8s-dashboard to v2.0.3 (#1516) * fix due to review * Dashboard unavailability, network fix for Flannel and Canal #1394 (#1519) * additional defaults for kafka config * fixes after review, remove redundant code * Named demo configuration the same as generated one * Added deletion step description * Added a note related to versions for upgrades * Fixed syntax errors * Added prerequisites section in upgrade doc * Added key encoding troubleshooting info * Test fixes for RabbitMQ 3.8.3 (#1533) * fix missing variable image rabbitmq * Add Kubernetes Dashboard to COMPONENTS.md (#1546) * Update CHANGELOG-0.7.md Minor changes to changelog before release. * CHANGELOG-0.7.md update v0.7.1 release date (#1552) * Increment version string to 0.7.1 (#1554) Co-authored-by: Mateusz Kyc <[email protected]> Co-authored-by: Mateusz Kyc <[email protected]> Co-authored-by: Michał Opala <[email protected]> Co-authored-by: to-bar <[email protected]> Co-authored-by: Luuk van Venrooij <[email protected]> Co-authored-by: Tomasz Arendt <[email protected]> Co-authored-by: Marcin Pyrka <[email protected]> Co-authored-by: erzetpe <[email protected]> Co-authored-by: Luuk van Venrooij <[email protected]> Co-authored-by: ar3ndt <[email protected]> Co-authored-by: Grzegorz Dajuk <[email protected]> Co-authored-by: Grzegorz Dajuk <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: TolikT <[email protected]> Co-authored-by: przemyslavic <[email protected]>

przemyslavic added type/bug status/grooming-needed labels Jun 29, 2020

mkyc added this to the 0.7.1 milestone Jul 2, 2020

toszo removed the status/grooming-needed label Jul 3, 2020

rafzei self-assigned this Jul 3, 2020

mkyc modified the milestones: 0.7.1, S20200729 Jul 17, 2020

mkyc modified the milestones: S20200729, S20200813 Jul 29, 2020

to-bar mentioned this issue Jul 29, 2020

Upgrade Kubernetes Dashboard to v2.0.3 #1510

Closed

rafzei added provider/aws status/ready-for-development labels Jul 31, 2020

rafzei linked a pull request Aug 3, 2020 that will close this issue

Dashboard unavailability, network fix for Flannel and Canal #1394 #1519

Merged

rafzei closed this as completed in #1519 Aug 4, 2020

rafzei added a commit that referenced this issue Aug 4, 2020

Dashboard unavailability, network fix for Flannel and Canal #1394 (#1519

4387843

)

rafzei reopened this Aug 4, 2020

przemyslavic self-assigned this Aug 4, 2020

mkyc closed this as completed Aug 6, 2020

seriva mentioned this issue Aug 11, 2020

Merging develop into 0.7.x branch #1549

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cannot access Kubernetes dashboard after upgrading to 1.17.7 #1394

[BUG] cannot access Kubernetes dashboard after upgrading to 1.17.7 #1394

przemyslavic commented Jun 29, 2020

rafzei commented Jul 2, 2020

toszo commented Jul 3, 2020

rafzei commented Jul 9, 2020

rafzei commented Jul 29, 2020 •

edited

Loading

rafzei commented Jul 31, 2020

przemyslavic commented Aug 6, 2020

[BUG] cannot access Kubernetes dashboard after upgrading to 1.17.7 #1394

[BUG] cannot access Kubernetes dashboard after upgrading to 1.17.7 #1394

Comments

przemyslavic commented Jun 29, 2020

rafzei commented Jul 2, 2020

toszo commented Jul 3, 2020

rafzei commented Jul 9, 2020

rafzei commented Jul 29, 2020 • edited Loading

rafzei commented Jul 31, 2020

przemyslavic commented Aug 6, 2020

rafzei commented Jul 29, 2020 •

edited

Loading