Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use terraform and gossip-based cluster at the same time #2990

Closed
simnalamburt opened this issue Jul 18, 2017 · 64 comments
Closed

Cannot use terraform and gossip-based cluster at the same time #2990

simnalamburt opened this issue Jul 18, 2017 · 64 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@simnalamburt
Copy link

simnalamburt commented Jul 18, 2017

If you create a cluster with both terraform and gossip options enabled, all kubectl commands shall fail.


How to reproduce the error

My environment

$ uname -a
Darwin *****.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64

$ kops version
Version 1.6.2

$ terraform version
Terraform v0.9.11

$ aws --version
aws-cli/1.11.117 Python/2.7.10 Darwin/16.6.0 botocore/1.5.80

Setting up the cluster

# Create RSA key
ssh-keygen -f shared_rsa -N ""

# Create S3 bucket
aws s3api create-bucket \
  --bucket=kops-temp \
  --region=ap-northeast-1 \
  --create-bucket-configuration LocationConstraint=ap-northeast-1

# Create terraform codes and some resources
# including *certificates* will be stored to S3
kops create cluster \
  --name=kops-temp.k8s.local \
  --state=s3://kops-temp \
  --zones=ap-northeast-1a,ap-northeast-1c \
  --ssh-public-key=./shared_rsa.pub \
  --out=. \
  --target=terraform

# Create cluster
terraform init
terraform plan -out ./create-cluster.plan
terraform show ./create-cluster.plan | less -R # final review
terraform apply ./create-cluster.plan # fire

# Done

Spoiler Alert: Creating the self-signed certificate before creating actual Kubernetes cluster is the root cause of this issue. Please continue to see why.

Scenario 1. Looking up non-existent domain

$ kubectl get nodes
Unable to connect to the server: dial tcp: lookup api.kops-temp.k8s.local on 8.8.8.8:53: no such host

This is basically because of erroneous ~/.kube/config file. If you run the kops create cluster with both terraform and gossip options enabled, you'll get wrong ~/.kube/config file.

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ABCABCABCABC...
    server: https://api.kops-temp.k8s.local
            # !!!! There's no such domain named "api.kops-temp.k8s.local"
  name: kops-temp.k8s.local
# ...

Let's manually correct that file. Or, you'll get good config file if you explicitly export the configuration once again.

kops export kubecfg kops-temp.k8s.local --state s3://kops-temp

Then the non-existent domain will be replaced with the ELB of master nodes' DNS name.

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ABCABCABCABC...
    server: https://api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com
  name: kops-temp.k8s.local
# ...

And you'll be ended up to the scenario 2 when you retry.

Scenario 2. Invalid certificate

$ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for api.internal.kops-temp.k8s.local, api.kops-temp.k8s.local, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, not api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com

This is simply because the DNS name of ELB is not included in the certificate. This scenario occurs only when you create the cluster with terraform option being enabled. If you try to create the cluster only with gossip option not using the terraform target, the self-signed certificate will properly contain the DNS name of ELB.

2017-07-19 12 01 32

(Sorry for the Korean, this is the list of DNS alternative names of certificate)

The only way to workaround this problem is forcing "kops-temp.k8s.local" to point proper IP address via manually editing /etc/hosts, which is undesired for many people.

# Recover ~/.kube/config
perl -i -pe \
    's|api-kops-temp-k8s-local-rnvnqsr-666666\.ap-northeast-1\.elb\.amazonaws\.com|api.kops-temp.k8s.local|g' \
    ~/.kube/config

# Hack /etc/hosts
host api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com |
    perl -pe 's|^.* address (.*)$|\1\tapi.kops-temp.k8s.local|g' |
    sudo tee -a /etc/hosts

# This will succeed
kubectl get nodes

I'm not very familiar with Kops internal, but I expect a huge change to properly fix this issue. Maybe using AWS Certificate Manager can be a solution. (#834) Any ideas?

@simnalamburt simnalamburt changed the title Cannot use terraform and gossip-based cluster in the same time Cannot use terraform and gossip-based cluster at the same time Jul 18, 2017
@gregd72002
Copy link

I can reproduce the problem using kops 1.7.5

@pastjean
Copy link
Contributor

If you run kops update cluster $NAME --target=terraform after the terraform apply it, it's actuallyt gonna generate a new certificate kops export kubecfg $NAME after that and you got a working thing. Although i know, its not pretty straightforward

@thedonvaughn
Copy link

thedonvaughn commented Oct 21, 2017

I also had the same reported issue. I took @pastjean advice and re-run kops update cluster $NAME --target=terraform and then kops export kubecfg $NAME. While this updated my kube config with the proper DNS name to the api ELB, I still have an invalid cert error.

@thedonvaughn
Copy link

Nevermind. I have to create the cluster with --target=terraform first. After running terraform apply and then updating I get a new master cert. I was creating the cluster, then I updated with --target=terraform, then apply, then I re-ran the update. This didn't generate a new cert. So my bad on the order. Issue is resolved. Thanks.

@chrislovecnm
Copy link
Contributor

Closing!

@sybeck2k
Copy link

sybeck2k commented Nov 6, 2017

Bug is still valid for me - and @pastjean solution is not working for me. I'm using an s3 remote store, here are my versions:

$ uname -a
Darwin xxxxx 17.0.0 Darwin Kernel Version 17.0.0: Thu Aug 24 21:48:19 PDT 2017; root:xnu-4570.1.46~2/RELEASE_X86_64 x86_64
$ kops version
Version 1.7.1
$ terraform version
Terraform v0.10.8
$ aws --version
aws-cli/1.11.137 Python/2.7.10 Darwin/17.0.0 botocore/1.6.4

to reproduce, I do the same steps as @simnalamburt reported. I then run kops update cluster $NAME --target=terraform --out=. and terraform apply, but I still have an invalid
certificate (it does not get the alias of the AWS LB).

Checking the s3 store, in the folder <cluster-name>/pki/issued/master, I can see that a first certificate is created when creating the cluster with kops - and a 2nd is added after the kops update request. The 2nd certificate does include the LB DNS name - but it is not deployed into the master(s) nodes.

Here is the update command output:

kops update cluster $NAME --target=terraform --out=.
I1106 18:10:54.285184    8239 apply_cluster.go:420] Gossip DNS: skipping DNS validation
I1106 18:10:55.044907    8239 executor.go:91] Tasks: 0 done / 83 total; 38 can run
I1106 18:10:55.467860    8239 executor.go:91] Tasks: 38 done / 83 total; 15 can run
I1106 18:10:55.469345    8239 executor.go:91] Tasks: 53 done / 83 total; 22 can run
I1106 18:10:56.032321    8239 executor.go:91] Tasks: 75 done / 83 total; 5 can run
I1106 18:10:56.691785    8239 vfs_castore.go:422] Issuing new certificate: "master"
I1106 18:10:57.160535    8239 executor.go:91] Tasks: 80 done / 83 total; 3 can run
I1106 18:10:57.160867    8239 executor.go:91] Tasks: 83 done / 83 total; 0 can run
I1106 18:10:57.261829    8239 target.go:269] Terraform output is in .
I1106 18:10:57.529372    8239 update_cluster.go:247] Exporting kubecfg for cluster
Kops has set your kubectl context to ci5-test.k8s.local

Terraform output has been placed into .

Changes may require instances to restart: kops rolling-update cluster

As you can see, the log reports that the certificate is generated. I've tried doing a kops rolling-update cluster --cloudonly as recommended, but the output is No rolling-update required.

@jlaswell
Copy link
Contributor

jlaswell commented Nov 6, 2017

@sybeck2k, we have also experienced this issue as of a few hours ago.

You will need to run kops rolling-update cluster --cloudonly --force --yes to force an update. This can take awhile depending on the size of the cluster, but we have found that trying to manually set the --master-interval or --node-interval can prevent nodes from reaching a Ready state. I suggest just grabbing some ☕️ and let the default interval do it's thing.

It is still a workaround solution atm, but we have found it to be repeatably successful.

@chrislovecnm
Copy link
Contributor

This should be fixed in master. If someone wants to test master or wait for the 1.8 beta release

@sybeck2k
Copy link

sybeck2k commented Nov 7, 2017

@jlaswell thanks a lot! I can confirm your workaround works for kops 1.7.1 .
Could anyone point me to the details of what is exactly pulled from the state store, and when? In the doc I found this information

@jlaswell
Copy link
Contributor

jlaswell commented Nov 7, 2017

Not sure about what is used when. I would bet that looking through some of the source code best for that but, I do know that you can look in the s3 bucket used for the state store if you are using AWS. We've perused that a few times to get an understanding.

@shashanktomar
Copy link
Contributor

shashanktomar commented Nov 12, 2017

@chrislovecnm I can still reproduce this in 1.8.0-beta.1. Both the steps are still required:

  • kops update cluster $NAME --target=terraform --out=.
  • kops rolling-update cluster --cloudonly --force --yes

@chrislovecnm chrislovecnm reopened this Nov 12, 2017
@chrislovecnm
Copy link
Contributor

@shashanktomar I would assume the work flow is

  1. kops update cluster --target=terraform
  2. terraform apply (not sure the syntax is correct)
  3. kops rolling-update cluster

What does rolling update show?

If would be a bug that the update does not create the same hash in the tf code that we are doing in the direct target code path.

@andresguisado
Copy link

andresguisado commented Nov 15, 2017

@chrislovecnm I can reproduce this in 1.8.0-beta.1 as well. As @shashanktomar are still required:

  • kops update cluster $NAME --state s3://bucket --target=terraform --out=.
  • kops rolling-update cluster --cloudonly --force --yes

Here is the rolling update output:

Using cluster from kubectl context: dev.xxx.k8s.local

NAME			STATUS	NEEDUPDATE	READY	MIN	MAX
master-eu-west-2a	Ready	0		1	1	1
nodes			Ready	0		2	2	2
W1115 15:28:50.519884   16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:28:50.519898   16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "master-eu-west-2a.masters.dev.xxx.k8s.local".
 
 
W1115 15:33:50.723093   16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
W1115 15:33:50.723189   16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:33:50.723203   16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "nodes.dev.xxx.k8s.local".
W1115 15:35:50.930041   16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
W1115 15:35:50.930978   16811 instancegroups.go:264] Not draining cluster nodes as 'cloudonly' flag is set.
I1115 15:35:50.931003   16811 instancegroups.go:352] Stopping instance "i-xxx", in AWS ASG "nodes.dev.xxx.k8s.local".
W1115 15:37:51.117159   16811 instancegroups.go:293] Not validating cluster as cloudonly flag is set.
I1115 15:37:51.117407   16811 rollingupdate.go:174] Rolling update completed!

@tspacek
Copy link
Contributor

tspacek commented Dec 20, 2017

I reproduced this in 1.8.0 after kops create cluster ... --target=terraform and terraform apply

I can confirm that running the following fixed it:
kops update cluster $NAME --target=terraform
kops rolling-update cluster $NAME --cloudonly --force --yes

@chrislovecnm
Copy link
Contributor

More detail please

@bashims
Copy link

bashims commented Feb 5, 2018

I am having the same problem here with (see version info below), the work around does indeed work but it takes way too long to complete - it would be great if this could be resolved.

kops version

Version 1.8.0 (git-5099bc5)      

@mbolek
Copy link

mbolek commented Mar 6, 2018

As above, this is still broken in:
Version 1.8.1 (git-94ef202)

Generally, as I understand it, the workaround flow is:
kops create cluster $NAME --target=terraform -out=.
terraform apply
kops rolling-update cluster $NAME --cloudonly --force --yes (around 20 mins with 3masters and 3 nodes) and then it should work but I had to re-export kops config
kops export kubecfg $NAME
and now it works for both kops and kubectl.
Are there any ideas as to how resolve this? I was also wondering if, in general, gossip-based is inferior to DNS approach?

@srolel
Copy link

srolel commented May 17, 2018

The fix using rolling-update did not work for me.

Version 1.9.0 (git-cccd71e67)

@mbolek
Copy link

mbolek commented May 17, 2018

@mosho1 did you export the config?
Can you check if the server in the ~/.kube/config points to an external endpoint?

@srolel
Copy link

srolel commented May 17, 2018

@mbolek yeah, it did, though I have already brought down that cluster and used kops directly instead.

@Hermain
Copy link

Hermain commented Jun 1, 2018

Fyi: Still broken in 1.9.0

@1ambda
Copy link

1ambda commented Jun 11, 2018

in 1.9.1 too. I am running a gossip-based cluster (.local)

and Was able to work around this issue by following comments above.

# assume that you already applied terraform once and ELB for kube api is generated on AWS

# make sure that export kubecfg before applying terraform so that LC is configured with exported cfg.
kops export kubecfg --name $NAME
kops update cluster $NAME --target=terraform --out=.
terraform plan
terraform apply 

kops rolling-update cluster $NAME --cloudonly --force --yes

In case of continuous failing you might add insecure-skip-tls-verify: true into the cluster entry in ~/.kube/config but usually its not recommended.

@gtmtech
Copy link

gtmtech commented Jul 10, 2018

Who wants to do a rolling-update straight after provisioning a cluster? kops should provision the correct server entries in the kubectl config file in the first place - Given that kops creates a dns entry just fine with a sensible name e.g. api.cluster.mydomain.net (as an alias record to the elb/alb), why isnt kops export kubecfg using the alias record in the server and not the elb? This alias record is already in the certificate as OP says, and if kops generates a kubectl config entry using a server: https://[alias record], then it works just fine, and no rolling-updates or post-shenanigans are needed.

This should work out of the box

@mbolek
Copy link

mbolek commented Jul 25, 2018

#kops version
Version 1.9.2 (git-cb54c6a52

Ok... so I though I had something but it seems the issue persists. You need to export config to fix the API server endpoint and you need to roll master to fix the SSL cert

@drzero42
Copy link

drzero42 commented Aug 30, 2018

Another workaround that does not require waiting to roll the master(s) is to create the ELB, then update the cluster and then do the rest of the terraform apply. Steps are:

  • Create cluster as usual
  • Create internet gateway, or ELB will fail to deploy: terraform apply -target aws_internet_gateway.CLUSTERNAME-k8s-local
  • Create ELB: terraform apply -target aws_elb.api-CLUSTERNAME-k8s-local
  • Update cluster (which will catch the DNS name for the ELB and issue a new master cert, as well as export a new kubecfg): kops update cluster --out=. --target=terraform
  • Create everything else: terraform apply

@mshivanna
Copy link

@mbolek the issue indeed persists
kops version
Version 1.10.0

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@esimonov
Copy link

esimonov commented Nov 6, 2020

Still seeing this issue.

Kops - Version 1.18.2
Terraform - v0.13.4

@djha736
Copy link

djha736 commented Dec 2, 2020

Hi Team,

Still i am getting certificate issue.

i am using Kops verion v1.19
terraform version v0.13.5

@kmichailg
Copy link

kmichailg commented Dec 4, 2020

Currently experiencing this
kops -> Version 1.18.2 (git-84495481e4)
terraform -> v0.13.5

Will try using terraform:0.14.0

@kmichailg
Copy link

Tried terraform:0.14.0

Still capturing the .k8s.local instead of the correct ELB address.

/reopen

@k8s-ci-robot
Copy link
Contributor

@kmichailg: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

Tried terraform:0.14.0

Still capturing the .k8s.local instead of the correct ELB address.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@olemarkus
Copy link
Member

/reopen

@k8s-ci-robot
Copy link
Contributor

@olemarkus: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Dec 4, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kmichailg
Copy link

Tested with terraform:0.14.3 and kops Version 1.18.2 (git-84495481e4)

Still capturing the .k8s.local instead of the correct ELB address. Workarounds doesn't seem to work.

Validation failed: unexpected error during validation: error listing nodes: an error on the server ("") has prevented the request from succeeding (get nodes)

Tried re-exporting the ELB endpoint:

kops export kubecfg --name ${CLUSTER_NAME} && \
kops update cluster ${CLUSTER_NAME} \
  --out=. \
  --target=terraform && \
terraform apply -auto-approve && \
kops rolling-update cluster ${CLUSTER_NAME} --cloudonly --force --yes

Doing this makes the master node seem to be stuck on the initializing status on AWS on a few occasions. Eventually becomes okay

Also tried creating the gateway and ELB first before using terraform apply same result:

kops create ...
...

terraform apply -target=aws_internet_gateway.${CLUSTER_PREFIX}-k8s-local -auto-approve && \
terraform apply -target=aws_elb.api-${CLUSTER_PREFIX}-k8s-local -auto-approve

kops update cluster \
  --out=. \
  --target=terraform

terraform apply -auto-approve && \
kops rolling-update cluster --cloudonly --force --master-interval=1s --node-interval=1s --yes

I am using t3a.small for the nodes, t3a.medium for the master node.

@kmichailg
Copy link

Still experiencing this regarding gossip-based clusters. Abandoning the infrastructure-as-code for now (via terraform), will just deploy via kops only.

Hopefully you reopen this for tracking. Thank you!

@alen-z
Copy link

alen-z commented Mar 23, 2021

Issue still persists. Great feature, but not usable at the moment.

@alen-z
Copy link

alen-z commented Mar 23, 2021

/reopen

@k8s-ci-robot
Copy link
Contributor

@alen-z: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rifelpet rifelpet reopened this Mar 23, 2021
@rifelpet
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 23, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 16, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests