Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to update GitLab instance #3942

Closed
amarjandu opened this issue Mar 14, 2022 · 11 comments
Closed

Unable to update GitLab instance #3942

amarjandu opened this issue Mar 14, 2022 · 11 comments
Assignees
Labels
bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team infra [subject] Project infrastructure like CI/CD, build and deployment scripts orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points workaround [type] An enhancement that works around a defect of an external dependency

Comments

@amarjandu
Copy link
Contributor

Unable to update GitLab from 14.7.4 to 14.7.5, Terraform attempts to update the ec2 instance in place, when the instance starts up again the instance version has not been updated.

See Terraform output for more details.

The patch used to apply the update was

From fd2fbc0f6b9ae67e3e8c3544d45c5d5cca4dc1c7 Mon Sep 17 00:00:00 2001
From: amar jandu <[email protected]>
Date: Mon, 14 Mar 2022 11:03:57 -0700
Subject: [PATCH] Update GitLab to 14.7.5

---
 terraform/gitlab/gitlab.tf.json.template.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/terraform/gitlab/gitlab.tf.json.template.py b/terraform/gitlab/gitlab.tf.json.template.py
index 7aa9261f..da6e3388 100644
--- a/terraform/gitlab/gitlab.tf.json.template.py
+++ b/terraform/gitlab/gitlab.tf.json.template.py
@@ -1240,7 +1240,7 @@ emit_tf({} if config.terraform_component != 'gitlab' else {
                                --volume /mnt/gitlab/config:/etc/gitlab \
                                --volume /mnt/gitlab/logs:/var/log/gitlab \
                                --volume /mnt/gitlab/data:/var/opt/gitlab \
-                               gitlab/gitlab-ce:14.7.4-ce.0
+                               gitlab/gitlab-ce:14.7.5-ce.0
                         docker run \
                                --detach \
                                --name gitlab-runner \
@@ -1248,7 +1248,7 @@ emit_tf({} if config.terraform_component != 'gitlab' else {
                                --volume /mnt/gitlab/runner/config:/etc/gitlab-runner \
                                --network gitlab-runner-net \
                                --env DOCKER_HOST=tcp://gitlab-dind:2375 \
-                               gitlab/gitlab-runner:v14.7.0
+                               gitlab/gitlab-runner:v14.7.1
                     '''[1:]),  # trim newline char at the beginning as dedent() only removes indent common to all lines
                 'tags': {
                     'Name': 'azul-gitlab',
-- 
2.24.3 (Apple Git-128)

@amarjandu amarjandu added the orange [process] Done by the Azul team label Mar 14, 2022
@amarjandu
Copy link
Contributor Author

Prior to the VPN addition (#3605), when updating the GitLab the ec2 instance was destroyed then created again, this time when updated Terraform performed the changes in-place. Perhaps this is why the modifications are not reflected to the instance.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Mar 14, 2022

hashicorp/terraform-provider-aws#23315

The provider update from #3605 is probably the cause.

From https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance#user_data

  • user_data - (Optional) User data to provide when launching the instance. Do not pass gzip-compressed data via this argument; see user_data_base64 instead. Updates to this field will trigger a stop/start of the EC2 instance.

But it it didn't even reboot the instance. I suspect that this is because we use the most recent version of the provider with a rather old version of TF.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Mar 15, 2022

I could see the updated user data referring to version 14.7.5 on the instance. Rebooting the instance did not seem to have an effect: the instance came back with the old image version 14.7.4. Stopping and starting also does not help, which is really odd.

image

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Mar 15, 2022

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-add-user-data.html

If you stop an instance, modify its user data, and start the instance, the updated user data is not run when you start the instance.

@hannes-ucsc
Copy link
Member

Spike to run the following experiment.

  1. Get back to clean state by terminating the instance in the console or with terraform destroy and redeploying from a clean unmodified working copy with develop checked out. Post screenshot of user data (like the one above) to prove that user data refers to 14.7.4. Wait until instance is fully up and reachable via web UI.

  2. Stop the instance in the EC2 console

  3. Apply above patch and deploy

  4. Post screenshot of user data

  5. Start the instance via the EC2 console

  6. Wait until it comes back up

  7. Post screenshot of user data and of https://gitlab.dev.singlecell.gi.ucsc.edu/help

@hannes-ucsc hannes-ucsc added bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts spike:2 [process] Spike estimate of two points labels Mar 15, 2022
@melainalegaspi melainalegaspi added the operator [process] To be addressed by whoever is operator label Mar 15, 2022
@achave11-ucsc
Copy link
Member

achave11-ucsc commented Mar 24, 2022

Screen Shot 2022-03-23 at 5 01 24 PM

Step four and seven are merged into one, since five is inapplicable, deploying the changes (step three) automatically starts the instance.

Screen Shot 2022-03-24 at 9 21 32 AM

@theathorn
Copy link

theathorn commented Mar 24, 2022

@hannes-ucsc :"Based on the experiment it appears that the only way for user data to take effect is terminating and recreating the instance, something that the current version of the AWS provider does not appear to be doing automatically. We need to workaround this by updating the operator manual to include terraform destroy on the instance".

@theathorn theathorn assigned amarjandu and unassigned achave11-ucsc Mar 24, 2022
@theathorn theathorn added workaround [type] An enhancement that works around a defect of an external dependency and removed operator [process] To be addressed by whoever is operator labels Mar 24, 2022
achave11-ucsc pushed a commit that referenced this issue Apr 28, 2022
@theathorn theathorn added the no demo [process] Not to be demonstrated at the end of the sprint label May 2, 2022
@hannes-ucsc
Copy link
Member

PR #3627 only added a workaround for this issue. If a PR is a partial fix for one issue and a complete fix for another, it should have the partial label (to alert us that something is special about the PR) but it should only be connected to the issue for which it is a complete fix.

FYI: @theathorn, @achave11

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Sep 12, 2022

Turns out that the root cause is a regression introduced by the fix for hashicorp/terraform-provider-aws#23. Before that fix a modification of the user_data property of the aws_instance resource resulted in a recreation of the instance. After that fix, which was included in the 4.2.0 release of the AWS provider, the same modification resulted only in a start/stop cycle. Cloud-init runs write_files once per instance (not per boot) so restarting the instance does not cause any files to be rewritten.

hashicorp/terraform-provider-aws#23315 fixes the regression by adding an attribute with which users can revert back to the pre-4.2.0 behavior. It's included in the 4.7.0 release.

@hannes-ucsc
Copy link
Member

Think about how to best demo this.

@dsotirho-ucsc dsotirho-ucsc removed the no demo [process] Not to be demonstrated at the end of the sprint label Sep 21, 2022
@hannes-ucsc
Copy link
Member

For demo, update the GitLab version on dev. It should not require any manual action.

@hannes-ucsc hannes-ucsc added the demo [process] To be demonstrated at the end of the sprint label Oct 11, 2022
@melainalegaspi melainalegaspi added the demoed [process] Successfully demonstrated to team label Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team infra [subject] Project infrastructure like CI/CD, build and deployment scripts orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points workaround [type] An enhancement that works around a defect of an external dependency
Projects
None yet
Development

No branches or pull requests

6 participants