-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(mirrors.jenkins.io/http://mirrors.jenkins-ci.org/) Sunset the legacy "mirrorbrain" service in favor of get.jenkins.io #2888
Comments
Ping @daniel-beck @MarkEWaite @olblak @lemeurherve @timja @halkeye @jnord @jglick for info, review and advise (If I forgot anything) |
evergreen plugin should be archived, the rest of the usages are pretty much documentation anyways
are they still able to? or will be we killing that access path? |
What do you think to just deprecated this DNS record. Officially it's not used anymore, or used it as a the k8s cluster fallback. you would cleanly deploy mirrorbits on that machine pkg.jenkins.io so if something goes wrong with the k8s cluster, you still have it working. Btw you may have notice that but we have a mirrorbits binary in the /opt directory that we used multiple time in the past to mitigate cluster downtime |
Thanks for the tip! It confirm that what we did in #2040 was correct. For information, https://github.com/jenkins-infra/evergreen is marked as "archived" repository
They are still able to, and we'll kill this access path as it implies force a redirect to https. If mirrors.jenkins.io or mirrors.jenkins-ci.org is used to download any file (war, plugin, or package), then it is only HTTP (there is not vhost for these domain at all, no certificates and defaults to https://pkg.origin.jenkins.io/ - with an expected TLS security alert for domain mismatch).
Thanks for the tip! You know that I like deleting things ;) But it might be a bit too harsh to kill this domain. Using a CNAME to get.jenkins.io would allow a smooth transition. Once we tracked as much usages (such as code in jenkinscu GH org) as we can and switched them to get.jenkins.io, then we can track access for a 2-3 months to see what usage is done and decide of killing it maybe at that time.
Good reminder! That we'll be the next subject. The current get.jenkins.io, which is kubernetes cluster wide, is still more available than the |
Opened the PR jenkins-infra/pipeline-library#374 in the shared library + notified with an email on the dev mailing list https://groups.google.com/g/jenkinsci-dev/c/anTCx9Q6mLI |
Thanks @MarkEWaite and @timja for jenkinsci/jep#386 on this area! |
Ref. jenkins-infra/helpdesk#2888 This change also uses long flags for `curl` and shows if an error happens during the download (easier to diagnose)
Another PR on the PCT: jenkinsci/plugin-compat-tester#363 |
Other references found on the github.com/jenkinsci organization are not worth the changes (README or deprecated projects such as evergreen) |
…elpdesk#2888 Signed-off-by: Damien Duportal <[email protected]>
…elpdesk#2888 Signed-off-by: Damien Duportal <[email protected]>
…elpdesk#2888 (#2329) Signed-off-by: Damien Duportal <[email protected]>
As per @MarkEWaite messages in the #jenkins-infra IRC channel:
Opening maintenance window on status.jenkins.io: jenkins-infra/status#157 |
Resized the root volume from 1000 to 1200 Gb:
The file system was automatically resized: $ df -hT / # Right after reboot
Filesystem Type Size Used Avail Use% Mounted on
/dev/xvda1 ext4 1.2T 811G 323G 72% / |
Failed to change the instance size: Today, we are using an $ cat /proc/cpuinfo | grep Xeon | sort | uniq
model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
$ grep -c processor /proc/cpuinfo
8 The idea was to try to migrate to a new instance size that would benefit from:
Check the following table to compare instance types, with the following rules:
Alas, each try to change the instance type ended up in an error message "configuration not documented" when starting the instance. Tried to enabled the "Enhanced Networking Adapter" did not change anything (but it is enabled now): $ aws ec2 describe-instances --instance-id i-e0968e19 --query "Reservations[].Instances[].EnaSupport" --region us-east-1 | jq -r '.'
[]
$ aws ec2 modify-instance-attribute --instance-id i-e0968e19 --ena-support --region us-east-1
$ aws ec2 describe-instances --instance-id i-e0968e19 --query "Reservations[].Instances[].EnaSupport" --region us-east-1 | jq -r '.'
[
true
] Let's keep this instance size for now: the AMI snapshot could be used to try creating a new instance but better putting our effort in #2649 |
While trying to "short-term" workaround with the high CPU usage on this machine, stumbled across the following error message in Apache error logs: AH00632: failed to prepare SQL statements: ERROR: relation "pfx2asn" does not exist\nLINE 1: ...EPARE asn_dbd_1 (varchar) AS SELECT pfx, asn FROM pfx2asn WH...\n This error is related to the
But this machine is a mess: there was 3 different postgresql server installations, each one on a different port:
Since this VM is not managed by puppet since some time, the following operation where done manually:
# Ensure postgresql 10 is installed properly
$ apt-get -y install postgresql-10
$ dpkg --get-selections | grep postgresql # Sanity check
# Migrate the actual 9.3 cluster named `main` to version 10 with the same name
$ pg_lsclusters
$ pg_renamecluster 10 main main_ver10
$ pg_lsclusters # Sanity check
$ systemctl stop [email protected]
$ pg_upgradecluster 9.3 main # Restarts the instance once done
$ pg_lsclusters # Sanity check
## Cleanup
$ pg_dropcluster --stop 9.3 main
$ pg_dropcluster --stop 10 main_ver10
$ pg_dropcluster --stop 9.5 main
$ apt-get remove --purge postgresql-9.3 postgresql-client-9.3 postgresql-9.5 postgresql-client-9.5
$ dpkg --get-selections | grep postgresql # Sanity check
# Ensure ip4r is installed properly
$ apt-get -y install postgresql-contrib postgresql-10-ip4r
# Create extension in the pgsql instance, as Pg superuser
$ su - postgres
$ psql # Top-level
# \dx
# CREATE EXTENSION ip4r ;
# \dx
# \q
$ psql --dbname=jenkins_mirrorbrain_db # On the mirrorbrain database
# \dx
# CREATE EXTENSION ip4r ;
# \dx
# \q
# Load the ASN script, now that the primitive type `iprange` is provided by the ip4r extension
$ psql --host=localhost --username=jenkins_mirrorbrain --password --dbname=jenkins_mirrorbrain_db --file=/usr/share/doc/libapache2-mod-asn/asn.sql
password: <redacted>
# Ensure everything is loaded and available
$ apt update && apt-get dist-upgrade && apt-get autoremove --purge && update-grub && reboot
$ tail -f /var/log/apache2/*log |
Another error on the apache log, but no solution for now:
Sounds related to https://www.claudiokuenzler.com/blog/948/apache-2.4-mpm-event-bug-freezing-up-scoreboard-full-after-reload (yes we are using MPM event, and the In order to help on this area, installed sysstat to provide a finer metric grain
It appears that there are peaks of CPU on 10:00:01 AM all 8.87 0.00 3.10 0.06 0.12 87.86
10:02:01 AM all 28.04 0.00 4.75 0.27 0.14 66.79
10:04:01 AM all 26.21 0.00 4.87 0.15 0.21 68.55
10:06:01 AM all 33.32 0.00 12.11 0.13 1.68 52.77
10:08:01 AM all 30.51 0.00 11.64 0.08 1.68 56.08
10:10:01 AM all 27.46 0.00 13.96 0.05 1.72 56.81
10:12:01 AM all 30.69 0.00 13.89 0.11 1.66 53.66
10:14:01 AM all 30.90 0.00 11.48 0.11 1.69 55.82
10:16:01 AM all 27.94 0.00 13.86 0.08 1.71 56.41
10:18:01 AM all 29.40 0.00 14.48 0.07 1.66 54.39
10:20:01 AM all 27.84 0.00 13.03 0.06 1.72 57.35
10:22:01 AM all 23.33 0.00 4.35 0.14 0.23 71.96
10:24:01 AM all 21.31 0.00 3.50 0.06 0.11 75.01 We might check the configuration history:
|
|
|
Starting maintenance on the VM:
In parallel, jenkins-infra/status#166 was opened to prepare puppet so we can put this machine under automatic puppet management again. |
Ran the following command on the VM (after snapshoting + backuping postgres data):
followed by the autoremove. Also, removed manually all the apache vhost configurations (after backuping it) related to domains mirrors.jenkins or get.jenkins.io. Still some apache config to clean up |
|
Just in case : backups of apache2 etc and var are in the /root if anything breaks + there is a snapshot of the vm root volume in aws |
Summary of the past days:
A lot of people help, and I'm really glad for it! Next step:
|
Yet another incident due to this issue: #2960 |
Closing as the incidents seems to be gone (all of them). |
jenkins-infra#374) * feat(infra) switch to the new mirror system in HTTPS - jenkins-infra/helpdesk#2888 Signed-off-by: Damien Duportal <[email protected]> * cleanup(runAth) remove unused mirror variable Signed-off-by: Damien Duportal <[email protected]> * chore(README) typos Signed-off-by: Damien Duportal <[email protected]>
jenkins-infra#374) * feat(infra) switch to the new mirror system in HTTPS - jenkins-infra/helpdesk#2888 Signed-off-by: Damien Duportal <[email protected]> * cleanup(runAth) remove unused mirror variable Signed-off-by: Damien Duportal <[email protected]> * chore(README) typos Signed-off-by: Damien Duportal <[email protected]>
Service(s)
Update center, Other
Summary
What Happened
Since 4 weeks, the infra team receives the following pager duty alert:
Weird Response time https://updates.jenkins-ci.org
multiple times a day.Click to see details
The alerts are triggered by a threshold in the datadog metrics collection for this service: https://github.com/jenkins-infra/docker-datadog/blob/main/conf.d/http_check.d/jenkins.yaml#L137-L148.
As shown in the screenshots, it means that the average HTTP response time is increased past 10s most of the time (when the alert is triggered).
Most of the time, the alert acknowledge itself as the response time decreased. Sometimes, the person on duty (@MarkEWaite or I) have to SSH to the machine
pkg.origin.jenkins.io
and restart the Apache server (rebooting the machine would be the last option).Root cause
The (legacy) service referenced as
mirrorbrain
(hosting the servicesmirrors.jenkins.io
andmirrors.jenkins-ci.org
), also hosted on this VM is causing a peak of CPU usage which slows done the other serviceupdates.jenkins.io
.Click to expand for details on the configuration as code
Puppet configuration audit trail:
VM definition: https://github.com/jenkins-infra/jenkins-infra/blob/production/manifests/site.pp#L119-L122
This VM has the role
mirrorbrain
: https://github.com/jenkins-infra/jenkins-infra/blob/production/dist/role/manifests/mirrorbrain.pp#L4-L7This role is composed of 4 profiles:
base
(as all VMs managed by Puppet): https://github.com/jenkins-infra/jenkins-infra/blob/7c6d6609b650f1ef209cd590dd4568bcc676514c/dist/profile/manifests/base.ppmirrorbrain
(which definesmirrors.jenkins*
services, that we want to sunset): mirrorbrainupdatesite
(which defines the update center site, causing alerts because slowed down): https://github.com/jenkins-infra/jenkins-infra/blob/production/dist/profile/manifests/updatesite.pppkgrepo
(used to build and host the Jenkins packages for debian/centos/etc. to be replaced later but not part of this issue: keep it for now)https://github.com/jenkins-infra/jenkins-infra/blob/production/manifests/site.pp#L119-L122
Proposal
Let's sunset the legacy service
mirrorbrain
in favor of the currentget.jenkins.io
modern mirror service based on mirrorbits!Rationale:
mirrorbits
defaults to HTTPS, whilemirrorbrain
only supports plain old HTTPmirrorbits
can scale horizontally and efficiently (redis database, hosted in Kubernetes) and is updated regularly and automaticallyClick to expand for details about the mirrorbits service
In order to NOT break end-users installations, the domains
mirrors.jenkins.io
andmirrors.jenkins-ci.org
should be CNAMEs to themirrorbits
new system.Known usages of the legacy mirror system
jenkinsci
(pipeline, scripts, docs) - https://github.com/search?q=org%3Ajenkinsci+mirrors.jenkins.io&type=code:To Do List
mirror.jenkins.io
andmirrors.jenkins-ci.org
in the mirrorbits configuration (to ensure that it will always work, whatever DNS configuration we use)The text was updated successfully, but these errors were encountered: