Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate *-joyent machines to new Equinix system. OUTAGE required on 8th December #3108

Closed
sxa opened this issue Dec 7, 2022 · 27 comments
Closed
Assignees
Labels

Comments

@sxa
Copy link
Member

sxa commented Dec 7, 2022

Spinning this out of #3104 since the existing machines are currently back online temporarily.

In February 2021 some of our systems were migrated to from Joyent's data centers to Equinix using an account managed by Joyent team members which was separate from our existing one at Equinix.

Recently it became apparent that some of those were hosted on the Equinix data centers which were due to be shut down at the end of November. After a call today between myself, @richardlau and @bahamat we now have a good understanding of where we area and how to move forward.

To summarise where we currently are: There are two systems hosted on the account managed by Joyent. Both are SmartOS hosts with virtual images inside them. One of these hosts is called nc-backup-01 and only contains one VM - the backup server which is SmartOS 15.4 and is in the DFW2 data center.

The second host is called nc-compute-01 and it contains all the other systems referenced in #3104 - some are KVM instances, some are SmartOS zones and one is an lx-branded zone. The details and breakdown are as follows:

[root@nc-compute-01 ~]# vmadm list
UUID                                  TYPE  RAM      STATE             ALIAS
0f85685d-0150-4f8f-e211-9ecee63e8b61  KVM   3840     running           test-joyent-ubuntu1604_arm_cross-x64-1
1cf77dcc-8a17-6c35-9132-83f55a8e058f  KVM   3840     running           test-joyent-ubuntu1804-x64-1
49f0a164-4e86-4fda-de73-abcf257587a0  KVM   3840     running           release-joyent-ubuntu1604_arm_cross-x64-1
356655a2-12e6-e1d7-ac7b-b5188ad37cb0  OS    4096     running           test-joyent-smartos20-x64-3
49089cfe-915f-c226-c697-a9faca6041f2  OS    4096     running           release-joyent-smartos20-x64-2
94f76b46-6d20-612c-84e1-92c0dc3bae69  OS    4096     running           release-joyent-smartos18-x64-2
c6e3d47a-1421-ee11-c52d-c3c80c198e95  OS    4096     running           test-joyent-smartos20-x64-4
d894f3c6-d09a-c9df-d7ae-b6f613d9b413  OS    4096     running           test-joyent-smartos18-x64-3
db3664d7-dd31-c233-cafb-df79efb9d069  OS    4096     running           test-joyent-smartos18-x64-4
d357fd3c-a929-cd9c-da35-ad53b53e2875  KVM   7936     running           release-joyent-ubuntu1804_docker-x64-1
feb21098-8101-66f6-f410-bd092952f84e  KVM   16128    running           infra-joyent-debian10-x64-1
12fa9eea-ba7a-4d55-abd9-d32c64ae1965  LX    32768    running           infra-joyent-ubuntu1604-x64-1-new

The infrastructure team's public ssh key has now been put onto both SmartOS hosts so that those team members can access the systems. @richardlau and @sxa have also been invited to co-administer the Equnix instance hosting these two in case any recovery of the hosts is required, and to hopefully set up to receive notifications.

We explored a few potential options:

  1. Provision a new machine for the SmartOS systems and migrate the others to our existing Equinix account
  2. Provision a new machine and reprovision all the servers from scratch
  3. provision a new machine and migrate the existing instances across

Given that option 3 was feasible and solved the immediate problem with Equinix attempting to shut down their data centeras we have chosen that one and we intend to start migrating the system tomorrow (evening UTC). @bahamat will handle provisioning the replacement SmartOS host on Equinix and migrating the images across. This will result in an outage on these systems while the migration takes place. The new server will need to be Intel rather than AMD to support SmartOS' KVM implementation.

We will also aim to rename these so they do not have joyent in the name since they are now hosted at equinix (Likely using a new equinix_mnx provider name to indicate that it's hosted separately from our other equinix systems)

FYI @nodejs/build @nodejs/build-infra @vielmetti

@sxa sxa assigned richardlau and sxa Dec 7, 2022
@sxa sxa changed the title Migrate *-joyent machines to new Equinix system. OUTAGE required on 8the December Migrate *-joyent machines to new Equinix system. OUTAGE required on 8th December Dec 7, 2022
@vielmetti
Copy link

Thanks @sxa . The machine roster is here https://deploy.equinix.com/product/servers/ and you'll pick an Intel version.

If you're picking data centers, this capacity dashboard https://deploy.equinix.com/developers/capacity-dashboard/ is helpful.

@sxa
Copy link
Member Author

sxa commented Dec 7, 2022

Thank you - that question about which DC to use did come up in the call and I figured you'd potentially have some pointers on that - the capacity dashboard should be useful for @bahamat

@vielmetti
Copy link

@sxa @bahamat If you avoid the "medium" or "low" capacity data centers you'll be fine.

@sxa
Copy link
Member Author

sxa commented Dec 9, 2022

Update: Most of the migration is complete and the test smartos' new IPs have been added so that they can get through the firewall to the CI server which is now running through the (fairly small) backlog. I've adjusted the comments next to the machines in the firewall configuration to have equinix_mnx instead of joyent in the name to indicate which ones I have changed. I have not done anything for the release machines which are currently unused. One of the infra- machines is quite large (3.2Tb) so will take a while to transfer between the data centers.
NOTE: While adding the firewall rules I spotted that there were some entries for what were presumably old machines that no longer exist as they are not in our CI. I have created #3109 to clear those entries up later.

@richardlau
Copy link
Member

I have not done anything for the release machines which are currently unused.

That's only true for the smartos release machines. We're using the ubuntu1804 docker container to cross compile for armv7l -- I've updated the firewall on ci-release but the machine is one of those yet to be migrated.

@richardlau
Copy link
Member

It looks like the smartos builds are currently broken 😞
e.g. https://ci.nodejs.org/job/node-test-commit-smartos/nodes=smartos20-64/46914/console

10:23:41 ../deps/v8/src/base/platform/platform-posix.cc:80:16: error: conflicting declaration of C function 'int madvise(caddr_t, std::size_t, int)'
10:23:42    80 | extern "C" int madvise(caddr_t, size_t, int);
10:23:42       |                ^~~~~~~
10:23:42 In file included from ../deps/v8/src/base/platform/platform-posix.cc:18:
10:23:42 /usr/include/sys/mman.h:268:12: note: previous declaration 'int madvise(void*, std::size_t, int)'
10:23:42   268 | extern int madvise(void *, size_t, int);
10:23:42       |            ^~~~~~~
10:23:42 make[2]: *** [tools/v8_gypfiles/v8_libbase.target.mk:177: /home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos20-64/out/Release/obj.target/v8_libbase/deps/v8/src/base/platform/platform-posix.o] Error 1
10:23:42 make[2]: *** Waiting for unfinished jobs....

Also failing on the smartos18 machines in similar fashion for Node.js 14 builds as well as the builds for the main branch and pull requests.

@sxa
Copy link
Member Author

sxa commented Dec 9, 2022

Looks like this is a result of the SmartOS upgrade on the host (global zone) and the fact that all of the SmartOS local zones inherit /usr which has a modified /usr/local/sys/mman.h:
The old one had:

extern int madvise(caddr_t, size_t, int);

and then later on:

#if !defined(__XOPEN_OR_POSIX) || defined(_XPG6) || defined(__EXTENSIONS__)
extern int posix_madvise(void *, size_t, int);
#endif

But the new one has :

#if !defined(_STRICT_POSIX) || defined(_XPG6)
extern int posix_madvise(void *, size_t, int);
#endif

and later:

#if !defined(_STRICT_POSIX)
extern int mincore(caddr_t, size_t, char *);
extern int memcntl(void *, size_t, int, void *, int, int);
extern int madvise(void *, size_t, int);
[...]

So we've lost the madvise definition with caddr_t as the first parameter which is almost certainly causing this compile failure. This may need a V8 patch unless we tweak the header files on the host to get around the immediate problem. I shall leave it in this state pending advise from @bahamat

@sxa
Copy link
Member Author

sxa commented Dec 9, 2022

FYI full diff of mmem.h between old an new systems: mmem.h.diff.txt.gz

@richardlau
Copy link
Member

richardlau commented Dec 9, 2022

So we've lost the madvise definition with caddr_t as the first parameter which is almost certainly causing this compile failure. This may need a V8 patch unless we tweak the header files on the host to get around the immediate problem. I shall leave it in this state pending advise from @bahamat

It looks like this is a V8 issue where madvise is being declared specifically for V8_OS_SOLARIS:
https://github.com/v8/v8/blob/458cda96fe5db5bded922caa80ed304ad8be2a72/src/base/platform/platform-posix.cc#L78-L84

#if defined(V8_OS_SOLARIS)
#if (defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE > 2) || defined(__EXTENSIONS__)
extern "C" int madvise(caddr_t, size_t, int);
#else
extern int madvise(caddr_t, size_t, int);
#endif
#endif

There was a PR opened on V8's GitHub mirror for this but it was closed and it looks like it was not upstreamed: v8/v8#37

@sxa
Copy link
Member Author

sxa commented Dec 9, 2022

SmartOS host has been put back to the old version in order to get the 'known good' headers in the global zone and therefore the inherited /usr in the local zones. Seeing if the patch can be backported to all relevant V8 release lines can be done separately.

@richardlau
Copy link
Member

I've updated the DNS entries for grafana and unencrypted in CloudFlare with the new IP addresses (although unencrypted is still being migrated so currently is down but DNS now points to the new address).

@bahamat
Copy link

bahamat commented Dec 15, 2022

This migration is complete. I don't have access to Jenkins (or at least don't know where it is) so I can't confirm all the nodes are connected.

If someone can verify this for me, then I think we can resolve this issue.

richardlau added a commit to richardlau/build that referenced this issue Dec 16, 2022
richardlau added a commit to richardlau/build that referenced this issue Dec 16, 2022
Rename the machines to more accurately reflect where those machines
are hosted (the MNX owned Equinix Metal account).

Refs: nodejs#3108
richardlau added a commit to richardlau/build that referenced this issue Dec 16, 2022
Rename the machines to more accurately reflect where those machines
are hosted (the MNX owned Equinix Metal account).

Refs: nodejs#3108
@richardlau
Copy link
Member

I believe all the Jenkins agent have reconnected. I've opened a PR to update the IP addresses in the Ansible inventory. I've also renamed the test machines from "joyent" to "equinix_mnx". We probably want to rename the release and infra machines as well but I'm not going to have time to do that this year (today is my last working day of 2022).

Have we confirmed with Equinix whether the nc-backup-01 host is okay where it is or if it also needs to migrate? According to the web console it's in DA - DFW2 and I thought all three letter data facilities were supposed to be closed at the end of November.

@bahamat
Copy link

bahamat commented Dec 16, 2022

I’ll see what I can find out.

@bahamat
Copy link

bahamat commented Dec 17, 2022

Confirmed, dfw2 is also shutting down. I'll get nc-backup-01 migrated to da11.

@mhdawson
Copy link
Member

@bahamat what is the old ip of nc-backup-01 ? I think it would be 139.178.83.227. Want to confirm which machine it corresponds to. If it is 139.178.83.227 then that machine I think mostly pulls from other machines to do backups.

In that case we may not need to configure ips other than updating in the inventory, but it might affect known_hosts on the machines it connects to. We'd want to validate that it can still connect to the other machines after the move.

@richardlau
Copy link
Member

@mhdawson I believe nc-backup-01 is backup (139.178.83.227) based on earlier discussions with @bahamat and @sxa and I also believe you are correct that the machine is pulling from other machines.

In that case we may not need to configure ips other than updating in the inventory, but it might affect known_hosts on the machines it connects to. We'd want to validate that it can still connect to the other machines after the move.

We will also need to update the firewall on the download server (www/direct.nodejs.org) so that backup can connect to it.

@bahamat
Copy link

bahamat commented Dec 22, 2022

Yeah, that's the right IP.

In that case, I'll do the final sync and start up the new one. I'll let you know when it's up.

@bahamat
Copy link

bahamat commented Dec 22, 2022

OK, all finished. The new IPs are:

  • 147.28.183.83
  • 2604:1380:4502:3500::3

The old one is stopped but not destroyed yet. Once you can confirm that the new one is working as intended, I'll destroy the old one.

@mhdawson
Copy link
Member

mhdawson commented Jan 3, 2023

@richardlau I think you mentioned you were going to look at this?

@richardlau
Copy link
Member

I've added 147.28.183.83 to the firewall on the www server (so the backup machine can rsync to it). However I don't seem to be able to ssh into 147.28.183.83 to verify if the backups are running.

@richardlau
Copy link
Member

Thanks to @bahamat for fixing the network interfaces on the new backup machine. I've been able to log into it and AFAICT the backups ran, so I think we're good with the replacement.

@vielmetti
Copy link

@richardlau @bahamat - following up here, is this work completed to the point where this issue can be closed?

@bahamat
Copy link

bahamat commented Jan 17, 2023

Yes, I think so.

@vielmetti
Copy link

@bahamat I see that the last old s1 storage system was "powered off" last week - if you are completely ready then you (or I) can "delete" that system and we'll really be done.

@bahamat
Copy link

bahamat commented Jan 24, 2023

OK, all set. f88f2ae4-52dd-4613-b123-262d47bf5d2c | nc-backup-01 has been deleted.

@vielmetti
Copy link

Confirmed! Thanks @bahamat - I think this issue can be closed out.

richardlau added a commit to richardlau/build that referenced this issue Jan 30, 2023
richardlau added a commit to richardlau/build that referenced this issue Jan 30, 2023
Rename the machines to more accurately reflect where those machines
are hosted (the MNX owned Equinix Metal account).

Refs: nodejs#3108
richardlau added a commit that referenced this issue Jan 30, 2023
Rename the machines to more accurately reflect where those machines
are hosted (the MNX owned Equinix Metal account).

Refs: #3108
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants