Migrate `*-joyent` machines to new Equinix system. OUTAGE required on 8th December #3108

sxa · 2022-12-07T18:15:29Z

Spinning this out of #3104 since the existing machines are currently back online temporarily.

In February 2021 some of our systems were migrated to from Joyent's data centers to Equinix using an account managed by Joyent team members which was separate from our existing one at Equinix.

Recently it became apparent that some of those were hosted on the Equinix data centers which were due to be shut down at the end of November. After a call today between myself, @richardlau and @bahamat we now have a good understanding of where we area and how to move forward.

To summarise where we currently are: There are two systems hosted on the account managed by Joyent. Both are SmartOS hosts with virtual images inside them. One of these hosts is called nc-backup-01 and only contains one VM - the backup server which is SmartOS 15.4 and is in the DFW2 data center.

The second host is called nc-compute-01 and it contains all the other systems referenced in #3104 - some are KVM instances, some are SmartOS zones and one is an lx-branded zone. The details and breakdown are as follows:

[root@nc-compute-01 ~]# vmadm list
UUID                                  TYPE  RAM      STATE             ALIAS
0f85685d-0150-4f8f-e211-9ecee63e8b61  KVM   3840     running           test-joyent-ubuntu1604_arm_cross-x64-1
1cf77dcc-8a17-6c35-9132-83f55a8e058f  KVM   3840     running           test-joyent-ubuntu1804-x64-1
49f0a164-4e86-4fda-de73-abcf257587a0  KVM   3840     running           release-joyent-ubuntu1604_arm_cross-x64-1
356655a2-12e6-e1d7-ac7b-b5188ad37cb0  OS    4096     running           test-joyent-smartos20-x64-3
49089cfe-915f-c226-c697-a9faca6041f2  OS    4096     running           release-joyent-smartos20-x64-2
94f76b46-6d20-612c-84e1-92c0dc3bae69  OS    4096     running           release-joyent-smartos18-x64-2
c6e3d47a-1421-ee11-c52d-c3c80c198e95  OS    4096     running           test-joyent-smartos20-x64-4
d894f3c6-d09a-c9df-d7ae-b6f613d9b413  OS    4096     running           test-joyent-smartos18-x64-3
db3664d7-dd31-c233-cafb-df79efb9d069  OS    4096     running           test-joyent-smartos18-x64-4
d357fd3c-a929-cd9c-da35-ad53b53e2875  KVM   7936     running           release-joyent-ubuntu1804_docker-x64-1
feb21098-8101-66f6-f410-bd092952f84e  KVM   16128    running           infra-joyent-debian10-x64-1
12fa9eea-ba7a-4d55-abd9-d32c64ae1965  LX    32768    running           infra-joyent-ubuntu1604-x64-1-new

The infrastructure team's public ssh key has now been put onto both SmartOS hosts so that those team members can access the systems. @richardlau and @sxa have also been invited to co-administer the Equnix instance hosting these two in case any recovery of the hosts is required, and to hopefully set up to receive notifications.

We explored a few potential options:

Provision a new machine for the SmartOS systems and migrate the others to our existing Equinix account
Provision a new machine and reprovision all the servers from scratch
provision a new machine and migrate the existing instances across

Given that option 3 was feasible and solved the immediate problem with Equinix attempting to shut down their data centeras we have chosen that one and we intend to start migrating the system tomorrow (evening UTC). @bahamat will handle provisioning the replacement SmartOS host on Equinix and migrating the images across. This will result in an outage on these systems while the migration takes place. The new server will need to be Intel rather than AMD to support SmartOS' KVM implementation.

We will also aim to rename these so they do not have joyent in the name since they are now hosted at equinix (Likely using a new equinix_mnx provider name to indicate that it's hosted separately from our other equinix systems)

FYI @nodejs/build @nodejs/build-infra @vielmetti

The text was updated successfully, but these errors were encountered:

vielmetti · 2022-12-07T18:23:37Z

Thanks @sxa . The machine roster is here https://deploy.equinix.com/product/servers/ and you'll pick an Intel version.

If you're picking data centers, this capacity dashboard https://deploy.equinix.com/developers/capacity-dashboard/ is helpful.

sxa · 2022-12-07T18:27:24Z

Thank you - that question about which DC to use did come up in the call and I figured you'd potentially have some pointers on that - the capacity dashboard should be useful for @bahamat

vielmetti · 2022-12-07T18:29:04Z

@sxa @bahamat If you avoid the "medium" or "low" capacity data centers you'll be fine.

sxa · 2022-12-09T10:43:05Z

Update: Most of the migration is complete and the test smartos' new IPs have been added so that they can get through the firewall to the CI server which is now running through the (fairly small) backlog. I've adjusted the comments next to the machines in the firewall configuration to have equinix_mnx instead of joyent in the name to indicate which ones I have changed. I have not done anything for the release machines which are currently unused. One of the infra- machines is quite large (3.2Tb) so will take a while to transfer between the data centers.
NOTE: While adding the firewall rules I spotted that there were some entries for what were presumably old machines that no longer exist as they are not in our CI. I have created #3109 to clear those entries up later.

richardlau · 2022-12-09T13:06:48Z

I have not done anything for the release machines which are currently unused.

That's only true for the smartos release machines. We're using the ubuntu1804 docker container to cross compile for armv7l -- I've updated the firewall on ci-release but the machine is one of those yet to be migrated.

richardlau · 2022-12-09T13:27:13Z

It looks like the smartos builds are currently broken 😞
e.g. https://ci.nodejs.org/job/node-test-commit-smartos/nodes=smartos20-64/46914/console

10:23:41 ../deps/v8/src/base/platform/platform-posix.cc:80:16: error: conflicting declaration of C function 'int madvise(caddr_t, std::size_t, int)'
10:23:42    80 | extern "C" int madvise(caddr_t, size_t, int);
10:23:42       |                ^~~~~~~
10:23:42 In file included from ../deps/v8/src/base/platform/platform-posix.cc:18:
10:23:42 /usr/include/sys/mman.h:268:12: note: previous declaration 'int madvise(void*, std::size_t, int)'
10:23:42   268 | extern int madvise(void *, size_t, int);
10:23:42       |            ^~~~~~~
10:23:42 make[2]: *** [tools/v8_gypfiles/v8_libbase.target.mk:177: /home/iojs/build/workspace/node-test-commit-smartos/nodes/smartos20-64/out/Release/obj.target/v8_libbase/deps/v8/src/base/platform/platform-posix.o] Error 1
10:23:42 make[2]: *** Waiting for unfinished jobs....

Also failing on the smartos18 machines in similar fashion for Node.js 14 builds as well as the builds for the main branch and pull requests.

sxa · 2022-12-09T15:14:11Z

Looks like this is a result of the SmartOS upgrade on the host (global zone) and the fact that all of the SmartOS local zones inherit /usr which has a modified /usr/local/sys/mman.h:
The old one had:

extern int madvise(caddr_t, size_t, int);

and then later on:

#if !defined(__XOPEN_OR_POSIX) || defined(_XPG6) || defined(__EXTENSIONS__)
extern int posix_madvise(void *, size_t, int);
#endif

But the new one has :

#if !defined(_STRICT_POSIX) || defined(_XPG6)
extern int posix_madvise(void *, size_t, int);
#endif

and later:

#if !defined(_STRICT_POSIX)
extern int mincore(caddr_t, size_t, char *);
extern int memcntl(void *, size_t, int, void *, int, int);
extern int madvise(void *, size_t, int);
[...]

So we've lost the madvise definition with caddr_t as the first parameter which is almost certainly causing this compile failure. This may need a V8 patch unless we tweak the header files on the host to get around the immediate problem. I shall leave it in this state pending advise from @bahamat

sxa · 2022-12-09T15:23:56Z

FYI full diff of mmem.h between old an new systems: mmem.h.diff.txt.gz

richardlau · 2022-12-09T15:37:01Z

So we've lost the madvise definition with caddr_t as the first parameter which is almost certainly causing this compile failure. This may need a V8 patch unless we tweak the header files on the host to get around the immediate problem. I shall leave it in this state pending advise from @bahamat

It looks like this is a V8 issue where madvise is being declared specifically for V8_OS_SOLARIS:
https://github.com/v8/v8/blob/458cda96fe5db5bded922caa80ed304ad8be2a72/src/base/platform/platform-posix.cc#L78-L84

#if defined(V8_OS_SOLARIS)
#if (defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE > 2) || defined(__EXTENSIONS__)
extern "C" int madvise(caddr_t, size_t, int);
#else
extern int madvise(caddr_t, size_t, int);
#endif
#endif

There was a PR opened on V8's GitHub mirror for this but it was closed and it looks like it was not upstreamed: v8/v8#37

sxa · 2022-12-09T19:52:29Z

SmartOS host has been put back to the old version in order to get the 'known good' headers in the global zone and therefore the inherited /usr in the local zones. Seeing if the patch can be backported to all relevant V8 release lines can be done separately.

richardlau · 2022-12-10T15:48:54Z

I've updated the DNS entries for grafana and unencrypted in CloudFlare with the new IP addresses (although unencrypted is still being migrated so currently is down but DNS now points to the new address).

bahamat · 2022-12-15T04:42:43Z

This migration is complete. I don't have access to Jenkins (or at least don't know where it is) so I can't confirm all the nodes are connected.

If someone can verify this for me, then I think we can resolve this issue.

Refs: nodejs#3108

Rename the machines to more accurately reflect where those machines are hosted (the MNX owned Equinix Metal account). Refs: nodejs#3108

richardlau · 2022-12-16T20:19:16Z

I believe all the Jenkins agent have reconnected. I've opened a PR to update the IP addresses in the Ansible inventory. I've also renamed the test machines from "joyent" to "equinix_mnx". We probably want to rename the release and infra machines as well but I'm not going to have time to do that this year (today is my last working day of 2022).

Have we confirmed with Equinix whether the nc-backup-01 host is okay where it is or if it also needs to migrate? According to the web console it's in DA - DFW2 and I thought all three letter data facilities were supposed to be closed at the end of November.

bahamat · 2022-12-16T20:20:35Z

I’ll see what I can find out.

bahamat · 2022-12-17T20:50:49Z

Confirmed, dfw2 is also shutting down. I'll get nc-backup-01 migrated to da11.

mhdawson · 2022-12-22T15:05:01Z

@bahamat what is the old ip of nc-backup-01 ? I think it would be 139.178.83.227. Want to confirm which machine it corresponds to. If it is 139.178.83.227 then that machine I think mostly pulls from other machines to do backups.

In that case we may not need to configure ips other than updating in the inventory, but it might affect known_hosts on the machines it connects to. We'd want to validate that it can still connect to the other machines after the move.

richardlau · 2022-12-22T15:14:15Z

@mhdawson I believe nc-backup-01 is backup (139.178.83.227) based on earlier discussions with @bahamat and @sxa and I also believe you are correct that the machine is pulling from other machines.

In that case we may not need to configure ips other than updating in the inventory, but it might affect known_hosts on the machines it connects to. We'd want to validate that it can still connect to the other machines after the move.

We will also need to update the firewall on the download server (www/direct.nodejs.org) so that backup can connect to it.

bahamat · 2022-12-22T16:13:45Z

Yeah, that's the right IP.

In that case, I'll do the final sync and start up the new one. I'll let you know when it's up.

bahamat · 2022-12-22T16:19:12Z

OK, all finished. The new IPs are:

147.28.183.83
2604:1380:4502:3500::3

The old one is stopped but not destroyed yet. Once you can confirm that the new one is working as intended, I'll destroy the old one.

mhdawson · 2023-01-03T19:08:02Z

@richardlau I think you mentioned you were going to look at this?

richardlau · 2023-01-04T12:31:33Z

I've added 147.28.183.83 to the firewall on the www server (so the backup machine can rsync to it). However I don't seem to be able to ssh into 147.28.183.83 to verify if the backups are running.

richardlau · 2023-01-05T13:34:50Z

Thanks to @bahamat for fixing the network interfaces on the new backup machine. I've been able to log into it and AFAICT the backups ran, so I think we're good with the replacement.

vielmetti · 2023-01-17T17:02:49Z

@richardlau @bahamat - following up here, is this work completed to the point where this issue can be closed?

bahamat · 2023-01-17T17:50:29Z

Yes, I think so.

vielmetti · 2023-01-23T19:51:29Z

@bahamat I see that the last old s1 storage system was "powered off" last week - if you are completely ready then you (or I) can "delete" that system and we'll really be done.

bahamat · 2023-01-24T02:01:31Z

OK, all set. f88f2ae4-52dd-4613-b123-262d47bf5d2c | nc-backup-01 has been deleted.

vielmetti · 2023-01-24T13:25:47Z

Confirmed! Thanks @bahamat - I think this issue can be closed out.

Refs: nodejs#3108

Rename the machines to more accurately reflect where those machines are hosted (the MNX owned Equinix Metal account). Refs: nodejs#3108

Refs: #3108

Rename the machines to more accurately reflect where those machines are hosted (the MNX owned Equinix Metal account). Refs: #3108

sxa assigned richardlau and sxa Dec 7, 2022

richardlau added the infra label Dec 7, 2022

sxa mentioned this issue Dec 7, 2022

All Joyent machines are offline #3104

Closed

sxa changed the title ~~Migrate *-joyent machines to new Equinix system. OUTAGE required on 8the December~~ Migrate *-joyent machines to new Equinix system. OUTAGE required on 8th December Dec 7, 2022

targos mentioned this issue Dec 10, 2022

tls: remove trustcor root ca certificates nodejs/node#45776

Merged

richardlau added a commit to richardlau/build that referenced this issue Dec 16, 2022

ansible: update IP addresses for "joyent" machines

41968fc

Refs: nodejs#3108

richardlau added a commit to richardlau/build that referenced this issue Dec 16, 2022

ansible: rename "joyent" machines to "equinix_mnx"

3d47eb7

Rename the machines to more accurately reflect where those machines are hosted (the MNX owned Equinix Metal account). Refs: nodejs#3108

richardlau mentioned this issue Dec 16, 2022

ansible: update IP addresses for "joyent" machines #3117

Merged

5 tasks

richardlau added a commit to richardlau/build that referenced this issue Dec 16, 2022

ansible: rename test "joyent" machines to "equinix_mnx"

7d22fc3

Rename the machines to more accurately reflect where those machines are hosted (the MNX owned Equinix Metal account). Refs: nodejs#3108

richardlau added a commit to richardlau/build that referenced this issue Jan 30, 2023

ansible: update IP addresses for "joyent" machines

73c29a0

Refs: nodejs#3108

richardlau added a commit to richardlau/build that referenced this issue Jan 30, 2023

ansible: rename test "joyent" machines to "equinix_mnx"

e6458b3

Rename the machines to more accurately reflect where those machines are hosted (the MNX owned Equinix Metal account). Refs: nodejs#3108

richardlau added a commit that referenced this issue Jan 30, 2023

ansible: update IP addresses for "joyent" machines

9641f55

Refs: #3108

richardlau added a commit that referenced this issue Jan 30, 2023

ansible: rename test "joyent" machines to "equinix_mnx"

90031b7

Rename the machines to more accurately reflect where those machines are hosted (the MNX owned Equinix Metal account). Refs: #3108

richardlau closed this as completed Jan 30, 2023

richardlau mentioned this issue Mar 16, 2023

Influxdb address needs updating #3228

Closed

richardlau mentioned this issue May 31, 2024

Equinix Move: Rebuild SmartOS Hosts #3731

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate `*-joyent` machines to new Equinix system. OUTAGE required on 8th December #3108

Migrate `*-joyent` machines to new Equinix system. OUTAGE required on 8th December #3108

sxa commented Dec 7, 2022 •

edited

Loading

vielmetti commented Dec 7, 2022

sxa commented Dec 7, 2022

vielmetti commented Dec 7, 2022

sxa commented Dec 9, 2022

richardlau commented Dec 9, 2022

richardlau commented Dec 9, 2022

sxa commented Dec 9, 2022

sxa commented Dec 9, 2022

richardlau commented Dec 9, 2022 •

edited

Loading

sxa commented Dec 9, 2022

richardlau commented Dec 10, 2022

bahamat commented Dec 15, 2022

richardlau commented Dec 16, 2022

bahamat commented Dec 16, 2022

bahamat commented Dec 17, 2022

mhdawson commented Dec 22, 2022

richardlau commented Dec 22, 2022

bahamat commented Dec 22, 2022

bahamat commented Dec 22, 2022 •

edited

Loading

mhdawson commented Jan 3, 2023

richardlau commented Jan 4, 2023

richardlau commented Jan 5, 2023

vielmetti commented Jan 17, 2023

bahamat commented Jan 17, 2023

vielmetti commented Jan 23, 2023

bahamat commented Jan 24, 2023

vielmetti commented Jan 24, 2023

Migrate *-joyent machines to new Equinix system. OUTAGE required on 8th December #3108

Migrate *-joyent machines to new Equinix system. OUTAGE required on 8th December #3108

Comments

sxa commented Dec 7, 2022 • edited Loading

vielmetti commented Dec 7, 2022

sxa commented Dec 7, 2022

vielmetti commented Dec 7, 2022

sxa commented Dec 9, 2022

richardlau commented Dec 9, 2022

richardlau commented Dec 9, 2022

sxa commented Dec 9, 2022

sxa commented Dec 9, 2022

richardlau commented Dec 9, 2022 • edited Loading

sxa commented Dec 9, 2022

richardlau commented Dec 10, 2022

bahamat commented Dec 15, 2022

richardlau commented Dec 16, 2022

bahamat commented Dec 16, 2022

bahamat commented Dec 17, 2022

mhdawson commented Dec 22, 2022

richardlau commented Dec 22, 2022

bahamat commented Dec 22, 2022

bahamat commented Dec 22, 2022 • edited Loading

mhdawson commented Jan 3, 2023

richardlau commented Jan 4, 2023

richardlau commented Jan 5, 2023

vielmetti commented Jan 17, 2023

bahamat commented Jan 17, 2023

vielmetti commented Jan 23, 2023

bahamat commented Jan 24, 2023

vielmetti commented Jan 24, 2023

Migrate `*-joyent` machines to new Equinix system. OUTAGE required on 8th December #3108

Migrate `*-joyent` machines to new Equinix system. OUTAGE required on 8th December #3108

sxa commented Dec 7, 2022 •

edited

Loading

richardlau commented Dec 9, 2022 •

edited

Loading

bahamat commented Dec 22, 2022 •

edited

Loading