-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate *-joyent
machines to new Equinix system. OUTAGE required on 8th December
#3108
Comments
*-joyent
machines to new Equinix system. OUTAGE required on 8the December*-joyent
machines to new Equinix system. OUTAGE required on 8th December
Thanks @sxa . The machine roster is here https://deploy.equinix.com/product/servers/ and you'll pick an Intel version. If you're picking data centers, this capacity dashboard https://deploy.equinix.com/developers/capacity-dashboard/ is helpful. |
Thank you - that question about which DC to use did come up in the call and I figured you'd potentially have some pointers on that - the capacity dashboard should be useful for @bahamat |
Update: Most of the migration is complete and the test smartos' new IPs have been added so that they can get through the firewall to the CI server which is now running through the (fairly small) backlog. I've adjusted the comments next to the machines in the firewall configuration to have |
That's only true for the smartos release machines. We're using the ubuntu1804 docker container to cross compile for armv7l -- I've updated the firewall on ci-release but the machine is one of those yet to be migrated. |
It looks like the smartos builds are currently broken 😞
Also failing on the smartos18 machines in similar fashion for Node.js 14 builds as well as the builds for the main branch and pull requests. |
Looks like this is a result of the SmartOS upgrade on the host (global zone) and the fact that all of the SmartOS local zones inherit
and then later on:
But the new one has :
and later:
So we've lost the madvise definition with caddr_t as the first parameter which is almost certainly causing this compile failure. This may need a V8 patch unless we tweak the header files on the host to get around the immediate problem. I shall leave it in this state pending advise from @bahamat |
FYI full |
It looks like this is a V8 issue where #if defined(V8_OS_SOLARIS)
#if (defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE > 2) || defined(__EXTENSIONS__)
extern "C" int madvise(caddr_t, size_t, int);
#else
extern int madvise(caddr_t, size_t, int);
#endif
#endif There was a PR opened on V8's GitHub mirror for this but it was closed and it looks like it was not upstreamed: v8/v8#37 |
SmartOS host has been put back to the old version in order to get the 'known good' headers in the global zone and therefore the inherited |
I've updated the DNS entries for grafana and unencrypted in CloudFlare with the new IP addresses (although unencrypted is still being migrated so currently is down but DNS now points to the new address). |
This migration is complete. I don't have access to Jenkins (or at least don't know where it is) so I can't confirm all the nodes are connected. If someone can verify this for me, then I think we can resolve this issue. |
Rename the machines to more accurately reflect where those machines are hosted (the MNX owned Equinix Metal account). Refs: nodejs#3108
Rename the machines to more accurately reflect where those machines are hosted (the MNX owned Equinix Metal account). Refs: nodejs#3108
I believe all the Jenkins agent have reconnected. I've opened a PR to update the IP addresses in the Ansible inventory. I've also renamed the test machines from "joyent" to "equinix_mnx". We probably want to rename the release and infra machines as well but I'm not going to have time to do that this year (today is my last working day of 2022). Have we confirmed with Equinix whether the nc-backup-01 host is okay where it is or if it also needs to migrate? According to the web console it's in DA - DFW2 and I thought all three letter data facilities were supposed to be closed at the end of November. |
I’ll see what I can find out. |
Confirmed, dfw2 is also shutting down. I'll get nc-backup-01 migrated to da11. |
@bahamat what is the old ip of nc-backup-01 ? I think it would be 139.178.83.227. Want to confirm which machine it corresponds to. If it is 139.178.83.227 then that machine I think mostly pulls from other machines to do backups. In that case we may not need to configure ips other than updating in the inventory, but it might affect known_hosts on the machines it connects to. We'd want to validate that it can still connect to the other machines after the move. |
@mhdawson I believe nc-backup-01 is backup (139.178.83.227) based on earlier discussions with @bahamat and @sxa and I also believe you are correct that the machine is pulling from other machines.
We will also need to update the firewall on the download server (www/direct.nodejs.org) so that backup can connect to it. |
Yeah, that's the right IP. In that case, I'll do the final sync and start up the new one. I'll let you know when it's up. |
OK, all finished. The new IPs are:
The old one is stopped but not destroyed yet. Once you can confirm that the new one is working as intended, I'll destroy the old one. |
@richardlau I think you mentioned you were going to look at this? |
I've added 147.28.183.83 to the firewall on the www server (so the backup machine can rsync to it). However I don't seem to be able to ssh into 147.28.183.83 to verify if the backups are running. |
Thanks to @bahamat for fixing the network interfaces on the new backup machine. I've been able to log into it and AFAICT the backups ran, so I think we're good with the replacement. |
@richardlau @bahamat - following up here, is this work completed to the point where this issue can be closed? |
Yes, I think so. |
@bahamat I see that the last old s1 storage system was "powered off" last week - if you are completely ready then you (or I) can "delete" that system and we'll really be done. |
OK, all set. |
Confirmed! Thanks @bahamat - I think this issue can be closed out. |
Rename the machines to more accurately reflect where those machines are hosted (the MNX owned Equinix Metal account). Refs: nodejs#3108
Rename the machines to more accurately reflect where those machines are hosted (the MNX owned Equinix Metal account). Refs: #3108
Spinning this out of #3104 since the existing machines are currently back online temporarily.
In February 2021 some of our systems were migrated to from Joyent's data centers to Equinix using an account managed by Joyent team members which was separate from our existing one at Equinix.
Recently it became apparent that some of those were hosted on the Equinix data centers which were due to be shut down at the end of November. After a call today between myself, @richardlau and @bahamat we now have a good understanding of where we area and how to move forward.
To summarise where we currently are: There are two systems hosted on the account managed by Joyent. Both are SmartOS hosts with virtual images inside them. One of these hosts is called
nc-backup-01
and only contains one VM - the backup server which is SmartOS 15.4 and is in the DFW2 data center.The second host is called
nc-compute-01
and it contains all the other systems referenced in #3104 - some are KVM instances, some are SmartOS zones and one is an lx-branded zone. The details and breakdown are as follows:The infrastructure team's public ssh key has now been put onto both SmartOS hosts so that those team members can access the systems. @richardlau and @sxa have also been invited to co-administer the Equnix instance hosting these two in case any recovery of the hosts is required, and to hopefully set up to receive notifications.
We explored a few potential options:
Given that option 3 was feasible and solved the immediate problem with Equinix attempting to shut down their data centeras we have chosen that one and we intend to start migrating the system tomorrow (evening UTC). @bahamat will handle provisioning the replacement SmartOS host on Equinix and migrating the images across. This will result in an outage on these systems while the migration takes place. The new server will need to be Intel rather than AMD to support SmartOS' KVM implementation.
We will also aim to rename these so they do not have
joyent
in the name since they are now hosted at equinix (Likely using a newequinix_mnx
provider name to indicate that it's hosted separately from our other equinix systems)FYI @nodejs/build @nodejs/build-infra @vielmetti
The text was updated successfully, but these errors were encountered: