Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconfigurator: Add support for boundary NTP zones #6259

Merged
merged 8 commits into from
Aug 8, 2024

Conversation

jgallagher
Copy link
Contributor

@jgallagher jgallagher commented Aug 7, 2024

This is almost entirely planner work; no executor work was required, because it already supports "zones with external networking", and there's nothing special to boundary NTP beyond that from an execution point of view.

I put this through a fairly thorough test on london; I'll put my notes from that in comments on the PR momentarily.

This builds on #6050 and should be merged after it.

@jgallagher
Copy link
Contributor Author

Test setup on london:

  • clean-slate, install omicron main, go through rack setup
  • park the rack
  • update to this commit

After the rack comes back, we see the new chrony config line from #6050 in an internal NTP zone, but the new boundary-ntp DNS name doesn't resolve (as expected - we ran rack setup before that DNS name existed, and we haven't run the blueprint execution task yet):

root@oxz_ntp_b34d0c6f:~# tail -n 3 /etc/inet/chrony.conf
server 1f092e4c-eac5-408e-b421-a6b3ba8d2ce7.host.control-plane.oxide.internal iburst minpoll 0 maxpoll 4
server 6effe156-cc36-486f-82ac-2333a2fe9d87.host.control-plane.oxide.internal iburst minpoll 0 maxpoll 4
pool boundary-ntp.control-plane.oxide.internal iburst maxdelay 0.1 maxsources 16
root@oxz_ntp_b34d0c6f:~# host boundary-ntp.control-plane.oxide.internal
Host boundary-ntp.control-plane.oxide.internal not found: 3(NXDOMAIN)

We can confirm the DNS contents via omdb:

root@oxz_switch1:~# omdb db dns names internal 1 2>/dev/null | grep -A 2 ntp
  _boundary-ntp._tcp                                 (records: 2)
      SRV  port   123 1f092e4c-eac5-408e-b421-a6b3ba8d2ce7.host.control-plane.oxide.internal
      SRV  port   123 6effe156-cc36-486f-82ac-2333a2fe9d87.host.control-plane.oxide.internal
--
  _internal-ntp._tcp                                 (records: 2)
      SRV  port   123 378be9b6-f88b-4a3e-9c1c-a601e2d1551f.host.control-plane.oxide.internal
      SRV  port   123 b34d0c6f-bc89-4e89-866e-5dd781dd596e.host.control-plane.oxide.internal

To get the new boundary-ntp DNS name, we need to get blueprint realization to run; we don't need a new blueprint, so we can just enable the current one:

root@oxz_switch1:~# omdb -w nexus blueprints target enable current
set target blueprint f4dd65b1-0df6-4723-ba40-9942051a5363 to enabled

We can see the new DNS config was pushed out:

root@oxz_switch1:~# omdb db dns show 2>/dev/null
GROUP    ZONE                         ver UPDATED              REASON
internal control-plane.oxide.internal 2   2024-08-07T18:48:52Z blueprint f4dd65b1-0df6-4723-ba40-9942051a5363 (initial blueprint from rack setup)
external london.eng.oxide.computer    2   2024-08-07T17:33:25Z create silo: "recovery"root@oxz_switch1:~# omdb db dns diff internal 2 2>/dev/null
DNS zone:                   control-plane.oxide.internal (Internal)
requested version:          2 (created at 2024-08-07T18:48:52Z)
version created by Nexus:   c4d99d7a-d614-4e1a-b30b-7a81388d8654
version created because:    blueprint f4dd65b1-0df6-4723-ba40-9942051a5363 (initial blueprint from rack setup)
changes:                    names added: 1, names removed: 0

+  boundary-ntp                                       (records: 2)
+      AAAA fd00:1122:3344:101::10
+      AAAA fd00:1122:3344:102::f

And back in the internal NTP zone, the name resolves:

root@oxz_ntp_b34d0c6f:~# host boundary-ntp.control-plane.oxide.internal
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::10
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::10
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::10
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f

(We'll need to do the above as part of the R10 upgrade, so we can assume moving forward that this name exists.)

@jgallagher
Copy link
Contributor Author

Continuing the test on london, I ran omdb db reconfigurator-save, manually extracted the current blueprint from the resulting JSON blob, and manually edited it to produce a new child blueprint with one boundary NTP zone expunged (it would be better to do this via reconfigurator-cli but I am lazy):

root@oxz_switch1:~# diff bp1.json bp2.json
2c2
<       "id": "f4dd65b1-0df6-4723-ba40-9942051a5363",
---
>       "id": "f4dd65b1-0df6-4723-ba40-9942051a5365",
11c11
<           "generation": 5,
---
>           "generation": 6,
14c14
<               "disposition": "in_service",
---
>               "disposition": "expunged",
1311c1311
<       "parent_blueprint_id": null,
---
>       "parent_blueprint_id": "f4dd65b1-0df6-4723-ba40-9942051a5363",

Importing that to Nexus and diffing shows the one zone expunged as well:

root@oxz_switch1:~# omdb -w nexus blueprints import bp2.json 2>/dev/null
uploaded new blueprint f4dd65b1-0df6-4723-ba40-9942051a5365
root@oxz_switch1:~# omdb -w nexus blueprints diff current f4dd65b1-0df6-4723-ba40-9942051a5365
from: blueprint f4dd65b1-0df6-4723-ba40-9942051a5363
to:   blueprint f4dd65b1-0df6-4723-ba40-9942051a5365

 UNCHANGED SLEDS:
 
 ... snip lots of output ...
  
 MODIFIED SLEDS:

  sled 81cf464e-146c-4f37-8a20-743d47da4735:

    physical disks at generation 1:
    -----------------------------------
    vendor   model             serial
    -----------------------------------
    1b96     WUS4C6432DSP3X3   A079DDA7
    1b96     WUS4C6432DSP3X3   A079E0E5
    1b96     WUS4C6432DSP3X3   A079E2D1
    1b96     WUS4C6432DSP3X3   A079E32C
    1b96     WUS4C6432DSP3X3   A079E34A
    1b96     WUS4C6432DSP3X3   A079E382
    1b96     WUS4C6432DSP3X3   A079E39F
    1b96     WUS4C6432DSP3X3   A079E432
    1b96     WUS4C6432DSP3X3   A084A5FD


    omicron zones at generation 5:
    ----------------------------------------------------------------------------------------------
    zone type         zone id                                disposition    underlay IP
    ----------------------------------------------------------------------------------------------
    cockroach_db      c3fc99af-492e-4ca2-b268-4a6fbb870fda   in service     fd00:1122:3344:101::3
    crucible          3be71281-2218-42ea-8810-df25d44d56ce   in service     fd00:1122:3344:101::f
    crucible          478d5ad4-fde3-443b-8f62-7d62d48ddae5   in service     fd00:1122:3344:101::b
    crucible          70d87df1-da3d-4105-806d-0e5b090dfbe0   in service     fd00:1122:3344:101::9
    crucible          7143583e-9c87-4b2b-a316-83854098e08b   in service     fd00:1122:3344:101::a
    crucible          9ca90502-bb1e-4771-89e8-256d013f2157   in service     fd00:1122:3344:101::7
    crucible          d5d44dc6-a9cb-4c4d-be3c-2547011b9187   in service     fd00:1122:3344:101::c
    crucible          ea15cba3-1f4f-4680-be5a-f856a4562d2d   in service     fd00:1122:3344:101::e
    crucible          f4a4b344-26f3-4555-8e6f-34565e7f757a   in service     fd00:1122:3344:101::d
    crucible          fd074915-2484-4c08-ad40-d4d42b3f9392   in service     fd00:1122:3344:101::8
    crucible_pantry   df08f7dc-8573-4102-9937-78b8208d8dfe   in service     fd00:1122:3344:101::6
    external_dns      fc4d7fdc-82cb-4817-af5c-cf520c9e5927   in service     fd00:1122:3344:101::4
    internal_dns      c60cf900-18e2-40dc-9388-45ec8b4664db   in service     fd00:1122:3344:1::1
    oximeter          7c038180-57ae-4554-9f05-f5bf8a00645c   in service     fd00:1122:3344:101::5
*   boundary_ntp      1f092e4c-eac5-408e-b421-a6b3ba8d2ce7   - in service   fd00:1122:3344:101::10
     └─                                                      + expunged


 COCKROACHDB SETTINGS:
    state fingerprint:::::::::::::::::   d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
    cluster.preserve_downgrade_option:   "22.1" (unchanged)

 METADATA:
    internal DNS version:   1 (unchanged)
    external DNS version:   2 (unchanged)

I then made this new blueprint the target:

root@oxz_switch1:~# omdb -w nexus blueprints target set f4dd65b1-0df6-4723-ba40-9942051a5365 enabled
set target blueprint to f4dd65b1-0df6-4723-ba40-9942051a5365

On the relevant sled, we see that sled-agent removed the zone:

BRM42220036 # grep halt_and_remove $(svcs -L sled-agent) | looker
19:03:27.407Z INFO SledAgent (ServiceManager): halt_and_remove_logged: Previous zone state: Running
    file = illumos-utils/src/zone.rs:270
    zone = oxz_ntp_1f092e4c-eac5-408e-b421-a6b3ba8d2ce7
BRM42220036 # zoneadm list | grep ntp
BRM42220036 #

Regenerating a new blueprint adds an internal NTP zone to the sled which now has no NTP zone at all, and promotes one of the preexisting internal NTP zones to a boundary NTP zone:

root@oxz_switch1:~# omdb -w nexus blueprints regenerate
generated new blueprint d3b6b80f-d82f-412a-9a35-e7faee553e2a
root@oxz_switch1:~# omdb -w nexus blueprints diff current d3b6b80f-d82f-412a-9a35-e7faee553e2a
from: blueprint f4dd65b1-0df6-4723-ba40-9942051a5365
to:   blueprint d3b6b80f-d82f-412a-9a35-e7faee553e2a

 UNCHANGED SLEDS:

   ... snip 2 sleds ...

 MODIFIED SLEDS:

  sled 81cf464e-146c-4f37-8a20-743d47da4735:

    physical disks at generation 1:
    -----------------------------------
    vendor   model             serial
    -----------------------------------
    1b96     WUS4C6432DSP3X3   A079DDA7
    1b96     WUS4C6432DSP3X3   A079E0E5
    1b96     WUS4C6432DSP3X3   A079E2D1
    1b96     WUS4C6432DSP3X3   A079E32C
    1b96     WUS4C6432DSP3X3   A079E34A
    1b96     WUS4C6432DSP3X3   A079E382
    1b96     WUS4C6432DSP3X3   A079E39F
    1b96     WUS4C6432DSP3X3   A079E432
    1b96     WUS4C6432DSP3X3   A084A5FD


    omicron zones generation 6 -> 7:
    ---------------------------------------------------------------------------------------------
    zone type         zone id                                disposition   underlay IP
    ---------------------------------------------------------------------------------------------
    boundary_ntp      1f092e4c-eac5-408e-b421-a6b3ba8d2ce7   expunged      fd00:1122:3344:101::10
    cockroach_db      c3fc99af-492e-4ca2-b268-4a6fbb870fda   in service    fd00:1122:3344:101::3
    crucible          3be71281-2218-42ea-8810-df25d44d56ce   in service    fd00:1122:3344:101::f
    crucible          478d5ad4-fde3-443b-8f62-7d62d48ddae5   in service    fd00:1122:3344:101::b
    crucible          70d87df1-da3d-4105-806d-0e5b090dfbe0   in service    fd00:1122:3344:101::9
    crucible          7143583e-9c87-4b2b-a316-83854098e08b   in service    fd00:1122:3344:101::a
    crucible          9ca90502-bb1e-4771-89e8-256d013f2157   in service    fd00:1122:3344:101::7
    crucible          d5d44dc6-a9cb-4c4d-be3c-2547011b9187   in service    fd00:1122:3344:101::c
    crucible          ea15cba3-1f4f-4680-be5a-f856a4562d2d   in service    fd00:1122:3344:101::e
    crucible          f4a4b344-26f3-4555-8e6f-34565e7f757a   in service    fd00:1122:3344:101::d
    crucible          fd074915-2484-4c08-ad40-d4d42b3f9392   in service    fd00:1122:3344:101::8
    crucible_pantry   df08f7dc-8573-4102-9937-78b8208d8dfe   in service    fd00:1122:3344:101::6
    external_dns      fc4d7fdc-82cb-4817-af5c-cf520c9e5927   in service    fd00:1122:3344:101::4
    internal_dns      c60cf900-18e2-40dc-9388-45ec8b4664db   in service    fd00:1122:3344:1::1
    oximeter          7c038180-57ae-4554-9f05-f5bf8a00645c   in service    fd00:1122:3344:101::5
+   internal_ntp      3c6c27e0-da82-4c43-a67d-d4e122a8bde6   in service    fd00:1122:3344:101::21


  sled a285a319-3733-47d9-ba89-25e4305b47e4:

    physical disks at generation 1:
    -----------------------------------
    vendor   model             serial
    -----------------------------------
    1b96     WUS4C6432DSP3X3   A079DDF2
    1b96     WUS4C6432DSP3X3   A079DF88
    1b96     WUS4C6432DSP3X3   A079DF89
    1b96     WUS4C6432DSP3X3   A079DFA9
    1b96     WUS4C6432DSP3X3   A079DFAA
    1b96     WUS4C6432DSP3X3   A079DFDC
    1b96     WUS4C6432DSP3X3   A079E026
    1b96     WUS4C6432DSP3X3   A079E047
    1b96     WUS4C6432DSP3X3   A079E08D
    1b96     WUS4C6432DSP3X3   A079E08E


    omicron zones generation 5 -> 6:
    ----------------------------------------------------------------------------------------------
    zone type         zone id                                disposition    underlay IP
    ----------------------------------------------------------------------------------------------
    cockroach_db      d04966e0-3e10-4a93-9d92-4e728a18e416   in service     fd00:1122:3344:103::3
    crucible          24fdefee-4c34-4f6e-ab5e-39d492f0b6d6   in service     fd00:1122:3344:103::8
    crucible          74dfd4fa-d62d-4071-b5a5-de9728c82d22   in service     fd00:1122:3344:103::f
    crucible          864c681d-6dec-4f87-95a6-e6c8e2df3ee2   in service     fd00:1122:3344:103::e
    crucible          956d0a22-8fc0-4554-b745-24d2c05c9002   in service     fd00:1122:3344:103::9
    crucible          b79c0d13-4d0f-4897-9d0b-8543b7468d12   in service     fd00:1122:3344:103::d
    crucible          bc52eab4-d406-4bee-84be-30a58817a434   in service     fd00:1122:3344:103::6
    crucible          c83fc730-1743-4b7c-9085-3e46b99f6a2c   in service     fd00:1122:3344:103::b
    crucible          cb318546-8c39-4f75-9f58-999be87c339d   in service     fd00:1122:3344:103::c
    crucible          d8ca057d-d420-4b4f-a8ec-06b43b5c1688   in service     fd00:1122:3344:103::a
    crucible          dd7b3465-9288-48d6-a586-8fe17646a7b8   in service     fd00:1122:3344:103::7
    crucible_pantry   d66c11a4-5f69-4cdc-826f-478f9f2d80fa   in service     fd00:1122:3344:103::5
    internal_dns      a83bc3fb-f06e-43a9-8212-f520424fb488   in service     fd00:1122:3344:3::1
    nexus             c4d99d7a-d614-4e1a-b30b-7a81388d8654   in service     fd00:1122:3344:103::4
*   internal_ntp      b34d0c6f-bc89-4e89-866e-5dd781dd596e   - in service   fd00:1122:3344:103::10
     └─                                                      + expunged
+   boundary_ntp      bcf1d8b9-eb28-4404-be89-dfa860fca90d   in service     fd00:1122:3344:103::21


 COCKROACHDB SETTINGS:
    state fingerprint:::::::::::::::::   d4d87aa2ad877a4cc2fddd0573952362739110de (unchanged)
    cluster.preserve_downgrade_option:   "22.1" (unchanged)

 METADATA:
*   internal DNS version:   1 -> 3
    external DNS version:   2 (unchanged)

After making this new blueprint the target, we can confirm that the new boundary NTP zone has connectivity:

BRM42220062 # zlogin $(zoneadm list | grep ntp)
[Connected to zone 'oxz_ntp_bcf1d8b9-eb28-4404-be89-dfa860fca90d' pts/5]
The illumos Project     helios-2.0.22800        August 2024
root@oxz_ntp_bcf1d8b9:~# chronyc sources
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* 172.20.0.5                    2   2   377     3    -18us[  -20us] +/-   37ms
root@oxz_ntp_bcf1d8b9:~# chronyc tracking
Reference ID    : AC140005 (172.20.0.5)
Stratum         : 3
Ref time (UTC)  : Wed Aug 07 19:11:30 2024
System time     : 0.000000982 seconds slow of NTP time
Last offset     : -0.000003033 seconds
RMS offset      : 0.000031792 seconds
Frequency       : 51.595 ppm slow
Residual freq   : -0.011 ppm
Skew            : 1.107 ppm
Root delay      : 0.071110584 seconds
Root dispersion : 0.001103470 seconds
Update interval : 8.0 seconds
Leap status     : Normal
root@oxz_ntp_bcf1d8b9:~# ipadm | grep omicron
oxControlService15/omicron6 static ok   fd00:1122:3344:103::21/64

As does the new internal NTP zone, including to the new boundary NTP zone (...:103::21):

BRM42220036 # zlogin $(zoneadm list | grep ntp)
[Connected to zone 'oxz_ntp_3c6c27e0-da82-4c43-a67d-d4e122a8bde6' pts/3]
The illumos Project     helios-2.0.22800        August 2024
root@oxz_ntp_3c6c27e0:~# chronyc sources
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* fd00:1122:3344:102::f         3   4   377     1  +3445ns[+5299ns] +/-   37ms
^- fd00:1122:3344:103::21        3   6    17    25   +188ns[+2904ns] +/-   37ms
root@oxz_ntp_3c6c27e0:~# chronyc tracking
Reference ID    : 9F7807A6 (fd00:1122:3344:102::f)
Stratum         : 4
Ref time (UTC)  : Wed Aug 07 19:11:53 2024
System time     : 0.000000010 seconds fast of NTP time
Last offset     : +0.000001618 seconds
RMS offset      : 0.000007027 seconds
Frequency       : 55.473 ppm slow
Residual freq   : +0.005 ppm
Skew            : 0.216 ppm
Root delay      : 0.071277142 seconds
Root dispersion : 0.001013304 seconds
Update interval : 8.1 seconds
Leap status     : Normal

Checking one of the existing, untouched internal NTP zones, we see that chrony has picked up the new boundary NTP zone, but also still has a record of the now-expunged one. I'm not sure if this is bad or indicative of something missing in the chrony config, or something that will update if I wait long enough? (The high LastRx is a strong indicator, and that value continued to grow as I watched it - I assume eventually chrony will give up on that one.)

BRM44220007 # zlogin $(zoneadm list | grep ntp)
[Connected to zone 'oxz_ntp_378be9b6-f88b-4a3e-9c1c-a601e2d1551f' pts/3]
The illumos Project     helios-2.0.22800        August 2024
root@oxz_ntp_378be9b6:~# chronyc sources
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^? fd00:1122:3344:101::10        3   4     0   627  +4059ns[-7768ns] +/-   36ms
^* fd00:1122:3344:102::f         3   4   377     7  -7372ns[-9450ns] +/-   37ms
^- fd00:1122:3344:103::21        3   6    37    17    -10us[  -12us] +/-   37ms
root@oxz_ntp_378be9b6:~# chronyc tracking
Reference ID    : 9F7807A6 (fd00:1122:3344:102::f)
Stratum         : 4
Ref time (UTC)  : Wed Aug 07 19:13:35 2024
System time     : 0.000000671 seconds slow of NTP time
Last offset     : -0.000002078 seconds
RMS offset      : 0.000003432 seconds
Frequency       : 50.671 ppm slow
Residual freq   : -0.007 ppm
Skew            : 0.274 ppm
Root delay      : 0.071322128 seconds
Root dispersion : 0.000981925 seconds
Update interval : 16.2 seconds
Leap status     : Normal

Checking the config and DNS, it's as we expect: we see the two original RSS boundary NTP zones, but the expunged one no longer resolves, and the new server is present in the boundary-ntp DNS name:

root@oxz_ntp_378be9b6:~# tail -n 3 /etc/inet/chrony.conf
server 1f092e4c-eac5-408e-b421-a6b3ba8d2ce7.host.control-plane.oxide.internal iburst minpoll 0 maxpoll 4
server 6effe156-cc36-486f-82ac-2333a2fe9d87.host.control-plane.oxide.internal iburst minpoll 0 maxpoll 4
pool boundary-ntp.control-plane.oxide.internal iburst maxdelay 0.1 maxsources 16
root@oxz_ntp_378be9b6:~# host 1f092e4c-eac5-408e-b421-a6b3ba8d2ce7.host.control-plane.oxide.internal
Host 1f092e4c-eac5-408e-b421-a6b3ba8d2ce7.host.control-plane.oxide.internal not found: 3(NXDOMAIN)
root@oxz_ntp_378be9b6:~# host 6effe156-cc36-486f-82ac-2333a2fe9d87.host.control-plane.oxide.internal
6effe156-cc36-486f-82ac-2333a2fe9d87.host.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
6effe156-cc36-486f-82ac-2333a2fe9d87.host.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
6effe156-cc36-486f-82ac-2333a2fe9d87.host.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
root@oxz_ntp_378be9b6:~# host boundary-ntp.control-plane.oxide.internal
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:103::21
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:103::21
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:103::21

I then repeated all of the above to expunge the other boundary NTP zone that RSS had created. By luck/chance, one of the original internal NTP zones remained, and we can see that it now knows about all four, but is only able to resolve the new ones (and only has recent LastRx values from them, as expected):

root@oxz_ntp_378be9b6:~# tail -n 3 /etc/inet/chrony.conf
server 1f092e4c-eac5-408e-b421-a6b3ba8d2ce7.host.control-plane.oxide.internal iburst minpoll 0 maxpoll 4
server 6effe156-cc36-486f-82ac-2333a2fe9d87.host.control-plane.oxide.internal iburst minpoll 0 maxpoll 4
pool boundary-ntp.control-plane.oxide.internal iburst maxdelay 0.1 maxsources 16
root@oxz_ntp_378be9b6:~# host 1f092e4c-eac5-408e-b421-a6b3ba8d2ce7.host.control-plane.oxide.internal
Host 1f092e4c-eac5-408e-b421-a6b3ba8d2ce7.host.control-plane.oxide.internal not found: 3(NXDOMAIN)
root@oxz_ntp_378be9b6:~# host 6effe156-cc36-486f-82ac-2333a2fe9d87.host.control-plane.oxide.internal
Host 6effe156-cc36-486f-82ac-2333a2fe9d87.host.control-plane.oxide.internal not found: 3(NXDOMAIN)
root@oxz_ntp_378be9b6:~# host boundary-ntp.control-plane.oxide.internal
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::22
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:103::21
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::22
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:103::21
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::22
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:103::21
root@oxz_ntp_378be9b6:~# chronyc sources
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^? fd00:1122:3344:101::10        3   4     0   23m    +37us[-7768ns] +/-   36ms
^* fd00:1122:3344:102::f         3   4     0   359  -1757ns[-4030ns] +/-   36ms
^- fd00:1122:3344:103::21        3   6   377    15    -31us[  -31us] +/-   36ms
^- fd00:1122:3344:101::22        3   6   177     8  -5147ns[-5147ns] +/-   36ms

We can also confirm that the external IP and vnic accounting in CRDB looks correct; we only have records for the two new boundary NTP zones:

root@oxz_switch1:~# omdb db network list-eips 2>/dev/null
 IP              PORTS        KIND      STATE     OWNER_KIND  OWNER_ID                              OWNER_NAME   OWNER_DISPOSITION
 172.20.27.1/32  0/65535      floating  Attached  service     fc4d7fdc-82cb-4817-af5c-cf520c9e5927  ExternalDns  in service
 172.20.27.2/32  0/65535      floating  Attached  service     c4b80178-8ed7-4c01-afe7-24e152472813  Nexus        in service
 172.20.27.3/32  0/65535      floating  Attached  service     c4d99d7a-d614-4e1a-b30b-7a81388d8654  Nexus        in service
 172.20.27.4/32  0/65535      floating  Attached  service     674546c6-1734-4ca6-ab8e-0028feb391e7  Nexus        in service
 172.20.27.5/32  32768/49151  SNAT      Attached  service     bcf1d8b9-eb28-4404-be89-dfa860fca90d  Ntp          in service
 172.20.27.5/32  49152/65535  SNAT      Attached  service     a5a77de8-2b95-408d-a1be-c51a8790fea8  Ntp          in service
root@oxz_switch1:~# omdb db network list-vnics 2>/dev/null
 IP             MAC                SLOT  PRIMARY  KIND     SUBNET         PARENT_ID                             PARENT_NAME
 172.30.1.5/32  A8:40:25:FF:A0:5E  0     true     service  172.30.1.0/24  fc4d7fdc-82cb-4817-af5c-cf520c9e5927  external-dns-fc4d7fdc-82cb-4817-af5c-cf520c9e5927
 172.30.2.5/32  A8:40:25:FF:F7:D2  0     true     service  172.30.2.0/24  c4b80178-8ed7-4c01-afe7-24e152472813  nexus-c4b80178-8ed7-4c01-afe7-24e152472813
 172.30.2.6/32  A8:40:25:FF:E0:E2  0     true     service  172.30.2.0/24  c4d99d7a-d614-4e1a-b30b-7a81388d8654  nexus-c4d99d7a-d614-4e1a-b30b-7a81388d8654
 172.30.2.7/32  A8:40:25:FF:9D:CB  0     true     service  172.30.2.0/24  674546c6-1734-4ca6-ab8e-0028feb391e7  nexus-674546c6-1734-4ca6-ab8e-0028feb391e7
 172.30.3.7/32  A8:40:25:FF:80:00  0     true     service  172.30.3.0/24  bcf1d8b9-eb28-4404-be89-dfa860fca90d  ntp-bcf1d8b9-eb28-4404-be89-dfa860fca90d
 172.30.3.8/32  A8:40:25:FF:80:01  0     true     service  172.30.3.0/24  a5a77de8-2b95-408d-a1be-c51a8790fea8  ntp-a5a77de8-2b95-408d-a1be-c51a8790fea8

As a final test, I rebooted all the sleds of london concurrently to ensure we could come back up with our new boundary NTP zones. This appeared to work fine; the control plane came back up, and the four boundary NTP zones all have the expected sources (including the original internal NTP that still has config references to now-nonexistent zones, which only sees the two new sources now):

root@oxz_switch1:~# pilot host exec -c 'zoneadm list | grep ntp' 14-17
14  BRM42220036        ok: oxz_ntp_a5a77de8-2b95-408d-a1be-c51a8790fea8
15  BRM42220062        ok: oxz_ntp_bcf1d8b9-eb28-4404-be89-dfa860fca90d
16  BRM42220030        ok: oxz_ntp_51632f45-9605-40ca-9092-925e8c38871b
17  BRM44220007        ok: oxz_ntp_378be9b6-f88b-4a3e-9c1c-a601e2d1551f
root@oxz_switch1:~# pilot host exec -c 'zlogin $(zoneadm list | grep ntp) chronyc sources' 14-17
14  BRM42220036        ok: MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* 172.20.0.5                    2   3   377     1  +3606ns[+7103ns] +/-   36ms
15  BRM42220062        ok: MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* 172.20.0.5                    2   3   377     1    -32us[-7437ns] +/-   36ms
16  BRM42220030        ok: MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* fd00:1122:3344:103::21        3   4   377    16  +8852ns[+8605ns] +/-   36ms
^+ fd00:1122:3344:101::22        3   6   377    18    -18us[  -18us] +/-   36ms
17  BRM44220007        ok: MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^+ fd00:1122:3344:101::22        3   6   377    19    +16us[  +36us] +/-   36ms
^* fd00:1122:3344:103::21        3   6   377    18    +29us[  +49us] +/-   36ms

@jgallagher jgallagher self-assigned this Aug 7, 2024
@davepacheco
Copy link
Collaborator

Those are great tests -- especially the cold boot.

And back in the internal NTP zone, the name resolves:

root@oxz_ntp_b34d0c6f:~# host boundary-ntp.control-plane.oxide.internal
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::10
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::10
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::10
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f

Do you know why we have these dups? Is it that we have three DNS servers in /etc/resolv.conf and host(1) asks all of them and prints all the results? I'm basically just trying to figure out if we have dups in the data coming from a single DNS server.

@jgallagher
Copy link
Contributor Author

And back in the internal NTP zone, the name resolves:

root@oxz_ntp_b34d0c6f:~# host boundary-ntp.control-plane.oxide.internal
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::10
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::10
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:101::10
boundary-ntp.control-plane.oxide.internal has IPv6 address fd00:1122:3344:102::f

Do you know why we have these dups? Is it that we have three DNS servers in /etc/resolv.conf and host(1) asks all of them and prints all the results? I'm basically just trying to figure out if we have dups in the data coming from a single DNS server.

I don't think it's duplicate data, but it might be a different problem? I'm seeing triplicate answers even on dogfood with one of the specific-zone names:

root@oxz_ntp_3ccea933:~# host 20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal
20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal has IPv6 address fd00:1122:3344:104::3
20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal has IPv6 address fd00:1122:3344:104::3
20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal has IPv6 address fd00:1122:3344:104::3

Bumping up the verbosity, it looks like host is querying for A, AAAA, and MX, and getting back the AAAA record each time?

root@oxz_ntp_3ccea933:~# host -v 20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal
Trying "20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57320
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal.        IN A

;; ANSWER SECTION:
20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:104::3

Received 116 bytes from fd00:1122:3344:3::1#53 in 0 ms
Trying "20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64707
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal.        IN AAAA

;; ANSWER SECTION:
20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:104::3

Received 116 bytes from fd00:1122:3344:3::1#53 in 0 ms
Trying "20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21423
;; flags: qr rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal.        IN MX

;; ANSWER SECTION:
20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal. 0 IN AAAA fd00:1122:3344:104::3

Received 116 bytes from fd00:1122:3344:3::1#53 in 0 ms

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this correctly, we don't ever convert a boundary NTP zone to an internal NTP one. (e.g., if we found ourselves with too many boundary NTP zones). I was surprised at first but I can't think of a reason we'd need to do that so I'm glad to avoid the complexity if we can.

@@ -435,6 +436,8 @@ enum BlueprintEditCommands {
},
/// add a CockroachDB instance to a particular sled
AddCockroach { sled_id: SledUuid },
/// expunge a particular zone from a particular sled
ExpungeZone { sled_id: SledUuid, zone_id: OmicronZoneUuid },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, nice!

Comment on lines 786 to 792
// We want to preserve the CockroachDB cluster settings from the parent
// blueprint.
new_blueprint
.cockroachdb_fingerprint
.clone_from(&blueprint.cockroachdb_fingerprint);
new_blueprint.cockroachdb_setting_preserve_downgrade =
blueprint.cockroachdb_setting_preserve_downgrade;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be handled in BlueprintBuilder::new_based_on() (called at L739 above)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cockroachdb_setting_preserve_downgrade is handled in new_based_on(), yeah, but cockroachdb_fingerprint isn't. I removed cockroachdb_setting_preserve_downgrade and expanded the comment to add detail on why we're manually setting the fingerprint in 6cf3ab5

@@ -341,8 +345,9 @@ impl<'a> BlueprintBuilder<'a> {
pub fn current_sled_zones(
&self,
sled_id: SledUuid,
filter: BlueprintZoneFilter,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Comment on lines 972 to 974
//
// TODO-cleanup Is there ever a case where we might want to do some kind
// of graceful shutdown of an internal NTP zone? Seems unlikely...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see why. Is there something you have in mind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. Maybe if we were doing some kind of graceful cleanup of all the zones on a sled? But even then I don't think there's anything to actually do. I'll remove this comment.

Comment on lines +84 to +86
let mut unmatched = BTreeSet::new();
unmatched.insert(zone_id);
BuilderZonesConfigError::ExpungeUnmatchedZones { unmatched }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rustfmt seems asleep at the switch here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so? If I change the formatting on this line, rustfmt puts it back to the way it is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but it's definitely wrong, right? I would expect L84-86 to be indented because they're inside a block that ends at line 87. I'd also expect L87 to be indented because it closes the block and parens opened at L83.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely weird and probably wrong, and possibly related to our rustfmt.toml settings:

max_width = 80
use_small_heuristics = "max"

If I comment out use_small_heuristics, it doesn't change. If I comment out both lines, I get what you want:

        let zone = self
            .zones
            .iter_mut()
            .find(|zone| zone.zone.id == zone_id)
            .ok_or_else(|| {
                let mut unmatched = BTreeSet::new();
                unmatched.insert(zone_id);
                BuilderZonesConfigError::ExpungeUnmatchedZones { unmatched }
            })?;

If I comment out just max_width and keep use_small_heuristics, I get this (which also seems reasonable and more correct than what we're getting now):

        let zone = self.zones.iter_mut().find(|zone| zone.zone.id == zone_id).ok_or_else(|| {
            let mut unmatched = BTreeSet::new();
            unmatched.insert(zone_id);
            BuilderZonesConfigError::ExpungeUnmatchedZones { unmatched }
        })?;

I think I will just leave this alone - I'm not sure how to isolate it, or whether it's a rustfmt bug or just some interaction we don't understand.

@davepacheco
Copy link
Collaborator

Do you know why we have these dups? Is it that we have three DNS servers in /etc/resolv.conf and host(1) asks all of them and prints all the results? I'm basically just trying to figure out if we have dups in the data coming from a single DNS server.

I don't think it's duplicate data, but it might be a different problem? I'm seeing triplicate answers even on dogfood with one of the specific-zone names:

root@oxz_ntp_3ccea933:~# host 20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal
20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal has IPv6 address fd00:1122:3344:104::3
20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal has IPv6 address fd00:1122:3344:104::3
20b100d0-84c3-4119-aa9b-0c632b0b6a3a.host.control-plane.oxide.internal has IPv6 address fd00:1122:3344:104::3

Bumping up the verbosity, it looks like host is querying for A, AAAA, and MX, and getting back the AAAA record each time?

Hmm. This looks similar to #4051 and #4258. It seems like maybe the DNS server is just ignoring the requested record type and reporting everything for the given name? If that's the behavior, that does seem like a bug, but I don't think it's a problem for this PR since this is the only kind of record that we have for this name. I guess the other possible problem would be if chrony also queries separately for different record types and gets confused by the dups. But it seems unlikely that dups would make it do the wrong thing.

@jgallagher
Copy link
Contributor Author

If I'm reading this correctly, we don't ever convert a boundary NTP zone to an internal NTP one. (e.g., if we found ourselves with too many boundary NTP zones). I was surprised at first but I can't think of a reason we'd need to do that so I'm glad to avoid the complexity if we can.

Today we don't handle any cases of "I have too many service of kind X"; we only ever add if we're below the policy count. That presumably needs to change at some point, and then we might need to convert boundary NTP back to internal. I think that would be pretty similar to this, though? Basically - expunge the boundary NTP zone, and add a new internal NTP?

Base automatically changed from john/resolve-boundary-ntp-from-internal-dns to main August 8, 2024 21:24
@jgallagher jgallagher force-pushed the john/reconfigurator-planning-boundary-ntp branch from 6cf3ab5 to eefaaf3 Compare August 8, 2024 21:25
@jgallagher jgallagher enabled auto-merge (squash) August 8, 2024 22:49
@jgallagher jgallagher merged commit 5fb9029 into main Aug 8, 2024
22 checks passed
@jgallagher jgallagher deleted the john/reconfigurator-planning-boundary-ntp branch August 8, 2024 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants