Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Separate CockroachDB "start" from CockroachDB "init" #2954

Merged
merged 62 commits into from
Jun 29, 2023
Merged
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
7b54128
[sled-agent] Make service_manager responsible for storage services too
smklein Apr 28, 2023
f410325
Merge branch 'main' into storage-manager-cleanup
smklein Apr 28, 2023
0b4b040
CRDB auto-format on boot
smklein Apr 28, 2023
ef9517c
better use of 'unique_name' (for storage zones), auto-launch storage …
smklein Apr 30, 2023
f1fd1f5
Merge branch 'main' into storage-manager-cleanup
smklein Apr 30, 2023
3a9ad87
Merge branch 'main' into storage-manager-cleanup
smklein Apr 30, 2023
5752a83
[RSS] Explicit set of Bootstrap Agents
smklein Apr 30, 2023
31b52a7
Refuse to enact a sled plan unless it's on the explicit set
smklein Apr 30, 2023
f77f235
wip
smklein Apr 28, 2023
2810f93
RSS explicitly calling to initialize CRDB
smklein Apr 30, 2023
5297950
Read the CRDB address from the running zone
smklein Apr 30, 2023
c21be48
Send requests to the right address
smklein Apr 30, 2023
f11c153
fmt
smklein Apr 30, 2023
4377f1d
Stop deleting chelsio addresses during uninstall (#2953)
smklein Apr 30, 2023
ec3b1e4
[RSS] Explicit set of Bootstrap Agents
smklein Apr 30, 2023
c812609
Merge branch 'storage-manager-cleanup' into rss-explicit
smklein Apr 30, 2023
cbfb8c8
Merge branch 'rss-explicit' into cockroach-init
smklein Apr 30, 2023
9d00c93
Fix tests
smklein Apr 30, 2023
8a08090
Merge branch 'storage-manager-cleanup' into rss-explicit
smklein Apr 30, 2023
7155257
Merge branch 'rss-explicit' into cockroach-init
smklein Apr 30, 2023
cfb7cbc
make serialization happier
smklein Apr 30, 2023
6346ca7
Merge branch 'rss-explicit' into cockroach-init
smklein Apr 30, 2023
c37e57e
Improve parsing for toml, openapi
smklein May 1, 2023
541f68d
Remove the comments about the ledger, we do that in #2972
smklein May 1, 2023
5d59951
configs -> ledgers
smklein May 1, 2023
ed20fff
review feedback
smklein May 1, 2023
7fe9f06
Merge branch 'rss-explicit' into cockroach-init
smklein May 1, 2023
ba2ba2c
Merge branch 'main' into storage-manager-cleanup
smklein May 1, 2023
afc03bd
Merge branch 'storage-manager-cleanup' into rss-explicit
smklein May 1, 2023
c970c4f
Merge branch 'main' into storage-manager-cleanup
smklein May 2, 2023
c00e2db
Merge branch 'storage-manager-cleanup' into rss-explicit
smklein May 2, 2023
599f669
Merge branch 'main' into storage-manager-cleanup
smklein May 2, 2023
e2b1a5b
Merge branch 'storage-manager-cleanup' into rss-explicit
smklein May 2, 2023
00633c1
Merge branch 'rss-explicit' into cockroach-init
smklein May 2, 2023
2487972
Merge branch 'main' into storage-manager-cleanup
smklein May 2, 2023
56a73e3
Merge branch 'storage-manager-cleanup' into rss-explicit
smklein May 2, 2023
6aef2e7
Merge branch 'main' into storage-manager-cleanup
smklein May 2, 2023
b84515f
Merge branch 'storage-manager-cleanup' into rss-explicit
smklein May 2, 2023
016f32d
Merge branch 'rss-explicit' into cockroach-init
smklein May 2, 2023
6718f5f
Merge branch 'main' into storage-manager-cleanup
smklein May 3, 2023
cbc90af
Merge branch 'storage-manager-cleanup' into rss-explicit
smklein May 3, 2023
9abc1f2
Merge branch 'main' into storage-manager-cleanup
smklein May 3, 2023
18337e4
Merge branch 'storage-manager-cleanup' into rss-explicit
smklein May 3, 2023
fc1d4a7
Merge branch 'rss-explicit' into cockroach-init
smklein May 3, 2023
74e8d66
Merge remote-tracking branch 'origin/main' into cockroach-init
davepacheco Jun 22, 2023
e3febbe
typos / rendering issues
davepacheco Jun 22, 2023
6291141
WIP (untested): use DNS to find the other Cockroach nodes
davepacheco Jun 22, 2023
c942e53
fix property builder for multi-valued properties
davepacheco Jun 26, 2023
d283479
trust-dns docs sent me down the wrong path
davepacheco Jun 27, 2023
ba8820c
move the point where we do DNS resolution
davepacheco Jun 27, 2023
05fef51
fix up comments
davepacheco Jun 27, 2023
3c54223
WIP (untested): use Rust program to resolve DNS
davepacheco Jun 27, 2023
4a8fe18
clippy nits
davepacheco Jun 27, 2023
9141db7
dnswait needs to move to its own package
davepacheco Jun 27, 2023
1406d7f
fix path
davepacheco Jun 27, 2023
967f03d
dnswait should report hostnames
davepacheco Jun 27, 2023
3c0e157
should be comma-separated
davepacheco Jun 28, 2023
3546cde
Merge branch 'main' into dap/cockroach-init
davepacheco Jun 28, 2023
8bcdbba
review feedback
davepacheco Jun 29, 2023
cab937e
Merge branch 'main' into dap/cockroach-init
davepacheco Jun 29, 2023
b99b1a5
fix Cargo.lock
davepacheco Jun 29, 2023
97ab92f
Merge branch 'main' into dap/cockroach-init
davepacheco Jun 29, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ members = [
"installinator-artifactd",
"installinator-common",
"internal-dns",
"internal-dns-cli",
"ipcc-key-value",
"key-manager",
"nexus",
Expand Down Expand Up @@ -83,6 +84,7 @@ default-members = [
"installinator-artifactd",
"installinator-common",
"internal-dns",
"internal-dns-cli",
"ipcc-key-value",
"key-manager",
"nexus",
Expand Down
11 changes: 8 additions & 3 deletions dns-server/src/dns_server.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,7 @@ use anyhow::anyhow;
use anyhow::Context;
use pretty_hex::*;
use serde::Deserialize;
use slog::info;
use slog::{debug, error, o, Logger};
use slog::{debug, error, info, o, trace, Logger};
use std::net::SocketAddr;
use std::str::FromStr;
use std::sync::Arc;
Expand Down Expand Up @@ -150,7 +149,7 @@ async fn handle_dns_packet(request: Request) {
let log = &request.log;
let buf = &request.packet;

debug!(&log, "buffer"; "buffer" => ?buf.hex_dump());
trace!(&log, "buffer"; "buffer" => ?buf.hex_dump());

// Decode the message.
let mut dec = BinDecoder::new(&buf);
Expand Down Expand Up @@ -270,6 +269,12 @@ async fn handle_dns_message(
}
})
.collect::<Result<Vec<_>, RequestError>>()?;
debug!(
&log,
"dns response";
"query" => ?query,
"records" => ?&response_records
);
respond_records(request, rb, header, &response_records).await
}

Expand Down
4 changes: 4 additions & 0 deletions illumos-utils/src/running_zone.rs
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,10 @@ impl RunningZone {
self.inner.zonepath.join("root")
}

pub fn control_interface(&self) -> AddrObject {
AddrObject::new(self.inner.get_control_vnic_name(), "omicron6").unwrap()
}

/// Runs a command within the Zone, return the output.
//
// NOTE: It's important that this function is synchronous.
Expand Down
15 changes: 15 additions & 0 deletions internal-dns-cli/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
[package]
name = "internal-dns-cli"
version = "0.1.0"
edition = "2021"
license = "MPL-2.0"

[dependencies]
anyhow.workspace = true
clap.workspace = true
dropshot.workspace = true
internal-dns.workspace = true
omicron-common.workspace = true
slog.workspace = true
tokio.workspace = true
trust-dns-resolver.workspace = true
98 changes: 98 additions & 0 deletions internal-dns-cli/src/bin/dnswait.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
// This Source Code Form is subject to the terms of the Mozilla Public
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I'm clear - the main reason for using this instead of dig +short <name> is the timeout?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had dig at first, but I realized it will only try a handful of times and only wait a few seconds, and you can't tell from its exit status that it's failed. I needed something that's going to retry an arbitrary number of times, waits until it finds some results, too, and if it's going to fail, fails clearly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but maybe it makes sense to move this utility into the omicron-brand zone alongside some of the other network setup that might be more general? Seems a bit like a pain-in-the-butt to manually package it into each zone that could need it

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Resolves DNS names within the Oxide control plane

use anyhow::Context;
use anyhow::Result;
use clap::Parser;
use clap::ValueEnum;
use internal_dns::resolver::ResolveError;
use internal_dns::resolver::Resolver;
use slog::{info, warn};
use std::net::SocketAddr;

#[derive(Debug, Parser)]
#[clap(name = "dnswait", about = "Resolves DNS names in the control plane")]
struct Opt {
/// Nameserver(s) to query
///
/// If unspecified, uses the system configuration (usually the nameservers
/// configured in /etc/resolv.conf).
#[clap(long, action)]
nameserver_addresses: Vec<SocketAddr>,

/// service name to be resolved (should be the target of a DNS name)
#[arg(value_enum)]
srv_name: ServiceName,
}

#[derive(Debug, Clone, Copy, ValueEnum)]
#[value(rename_all = "kebab-case")]
enum ServiceName {
Cockroach,
}

impl From<ServiceName> for internal_dns::ServiceName {
fn from(value: ServiceName) -> Self {
match value {
ServiceName::Cockroach => internal_dns::ServiceName::Cockroach,
}
}
}

#[tokio::main]
async fn main() -> Result<()> {
let opt = Opt::parse();
let log = dropshot::ConfigLogging::File {
path: "/dev/stderr".into(),
level: dropshot::ConfigLoggingLevel::Info,
if_exists: dropshot::ConfigLoggingIfExists::Append,
}
.to_logger("dnswait")
.context("creating log")?;

let resolver = if opt.nameserver_addresses.is_empty() {
info!(&log, "using system configuration");
let async_resolver =
trust_dns_resolver::AsyncResolver::tokio_from_system_conf()
.context("initializing resolver from system configuration")?;
Resolver::new_with_resolver(log.clone(), async_resolver)
} else {
let addrs = opt.nameserver_addresses;
info!(&log, "using explicit nameservers"; "nameservers" => ?addrs);
Resolver::new_from_addrs(log.clone(), addrs)
.context("creating resolver with explicit nameserver addresses")?
};

let result = omicron_common::backoff::retry_notify(
omicron_common::backoff::retry_policy_internal_service(),
|| async {
let dns_name = internal_dns::ServiceName::from(opt.srv_name);
resolver.lookup_srv(dns_name).await.map_err(|error| match error {
ResolveError::Resolve(_)
| ResolveError::NotFound(_)
| ResolveError::NotFoundByString(_) => {
omicron_common::backoff::BackoffError::transient(error)
}
})
},
|error, delay| {
warn!(
&log,
"DNS query failed; will try again";
"error" => format!("{:#}", error),
"delay" => ?delay,
);
},
)
.await
.context("unexpectedly gave up")?;

for ip in result {
println!("{}", ip)
}

Ok(())
}
137 changes: 122 additions & 15 deletions internal-dns/src/resolver.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ use crate::DNS_ZONE;
use omicron_common::address::{
Ipv6Subnet, ReservedRackSubnet, AZ_PREFIX, DNS_PORT,
};
use slog::{debug, info};
use slog::{debug, info, trace};
use std::net::{IpAddr, Ipv6Addr, SocketAddr, SocketAddrV6};
use trust_dns_proto::rr::record_type::RecordType;
use trust_dns_resolver::config::{
Expand Down Expand Up @@ -70,6 +70,15 @@ impl Resolver {
Self::new_from_subnet(log, subnet)
}

/// Return a resolver that uses the system configuration (usually
/// /etc/resolv.conf) for the underlying nameservers.
pub fn new_with_resolver(
log: slog::Logger,
tokio_resolver: TokioAsyncResolver,
) -> Self {
Resolver { log, inner: Box::new(tokio_resolver) }
}

// TODO-correctness This function and its callers make assumptions about how
// many internal DNS servers there are on the subnet and where they are. Is
// that okay? It would seem more flexible not to assume this. Instead, we
Expand Down Expand Up @@ -121,6 +130,107 @@ impl Resolver {
Ok(*address)
}

/// Returns the targets of the SRV records for a DNS name
///
/// The returned values are generally other DNS names that themselves would
/// need to be looked up to find A/AAAA records.
pub async fn lookup_srv(
&self,
srv: crate::ServiceName,
) -> Result<Vec<String>, ResolveError> {
let name = format!("{}.{}", srv.dns_name(), DNS_ZONE);
trace!(self.log, "lookup_srv"; "dns_name" => &name);
let response = self.inner.srv_lookup(&name).await?;
debug!(
self.log,
"lookup_srv";
"dns_name" => &name,
"response" => ?response
);

Ok(response.into_iter().map(|srv| srv.target().to_string()).collect())
}

pub async fn lookup_all_ipv6(
&self,
srv: crate::ServiceName,
) -> Result<Vec<Ipv6Addr>, ResolveError> {
let name = format!("{}.{}", srv.dns_name(), DNS_ZONE);
trace!(self.log, "lookup_all_ipv6 srv"; "dns_name" => &name);
let response = self.inner.srv_lookup(&name).await?;
debug!(
self.log,
"lookup_ipv6 srv";
"dns_name" => &name,
"response" => ?response
);

// SRV records have a target, which is itself another DNS name that
// needs to be looked up in order to get to the actual IP addresses.
// Many DNS servers return these IP addresses directly in the response
// to the SRV query as Additional records. Ours does not. See
// omicron#3434. So we need to do another round of lookups separately.
//
// According to the docs` for
// `trust_dns_resolver::lookup::SrvLookup::ip_iter()`, it sounds like
// trust-dns would have done this for us. It doesn't. See
// bluejekyll/trust-dns#1980.
//
// So if we have gotten any IPs, then we assume that one of the above
// issues has been addressed and so we have all the IPs and we're done.
// Otherwise, explicitly do the extra lookups.
let addresses: Vec<Ipv6Addr> = response
.ip_iter()
.filter_map(|addr| match addr {
IpAddr::V4(_) => None,
IpAddr::V6(addr) => Some(addr),
})
.collect();
if !addresses.is_empty() {
return Ok(addresses);
}

// What do we do if some of these queries succeed while others fail? We
// may have some addresses, but the list might be incomplete. That
// might be okay for some use cases but not others. For now, we do the
// simple thing. In the future, we'll want a more cueball-like resolver
// interface that better deals with these cases.
let log = &self.log;
let futures = response.iter().map(|srv| async {
let target = srv.target();
trace!(
log,
"lookup_all_ipv6: looking up SRV target";
"name" => ?target,
);
self.inner.ipv6_lookup(target.clone()).await
});
let results = futures::future::try_join_all(futures).await?;
let results = results
.into_iter()
.flat_map(|ipv6| ipv6.into_iter())
.collect::<Vec<_>>();
if results.is_empty() {
Err(ResolveError::NotFound(srv))
} else {
Ok(results)
}
}

pub async fn lookup_ip(
&self,
srv: crate::ServiceName,
) -> Result<IpAddr, ResolveError> {
let name = format!("{}.{}", srv.dns_name(), DNS_ZONE);
debug!(self.log, "lookup srv"; "dns_name" => &name);
let response = self.inner.lookup_ip(&name).await?;
let address = response
.iter()
.next()
.ok_or_else(|| ResolveError::NotFound(srv))?;
Ok(address)
}

/// Looks up a single [`SocketAddrV6`] based on the SRV name
/// Returns an error if the record does not exist.
pub async fn lookup_socket_v6(
Expand Down Expand Up @@ -156,20 +266,6 @@ impl Resolver {
}
})
}

pub async fn lookup_ip(
&self,
srv: crate::ServiceName,
) -> Result<IpAddr, ResolveError> {
let name = format!("{}.{}", srv.dns_name(), DNS_ZONE);
debug!(self.log, "lookup srv"; "dns_name" => &name);
let response = self.inner.lookup_ip(&name).await?;
let address = response
.iter()
.next()
.ok_or_else(|| ResolveError::NotFound(srv))?;
Ok(address)
}
}

#[cfg(test)]
Expand Down Expand Up @@ -427,6 +523,17 @@ mod test {
.expect("Should have been able to look up IP address");
assert!(cockroach_addrs.iter().any(|addr| addr.ip() == &ip));

// Look up all the Cockroach addresses.
let mut ips =
resolver.lookup_all_ipv6(ServiceName::Cockroach).await.expect(
"Should have been able to look up all CockroachDB addresses",
);
ips.sort();
assert_eq!(
ips,
cockroach_addrs.iter().map(|s| *s.ip()).collect::<Vec<_>>()
);

// Look up Clickhouse
let ip = resolver
.lookup_ipv6(ServiceName::Clickhouse)
Expand Down
17 changes: 17 additions & 0 deletions openapi/sled-agent.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,23 @@
"version": "0.0.1"
},
"paths": {
"/cockroachdb": {
"post": {
"summary": "Initializes a CockroachDB cluster",
"operationId": "cockroachdb_init",
"responses": {
"204": {
"description": "resource updated"
},
"4XX": {
"$ref": "#/components/responses/Error"
},
"5XX": {
"$ref": "#/components/responses/Error"
}
}
}
},
"/disks/{disk_id}": {
"put": {
"operationId": "disk_put",
Expand Down
Loading