Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between Scrimlets and Gimlets #1510

Merged
merged 3 commits into from
Jul 28, 2022
Merged

Distinguish between Scrimlets and Gimlets #1510

merged 3 commits into from
Jul 28, 2022

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Jul 28, 2022

Part of #823

Fixes #1511

Sled-Agent

  • Uses the presence/absence of RSS to decide "am I a Scrimlet"
  • Sends that info upward to Nexus

Nexus

  • Parses the "is this Sled a scrimlet" info out of the sled registration request
    • While I was there, renamed the function, made it PUT not POST
  • Stores this info in the DB
    • I plan on using this info in follow-up PRs to decide "should Dendrite be running here".

@smklein smklein requested a review from bnaecker July 28, 2022 14:31
Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I don't think it matters much now, but we'll definitely need to handle the sled agent coming up without knowing whether it's running on a Scrimlet, and asking Nexus to update the database once that can be determined.

@smklein smklein force-pushed the scrimlet-report-in branch from 1b3365f to d172c9c Compare July 28, 2022 16:46
@smklein
Copy link
Collaborator Author

smklein commented Jul 28, 2022

Looks good to me! I don't think it matters much now, but we'll definitely need to handle the sled agent coming up without knowing whether it's running on a Scrimlet, and asking Nexus to update the database once that can be determined.

Sounds good. Updated the upsert function to update this value accordingly.

@smklein smklein enabled auto-merge (squash) July 28, 2022 17:19
// TODO(https://github.com/oxidecomputer/omicron/issues/823):
// Currently, the prescence or abscence of RSS is our signal
// for "is this a scrimlet or not".
// Longer-term, we should make this call based on the underlying

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that this changes at runtime with respect to hardware. It's not a static thing and the underlying hardware presence will come and go as the Sidecar is power cycled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That matches my understanding. I do think there is more work that'll need to smooth out the case of "a sled has been removed from one location in the rack, and plugged in elsewhere" - perhaps a call to Nexus, to figure out what services should change - but otherwise, this makes sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, while the movement bit seemed captured, I just wanted to make sure it was clear that you might choose to start a bootstrap agent before this is known and wanted to call that out as part of this given what appeared the initial start the agent piece was encoding this knowledge right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks Robert. I believe Sean added functionality to update the database's record including the is_scrimlet value in d172c9c.

One question I have though is about when that value is updated. At RSS-time, it's definitely possible that the sled agent comes up before Sidecar, so we'd need to support updating it from false to true. But it seems kind of "sticky" to me. That is, I'm not sure it makes sense to set this to false, at least not right away, when Sidecar is cycled. If that's done as part of an administrative action or by Nexus, it seems like we'd want to keep the knowledge that this is or was a Scrimlet, and report 503s from dpd or the sled agent. If Sidecar cycles unexpectedly, we probably still want to do that, along with generating an alert of some kind. I don't know if there's any action here, but spelling that out seems useful to me.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect an initial policy of stickiness across a given boot will be a pretty useful way to go here. That is, when it's rebooted, it'll be something that is determined fresh and resets stickiness.

@smklein smklein merged commit 28ffeef into main Jul 28, 2022
@smklein smklein deleted the scrimlet-report-in branch July 28, 2022 17:47
leftwo pushed a commit that referenced this pull request Oct 25, 2024
Crucible changes
Add test and fix for replay race condition (#1519)
Fix clippy warnings (#1517)
Add edition to `crucible-workspace-hack` (#1516)
Split out Downstairs-specific stats (#1511)
Move remaining `GuestWork` functionality into `Downstairs` (#1510)
Track both jobs and bytes in each IO state (#1507)
Fix new `rustc` and `clippy` warnings (#1509)
Remove IOP/BW limits (for now) (#1506)
Move `GuestBlockRes` into `DownstairsIO` (#1502)
Update actions/checkout digest to eef6144 (#1499)
Update Rust crate hyper-staticfile to 0.10.1 (#1411)
Turn off test-up-2region-encrypted.sh (#1504)
Add `IOop::Barrier` (#1494)
Fix IPv6 addresses in `crutest` (#1503)
Add region set options to more tests. (#1496)
Simplify `CompleteJobs` (#1493)
Removed ignored CI jobs (#1497)
Minor cleanups to `print_last_completed` (#1501)
Remove remaining `Arc<Volume>` instances (#1500)
Add `VolumeBuilder` type (#1492)
remove old unused scripts (#1495)
More multiple region support. (#1484)
Simplify matches (#1490)
Move complete job tracker to a helper object (#1489)
Expand summary and add documentation references to the README. (#1486)
Remove `GuestWorkId` (2/2) (#1482)
Remove `JobId` from `DownstairsIO` (1/2) (#1481)
Remove unused `#[derive(..)]` (#1483)
Update more tests to use dsc (#1480)
Crutest now Volume only (#1479)

Propolis changes
manually impl Deserialize for PciPath for validation purposes (#801)
phd: gate OS-specific tests, make others more OS-agnostic (#799)
lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795)
add marker trait to help check safety of guest memory reads (#794)
clippy fixes for 1.82 (#796)
lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782)
PHD: write efivars in one go (#786)
PHD: support guest-initiated reboot (#785)
server: accept CPUID values in instance specs and plumb them to bhyve (#780)
PHD: allow patched Crucible dependencies (#778)
server: add a first-class error type to machine init (#777)
PciPath to Bdf conversion is infallible; prove it and refactor (#774)
instance spec rework: flatten InstanceSpecV0 (#767)
Make PUT /instance/state 503 when waiting to init
Less anxiety-inducing `Vm::{get, state_watcher}`
leftwo added a commit that referenced this pull request Oct 30, 2024
Crucible changes
Add test and fix for replay race condition (#1519) Fix clippy warnings
(#1517)
Add edition to `crucible-workspace-hack` (#1516)
Split out Downstairs-specific stats (#1511)
Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track
both jobs and bytes in each IO state (#1507) Fix new `rustc` and
`clippy` warnings (#1509)
Remove IOP/BW limits (for now) (#1506)
Move `GuestBlockRes` into `DownstairsIO` (#1502)
Update actions/checkout digest to eef6144 (#1499)
Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off
test-up-2region-encrypted.sh (#1504)
Add `IOop::Barrier` (#1494)
Fix IPv6 addresses in `crutest` (#1503)
Add region set options to more tests. (#1496)
Simplify `CompleteJobs` (#1493)
Removed ignored CI jobs (#1497)
Minor cleanups to `print_last_completed` (#1501)
Remove remaining `Arc<Volume>` instances (#1500)
Add `VolumeBuilder` type (#1492)
remove old unused scripts (#1495)
More multiple region support. (#1484)
Simplify matches (#1490)
Move complete job tracker to a helper object (#1489) Expand summary and
add documentation references to the README. (#1486) Remove `GuestWorkId`
(2/2) (#1482)
Remove `JobId` from `DownstairsIO` (1/2) (#1481)
Remove unused `#[derive(..)]` (#1483)
Update more tests to use dsc (#1480)
Crutest now Volume only (#1479)

Propolis changes
manually impl Deserialize for PciPath for validation purposes (#801)
phd: gate OS-specific tests, make others more OS-agnostic (#799) lib:
log vCPU diagnostics on triple fault and for some unhandled exit types
(#795) add marker trait to help check safety of guest memory reads
(#794) clippy fixes for 1.82 (#796)
lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts
(#782) PHD: write efivars in one go (#786)
PHD: support guest-initiated reboot (#785)
server: accept CPUID values in instance specs and plumb them to bhyve
(#780) PHD: allow patched Crucible dependencies (#778)
server: add a first-class error type to machine init (#777) PciPath to
Bdf conversion is infallible; prove it and refactor (#774) instance spec
rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when
waiting to init
Less anxiety-inducing `Vm::{get, state_watcher}`

---------

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Report the "is gimlet/scrimlet" info upward to Nexus
3 participants