Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numa: enable numa topology detection #18146

Merged
merged 9 commits into from
Aug 10, 2023
Merged

numa: enable numa topology detection #18146

merged 9 commits into from
Aug 10, 2023

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Aug 3, 2023

Reviewers: see NMD-183 (internal).

This change looks larger than it is, just due to the amount of deletions and fallout in plumbing things through. It's crudely broken up into 9 commits (by filepath):

b17d418 - refactor how the client handles cgroups (moving responsibility from the client cpuset manager + task drivers into mostly just task hooks).

2121b44 - fingerprint numa topology (and eliminate aws cpu fingerprinter)

199dcb2 - plumbing numa and cgroups changes into drivers

1cd5264 - cleaning up task resource usage accounting

200a275 - plumbing for client config

33b958c - a small stack implementation (used by the refactoring around pid tracking / process resource utilization)

6d7401e - remove the ec2info tool

44c33ee - fixup testing packages in plugins/drivers

1ea6efc - build step changes

Documentation and probably some e2e testing will be in a followup PR.

Not sure what's going on with the test-e2e workflow (vault compat matrix) but I'll look into it.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @shoenig. I really like how this API turned out. It feels like there are a lot of layers but when you start digging you can see exactly how each one has its own responsibilities (ex. splitting out setting up the cgroups vs writing the resources into them)


package lang

// A Stack is a simple LIFO datastructure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note that it's not concurrency safe?

@@ -72,6 +72,7 @@ func (tr *TaskRunner) initHooks() {
newStatsHook(tr, tr.clientConfig.StatsCollectionInterval, hookLogger),
newDeviceHook(tr.devicemanager, hookLogger),
newAPIHook(tr.shutdownCtx, tr.clientConfig.APIListenerRegistrar, hookLogger),
newWranglerHook(tr.wranglers, task.Name, alloc.ID, hookLogger),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooks run in the order listed here. If we moved this earlier in the list (say right after the task dir hook or identity hook), would we be able to take advantage of having the cgroup setup for other child processes associated with the task (like the artifact hook or logmon hook or some future hypothetical sandboxed-template hook)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea! Created #18208 to track it.

)

const (
wranglerHookName = "wrangler"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nitpick, feel free to ignore: I like the whimsy of calling this wrangler but it will show up in logs for this hook. Does this word carry unintended meaning outside of specific cultural context? (Dictionary suggests it's "argumentative", a quick search brings up mostly American brand names, etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point - I can rename this to something like "procisolation" for now, unless someone has something better

@@ -795,7 +794,7 @@ func DefaultConfig() *Config {
CNIConfigDir: "/opt/cni/config",
CNIInterfacePrefix: "eth",
HostNetworks: map[string]*structs.ClientHostNetworkConfig{},
CgroupParent: cgutil.GetCgroupParent(""),
CgroupParent: "nomad.slice", // SETH todo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another TODO associated with this: the systemd service unit we package doesn't delegate management of our cgroup slice (ref https://systemd.io/CGROUP_DELEGATION/). We should probably add that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great callout - filed #18211 to cover that

Comment on lines +450 to +451
// restore process wrangler for task
ar.wranglers.Setup(proclib.Task{AllocID: tr.Alloc().ID, Task: tr.Task().Name})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the task handle restore fails, we'll never hit this block, so it doesn't seem like this is here to handle that case. Should this belong in the tr.Restore method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the ar.wranglers field belongs to the alloc runner which the individual task runner wouldn't have access to. Which, I think is technically unnecessary for the current implementation but in the long term it could also manage child processes associated with the alloc (and not any task).

@@ -1,688 +0,0 @@
// Code generated from hashicorp/nomad/tools/ec2info; DO NOT EDIT.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Comment on lines +57 to +63
func (hz KHz) MHz() MHz {
return MHz(hz / 1000)
}

func (hz KHz) String() string {
return strconv.FormatUint(uint64(hz.MHz()), 10)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this pointer receiver hz feels a little gross 😀

Suggested change
func (hz KHz) MHz() MHz {
return MHz(hz / 1000)
}
func (hz KHz) String() string {
return strconv.FormatUint(uint64(hz.MHz()), 10)
}
func (khz KHz) MHz() MHz {
return MHz(khz / 1000)
}
func (khz KHz) String() string {
return strconv.FormatUint(uint64(khz.MHz()), 10)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops! 😱

}

func (f *CPUFingerprint) setReservableCores(request *FingerprintRequest, response *FingerprintResponse) {
reservable := request.Config.ReservableCores
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we've dropped handling of user-provided configuration here, but we specifically refer to it in the new numa package. Should this still be part of the fingerprint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #18213 - will think about this more

func (m *MacOS) scanLegacyX86(top *Topology) {
coreCount, _ := unix.SysctlUint32("machdep.cpu.core_count")
hz, _ := unix.SysctlUint64("hw.cpufrequency")
coreSpeed := KHz(hz / 1_000_000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smells funny... what's the unit of hw.cpufrequency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #18212 for myself to followup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants