Polaris app_run changes #357

cms21 · 2023-06-01T03:12:34Z

This PR modifies the way cpu_bind is set to make it consistent with the cpus assigned to the job by the node manager. A few items/questions that should be noted:

The meaning of -d depends on the value of cpu-bind. If cpu-bind=cores, -d is the number of physical cores per rank. If it is depth or numa, it is the number of hardware threads. We have a check for this now in _build_cmdline.
There are some print statements that probably need to be removed, but may be helpful for testing and review of the PR.
This PR reorganized and changed get_cpus_per_rank(). It now uses the ComputeNode cpu_ids in the multimode case, and it has been reorganized to be more readable.
Should we also set OMP_PLACES and OMP_PROC_BIND for the user?

…des notes

masalim2 · 2023-06-01T17:54:20Z

balsam/platform/app_run/app_run.py

+        cpu_ids = self._node_spec.cpu_ids[0]
+        cpus_per_node = len(cpu_ids)
+        if not cpu_ids:
+            compute_node = ComputeNode(self._node_spec.node_ids[0], self._node_spec.hostnames[0])


One thing to watch out for here is that ComputeNode will always have an empty list of cpu_ids since it's the generic base class. Only at runtime do the launchers load the specific compute node class from the site configuration.

I think it should be safe for the NodeManager to include the full set of CPU IDs in the NodeSpec of a multinode job. This can be done by updating the method NodeManager._assign_multi_node.

That would keep the abstraction intact and avoid the need to pass an extra piece of information (which subclass of ComputeNode) from the launcher to the AppRun.

https://github.com/argonne-lcf/balsam/blob/main/balsam/site/launcher/node_manager.py#L68

I thought about doing that (modifying _assign_multi_node) but the way the NodeSpec object is structured, cpu_ids and gpu_ids will then be lists of a bunch of identical lists stored in memory. That seemed a bit silly. I did wonder why cpu_ids and gpu_ids have to be lists of lists? They only time they contain non-empty lists in the single node case. Having a list for every node does not seem to be used in any functional way.

masalim2 · 2023-06-02T00:51:18Z

You’re totally right that none of the implementations so far need cpu_ids or gpu_ids to be a list of lists! This was just an attempt to provide a generic interface, in case some future job launch mechanism provided more fine-grained control over resources (like using 2 GPUs on node X and 1 GPU on node Y). The current decision that you see in NodeManager for multi-node jobs was “let’s assume that if a job needs more than one node, it uses a *whole number of nodes*.” That’s why the _assign_multi_node requests a full node occupancy (setting each node’s occupancy to 1.0) and doesn’t bother with setting CPUs. Honestly I don’t know if there’s ever a situation where it makes sense to have a multi-node job that only uses partial resources of each node. The list of lists does seem silly in light of the fact it’s not used, it’s a great call out! Still I would say the impact of sending that list of lists in each node spec would be negligible as far as memory usage or code maintainability. Alternatively, including the ComputeNode class as another attribute in the NodeSpec seems like a good option too.

…

On Thu, Jun 1, 2023 at 4:32 PM Christine Simpson ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In balsam/platform/app_run/app_run.py <#357 (comment)>: > @@ -67,10 +68,23 @@ def get_num_ranks(self) -> int: return self._ranks_per_node * len(self._node_spec.node_ids) def get_cpus_per_rank(self) -> int: - cpu_per_rank = len(self._node_spec.cpu_ids[0]) // self._ranks_per_node - if not cpu_per_rank: - cpu_per_rank = max(1, int(self._threads_per_rank // self._threads_per_core)) - return cpu_per_rank + # Get the list of cpus assigned to the job. If it is a single node job, that is stored in + # the NodeSpec object. If it is a multinode job, the cpu_ids assigned to NodeSpec is empty, + # so we will assume all cpus on a compute node are available to the job. The list of cpus is + # just the list of cpus on the node in that case. + cpu_ids = self._node_spec.cpu_ids[0] + cpus_per_node = len(cpu_ids) + if not cpu_ids: + compute_node = ComputeNode(self._node_spec.node_ids[0], self._node_spec.hostnames[0]) I thought about doing that (modifying _assign_multi_node) but the way the NodeSpec object is structured, cpu_ids and gpu_ids will then be lists of a bunch of identical lists stored in memory. That seemed a bit silly. I did wonder why cpu_ids and gpu_ids have to be lists of lists? They only time they contain non-empty lists in the single node case. — Reply to this email directly, view it on GitHub <#357 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE753U5YE2FKH2PPSISF6PTXJEC5XANCNFSM6AAAAAAYWJDQWI> . You are receiving this because you commented.Message ID: ***@***.***>

codecov-commenter · 2023-08-04T22:14:27Z

Codecov Report

Patch coverage: 14.28% and project coverage change: -0.23% ⚠️

Comparison is base (51b9c7d) 60.91% compared to head (3890bfa) 60.69%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #357      +/-   ##
==========================================
- Coverage   60.91%   60.69%   -0.23%     
==========================================
  Files         157      157              
  Lines        9627     9677      +50     
  Branches     1259     1271      +12     
==========================================
+ Hits         5864     5873       +9     
- Misses       3502     3544      +42     
+ Partials      261      260       -1

Files Changed	Coverage Δ
balsam/platform/app_run/app_run.py	`71.52% <10.00%> (-2.27%)`	⬇️
balsam/platform/app_run/polaris.py	`15.68% <11.36%> (-21.82%)`	⬇️
balsam/platform/compute_node/alcf_polaris_node.py	`41.46% <100.00%> (+1.46%)`	⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Christine Simpson and others added 17 commits April 28, 2023 16:27

first attempt to sketch out cpu affinity bindings

4945700

updates to polaris app run

5c85221

updates to polaris app run

9f1b1ca

attempt to fix cpu affinity in Polaris app_run

c811f37

added polaris gpu affinity script

0936c68

fixes to the affinity script

08e0976

some style changes

e61c12c

reverting affinity script addition, put in different branch

7bcbc52

removed helper function

b0973cf

Updates to polaris cmdline implementation after dev discussion; inclu…

77f8941

…des notes

remove turam path from polaris job-template.sh

2efaa8e

more updates to polaris cmdline

1281a79

changes to make depth paramter for Polaris app_run consistent with docs

1b64cdb

Removed blank lines

937947e

lint fixes

8d6f5f0

fix type error

c57beb7

fix type error

0691ed3

masalim2 reviewed Jun 1, 2023

View reviewed changes

cms21 assigned tomuram Jun 2, 2023

Christine Simpson and others added 3 commits June 13, 2023 16:28

made change to accept a user setting cpu_bind to none

ad0e661

polaris app_run cleanup

6a10eb7

lint fix

020ae44

cms21 self-assigned this Aug 4, 2023

Merge branch 'main' into polaris

3890bfa

cms21 merged commit 10f5d27 into main Aug 4, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polaris app_run changes #357

Polaris app_run changes #357

cms21 commented Jun 1, 2023

masalim2 Jun 1, 2023

cms21 Jun 1, 2023 •

edited

Loading

masalim2 commented Jun 2, 2023 via email

codecov-commenter commented Aug 4, 2023 •

edited

Loading

Polaris app_run changes #357

Polaris app_run changes #357

Conversation

cms21 commented Jun 1, 2023

masalim2 Jun 1, 2023

Choose a reason for hiding this comment

cms21 Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

masalim2 commented Jun 2, 2023 via email

codecov-commenter commented Aug 4, 2023 • edited Loading

Codecov Report

cms21 Jun 1, 2023 •

edited

Loading

codecov-commenter commented Aug 4, 2023 •

edited

Loading