Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuned-libnbc algorithm configuration upgrade #8435

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

goncalvt
Copy link

@goncalvt goncalvt commented Feb 2, 2021

  • A new node level is proposed for tuned configuration file.
  • The configuration file feature is now available for libnbc component.

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@ggouaillardet
Copy link
Contributor

@goncalvt thanks for the PR.

could you please give some more context on what this PR is doing and why?

In order to be accepted, all commits have to be signed off.

The usual way to have contributions upstreamed is to issue a PR vs the master branch, have it reviewed and merged.
Then the Release Manager of a given branch can be contacted to determine whether the changes should be back ported into this given branch, or will be part of the next major release (e.g. Open MPI 5)

@goncalvt
Copy link
Author

goncalvt commented Feb 2, 2021

@goncalvt thanks for the PR.

could you please give some more context on what this PR is doing and why?

In order to be accepted, all commits have to be signed off.

The usual way to have contributions upstreamed is to issue a PR vs the master branch, have it reviewed and merged.
Then the Release Manager of a given branch can be contacted to determine whether the changes should be back ported into this given branch, or will be part of the next major release (e.g. Open MPI 5)

Collective algorithm selection provided by tuned component allow to improve performance. Nevertheless, an issue happens when we want to choose the best algorithm for a given number of MPI ranks: The file format does not allow to differentiate between the case x ranks = N nodes x P processes per node and the case x ranks = N' nodes x P' processes per node. This leads to poor performance in some configurations. Thus, we propose to add a node level to configuration file. Also, this feature can be used in libnbc component.

@goncalvt
Copy link
Author

goncalvt commented Feb 2, 2021

The usual way to have contributions upstreamed is to issue a PR vs the master branch, have it reviewed and merged.
Then the Release Manager of a given branch can be contacted to determine whether the changes should be back ported into this given branch, or will be part of the next major release (e.g. Open MPI 5)

If i well understood i should move this PR on master branch ?

@ggouaillardet
Copy link
Contributor

yes, master is a better fit.

but let's wait for @bosilca take on that. The new coll/adapt might replace coll/tuned in the long run.

Also, @rhc54, can PMIx help more efficiently (time and memory overhead) implement ompi_coll_base_get_nnodes() ?

@@ -0,0 +1,302 @@
/*
* Copyright (c) 2020 Bull SAS. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should your copyrights be 2021?

ompi/mca/coll/base/coll_base_dynamic_file.c Show resolved Hide resolved
opal_output_verbose(1,ompi_coll_base_framework.framework_output,"Fix errors as listed above and try again.\n");

/* deallocate memory if allocated */
if (alg_rules) ompi_coll_base_free_all_rules (alg_rules, n_collectives);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style nit: please put all blocks in {}, even 1-line blocks.

We have very few style restrictions in Open MPI, but this is one of them. Thanks!

(If you haven't already, you might want to check out https://github.com/open-mpi/ompi/wiki/CodingStyle)

*
*/

int ompi_coll_base_read_rules_config_file (char *fname, int format_version, ompi_coll_base_alg_rule_t** rules, int n_collectives)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be major new functionality.

Can you provide some documentation for what it is, how it works, how users use it, why users should use it, ...etc.?

Can you also provide some test cases in https://github.com/open-mpi/ompi-tests-public?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a renaming (and moving) of the tuned capability (aka. ompi_coll_tuned_read_rules_config_file).


END_C_DECLS
#endif /* MCA_COLL_BASE_DYNAMIC_RULES_H_HAS_BEEN_INCLUDED */

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please remove blank lines at the ends of files.

@@ -558,3 +559,69 @@ const char* mca_coll_base_colltype_to_str(int collid)
}
return colltype_translation_table[collid];
}

OBJ_CLASS_INSTANCE(ompi_coll_base_hostname_item_t, opal_list_item_t, NULL, NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need a class here? I think you could probably use a char** here, and the opal_argv functions. It may not be a huge difference, but the opal_argv functions might be slightly lighter-weight...?

@gpaulsen gpaulsen added this to the v4.1.1 milestone Feb 2, 2021
@jsquyres jsquyres removed this from the v4.1.1 milestone Feb 2, 2021
@bosilca
Copy link
Member

bosilca commented Feb 2, 2021

Most of the features introduced here are interesting and definitively good to have. Basically, I would remove the extension to the dynamic configuration file, and pull everything else in master (not in the 4.1 branch).

As @ggouaillardet stated, tuned is not meant to support hierarchical collectives, and as such does not have the flexibility required to correctly address this issue. A combination of ADAPT and HAN will be the best way forward, if you have any questions I'll be happy to talk with you about.

@goncalvt
Copy link
Author

goncalvt commented Feb 3, 2021

Most of the features introduced here are interesting and definitively good to have. Basically, I would remove the extension to the dynamic configuration file, and pull everything else in master (not in the 4.1 branch).

As @ggouaillardet stated, tuned is not meant to support hierarchical collectives, and as such does not have the flexibility required to correctly address this issue. A combination of ADAPT and HAN will be the best way forward, if you have any questions I'll be happy to talk with you about.

I agree with the principle of the hierarchical approach of adapt/han but in my opinion this can't be sufficient in all cases. Depending on the network capabilities, the collective type, the communicator, etc. the hierarchical implementation may lead to performance bottleneck. The aim here is to provide an alternative solution to deals with the performance issue of collective implementations.

@jjhursey
Copy link
Member

jjhursey commented Feb 3, 2021

For ompi_coll_base_get_nnodes PMIx can help.

  • If you just need the count of the number of nodes in the job you can do a PMIx_Get on PMIX_NUM_NODES. It's a lookup from local shared memory so should be pretty fast.
  • If you need the full list of nodes (which I don't think you need here) then you can use either PMIx_Get on PMIX_NODE_LIST or you can use PMIx_Resolve_nodes.

@goncalvt
Copy link
Author

goncalvt commented Feb 3, 2021

For ompi_coll_base_get_nnodes PMIx can help.

  • If you just need the count of the number of nodes in the job you can do a PMIx_Get on PMIX_NUM_NODES. It's a lookup from local shared memory so should be pretty fast.
  • If you need the full list of nodes (which I don't think you need here) then you can use either PMIx_Get on PMIX_NODE_LIST or you can use PMIx_Resolve_nodes.

The purpose of the function ompi_coll_base_get_nnodes is to get the number of nodes of a specific communicator. It seems that the feature of PMIx you are talking about deals with the whole MPI job. The second solution (PMIX_NODE_lIST or PMIX_Resolve_nodes) may do the job but we must apply it on an OMPI communicator and i'm not sure that this can work. I will take a look at the pmix API anyway.

@jjhursey
Copy link
Member

jjhursey commented Feb 3, 2021

That is correct. The PMIx data will be for the whole job in this case (MPI_COMM_WORLD). Since PMIx doesn't have knowledge of the MPI communicator it might not be of much help here.

@bosilca
Copy link
Member

bosilca commented Feb 3, 2021

Depending on the network capabilities, the collective type, the communicator, etc. the hierarchical implementation may lead to performance bottleneck. The aim here is to provide an alternative solution to deals with the performance issue of collective implementations.

Do you have any proof of such a bold statement ? The literature has plenty of works highlighting the benefits of a hierarchical approach, and based of my understanding of the existing collective algorithms provided via tuned, not a single provide support for any reasonable, location based overlay communication tree.

@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2021

PMIx does know where each proc is located, so one could obtain the desired information easily enough. I don't know the relative speed of the two approaches. As @jjhursey said, the PMIx call is shmem-local, but you'd have to call it on each proc (or get the entire proc map for the job and then parse it). Can't say if that is faster than doing some MPI collective op and then parsing the result. Might be worth investigating.

@goncalvt
Copy link
Author

goncalvt commented Feb 4, 2021

Depending on the network capabilities, the collective type, the communicator, etc. the hierarchical implementation may lead to performance bottleneck. The aim here is to provide an alternative solution to deals with the performance issue of collective implementations.

Do you have any proof of such a bold statement ? The literature has plenty of works highlighting the benefits of a hierarchical approach, and based of my understanding of the existing collective algorithms provided via tuned, not a single provide support for any reasonable, location based overlay communication tree.

Actually, we made some tests and we observe that sometimes han obtained worst result than tuned. Han have an overhead due to the hierarchical initialization cost, it may explain that kind of results. It may be also explained by an algorithm configuration issue. I didn't made these tests personally so i haven't a lot of details to give you. Then, does Han implement all collectives ?

@goncalvt
Copy link
Author

goncalvt commented Feb 4, 2021

It seems that tuned config upgrade is not wished. I propose to remove that support on this Pull request. Even if tuned upgrade would be cancelled Libnbc configuration feature may still be submitted. So, i propose you to remove this PR and resubmit with only libnbc features. In this case, does libnbc file parser should support node level configuration ? Only ranks number level (as previous tuned) ? Don't hesitate to give your opinion.

/* For each rank */
for (i=0 ; i<group_size ; i++) {
proc = ompi_group_get_proc_ptr (group, i, true);
hostname = opal_get_proc_hostname(&proc->super);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhc54 @jjhursey ompi_coll_base_get_nnodes() returns the number of nodes used by a given communicator.
I do not expect PMIx to be MPI communicators aware.

That being said, the current implementation calls opal_get_proc_hostname() for each rank of the communicator. If I remember correctly, we tried to avoid that (because of both performance and memory overhead). A more efficient implementation would be to use a bitmap, use the nodeid (e.g. an int) of each proc to set the corresponding bit, and then count the number of bits in the bitmap. Can PMIx be used to retrieve the nodeid of each proc? (bonus point if PMIx can return the nodeids of a list or proc).

Note both current and proposed implementation do not use any collective operations.

I am pretty sure ROMIO and possibly ompio implement a similar subroutine, and would benefit from a single and optimized implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not expect PMIx to be MPI communicators aware.

Yeah, it definitely won't be unless PMIx_Connect was called when those communicators are formed. We do that for inter-communicators, so PMIx does know that membership. I'm pretty sure we don't do it for intra-communicators since those procs are already "connected" by definition, so we wouldn't know about the membership there.

Can PMIx be used to retrieve the nodeid of each proc?

Sure - just call PMIx_Get with the name of the proc and the PMIX_NODEID key

(bonus point if PMIx can return the nodeids of a list or proc).

You can, but you would need to use a different interface. You would PMIx_Query_info with the PMIX_NODEID key and an array of proc names. I'd have to add the backend support for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for all this fanciness, MPI provides support for doing all this in MPI. Similar capability is already available in HAN in the subcomm file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HAN uses MPI_Allreduce() and MPI_Allgather() to get this kind of information.

Wouldn't it be faster to get node ids from PMIx and then have each task independently process them to achieve the very same outcome?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not rely on yet another component for this kind of information. The MPI version is portable, and will remain so whatever runtime we are using. Plus, HAN's code does more than just what is required here, you only need a comm_split and a allgather. Moreover, if we are taking this route and will almost always need the leader and local communicators, we can embed their creation into the communicator creation, bringing the cost to a minimum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid you misinterpreted a large part of the discussion here, nobody (and certainly not HAN) is ignoring the RM information in order to reinvent the wheel, we already have it in a format that is OMPI-specific and that we can use in a more optimized way than linearly going through the PMIx database.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this on Tuesday's webex -- I don't have the larger context here, but looking at ompi_coll_base_get_nnodes(), I'm not quite grokking how a comm_split+allgather+string compares to test for "unknown" hostnames (which could lead to errors) and further nested string lookup loops would be faster than the put-the-node-id-into-a-bit-field approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this on the webex today, and things were much more clear to me:

  1. To be clear: @bosilca doesn't think that this functionality belongs here in tuned, anyway (please correct me if I got that sentiment incorrect).
  2. But if it is to be used here in tuned, we should probably extract the functionality out of HAN's init-time startup where this information (and some additional stuff) is already obtained, and put that in coll base. That way, other coll components can use it. From my understanding, the HAN implementation is already good/scalable/happy; it isn't an issue of PMIx vs. MPI -- it was more: we already have the information locally; there's no need to calculate that information again.
  3. We all recognize that this PR probably needs to be re-filed on master.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion topologic configuration should be avalaible in tuned. Tha main reason is that there is collective which does not have han implementation. Then, it's not guarantee that all collectives can be easily implemented without risk of bugs using han approach (alltoall with disordered ranks,etc.). In that cases fallback will be used and tuned implementations would be used and it's can be optimized using topologic configuration. In the worst case, that kind of configuration might be used in tuned fallback situations. This is quite true for libnbc.

Then, i'm not against the idea of merging topological init of Han and tuned. It implies that we consider that Han is the most of the time enabled and used and thus subcomms init is essential anyway. It's not a problem to me even if our customer does not use han right now (it may change soon). In such approach comm rules init in tuned must be delayed (same in libnbc of course). It implies a lot of change but it can facilitate the acceptability of this PR it's does not matter.

I will resubmit this PR on master branch anyway but before doing this i want to be sure that we agree about the global scope of what this PR can address. I'm at your disposal to find an agreement.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will evaluate Coll/han vs coll/tuned and try to illustrate on which case (at the moment) node level configuration on tuned can be useful. I imagine that there is an interest when tuned is used for both intra node level and inter node level with the same number of processes. In this situation the current file format can't differentiate both configuration resulting on non optimal performances. Then, depending on that results i will decide to abandon node level format or not.

@bosilca
Copy link
Member

bosilca commented Feb 4, 2021

Actually, we made some tests and we observe that sometimes han obtained worst result than tuned. Han have an overhead due to the hierarchical initialization cost, it may explain that kind of results. It may be also explained by an algorithm configuration issue. I didn't made these tests personally so i haven't a lot of details to give you. Then, does Han implement all collectives ?

HAN is dependent on the algorithm selection on the 2 levels of the hierarchy. HAN does not implement all collectives, but just yesterday @EmmanuelBRELLE added support for barrier, so the number of supported algorithms is increasing. I don't know in what cases tuned was able to provide better support than HAN, but I would definitively love to hear more details about, because tuned is expected to be retired and replaced with HAN. Basically, everything tuned can do should already be supported by HAN, including single level collective description.

@bosilca
Copy link
Member

bosilca commented Feb 4, 2021

It seems that tuned config upgrade is not wished. I propose to remove that support on this Pull request. Even if tuned upgrade would be cancelled Libnbc configuration feature may still be submitted. So, i propose you to remove this PR and resubmit with only libnbc features. In this case, does libnbc file parser should support node level configuration ? Only ranks number level (as previous tuned) ? Don't hesitate to give your opinion.

Right, dont update the configuration, but move it in base, and use it in libnbc. Take a look in HAN configuration file, it might more more suitable as a starting point for libnbc.

@EmmanuelBRELLE
Copy link
Contributor

EmmanuelBRELLE commented Feb 5, 2021

Actually, we made some tests and we observe that sometimes han obtained worst result than tuned. Han have an overhead due to the hierarchical initialization cost, it may explain that kind of results. It may be also explained by an algorithm configuration issue. I didn't made these tests personally so i haven't a lot of details to give you. Then, does Han implement all collectives ?

HAN is dependent on the algorithm selection on the 2 levels of the hierarchy. HAN does not implement all collectives, but just yesterday @EmmanuelBRELLE added support for barrier, so the number of supported algorithms is increasing. I don't know in what cases tuned was able to provide better support than HAN, but I would definitively love to hear more details about, because tuned is expected to be retired and replaced with HAN. Basically, everything tuned can do should already be supported by HAN, including single level collective description.

From my point of view, Han is complementary to tuned but does not replace it. Han says how to split the collectives (=Han algorithms) into smaller collectives (and how to order these collectives), whereas tuned selects the (best?) point to point communication pattern (from base) to acheive this collective given size of the (sub-)communicator and the data. For now tuned is still a nice-to-have in between Han and base

@goncalvt
Copy link
Author

goncalvt commented Feb 5, 2021

It seems that tuned config upgrade is not wished. I propose to remove that support on this Pull request. Even if tuned upgrade would be cancelled Libnbc configuration feature may still be submitted. So, i propose you to remove this PR and resubmit with only libnbc features. In this case, does libnbc file parser should support node level configuration ? Only ranks number level (as previous tuned) ? Don't hesitate to give your opinion.

Right, dont update the configuration, but move it in base, and use it in libnbc. Take a look in [HAN configuration file], it might more more suitable as a starting point for libnbc.

The Han parser handle string for collective and component. Is that kind of features you expect to see forwarded in libnbc-tuned
parser ?

@bosilca
Copy link
Member

bosilca commented Feb 5, 2021

From my point of view, Han is complementary to tuned but does not replace it. Han says how to split the collectives (=Han algorithms) into smaller collectives (and how to order these collectives), whereas tuned selects the (best?) point to point communication pattern (from base) to achieve this collective given size of the (sub-)communicator and the data. For now tuned is still a nice-to-have in between Han and base

It really depends how you want to see it. You are right, in the current form HAN calls into tuned for some of the collectives, but not because it really needs yet another level of decision, it is simply because we assumed that tuned will provide the best decision for a single layer of our hierarchy (intra or inter node). Moving away from this, is a simple matter of allowing algorithm naming instead of component naming in the configuration file, et voila.

@bosilca
Copy link
Member

bosilca commented Feb 5, 2021

The Han parser handle string for collective and component. Is that kind of features you expect to see forwarded in libnbc-tuned parser ?

That would be great, as it drastically simplifies the writing of these decision files. And does not change the internal structure of the storage, we convert back to a internal indexed naming scheme.

@jsquyres jsquyres changed the title Tuned-libnbc algorithm configuration upgrade v4.1.x: Tuned-libnbc algorithm configuration upgrade Feb 5, 2021
@jsquyres jsquyres added this to the v4.1.1 milestone Feb 5, 2021
@jsquyres jsquyres marked this pull request as draft February 6, 2021 21:21
@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2021

I took the liberty of converting this PR to Draft status so that we don't accidentally merge it before it has been merged to master and cherry-picked to v4.1.x.

@jsquyres
Copy link
Member

jsquyres commented Feb 9, 2021

A larger question: does this functionality belong in v4.1.x? Technically, it's a whole new feature, and since it wasn't included in 4.1.0, that's not really what we do in subreleases like this.

Are there bug fixes that need to be separated out of this PR that should be applied to v4.1.x?

@goncalvt
Copy link
Author

A larger question: does this functionality belong in v4.1.x? Technically, it's a whole new feature, and since it wasn't included in 4.1.0, that's not really what we do in subreleases like this.

Are there bug fixes that need to be separated out of this PR that should be applied to v4.1.x?

There is no specific need for 4.1 branch.

@rajachan rajachan removed this from the v4.1.1 milestone Feb 12, 2021
@gpaulsen gpaulsen added this to the v4.1.2 milestone Apr 26, 2021
@jsquyres
Copy link
Member

@goncalvt Can you re-target this against master instead of the v4.1.x branch?

@jsquyres jsquyres removed this from the v4.1.2 milestone Apr 26, 2021
@jsquyres jsquyres changed the title v4.1.x: Tuned-libnbc algorithm configuration upgrade Tuned-libnbc algorithm configuration upgrade Apr 26, 2021
@goncalvt goncalvt changed the base branch from v4.1.x to master April 27, 2021 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants