Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python/concurrency defaults #1979

Merged

Conversation

brenthuisman
Copy link
Contributor

@brenthuisman brenthuisman commented Sep 15, 2022

  • Rename Python's create_context to make_context_shim, to align with arb::make_context
  • A special no-arg make_context_shim initializes a context with no MPI, and a arbenv::default_allocation(). This is a proc_allocation with arbenv::thread_concurrency() threads and a arbenv::default_gpu(). This no-arg constructor is not exposed and used only by the no-arg arbor.context() constructor and the single-arg (=recipe) arbor.simulation() constructor.
    • This removes any effort for default (== local) use, but keeps proper defaults in the case users request MPI.
  • Binds a few more helper functions from arbenv in arbor.env, e.g. arbor.env.thread_concurrency() binds arbenv::thread_concurrency()
  • Update tests and tutorial texts.
  • [x] Align Pybind minver in python/CMakeLists.txt with elsewhere.
  • Docs updated/markup corrected to reflect changes in Feature: External Connectivity #2001
    • doc/cpp/hardware.rst
    • doc/concept/interconnectivity.rst
  • Update proc_allocation printer for 📊 Pin threads #2094
  • Expand arbor.config() with timestamp and python_lib_path.

@brenthuisman brenthuisman marked this pull request as draft September 15, 2022 14:40
@brenthuisman brenthuisman marked this pull request as ready for review September 15, 2022 14:55
@thorstenhater
Copy link
Contributor

Hi @brenthuisman,

thanks that might save a few people a question on the help board. However, the tests run into a timeout.
I'll restart the job for now, but if it persists, please check what's going on.

@brenthuisman
Copy link
Contributor Author

brenthuisman commented Sep 21, 2022

I restarted the job a few times now, it's the Python examples running on the Pip source install that does not terminate. Logs are not very helpful (we need to print script filenames to see which is actually not terminating)! I'll try to take a look today or tomorrow.

Update: offending script is plasticity.py. It doesn't do much but eat cycles if you give it more than one thread.

@thorstenhater
Copy link
Contributor

Interesting ... can you figure out where exactly it spins?

@thorstenhater
Copy link
Contributor

thorstenhater commented Sep 22, 2022

Ok, this appears exactly in this constellation

  • Python
  • threads > 1
  • calling sim.update after sim.__init__

and results in an uninterruptible hang in update. I kneejerk blame the threading implementation, but
it's a weird one.

Further: Spike recording or the number of connections is not relevant.

Yes, threading seems to be the cause in particular communicator.cpp:78

    for(int ix = 0; ix < gids.size(); ++ix) {
            auto gid = gids[ix];
            gid_infos[ix] = gid_info(gid, ix, rec.connections_on(gid));
    }

instead of

    threading::parallel_for::apply(0, gids.size(), thread_pool_.get(),
        [&](cell_size_type i) {
            auto gid = gids[i];
            gid_infos[i] = gid_info(gid, i, rec.connections_on(gid));
        });

Other threaded blocks here are ok.

@thorstenhater
Copy link
Contributor

Ok, found the perpetrator. See #1984.

@thorstenhater
Copy link
Contributor

This should be unblocked now.

@boeschf
Copy link
Contributor

boeschf commented Oct 5, 2022

seems to hang in python examples again

Merge remote-tracking branch 'upstream/master' into python/concurrency_defaults
@brenthuisman
Copy link
Contributor Author

Runs OK now locally!

Copy link
Contributor

@thorstenhater thorstenhater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A few changes I think we should have before merging

  • Please pull out the pybind version bump.
  • Make C++ behave like we'd like Python to
  • Have 'avail_threads' even in presence of GPU

If you like you can also make this even more fancy by make it
play nice with MPI, but that's not neccessary to merge.

doc/python/hardware.rst Outdated Show resolved Hide resolved
doc/tutorial/network_ring.rst Outdated Show resolved Hide resolved
doc/tutorial/network_ring_mpi.rst Show resolved Hide resolved
python/CMakeLists.txt Show resolved Hide resolved
python/context.cpp Outdated Show resolved Hide resolved
python/context.cpp Outdated Show resolved Hide resolved
@@ -210,7 +212,7 @@ void register_simulation(pybind11::module& m, pyarb_global_ptr global_ptr) {
const std::optional<arb::domain_decomposition>& decomp,
std::uint64_t seed) {
try {
auto ctx = ctx_ ? ctx_ : std::make_shared<context_shim>(arb::make_context());
auto ctx = ctx_ ? ctx_ : std::make_shared<context_shim>(arb::make_context({arbenv::thread_concurrency(),-1}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, under the idea that Python and C++ APIs should co-evolve I am against this. Keep it at the default here and change the default in C++. But still keep the option to ask for 'avail_threads' explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, also in cpp make avail_threads an option you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, currently it is, right, via thread_concurrency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure, but it isn't a default (yet). I thought that was your request?

I was thinking that the best place to set this default (once) is in proc_allocation, but no, because it does not know about MPI. So, the default context constructors should set num_threads to arbenv::thread_concurrency(), if MPI is not enabled.

In the interest of keeping things similar, should in Python the variable be called thread_concurrency, and be a member of arbor? e.g. arbor.context(arbor.thread_concurrency(), None)? I like it!

Copy link
Contributor

@thorstenhater thorstenhater Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in arbor.env, yes, but that's the vision.
The C++ c'tor could be this:

context(std::optional<int> threads={}, 
        std::optional<int> gpu_id={}, 
        std::optional<MPI_Comm> comm={}) {
  if (!threads && !comm {
  }
  else if (!threads && comm) {
  }
  else if ...

I know that gpu_id=-1 is used instead of optional and
that we cannot mention the type MPI_Comm w/o MPI=ON,
but this is type we should have in mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike in Python, there is no explicit constructor in C++, it all goes through proc_allocation: execution_context(const proc_allocation& resources = proc_allocation{}). Also, having this in C++ would make (all of) Arbor dependent on arbenv.

Since there is no way to currently do sth like auto ctx = arbor::context{gpu_id=1}, we could set proc_allocation.num_threads by changing the non-mpi constructor to execution_context(const proc_allocation& resources = proc_allocation{arbenv::thread_concurrency(),-1}); To the extent that execution_context construction maps to Python, it sets num_threads under the same circumstances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike in Python, there is no explicit constructor in C++, it all goes through proc_allocation: execution_context(const proc_allocation& resources = proc_allocation{}).

You could call proc_allocation a set of kwargs. Once we have achieved C++20 in its full glory this becomes
a thing

execution_context({.threads=4, .comm=MPI_COMM_WORLD})

Also, having this in C++ would make (all of) Arbor dependent on arbenv.

That's fine, to a certain extent, arbenv is a library made for Arbor to interact with the environment.
As long we do not pull it into the deepest layers.

Also, nothing stopping us from making new, purpose-built c'tors, eg

context from_mpi_one_thread_per_logical_core(MPI_Comm& comm, int gpu_id=-1);
context from_mpi_one_thread_per_physical_core(MPI_Comm& comm, int gpu_id=-1);
// ...

Also, if proc_allocation is really too much of a hassle to have around, feel free to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code as per my earlier suggestion in the interested of merging this now.

  • Removal of proc_allocation is fine by me, but I'm not a user. The C++ tests do, so that's why I have not suggested removal.
  • MPI cleverness: to me that can be seen as a separate issues (the PR was initially written to get a better default for local use). I'll ask in the channel if users are missing something.

python/example/network_ring_gpu.py Outdated Show resolved Hide resolved
The number of threads available locally for execution. Must be set to 1 at minimum. 1 by default.
Passing ``"avail_threads"`` (as string) will query and use the maximum number of threads the system makes available.

The number of threads available locally for execution. Defaults to the maximum number of threads the system makes available if gpu_id and mpi are not set, else defaults to 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the rest of the change set, how about digging in a bit deeper and come up with sensible defaults for all cases? It's not a requirement to merge, but if we are going to put it in eventually, why not in 0.8?
I don't know whether arbenv has the proper calls, but

  • MPI: figure out how many threads a task is allowed to use (using hwloc maybe) and set that when handed 'avail_threads'. Other than that there's OS-specific calls (which hwloc uses), we would need to consider MacOS and Linux.
  • GPU: can just use 'avail_threads', it speeds up construction a bit. Independent of the change above. If 'avail_threads' receives super-charged capabilities wrt MPI, it'll benefit, if not, then not.
  • MPI+GPU: do whatever MPI does. Likely we are using one task per GPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial idea was to only touch the default constructor, to provide the best experience for novice users.

If you'd like to touch more than the default constructor: we could initialize threads to thread_concurrency. It's straightforward to understand for the user, but in particular when MPI is used, may result in surprising behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the experienced folks appreciate less typing, too. So, yes, thread_concurrency is fine, but it's interaction
with MPI must be understood and documented. A sensible approach could be to try and use thread_concurrency/MPI_size. BUT if we are on a HPC cluster, which should be 90% of cases when MPI is in play, someone might have set up a thread mask accordingly already. Some time ago we looked at OMP_NUM_THREADS which is used and exported by SLURM and ilk.

Copy link
Contributor Author

@brenthuisman brenthuisman Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI config is it's own world of expertise, and I wager normal users stick to 1 thread per rank (I have never seen or done differently, unless it was you trying to squeeze out every last drop).

So, my pref is then to default threads to thread_concurrency unless mpi is set, in which case it stays 1. If cleverer people than I come up with a thought out proposal for MPI then I'm happy to consider it. Agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, I'd be fine with merging this part as is, but consider the extra effort for making it extra-nice for the user interesting. I think the common patterns are

  1. No MPI, 1 thread/(logical)core
  • with GPU
  • without GPU
  1. MPI, still 1 thread per (logical)core but now each task gets handed 1/nth. N is the number of tasks according to
  • 1 task per socket
  • 1 task GPU

So, the approach is to compute

  • the (logical)core count: arbenv::cores(logical=true|false)
  • the number of tasks: handed to you via Comm_size, defaults to 1

and context or thread_concurrency could be instructed whether to distribute threads over
tasks or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above's note on MPI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, it'd been nice to have rather than critical. I am personally fine with typing out these, but as we were sanding off some rougher edges it offered an easy win.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd want to test this out, but for this week I don't think I'll have the time. Making the release will almost certainly eat up the rest of my hours, and I prefer not to postpone it further. Maybe a small discussion in the team meet about removing proc_alloc and have these new constructors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said, optional, and only if proc_alloc blocks you from achieving more.

@brenthuisman
Copy link
Contributor Author

C++ was left out, since I assumed (some people in the past certainly argued so) that there should be no magic. I'm OK with keeping them aligned, just want to make sure this has changed.

@thorstenhater
Copy link
Contributor

For me alignment beats the no_magic paradigm, so if there must be magic, it should be working the same always.
One might argue that under this requirement it is not magic anymore.

@brenthuisman
Copy link
Contributor Author

Hmm, I see a CI build error pop up that I didn't see locally:

CMake Error: The inter-target dependency graph contains the following strongly connected component (cycle):
  "arbor" of type SHARED_LIBRARY
    depends on "arborenv" (weak)
  "arborenv" of type SHARED_LIBRARY
    depends on "arbor" (weak)
At least one of these targets is not a STATIC_LIBRARY.  Cyclic dependencies are allowed only among static libraries.

So, how do we handle this for the shared variant? Do we need that variant for anything (I don't recall noticing it before)?
The Python build test does not show the error.

@brenthuisman brenthuisman marked this pull request as ready for review April 20, 2023 15:08
@brenthuisman
Copy link
Contributor Author

Updated, changed. PR message reflects the major changes.

@brenthuisman brenthuisman requested review from thorstenhater and boeschf and removed request for thorstenhater and boeschf April 20, 2023 15:10
Copy link
Contributor

@thorstenhater thorstenhater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shaping up to a nice addition, some comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is largely unrelated formatting changes? Fine, but confusing to review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True-ish, I had to touch hardware.rst and discovered some things neglected/incorrect in #2001.

@@ -214,6 +214,7 @@ TEST(SPIKES_TEST_CLASS, threshold_watcher_interpolation) {
dict.set("mid", arb::ls::on_branches(0.5));

arb::proc_allocation resources;
resources.num_threads = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why one?

Copy link
Contributor Author

@brenthuisman brenthuisman May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless testing the defaults, I like tests be explicit. default_concurrency() would also be fine.

@@ -152,6 +152,7 @@ namespace {
TEST(domain_decomposition, homogenous_population)
{
proc_allocation resources;
resources.num_threads = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below.

@@ -219,6 +220,7 @@ TEST(domain_decomposition, homogenous_population)
TEST(domain_decomposition, heterogenous_population)
{
proc_allocation resources;
resources.num_threads = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below.

# (12) Create an execution context using all locally available threads and simulation
ctx = arbor.context(threads="avail_threads")
sim = arbor.simulation(recipe, ctx)
# (12) Create a simulation using all locally available threads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's short, yet less clear what's going on.

Suggested change
# (12) Create a simulation using all locally available threads
# (12) Create a simulation using the default settings:
# - Use all threads available
# - Use round-robin distribution of cells across groups with one cell per group
# - No GPU
# - No MPI
# - Empty label dictionary
# Other constructors of simulation can be used to change all of these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Slightly modified in upcoming commit. Label dict left out, it's set elsewhere.

@@ -66,6 +66,17 @@ std::optional<T> py2optional(pybind11::object o, const char* msg) {
return o.is_none()? std::nullopt: std::optional<T>(std::move(value));
}

template <typename T, typename F>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

template <typename T, typename F>
std::optional<T> optional_when(T&& o, F&& pred) {
    if (pred(std::forward(T)) {
      return {std::forward(T)};
    } else {
      return {};
    }
}
  • A slightly more descriptive name
  • Do not construct the optional if not needed
  • Play nicely with rvalue refs, lvalue ref, and values.

@@ -136,8 +124,10 @@ void register_contexts(pybind11::module& m) {
"bind_procs"_a=false,
"bind_threads"_a=false,
"Construct an allocation with arguments:\n"
" threads: The number of threads available locally for execution. Must be set to 1 at minimum. 1 by default.\n"
" gpu_id: The identifier of the GPU to use, None by default.\n")
" threads: The number of threads available locally for execution. Must be set to 1 at minimum. 1 by default.\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might this be a symbolic constant or string? Such as 'avail_threads'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Available as arbor.env.default_concurrency().

arb::proc_allocation allocation() const {
return arb::proc_allocation(num_threads, gpu_id.value_or(-1), bind_procs, bind_threads);
void proc_allocation_shim::set_num_threads(unsigned threads) {
if (0==threads) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we throw this from C++?

Copy link
Contributor Author

@brenthuisman brenthuisman May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arb:::proc_allocation does not do any checking. Nor does anything else for that matter. Only the pyobject ever did.

set_property(SOURCE config.cpp PROPERTY COMPILE_DEFINITIONS ARB_DATA="${CMAKE_INSTALL_DATAROOTDIR}" APPEND)
set_property(SOURCE config.cpp PROPERTY COMPILE_DEFINITIONS ARB_CXX_COMPILER="${CMAKE_CXX_COMPILER}" APPEND)
set_property(SOURCE config.cpp PROPERTY COMPILE_DEFINITIONS ARB_PREFIX="${CMAKE_INSTALL_PREFIX}" APPEND)
set_property(SOURCE config.cpp PROPERTY COMPILE_DEFINITIONS ARB_BINARY="${CMAKE_INSTALL_BINDIR}" APPEND)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changed here?

Copy link
Contributor Author

@brenthuisman brenthuisman May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to list in PR message (added): arbor.config() now returns timestamp and python_lib_path. The prefix need not be (and for pip installs, isn't, it'll be a venv typically) where CMake installed the package.

@@ -25,7 +25,7 @@ struct ARB_ARBOR_API execution_context {
task_system_handle thread_pool;
gpu_context_handle gpu;

execution_context(const proc_allocation& resources = proc_allocation{});
execution_context(const proc_allocation& resources = proc_allocation{1,-1});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a symbolic constant here, ie adding

constexpr static int gpu_nil_id = -1;

to context.hpp

Copy link
Contributor

@thorstenhater thorstenhater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One weirdness ... Other than that good.

# (12) Create a simulation using the default settings:
# - Use all threads available
# - Use round-robin distribution of cells across groups with one cell per group
# - Use GPU if present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we? Boy we are smart ;)

@brenthuisman brenthuisman merged commit 38453f3 into arbor-sim:master May 10, 2023
@brenthuisman brenthuisman deleted the python/concurrency_defaults branch May 10, 2023 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants