Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RADICAL-Pilot Integration with Parsl #2923

Merged
merged 111 commits into from
Nov 14, 2023
Merged

RADICAL-Pilot Integration with Parsl #2923

merged 111 commits into from
Nov 14, 2023

Conversation

AymenFJA
Copy link
Contributor

Description

Summary: This PR integrates RADICAL-Pilot workload management and runtime system with Parsl as an executor.

The executor offers the following capabilities:

  1. MPI execution of heterogeneous tasks:

    • Single/Multi-node MPI functions that require a private MPI communicator on the runtime.
    • Single/Multi-node (non)/MPI executables.
  2. Heterogeneous task execution of all the above on GPUs and CPUs.

  3. Improve the submission mechanism by offering bulk_mode submission alongside stream submission of tasks from Parsl
    to the RADICAL-Pilot runtime system.

Fixes: None

Type of change: Additional features.

Can you please guide me on where to put examples that show how the executor can be used?

Thanks all.

@AymenFJA
Copy link
Contributor Author

AymenFJA commented Nov 11, 2023

I had a brief mess around with the timeout test on my laptop, including changing some of the timings in that test for larger values, and I couldn't see anything immediately wrong on the parsl side.

I don't know enough about the radical pilot logs to see if the executable for that test really takes more than 0.2 seconds to run.

I will dig deeper into the bash_app timeout. The reason the tests are not working (test_apptimeout) is that we do not execute the bash_apps and wrap them as Parsl does. We execute them as a black-box executable. I am investigating an approach like this:

  1. Mark the bash_apps as rp.PROC, and in that way, we can use Parsl bash_app execution approach to execute a single-line or multi-line bash_apps.
  2. I will introduce another decorator only for executables I.e. executing for example: python example.py or ./c_app. This will allow not to abandon RP's capabilities on executing multi-node MPI executables if the user does not want to use a python_app or bash_app

@benclifford
Copy link
Collaborator

From the way this PR sets a timeout on the task description based on the parsl app kwargs, I was imagining that that would cause radical pilot to enforce a timeout though... what's happening there?

@AymenFJA
Copy link
Contributor Author

From the way this PR sets a timeout on the task description based on the parsl app kwargs, I was imagining that that would cause radical pilot to enforce a timeout though... what's happening there?

If I understand your question correctly. Yes, on the executables level, RP sets a timeout, and I quote, "Any timeout larger than 0 will result in the task process being killed after the specified number of seconds. The task will then end up in CANCELED state."

But it does not raise an exception, which is what Parsl tests expect for example, AppTimeout

@benclifford
Copy link
Collaborator

ah ok. So what state does a cancelled task end up back in the parsl side of things?

@AymenFJA
Copy link
Contributor Author

ah ok. So what state does a cancelled task end up back in the parsl side of things?

We set it as future.cancel() check this line:

https://github.com/AymenFJA/parsl/blob/65ff8f4ac2c1afeeaf24f74dc40a2ac8df8e56db/parsl/executors/radical/executor.py#L184

We can move forward with this PR and I can open another feature PR with timeout workaround, what do you think?

@benclifford
Copy link
Collaborator

yeah I think this is fine to merge now - I'm going to pull out the CI changes I made because that has potential to affect other things so I would like it to be a PR; once that is merged, I'll merge this #2923.

benclifford added a commit that referenced this pull request Nov 13, 2023
This is motivated by upcoming PR #2923 which adds a RADICAL-Pilot
executor: that executor is able to re-use a submit side virtualenv,
but not able to re-use `--user` level installs.

This should not change any other behaviour of the CI: packages will
be installed at different paths, but anything that was reliant on
those paths being a specific values is suspicious.
BASH = 'bash'
PYTHON = 'python'

os.environ["RADICAL_REPORT"] = "False"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit uncomfortable with screwing with the environment just because someone did "import parsl" and then didn't use radical pilot at all: this is something that affects non-radical parsl users.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should have been addressed by us. A radical.utils PR now changes the default behavior.


return True

@property
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the start of development of this executor, there have been some changes to the scaling API: I think it's safe to remove all three of scaling_enabled, scale_in and scale_out: scaling_enabled was removed in PR #2545 because scaling behaviour now comes from subclassing BlockProviderExecutor and scale_in, scale_out are now only needed for BlockProviderExecutor subclasses - this work was intended to remove the need for all these stubs on executors which will manage their own workers (such as the radical pilot executor)

@benclifford
Copy link
Collaborator

The most recent change I made to how radical is installed (which only installs radical.pilot quite late in the CI, right before radical tests) suggests that there's a problem for non-radical users if radical is not installed - I think this will break things for all non-RADICAL parsl users.

This was masked by installing radical.pilot right at the start of the test run (and that's part of the argument for installing things later rather than earlier)

pytest parsl/tests/ -k "not cleannet" --config parsl/tests/configs/local_threads.py --random-order --durations 10
============================= test session starts ==============================
platform linux -- Python 3.9.18, pytest-7.4.3, pluggy-1.3.0
Using --random-order-bucket=module
Using --random-order-seed=818270

rootdir: /home/runner/work/parsl/parsl/parsl/tests
configfile: pytest.ini
plugins: typeguard-2.13.3, cov-4.1.0, random-order-1.1.0
collected 239 items / 1 error / 1 skipped

==================================== ERRORS ====================================
_______________ ERROR collecting test_radical/test_mpi_funcs.py ________________
ImportError while importing test module '/home/runner/work/parsl/parsl/parsl/tests/test_radical/test_mpi_funcs.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
parsl/tests/test_radical/test_mpi_funcs.py:4: in <module>
    from parsl.tests.configs.local_radical_mpi import fresh_config as local_config
parsl/tests/configs/local_radical_mpi.py:4: in <module>
    from parsl.executors.radical import RadicalPilotExecutor
parsl/executors/radical/__init__.py:1: in <module>
    from parsl.executors.radical.executor import RadicalPilotExecutor
parsl/executors/radical/executor.py:14: in <module>
    import radical.pilot as rp
E   ModuleNotFoundError: No module named 'radical'
=========================== short test summary info ============================
SKIPPED [1] parsl/tests/configs/ec2_single_node.py:30: 'public_ip' not configured in user_opts.py
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
========================= 1 skipped, 1 error in 0.50s ==========================
make: *** [Makefile:52: local_thread_test] Error 1

benclifford added a commit that referenced this pull request Nov 13, 2023
This is motivated by upcoming PR #2923 which adds a RADICAL-Pilot
executor: that executor is able to re-use a submit side virtualenv,
but not able to re-use `--user` level installs.

This should not change any other behaviour of the CI: packages will
be installed at different paths, but anything that was reliant on
those paths being a specific values is suspicious.
@AymenFJA
Copy link
Contributor Author

The most recent change I made to how radical is installed (which only installs radical.pilot quite late in the CI, right before radical tests) suggests that there's a problem for non-radical users if radical is not installed - I think this will break things for all non-RADICAL parsl users.

This was masked by installing radical.pilot right at the start of the test run (and that's part of the argument for installing things later rather than earlier)

pytest parsl/tests/ -k "not cleannet" --config parsl/tests/configs/local_threads.py --random-order --durations 10
============================= test session starts ==============================
platform linux -- Python 3.9.18, pytest-7.4.3, pluggy-1.3.0
Using --random-order-bucket=module
Using --random-order-seed=818270

rootdir: /home/runner/work/parsl/parsl/parsl/tests
configfile: pytest.ini
plugins: typeguard-2.13.3, cov-4.1.0, random-order-1.1.0
collected 239 items / 1 error / 1 skipped

==================================== ERRORS ====================================
_______________ ERROR collecting test_radical/test_mpi_funcs.py ________________
ImportError while importing test module '/home/runner/work/parsl/parsl/parsl/tests/test_radical/test_mpi_funcs.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
parsl/tests/test_radical/test_mpi_funcs.py:4: in <module>
    from parsl.tests.configs.local_radical_mpi import fresh_config as local_config
parsl/tests/configs/local_radical_mpi.py:4: in <module>
    from parsl.executors.radical import RadicalPilotExecutor
parsl/executors/radical/__init__.py:1: in <module>
    from parsl.executors.radical.executor import RadicalPilotExecutor
parsl/executors/radical/executor.py:14: in <module>
    import radical.pilot as rp
E   ModuleNotFoundError: No module named 'radical'
=========================== short test summary info ============================
SKIPPED [1] parsl/tests/configs/ec2_single_node.py:30: 'public_ip' not configured in user_opts.py
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
========================= 1 skipped, 1 error in 0.50s ==========================
make: *** [Makefile:52: local_thread_test] Error 1

I have the same behavior on my machine now. How should we approach this? I honestly have no clue how to fix this. I assumed that isolating RP from the Parsl import (explicit import) would prevent this issue, but apparently not on the test level, at least for tests that are only for the RADICAL executor.

@benclifford
Copy link
Collaborator

@AymenFJA I'll have a look at what's happening with imports.

@AymenFJA
Copy link
Contributor Author

@AymenFJA I'll have a look at what's happening with imports.

Sorry for the noise. I tried this, it does allows other executors to pass but it actually ignores the test:

@pytest.mark.local
def test_radical_mpi(n=7):
    from parsl.tests.configs.local_radical_mpi import fresh_config as local_config
    # rank size should be > 1 for the
    # radical runtime system to run this function in MPI env
    for i in range(2, n):
        t = test_mpi_func(msg='mpi.func.%06d' % i, sleep=1, ranks=i, comm=None)
        apps.append(t)
    assert [len(app.result()) for app in apps] == list(range(2, n))

@benclifford
Copy link
Collaborator

@AymenFJA this is passing in CI again now, and the documentation is building and I think that parsl is again importable when radical.pilot is not installed.

The last few commits I added to this branch are a very awkward mechanism we've used elsewhere in parsl to let classes be importable enough to discover their docstrings (which is needed for documentation generation) - I really don't like how that works but I don't have a better solution at the moment.

If you're happy with those changes that I made, I'll merge this PR.

@AymenFJA
Copy link
Contributor Author

@benclifford this seems reasonable to me at least at the moment. Yes please feel free to merge if nothing major is blocking. Soon I will prepare another PR towards fixing the timeout for the bashapps test. Thanks for your efforts I appreciate it.

@benclifford benclifford merged commit 3bcbc5d into Parsl:master Nov 14, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants