Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use an agent-id rather than the process PID #24968

Merged
merged 3 commits into from
May 27, 2022

Conversation

mattip
Copy link
Contributor

@mattip mattip commented May 19, 2022

Why are these changes needed?

When using ray inside a virtualenv on windows, python.exe as reported by sys.executable is a PEP397 launcher to the actual python as reported by os.getpid():

>>> import sys, os, psutil
>>> >>> print(sys.executable)
C:\temp\issue24361\Scripts\python.exe
>>> os.getpid()
2208
>>> child = psutil.Process(2208)
>>> child.cmdline()
['C:\\oss\\CPython38\\python.exe']
>>> child.parent().cmdline()
['C:\\temp\\issue24361\\Scripts\\python.exe']
>>> child.parent().pid
6424

When the agent_manager launches the agent process via Process::Process(), it gets the PID of the launcher process (6424), which is what is expected as an ID when registering the agent in the gRPC callback. But inside agent.py, the child process reports the PID via os.getpid(), which is 2208, and this is the wrong PID to register the agent.

The solution proposed here is another version of #24905 that creates a int agent_id = rand(); before starting the python process, and passes the agent_id to the process.

Related issue number

Closes #24361, #23945

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@mattip
Copy link
Contributor Author

mattip commented May 19, 2022

@rkooo567 this replaces the other PR, which I closed.

@@ -55,27 +55,34 @@ void AgentManager::StartAgent() {
RAY_LOG(INFO) << "Not starting agent, the agent command is empty.";
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the stanza below to after adding the new arguments so it prints out everything

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a lot better! I think we should do 2 additional things.

(1) Remove agent_pid from the gRPC request of RegisterAgentRequest since it is not pid anymore
(2)

agent_pid_ = request.agent_pid();
At this code, can you change the code to agent_id_ = request.agent_id();?

<< " has not registered. ip " << agent_ip_address_
<< ", pid " << agent_pid_;
[this, child, agent_id]() mutable {
if (agent_pid_ != agent_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of agent_pid_, can we rename it to agent_id_? Since it is not pid anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and improved the error message

@@ -54,6 +54,7 @@ def __init__(
raylet_name=None,
logging_params=None,
disable_metrics_collection: bool = False,
agent_id: int = os.getpid(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need the default id as pid in this case. Can you remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to use a required keyword-only * signature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I might be missing something here, but it still seems to receive os.getpid() as default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, check now :)


// Launch the process to create the agent.
std::error_code ec;
// Create a random agent_id to pass to the child process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Create a random agent_id to pass to the child process
// Create a random agent_id to pass to the child process. We cannot use pid an id because os.getpid() from the python process is not reliable in Windows. See [Github issue link]

@mattip
Copy link
Contributor Author

mattip commented May 20, 2022

I refactored all the naming and also tried to improve the error reporting.

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!!

@@ -54,6 +54,7 @@ def __init__(
raylet_name=None,
logging_params=None,
disable_metrics_collection: bool = False,
agent_id: int = os.getpid(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I might be missing something here, but it still seems to receive os.getpid() as default?

@rkooo567
Copy link
Contributor

Failed tests seem unrelated. @mattip can you verify test_args failure from Windows is unrelated to your PR?

@jjyao
Copy link
Collaborator

jjyao commented May 20, 2022

Looks like StartupToken for worker process :)

@mattip
Copy link
Contributor Author

mattip commented May 22, 2022

Can you link to the test_args failure? It looks to me like windows is passing...

@rkooo567
Copy link
Contributor

Oh I re-ran the test. It is okay to merge I think

@rkooo567 rkooo567 merged commit 02f220b into ray-project:master May 27, 2022
@rkooo567
Copy link
Contributor

Thanks for addressing the comment :). Pretty happy for the final version of this fix!!

rkooo567 added a commit to rkooo567/ray that referenced this pull request Jun 1, 2022
rkooo567 added a commit that referenced this pull request Jun 1, 2022
This reverts commit 02f220b.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

## Why are these changes needed?

Looks like this commit makes `test_ray_shutdown` way more flaky.  cc @mattip for further investigation after revert
<img width="760" alt="Screen Shot 2022-05-31 at 11 14 48 PM" src="https://user-images.githubusercontent.com/18510752/171339737-f48e6e90-391a-4235-bfac-a0aa0e563eb7.png">


## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
rkooo567 added a commit to rkooo567/ray that referenced this pull request Jun 1, 2022
simon-mo added a commit that referenced this pull request Jun 10, 2022
simon-mo added a commit that referenced this pull request Jun 13, 2022
pcmoritz pushed a commit that referenced this pull request Jul 20, 2022
Redo the agent-id changes from #24968. The original PR is in the first commit, the second commit fixes a fatal flaw when using RAY_BACKEND_LOG_LEVEL=debug, which caused the "Ray C++, Java" tests to fail on macOS.
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
Redo the agent-id changes from ray-project#24968. The original PR is in the first commit, the second commit fixes a fatal flaw when using RAY_BACKEND_LOG_LEVEL=debug, which caused the "Ray C++, Java" tests to fail on macOS.

Signed-off-by: Rohan138 <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Redo the agent-id changes from ray-project#24968. The original PR is in the first commit, the second commit fixes a fatal flaw when using RAY_BACKEND_LOG_LEVEL=debug, which caused the "Ray C++, Java" tests to fail on macOS.

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants