-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use an agent-id rather than the process PID #24968
Conversation
@rkooo567 this replaces the other PR, which I closed. |
@@ -55,27 +55,34 @@ void AgentManager::StartAgent() { | |||
RAY_LOG(INFO) << "Not starting agent, the agent command is empty."; | |||
return; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the stanza below to after adding the new arguments so it prints out everything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a lot better! I think we should do 2 additional things.
(1) Remove agent_pid from the gRPC request of RegisterAgentRequest
since it is not pid anymore
(2)
ray/src/ray/raylet/agent_manager.cc
Line 34 in 0212f5e
agent_pid_ = request.agent_pid(); |
agent_id_ = request.agent_id();
?
src/ray/raylet/agent_manager.cc
Outdated
<< " has not registered. ip " << agent_ip_address_ | ||
<< ", pid " << agent_pid_; | ||
[this, child, agent_id]() mutable { | ||
if (agent_pid_ != agent_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of agent_pid_, can we rename it to agent_id_? Since it is not pid anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and improved the error message
dashboard/agent.py
Outdated
@@ -54,6 +54,7 @@ def __init__( | |||
raylet_name=None, | |||
logging_params=None, | |||
disable_metrics_collection: bool = False, | |||
agent_id: int = os.getpid(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need the default id as pid in this case. Can you remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to use a required keyword-only *
signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I might be missing something here, but it still seems to receive os.getpid() as default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, check now :)
src/ray/raylet/agent_manager.cc
Outdated
|
||
// Launch the process to create the agent. | ||
std::error_code ec; | ||
// Create a random agent_id to pass to the child process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Create a random agent_id to pass to the child process | |
// Create a random agent_id to pass to the child process. We cannot use pid an id because os.getpid() from the python process is not reliable in Windows. See [Github issue link] |
I refactored all the naming and also tried to improve the error reporting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!!
dashboard/agent.py
Outdated
@@ -54,6 +54,7 @@ def __init__( | |||
raylet_name=None, | |||
logging_params=None, | |||
disable_metrics_collection: bool = False, | |||
agent_id: int = os.getpid(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I might be missing something here, but it still seems to receive os.getpid() as default?
Failed tests seem unrelated. @mattip can you verify |
Looks like StartupToken for worker process :) |
Can you link to the |
Oh I re-ran the test. It is okay to merge I think |
Thanks for addressing the comment :). Pretty happy for the final version of this fix!! |
This reverts commit 02f220b.
This reverts commit 02f220b. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Looks like this commit makes `test_ray_shutdown` way more flaky. cc @mattip for further investigation after revert <img width="760" alt="Screen Shot 2022-05-31 at 11 14 48 PM" src="https://user-images.githubusercontent.com/18510752/171339737-f48e6e90-391a-4235-bfac-a0aa0e563eb7.png"> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(
…ect#24968)" (ray-project#25342)" This reverts commit 3385d19.
Redo the agent-id changes from #24968. The original PR is in the first commit, the second commit fixes a fatal flaw when using RAY_BACKEND_LOG_LEVEL=debug, which caused the "Ray C++, Java" tests to fail on macOS.
Redo the agent-id changes from ray-project#24968. The original PR is in the first commit, the second commit fixes a fatal flaw when using RAY_BACKEND_LOG_LEVEL=debug, which caused the "Ray C++, Java" tests to fail on macOS. Signed-off-by: Rohan138 <[email protected]>
Redo the agent-id changes from ray-project#24968. The original PR is in the first commit, the second commit fixes a fatal flaw when using RAY_BACKEND_LOG_LEVEL=debug, which caused the "Ray C++, Java" tests to fail on macOS. Signed-off-by: Stefan van der Kleij <[email protected]>
Why are these changes needed?
When using ray inside a virtualenv on windows,
python.exe
as reported bysys.executable
is a PEP397 launcher to the actual python as reported byos.getpid()
:When the
agent_manager
launches the agent process viaProcess::Process()
, it gets the PID of the launcher process (6424), which is what is expected as an ID when registering the agent in the gRPC callback. But insideagent.py
, the child process reports the PID viaos.getpid()
, which is 2208, and this is the wrong PID to register the agent.The solution proposed here is another version of #24905 that creates a
int agent_id = rand();
before starting the python process, and passes theagent_id
to the process.Related issue number
Closes #24361, #23945
Checks
scripts/format.sh
to lint the changes in this PR.