-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][run_function_on_all_workers] deflake run_function_on_all_workers and reenable test #31838
Conversation
# Start the import thread | ||
# Setup import thread, but defer the start up of | ||
# import thread until job_config is initialized. | ||
# (python/ray/_raylet.pyx maybe_initialize_job_config) | ||
if mode not in (RESTORE_WORKER_MODE, SPILL_WORKER_MODE): | ||
worker.import_thread = import_thread.ImportThread( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any cases where other parts of Ray will call things like import_thread.join()
during this window where the import thread isn't started?
I also notice that self.threads_stopped.is_set()
is checked after self._do_importing()
is called. So in the case where threads_stopped
is set before the import thread is started, then the import thread will call _do_importing
once before shutting down. Could cause a crash. I'm not sure how this is handled or even if it's likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any cases where other parts of Ray will call things like import_thread.join() during this window where the import thread isn't started?
this should be fine since the import_thread.join()
will check if import_thread has already started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also notice that self.threads_stopped.is_set() is checked after self._do_importing() is called. So in the case where threads_stopped is set before the import thread is started, then the import thread will call _do_importing once before shutting down. Could cause a crash. I'm not sure how this is handled or even if it's likely.
good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any cases where other parts of Ray will call things like import_thread.join() during this window where the import thread isn't started?
this should be fine since the
import_thread.join()
will check if import_thread has already started.
Yep. I was thinking more of ordering issues -- any thread which calls import_thread.join
will expect to continue after import_thread has finished. But that assumption could not be held. Not sure how impactful this is; hence why I asked.
# Start the import thread | ||
# Setup import thread, but defer the start up of | ||
# import thread until job_config is initialized. | ||
# (python/ray/_raylet.pyx maybe_initialize_job_config) | ||
if mode not in (RESTORE_WORKER_MODE, SPILL_WORKER_MODE): | ||
worker.import_thread = import_thread.ImportThread( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any cases where other parts of Ray will call things like import_thread.join() during this window where the import thread isn't started?
this should be fine since the
import_thread.join()
will check if import_thread has already started.
Yep. I was thinking more of ordering issues -- any thread which calls import_thread.join
will expect to continue after import_thread has finished. But that assumption could not be held. Not sure how impactful this is; hence why I asked.
Hmm isn't there a possibility of regression? E.g., previously the import thread starts as soon as the worker starts and imports the task/actor definition (iirc) Now it is deferred until the first task is sent. That means we will have longer delay in the first execution (because the import will happen after the task/actor is submitted first time). |
@rkooo567 good point. I suspect it will probably fine since we do import once on thread creation: https://github.com/ray-project/ray/pull/31838/files#diff-ae2296559ecbe71776d337ff68fff30419b7acb67177065bcc6e49ad631e8e70L47
So one thing we can do is immediately after import_thread created, we opportunistically start import thread. |
…#31846) Why are these changes needed? Previously the import thread starts as soon as the worker starts and imports the task/actor definition. After #31838, it is deferred until the first task is sent. That means we will have longer delay in the first execution. To address the problem, we can opportunistically start the import thread after the import thread is created, if the job_id does exist.
@rkooo567 @scv119 Is it possible that this could cause a regression with imports happening concurrently with argument deserialization or even task execution? It looks like a Datasets test became flaky at around the time of this PR, failing with a |
For context, importing Pandas is not reentrant, and if the import thread imports Pandas at the same time as argument deserialization or task execution imports Pandas in the execution thread, one of the threads will crash with this error. |
Why are these changes needed?
run_function_on_all_workers importing requires job_id to run properly. after #30883 the worker might not have job_id when startup, which lead to run_function_on_all_workers failed to be executed on start up. to fix it, we defer the import_thread start up until job_config is initialized.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.