-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20540][CORE] Fix unstable executor requests. #17813
Conversation
@vanzin, can you take a look at this? It is a dynamic allocation bug. |
Test build #76332 has finished for PR 17813 at commit
|
|numExistingExecutors = $numExistingExecutors | ||
|numPendingExecutors = $numPendingExecutors | ||
|executorsPendingToRemove = ${executorsPendingToRemove.size} | ||
""".stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll fix it.
doRequestTotalExecutors( | ||
numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size) | ||
requestedTotalExecutors = math.max(requestedTotalExecutors - executorsToKill.size, 0) | ||
if (requestedTotalExecutors != |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this cause the message to be logged in the situation you describe in the PR description? Isn't that an "expected" situation? If so I'd demote this message, since users tend to get scared when messages like this one show up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would. I can change it to debug or remove it. This was mainly for us to see how often it happened. With the fix to the request timing, this doesn't tend to happen at all. It is just if the method is called every 100ms that you see the behavior all the time because there isn't enough time for kills and requests to complete before recomputing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rdblue I have seen the message while testing dynamic allocation on mesos:
17/10/25 13:58:44 INFO MesosCoarseGrainedSchedulerBackend: Actual list of executor(s) to be killed is 1
17/10/25 13:58:44 DEBUG MesosCoarseGrainedSchedulerBackend: killExecutors(ArrayBuffer(1), false, false): Executor counts do not match:
requestedTotalExecutors = 0
numExistingExecutors = 2
numPendingExecutors = 0
executorsPendingToRemove = 1
The executors are removed at some point after that message.
Test is here.
What should I expect here? I am a bit confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just informational. The problem is that the state of the allocation manager isn't synced with the scheduler. Instead, the allocator sends messages to try to control the scheduler backend to get the same state. For example, instead of telling the scheduler backend that the desired number of executors is 10, the allocator sends a message to add 2 executors. When this gets out of sync because of failures or network delay, you end up with these messages.
When you see these, make sure you're just out of sync (and will eventually get back in sync), and not in a state where the scheduler and allocator can't reconcile the required number of executors. That's what this PR tried to fix.
The long-term solution is to update the communication so that the allocator requests its ideal state, always telling the scheduler backend how many executors it currently needs, instead of killing or requesting more.
There are two problems fixed in this commit. First, the ExecutorAllocationManager sets a timeout to avoid requesting executors too often. However, the timeout is always updated based on its value and a timeout, not the current time. If the call is delayed by locking for more than the ongoing scheduler timeout, the manager will request more executors on every run. The second problem is that the total number of requested executors is not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates the value based on the current status of 3 variables: the number of known executors, the number of executors that have been killed, and the number of pending executors. But, the number of pending executors is never less than 0, even though there may be more known than requested. When executors are killed and not replaced, this can cause the request sent to YARN to be incorrect because there were too many executors due to the scheduler's state being slightly out of date.
96a7686
to
3e46f4f
Compare
Test build #76356 has finished for PR 17813 at commit
|
@vanzin, I fixed your review comments and tests are passing. |
LGTM. Merging to master / 2.2 / 2.1. |
There are two problems fixed in this commit. First, the ExecutorAllocationManager sets a timeout to avoid requesting executors too often. However, the timeout is always updated based on its value and a timeout, not the current time. If the call is delayed by locking for more than the ongoing scheduler timeout, the manager will request more executors on every run. This seems to be the main cause of SPARK-20540. The second problem is that the total number of requested executors is not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates the value based on the current status of 3 variables: the number of known executors, the number of executors that have been killed, and the number of pending executors. But, the number of pending executors is never less than 0, even though there may be more known than requested. When executors are killed and not replaced, this can cause the request sent to YARN to be incorrect because there were too many executors due to the scheduler's state being slightly out of date. This is fixed by tracking the currently requested size explicitly. ## How was this patch tested? Existing tests. Author: Ryan Blue <[email protected]> Closes #17813 from rdblue/SPARK-20540-fix-dynamic-allocation. (cherry picked from commit 2b2dd08) Signed-off-by: Marcelo Vanzin <[email protected]>
Thanks! |
There are two problems fixed in this commit. First, the ExecutorAllocationManager sets a timeout to avoid requesting executors too often. However, the timeout is always updated based on its value and a timeout, not the current time. If the call is delayed by locking for more than the ongoing scheduler timeout, the manager will request more executors on every run. This seems to be the main cause of SPARK-20540. The second problem is that the total number of requested executors is not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates the value based on the current status of 3 variables: the number of known executors, the number of executors that have been killed, and the number of pending executors. But, the number of pending executors is never less than 0, even though there may be more known than requested. When executors are killed and not replaced, this can cause the request sent to YARN to be incorrect because there were too many executors due to the scheduler's state being slightly out of date. This is fixed by tracking the currently requested size explicitly. ## How was this patch tested? Existing tests. Author: Ryan Blue <[email protected]> Closes #17813 from rdblue/SPARK-20540-fix-dynamic-allocation. (cherry picked from commit 2b2dd08) Signed-off-by: Marcelo Vanzin <[email protected]>
There are two problems fixed in this commit. First, the ExecutorAllocationManager sets a timeout to avoid requesting executors too often. However, the timeout is always updated based on its value and a timeout, not the current time. If the call is delayed by locking for more than the ongoing scheduler timeout, the manager will request more executors on every run. This seems to be the main cause of SPARK-20540. The second problem is that the total number of requested executors is not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates the value based on the current status of 3 variables: the number of known executors, the number of executors that have been killed, and the number of pending executors. But, the number of pending executors is never less than 0, even though there may be more known than requested. When executors are killed and not replaced, this can cause the request sent to YARN to be incorrect because there were too many executors due to the scheduler's state being slightly out of date. This is fixed by tracking the currently requested size explicitly. ## How was this patch tested? Existing tests. Author: Ryan Blue <[email protected]> Closes apache#17813 from rdblue/SPARK-20540-fix-dynamic-allocation. (cherry picked from commit 2b2dd08) Signed-off-by: Marcelo Vanzin <[email protected]>
There are two problems fixed in this commit. First, the
ExecutorAllocationManager sets a timeout to avoid requesting executors
too often. However, the timeout is always updated based on its value and
a timeout, not the current time. If the call is delayed by locking for
more than the ongoing scheduler timeout, the manager will request more
executors on every run. This seems to be the main cause of SPARK-20540.
The second problem is that the total number of requested executors is
not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
the value based on the current status of 3 variables: the number of
known executors, the number of executors that have been killed, and the
number of pending executors. But, the number of pending executors is
never less than 0, even though there may be more known than requested.
When executors are killed and not replaced, this can cause the request
sent to YARN to be incorrect because there were too many executors due
to the scheduler's state being slightly out of date. This is fixed by tracking
the currently requested size explicitly.
How was this patch tested?
Existing tests.