Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20843][Core]Add a config to set driver terminate timeout #18126

Closed
wants to merge 1 commit into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented May 26, 2017

What changes were proposed in this pull request?

Add a worker configuration to set how long to wait before forcibly killing driver.

How was this patch tested?

Jenkins

@zsxwing
Copy link
Member Author

zsxwing commented May 26, 2017

cc @vanzin @BryanCutler

@vanzin
Copy link
Contributor

vanzin commented May 26, 2017

Looks ok.

I wonder if keeping the old behavior by default wouldn't be better, to avoid surprising users who upgrade and run into this.

@zsxwing
Copy link
Member Author

zsxwing commented May 26, 2017

This is the behavior in 2.1.0, if we change the default value to Long.MaxValue, it would surprise users again :(.

I'm inclined to keep it as 2.1.0.

@vanzin
Copy link
Contributor

vanzin commented May 26, 2017

Ah, I thought the original change was in 2.2 only. Looks good then.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one thought on the property name but it is fine as is. In hind-sight, 10s is pretty short for a driver timeout... thanks for putting up a quick fix

@@ -57,7 +57,8 @@ private[deploy] class DriverRunner(
@volatile private[worker] var finalException: Option[Exception] = None

// Timeout to wait for when trying to terminate a driver.
private val DRIVER_TERMINATE_TIMEOUT_MS = 10 * 1000
private val DRIVER_TERMINATE_TIMEOUT_MS =
conf.getTimeAsMs("spark.worker.driverTerminateTimeout", "10s")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if maybe adding something to the property to be clear that this is for a driver with deploy mode cluster only? Although it is prefixed with worker so maybe that is good enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.worker means this is only for Spark workers, so I think it should be obvious. Do you have a better config name?

@zsxwing
Copy link
Member Author

zsxwing commented May 26, 2017

10s is pretty short for a driver timeout

This is usually not a problem. If worker is trying to kill a driver, it often means the driver is unhealthy or being killed by the user intentionally. 10 seconds to allow shutdown hooks cleaning up resources such as deleting temp files is usually enough. Given that replying on shutdown hooks to persist data is not common, and we have a configuration for special cases, I think it's fine.

@BryanCutler
Copy link
Member

BryanCutler commented May 26, 2017 via email

@SparkQA
Copy link

SparkQA commented May 27, 2017

Test build #77438 has finished for PR 18126 at commit ca2c9c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member Author

zsxwing commented May 27, 2017

Thanks! Merging to master, 2.2 and 2.1.

asfgit pushed a commit that referenced this pull request May 27, 2017
## What changes were proposed in this pull request?

Add a `worker` configuration to set how long to wait before forcibly killing driver.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #18126 from zsxwing/SPARK-20843.

(cherry picked from commit 6c1dbd6)
Signed-off-by: Shixiong Zhu <[email protected]>
asfgit pushed a commit that referenced this pull request May 27, 2017
## What changes were proposed in this pull request?

Add a `worker` configuration to set how long to wait before forcibly killing driver.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <[email protected]>

Closes #18126 from zsxwing/SPARK-20843.

(cherry picked from commit 6c1dbd6)
Signed-off-by: Shixiong Zhu <[email protected]>
@asfgit asfgit closed this in 6c1dbd6 May 27, 2017
@zsxwing zsxwing deleted the SPARK-20843 branch May 27, 2017 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants