Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19688] [STREAMING] Not to read spark.yarn.credentials.file from checkpoint. #18230

Closed
wants to merge 2 commits into from

Conversation

saturday-shi
Copy link

What changes were proposed in this pull request?

Reload the spark.yarn.credentials.file property when restarting a streaming application from checkpoint.

How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

Notice

This should be merged into maintenance branches too.

jira: SPARK-21008

@vanzin
Copy link
Contributor

vanzin commented Jun 8, 2017

ok to test

@SparkQA
Copy link

SparkQA commented Jun 8, 2017

Test build #77804 has finished for PR 18230 at commit 13d40d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jerryshao
Copy link
Contributor

@saturday-shi would you please update the title to track SPARK-19688?

@jerryshao
Copy link
Contributor

I guess "spark.yarn.credentials.renewalTime" and "spark.yarn.credentials.updateTime" should also be excluded.

@saturday-shi saturday-shi changed the title [SPARK-21008] [STREAMING] Not to read spark.yarn.credentials.file from checkpoint. [SPARK-19688] [STREAMING] Not to read spark.yarn.credentials.file from checkpoint. Jun 8, 2017
@saturday-shi
Copy link
Author

I guess "spark.yarn.credentials.renewalTime" and "spark.yarn.credentials.updateTime" should also be excluded.

Thank you for pointing out that. I'll check & fix them.

@saturday-shi
Copy link
Author

@jerryshao I've taken a look at spark.yarn.credentials.renewalTime and spark.yarn.credentials.updateTime, but I don't think there is a necessity for excluding them. Changing on these properties means that either delegation token's configuration or Spark itself has been changed. In both case, restarting from checkpoint will not work.

Could you describe some elaborate scenes about the changing of spark.yarn.credentials.renewalTime and spark.yarn.credentials.updateTime? Or just mean to make it more robust?

@jerryshao
Copy link
Contributor

From my understanding this two configurations will also point to expired timestamp, since each started application will reset it, so no need to checkpoint them.

I think all the Spark on YARN internal configurations used for internal state tracking and passing needn't to checkpoint, they will always be updated in yarn client's launching.

@saturday-shi
Copy link
Author

@jerryshao Sorry for the delay.
Currently, Spark checkpoint ALL the configurations no matter it is "internal" or not. So we have to reload ones that should be updated at launching. Probably not to checkpoint the "internal" ones may be a good idea, but it is much like a new feature than a bug fix.

This PR attempt to fix the bug at least changing of codes, so it'll be easily merged into any maintenance branches. I don't care adding more options into the exclude-list, but you'll have to do an extra work to cherry-pick a subset for branches before 2.1, as spark.yarn.credentials.renewalTime and spark.yarn.credentials.updateTime don't exist in old branches.

@jerryshao
Copy link
Contributor

I don't agree with your point of view, there's already some potential issues regarding to internal configurations, either they will potentially lead to unexpected state, either they're meaningless to checkpoint, I will reserve my opinion if you insist on a point fix.

@saturday-shi
Copy link
Author

No, I don't mean to insist on my opinion. I'm just curious to know the reason for the changing (as it looks like another point fix).

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77910 has finished for PR 18230 at commit 2c50858.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jun 16, 2017

The wording in this code always confuses me... I never know what "reload" means (propertiesToReload).

Anyway, I think I understand why this is broken. Because of this in ApplicationMaster:

      if (sparkConf.contains(CREDENTIALS_FILE_PATH.key)) {
        // If a principal and keytab have been set, use that to create new credentials for executors
        // periodically
        credentialRenewer =
          new ConfigurableCredentialManager(sparkConf, yarnConf).credentialRenewer()
        credentialRenewer.scheduleLoginFromKeytab()
      }

So if you start the second streaming application without providing principal / keytab, Client.scala will not overwrite the credential file path, but still the AM will start the credential updater, because the file location is in the configuration read from the checkpoint.

So the workaround is to make sure the restarted application also has the principal / keytab arguments.

As for the fix, it seems only adding the credential file does fix the problem, since the AM code only looks at it to decide whether to start the credential updater thread. But perhaps a better fix would be to fix the AM to look at the correct config (e.g. PRINCIPAL instead of CREDENTIALS_FILE_PATH) when starting the credential updater.

@jerryshao
Copy link
Contributor

@vanzin "reload" here meanings retrieving back SparkConf from checkpoint file and using this retrieved SparkConf to create SparkContext when restarting streaming application.

This is a streaming specific operation, the original purpose is to keep restarted streaming application be the same state (as well as configurations) as the previous stopped app. But in Spark we have some configurations which are only meaningful in the current application, so recovering these for restarted app is meaningless and will lead to some issues. And for this purpose Spark Streaming maintains a propertiesToReload list which means these configurations will be not be reloaded, instead they will be set during current application-time.

@saturday-shi
Copy link
Author

@jerryshao

"reload" here meanings retrieving back SparkConf from checkpoint file and using this retrieved SparkConf to create SparkContext when restarting streaming application.

That explanation is extremely wrong. But your opinion of what the propertiesToReload list does is right.

After restarting from checkpoint, properties in SparkConf will be the same as the previous application. But something like spark.yarn.app.id will be stale an useless in a restarted app. So after retrieving back the SparkConf from checkpoint, we want to "reload" the fresh values from system properties, instead of using old ones in the checkpoint.

@vanzin

So if you start the second streaming application without providing principal / keytab, Client.scala will not overwrite the credential file path, but still the AM will start the credential updater, because the file location is in the configuration read from the checkpoint.

That's probably right, but not the case. I do submit the principal & keytab at restarting and the AM do renew the token using the principal successfully.

I noticed that the SparkConf used by AMCredentialRenewer and CredentialUpdater seems NOT THE SAME. The credential renewer thread launched by the AM will work correctly, but the credential updater in executor backend - which uses configs provided by the diver - will confuse and fail in its job. So just fixing the AM code doesn't make much sense.

FYI, the log of AMCredentialRenewer looks like this:

17/06/07 15:11:14 INFO security.AMCredentialRenewer: Scheduling login from keytab in 96952 millis.
...
17/06/07 15:12:51 INFO security.AMCredentialRenewer: Attempting to login to KDC using principal: [email protected]
17/06/07 15:12:51 INFO security.AMCredentialRenewer: Successfully logged into KDC.
...
17/06/07 15:12:53 INFO security.AMCredentialRenewer: Writing out delegation tokens to hdfs://nameservice1/user/xxx/.sparkStaging/application_1496384469444_0036/credentials-044b83ea-b46b-4bd4-8e98-0e38928fd58c-1496816091985-1.tmp
17/06/07 15:12:53 INFO security.AMCredentialRenewer: Delegation Tokens written out successfully. Renaming file to hdfs://nameservice1/user/xxx/.sparkStaging/application_1496384469444_0036/credentials-044b83ea-b46b-4bd4-8e98-0e38928fd58c-1496816091985-1
17/06/07 15:12:53 INFO security.AMCredentialRenewer: Delegation token file rename complete.
17/06/07 15:12:53 INFO security.AMCredentialRenewer: Scheduling login from keytab in 110925 millis.
...

It renews the token successfully and saves it to application_1496384469444_0036's dir.
But the CredentialUpdater (started by YarnSparkHadoopUtil) complains about this:

17/06/07 15:11:24 INFO executor.CoarseGrainedExecutorBackend: Will periodically update credentials from: hdfs://nameservice1/user/xxx/.sparkStaging/application_1496384469444_0035/credentials-19a7c11e-8c93-478c-ab0a-cdbfae5b2ae5
...
17/06/07 15:12:24 WARN yarn.YarnSparkHadoopUtil: Error while attempting to list files from application staging dir
java.io.FileNotFoundException: File hdfs://nameservice1/user/xxx/.sparkStaging/application_1496384469444_0035 does not exist.
...

... which says that the credentials file doesn't exist in application_1496384469444_0035's dir.

@vanzin
Copy link
Contributor

vanzin commented Jun 19, 2017

Ok, I think I follow the code path now. The AM uses the conf from the user, while SparkContext (which provides the conf to the executors) loads it from the checkpoint and needs to overwrite the properties listed in the "reload" list.

Anyway, current patch looks good. Merging to master / 2.2 / 2.1 / 2.0 / 1.6 (although I wouldn't hold my breath for a new 1.6 release).

asfgit pushed a commit that referenced this pull request Jun 19, 2017
…om checkpoint.

## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)

Author: saturday_s <[email protected]>

Closes #18230 from saturday-shi/SPARK-21008.

(cherry picked from commit e92ffe6)
Signed-off-by: Marcelo Vanzin <[email protected]>
asfgit pushed a commit that referenced this pull request Jun 19, 2017
…om checkpoint.

## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)

Author: saturday_s <[email protected]>

Closes #18230 from saturday-shi/SPARK-21008.

(cherry picked from commit e92ffe6)
Signed-off-by: Marcelo Vanzin <[email protected]>
asfgit pushed a commit that referenced this pull request Jun 19, 2017
…om checkpoint.

## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)

Author: saturday_s <[email protected]>

Closes #18230 from saturday-shi/SPARK-21008.

(cherry picked from commit e92ffe6)
Signed-off-by: Marcelo Vanzin <[email protected]>
asfgit pushed a commit that referenced this pull request Jun 19, 2017
…om checkpoint.

## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)

Author: saturday_s <[email protected]>

Closes #18230 from saturday-shi/SPARK-21008.

(cherry picked from commit e92ffe6)
Signed-off-by: Marcelo Vanzin <[email protected]>
@vanzin
Copy link
Contributor

vanzin commented Jun 19, 2017

@saturday-shi I don't know your jira handle, let me know it if you want the bug to be assigned to you.

@asfgit asfgit closed this in e92ffe6 Jun 19, 2017
@saturday-shi
Copy link
Author

@vanzin Xing Shi (saturday_s), thanks.

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Jun 20, 2017
…om checkpoint.

## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)

Author: saturday_s <[email protected]>

Closes apache#18230 from saturday-shi/SPARK-21008.

(cherry picked from commit e92ffe6)
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit a233fac)
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
…om checkpoint.

## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)

Author: saturday_s <[email protected]>

Closes apache#18230 from saturday-shi/SPARK-21008.
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
…om checkpoint.

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008)

Author: saturday_s <[email protected]>

Closes apache#18230 from saturday-shi/SPARK-21008.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants