Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5158] Access kerberized HDFS from Spark standalone #17530

Closed
wants to merge 18 commits into from

Conversation

themodernlife
Copy link
Contributor

What changes were proposed in this pull request?

  • Refactor ConfigurableCredentialManager and related CredentialProviders so that they are no longer tied to YARN
  • Setup credential renewal/updating from within the StandaloneSchedulerBackend
  • Ensure executors/drivers are able to find initial tokens for contacting HDFS and renew them at regular intervals

The implementation does basically the same thing as the YARN backend. The keytab is copied to driver/executors through an environment variable in the ApplicationDescription.

How was this patch tested?

https://github.com/themodernlife/spark-standalone-kerberos contains a docker-compose environment with a KDC and Kerberized HDFS mini-cluster. The README contains instructions for running the integration test script to see credential refresh/updating occur. Credentials are set to update very 2 minutes or so.

Ian Hummel added 18 commits February 24, 2017 16:29
* master: (164 commits)
  [SPARK-20198][SQL] Remove the inconsistency in table/function name conventions in SparkSession.Catalog APIs
  [SPARK-20190][APP-ID] applications//jobs' in rest api,status should be [running|s…
  [SPARK-19825][R][ML] spark.ml R API for FPGrowth
  [SPARK-20067][SQL] Unify and Clean Up Desc Commands Using Catalog Interface
  [SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS
  [SPARK-19408][SQL] filter estimation on two columns of same table
  [SPARK-20145] Fix range case insensitive bug in SQL
  [SPARK-20194] Add support for partition pruning to in-memory catalog
  [SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produces incorrect schema for non-array/object JSONs
  [SPARK-19969][ML] Imputer doc and example
  [SPARK-9002][CORE] KryoSerializer initialization does not include 'Array[Int]'
  [MINOR][DOCS] Replace non-breaking space to normal spaces that breaks rendering markdown
  [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options
  [SPARK-19985][ML] Fixed copy method for some ML Models
  [SPARK-20159][SPARKR][SQL] Support all catalog API in R
  [SPARK-20173][SQL][HIVE-THRIFTSERVER] Throw NullPointerException when HiveThriftServer2 is shutdown
  [SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g. $SPARK…
  [SPARK-20143][SQL] DataType.fromJson should throw an exception with better message
  [SPARK-20186][SQL] BroadcastHint should use child's stats
  [SPARK-19148][SQL][FOLLOW-UP] do not expose the external table concept in Catalog
  ...
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@vanzin
Copy link
Contributor

vanzin commented Apr 4, 2017

How does this patch handle the security issues raised in other attempts at such feature, such as #4106? Specifically the following comment, if you don't want to go through all the discussion: #4106 (comment)

Basically, how do you prevent user A from reading user B's keytab here?

@themodernlife
Copy link
Contributor Author

Hi @vanzin, spark standalone isn't really multi user in any sense since the executors for all jobs run as whatever user the worker daemon was started as. That shouldn't preclude standalone clusters from communicating with secured resources.

Happy to add some additional documentation on this very point to the PR.

Any other other thoughts?

Thanks,

@vanzin
Copy link
Contributor

vanzin commented Apr 5, 2017

That shouldn't preclude standalone clusters from communicating with secured resources.

Of course it should. You're inserting a service into your cluster that allows people to steal each others' credentials, basically making security in all the other services pointless. Explain to me how does that make sense?

If all you want is to allow Spark to access secure HDFS, create a user for the "Spark Standalone" service and login on every Worker node as that user. All Spark applications will use that user, and then administrators have control of how much damage that user can do to other services.

But as is, you're basically breaking security of the whole datacenter by introducing an insecure service.

@themodernlife
Copy link
Contributor Author

In our setup each user gets their own standalone cluster. Users cannot submit jobs to each other's clusters. By providing a keytab on cluster creation and having Spark manage renewal on behalf of the user, we can support long running jobs with less headache.

@vanzin
Copy link
Contributor

vanzin commented Apr 5, 2017

Then in your setup you can configure things so that the cluster already has the user's keytab; Spark doesn't need to distribute it for you.

@themodernlife
Copy link
Contributor Author

That's right, but you still need a separate out of band process refreshing with the KDC. My thinking is why not have spark do that on your behalf?

@vanzin
Copy link
Contributor

vanzin commented Apr 5, 2017

That is not what you change does, though.

If you want to change the master / worker scripts to refresh kerberos credentials, that would be a lot more acceptable. This change is just not acceptable, because outside of your user case (which I don't know all the details of), it's a big security hole.

@themodernlife
Copy link
Contributor Author

To me it's basically the same as users including S3 credentials when submitting to spark standalone. Kerberos just requires more machinery. It might be a little harder to get at the spark conf entries of another user's job, but still possible since everything runs as the same unix user and shares the cluster secret.

@themodernlife
Copy link
Contributor Author

Said another way people need another layer to use spark standalone in secured environments anyway.

@vanzin
Copy link
Contributor

vanzin commented Apr 5, 2017

I'm sorry but you won't convince me that it's a useful feature to have Spark be a big security hole when inserted into a kerberos environment.

As I said, if you want to change your approach to have the Master / Worker daemons manage a kerberos login that is shared among all users, that would be more acceptable, and also cover your use case as far as I can see. But your current approach is just not going to happen, at least from my point of view. If you can find someone else to shepherd your PR, I'll let them figure it out, but I'm not for adding this particular implementation to Spark.

And people can use YARN if they really care about security. Standalone was never meant to be secure, nor used in secure environments. (There's also ongoing work to get Kerberos to work with Mesos, which does have the necessary security features as far as I know.)

@vanzin
Copy link
Contributor

vanzin commented Apr 5, 2017

And BTW, if you really want to pursue this, please write a detailed spec explaining everything that is being done, and describe all the security issues people need to be aware of. It might even be prudent to make it an SPIP.

@themodernlife
Copy link
Contributor Author

That would work for cluster mode but in client mode the driver on the submitting nodes still needs the keytab unfortunately.

Standalone clusters are best viewed as distributed single-user programs, so I think the real mistake is not bringing them into a secure environment, but bringing them into a secure environment and trying to use them in a multi-tenant/multi-user fashion.

I can see the concern that this feature might give someone who brings standalone clusters into a kerberized environment a false sense of security. What about disabling unless something like spark.standalone.single-user is set to true?

@themodernlife
Copy link
Contributor Author

BTW not trying to give you the hard sell and appreciate the help rounding out the requirements from the core committers' POV.

@vanzin
Copy link
Contributor

vanzin commented Apr 5, 2017

I'm just not sold on the idea that this is necessary in the first place. Personally I don't use standalone nor do I play with it at all, so my concerns are purely from a security standpoint. As in, if I were managing a secure cluster, why would I ever allow this code to run on it?

If you can write a spec and convince people who maintain the standalone manager that it's a good idea, then go for it. I just want to make sure you're not setting up people to completely break security on their clusters.

@vanzin
Copy link
Contributor

vanzin commented Apr 5, 2017

That would work for cluster mode but in client mode the driver on the submitting nodes still needs the keytab unfortunately.

You're setting up a special cluster for a single user. I'm pretty sure that user can login to the KDC on the driver host.

@mgummelt
Copy link
Contributor

@themodernlife I'm trying to add Kerberos support for Mesos, and creating HadoopRDDs fail for me because YARN isn't configured: https://issues.apache.org/jira/browse/SPARK-20328

Did you run into this?

@themodernlife
Copy link
Contributor Author

There is a spot in HadoopFSCredentialProvider where it looks for a Hadoop config key related to yarn to set the token renewer.

In getTokenRenewer it calls Master.getMasterPrincipal(conf) which will need some yarn configuration set for things to succeed.

Right now the PR doesn't set that, so it needs to be set under the user's HADOOP_CONF even though it had no real effect. That probably should be changed.

Didn't have a chance to dig into the ticket you linked to but will try to have a look and compare notes. If anything comes to mind will comment there.

@mgummelt
Copy link
Contributor

Right now the PR doesn't set that, so it needs to be set under the user's HADOOP_CONF even though it had no real effect. That probably should be changed.

Yep, same problem I'm seeing. Thanks.

@jiangxb1987
Copy link
Contributor

Is this addressing similar issue with #17387 ?

@HyukjinKwon
Copy link
Member

gentle ping @themodernlife on ^.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants