-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5158] Access kerberized HDFS from Spark standalone #17530
Conversation
… renewer = yarn.resourcemanager.principal
* master: (164 commits) [SPARK-20198][SQL] Remove the inconsistency in table/function name conventions in SparkSession.Catalog APIs [SPARK-20190][APP-ID] applications//jobs' in rest api,status should be [running|s… [SPARK-19825][R][ML] spark.ml R API for FPGrowth [SPARK-20067][SQL] Unify and Clean Up Desc Commands Using Catalog Interface [SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS [SPARK-19408][SQL] filter estimation on two columns of same table [SPARK-20145] Fix range case insensitive bug in SQL [SPARK-20194] Add support for partition pruning to in-memory catalog [SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produces incorrect schema for non-array/object JSONs [SPARK-19969][ML] Imputer doc and example [SPARK-9002][CORE] KryoSerializer initialization does not include 'Array[Int]' [MINOR][DOCS] Replace non-breaking space to normal spaces that breaks rendering markdown [SPARK-20166][SQL] Use XXX for ISO 8601 timezone instead of ZZ (FastDateFormat specific) in CSV/JSON timeformat options [SPARK-19985][ML] Fixed copy method for some ML Models [SPARK-20159][SPARKR][SQL] Support all catalog API in R [SPARK-20173][SQL][HIVE-THRIFTSERVER] Throw NullPointerException when HiveThriftServer2 is shutdown [SPARK-20123][BUILD] SPARK_HOME variable might have spaces in it(e.g. $SPARK… [SPARK-20143][SQL] DataType.fromJson should throw an exception with better message [SPARK-20186][SQL] BroadcastHint should use child's stats [SPARK-19148][SQL][FOLLOW-UP] do not expose the external table concept in Catalog ...
Can one of the admins verify this patch? |
How does this patch handle the security issues raised in other attempts at such feature, such as #4106? Specifically the following comment, if you don't want to go through all the discussion: #4106 (comment) Basically, how do you prevent user A from reading user B's keytab here? |
Hi @vanzin, spark standalone isn't really multi user in any sense since the executors for all jobs run as whatever user the worker daemon was started as. That shouldn't preclude standalone clusters from communicating with secured resources. Happy to add some additional documentation on this very point to the PR. Any other other thoughts? Thanks, |
Of course it should. You're inserting a service into your cluster that allows people to steal each others' credentials, basically making security in all the other services pointless. Explain to me how does that make sense? If all you want is to allow Spark to access secure HDFS, create a user for the "Spark Standalone" service and login on every Worker node as that user. All Spark applications will use that user, and then administrators have control of how much damage that user can do to other services. But as is, you're basically breaking security of the whole datacenter by introducing an insecure service. |
In our setup each user gets their own standalone cluster. Users cannot submit jobs to each other's clusters. By providing a keytab on cluster creation and having Spark manage renewal on behalf of the user, we can support long running jobs with less headache. |
Then in your setup you can configure things so that the cluster already has the user's keytab; Spark doesn't need to distribute it for you. |
That's right, but you still need a separate out of band process refreshing with the KDC. My thinking is why not have spark do that on your behalf? |
That is not what you change does, though. If you want to change the master / worker scripts to refresh kerberos credentials, that would be a lot more acceptable. This change is just not acceptable, because outside of your user case (which I don't know all the details of), it's a big security hole. |
To me it's basically the same as users including S3 credentials when submitting to spark standalone. Kerberos just requires more machinery. It might be a little harder to get at the spark conf entries of another user's job, but still possible since everything runs as the same unix user and shares the cluster secret. |
Said another way people need another layer to use spark standalone in secured environments anyway. |
I'm sorry but you won't convince me that it's a useful feature to have Spark be a big security hole when inserted into a kerberos environment. As I said, if you want to change your approach to have the Master / Worker daemons manage a kerberos login that is shared among all users, that would be more acceptable, and also cover your use case as far as I can see. But your current approach is just not going to happen, at least from my point of view. If you can find someone else to shepherd your PR, I'll let them figure it out, but I'm not for adding this particular implementation to Spark. And people can use YARN if they really care about security. Standalone was never meant to be secure, nor used in secure environments. (There's also ongoing work to get Kerberos to work with Mesos, which does have the necessary security features as far as I know.) |
And BTW, if you really want to pursue this, please write a detailed spec explaining everything that is being done, and describe all the security issues people need to be aware of. It might even be prudent to make it an SPIP. |
That would work for cluster mode but in client mode the driver on the submitting nodes still needs the keytab unfortunately. Standalone clusters are best viewed as distributed single-user programs, so I think the real mistake is not bringing them into a secure environment, but bringing them into a secure environment and trying to use them in a multi-tenant/multi-user fashion. I can see the concern that this feature might give someone who brings standalone clusters into a kerberized environment a false sense of security. What about disabling unless something like |
BTW not trying to give you the hard sell and appreciate the help rounding out the requirements from the core committers' POV. |
I'm just not sold on the idea that this is necessary in the first place. Personally I don't use standalone nor do I play with it at all, so my concerns are purely from a security standpoint. As in, if I were managing a secure cluster, why would I ever allow this code to run on it? If you can write a spec and convince people who maintain the standalone manager that it's a good idea, then go for it. I just want to make sure you're not setting up people to completely break security on their clusters. |
You're setting up a special cluster for a single user. I'm pretty sure that user can login to the KDC on the driver host. |
@themodernlife I'm trying to add Kerberos support for Mesos, and creating HadoopRDDs fail for me because YARN isn't configured: https://issues.apache.org/jira/browse/SPARK-20328 Did you run into this? |
There is a spot in HadoopFSCredentialProvider where it looks for a Hadoop config key related to yarn to set the token renewer. In getTokenRenewer it calls Master.getMasterPrincipal(conf) which will need some yarn configuration set for things to succeed. Right now the PR doesn't set that, so it needs to be set under the user's HADOOP_CONF even though it had no real effect. That probably should be changed. Didn't have a chance to dig into the ticket you linked to but will try to have a look and compare notes. If anything comes to mind will comment there. |
Yep, same problem I'm seeing. Thanks. |
Is this addressing similar issue with #17387 ? |
gentle ping @themodernlife on ^. |
What changes were proposed in this pull request?
ConfigurableCredentialManager
and relatedCredentialProviders
so that they are no longer tied to YARNStandaloneSchedulerBackend
The implementation does basically the same thing as the YARN backend. The keytab is copied to driver/executors through an environment variable in the
ApplicationDescription
.How was this patch tested?
https://github.com/themodernlife/spark-standalone-kerberos contains a docker-compose environment with a KDC and Kerberized HDFS mini-cluster. The README contains instructions for running the integration test script to see credential refresh/updating occur. Credentials are set to update very 2 minutes or so.