The following are all true stories. We welcome more submissions of these stories, especially covering the steps taken to determine what was wrong.
A client program could not work with zookeeper: the connections were being broken. But it was working for everything else.
The cluster was one year old that day.
It turns out that ZK reacts to an auth failure by logging something in its logs, and breaking the client connection —without any notification to the client. Rather than a network problem (initial hypothesis), this a Kerberos problem. How was that worked out? By examining the Zookeeper logs —there was nothing client-side except the reports of connections being closed and the ZK client attempting to retry.
When a Kerberos keytab is created, the entries in it have a lifespan. The default value is one year. This was its first birthday, hence ZK wouldn't trust the client.
Fix: create new keytabs, valid for another year, and distribute them.
This one showed up during release testing —credit to Andras Bokor for tracking it all down.
A stack trace
16/01/16 01:42:39 WARN ipc.Client: Exception encountered while connecting to the server :
javax.security.sasl.SaslException: GSS initiate failed
[Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
java.io.IOException: Failed on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed
[Caused by GSSException: No valid credentials provided
(Mechanism level: Failed to find any Kerberos tgt)]; Host Details :
local host is: "os-u14-2-2.novalocal/172.22.73.243"; destination host is: "os-u14-2-3.novalocal":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:773)
at org.apache.hadoop.ipc.Client.call(Client.java:1431)
at org.apache.hadoop.ipc.Client.call(Client.java:1358)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2116)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1315)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1311)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1311)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
This looks like a normal "not logged in" problem, except for some little facts:
- The user was logged in.
- The failure was replicable.
- It only surfaced on OpenJDK, not oracle JDK.
- Everything worked on OpenJDK 7u51, but not on OpenJDK 7u91.
Something had changed in the JDK to reject the login on this system (ubuntu, virtual test cluster).
Kdiag
didn't throw up anything obvious. What did show some warning was klist
:
Ticket cache: FILE:/tmp/krb5cc_2529
Default principal: qe@REALM
Valid starting Expires Service principal
01/16/2016 11:07:23 01/16/2016 21:07:23 krbtgt/REALM@REALM
renew until 01/23/2016 11:07:23
01/16/2016 13:13:11 01/16/2016 21:07:23 HTTP/hdfs-3-5@
renew until 01/23/2016 11:07:23
01/16/2016 13:13:11 01/16/2016 21:07:23 HTTP/hdfs-3-5@REALM
renew until 01/23/2016 11:07:23
See that? There's a principal which doesn't have a stated realm. Does that matter?
In OracleJDK, and OpenJDK 7u51, apparently not. In OpenJDK 7u91, yes
There's some new code in sun.security.krb5.PrincipalName
(Oracle OpenJDK copyright)
// Validate a nameStrings argument
private static void validateNameStrings(String[] ns) {
if (ns == null) {
throw new IllegalArgumentException("Null nameStrings not allowed");
}
if (ns.length == 0) {
throw new IllegalArgumentException("Empty nameStrings not allowed");
}
for (String s: ns) {
if (s == null) {
throw new IllegalArgumentException("Null nameString not allowed");
}
if (s.isEmpty()) {
throw new IllegalArgumentException("Empty nameString not allowed");
}
}
}
This checks the code, and rejects if nothing is valid. Now, how does something invalid get in?
Setting HADOOP_JAAS_DEBUG=true
and logging at debug turned up output,
With 7u51:
16/01/20 15:13:20 DEBUG security.UserGroupInformation: using kerberos user:qe@REALM
With 7u91:
16/01/20 15:10:44 DEBUG security.UserGroupInformation: using kerberos user:null
Which means that the default principal wasn't being picked up, instead some JVM specific introspection had kicked in —and it was finding the principal without a realm, rather than the one that was.
*Fix: add a domain_realm
in /etc/krb5.conf
mapping hostnames to realms *
[domain_realm]
hdfs-3-5.novalocal = REALM
A klist
then returns a list of credentials without this realm-less one in.
Valid starting Expires Service principal
01/17/2016 14:49:08 01/18/2016 00:49:08 krbtgt/REALM@REALM
renew until 01/24/2016 14:49:08
01/17/2016 14:49:16 01/18/2016 00:49:08 HTTP/hdfs-3-5@REALM
renew until 01/24/2016 14:49:08
Because this was a virtual cluster, DNS/RDNS probably wasn't working, presumably kerberos didn't know what realm the host was in, and things went downhill. It just didn't show in any validation operations, merely in the classic "no TGT" error.
Real-life example:
- Company ACME has one ActiveDirectory domain per continent.
- Domains have mutual trust enabled.
- AD is also used for Kerberos authentication.
- Kerberos trust is handled by a few AD servers in the "root" domain.
- Hadoop cluster is running in Europe.
When a South American user opens a SSH session on the edge node, authentication is done by LDAP, no issue
the dirty work is done by the AD servers.
But when a S.A. user tries to connect to HDFS or HiveServer2, with principal [email protected]
,
then the Kerberos client must make several hops...
- AD server for @SAM.ACME.COM says "no, can't create ticket for svc/[email protected]"
- AD server for @SAM.ACME.COM says "OK, I can get you a credential to @ACME.COM, see what they can do there" alas,
- There's no AD server defined in conf file for @ACME.COM
- This leads to the all to familiar message,
Fail to create credential. (63) - No service creds
Of course the only thing displayed in logs was the final error message.
Even after enabling the "secret" debug flags, it was not clear what the client was trying to do
with all these attempts.
But the tell-tale was the "following capath" comment on each hop, because CAPATH
is actually
an optional section in krb5.conf
. Fix: add the information about cross realm authentication
to the krb5.conf
file.