Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Pulsar 4.0.1 image crashed when loading the native SSL library #23717

Closed
2 of 3 tasks
BewareMyPower opened this issue Dec 12, 2024 · 17 comments · Fixed by #23762
Closed
2 of 3 tasks

[Bug] Pulsar 4.0.1 image crashed when loading the native SSL library #23717

BewareMyPower opened this issue Dec 12, 2024 · 17 comments · Fixed by #23762
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@BewareMyPower
Copy link
Contributor

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

apachepulsar/pulsar                                                latest      93b5506ac8eb   12 days ago     481MB

Minimal reproduce step

git clone https://github.com/apache/pulsar-client-python.git
cd pulsar-client-python
./build-support/pulsar-test-service-start.sh

The script is stuck at waiting for the 8080 port.

What did you expect to see?

The script should succeed

What did you see instead?

The Pulsar process in the container was crashed and the log from /pulsar/logs inside the container stopped at:

2024-12-12T02:49:54,604+0000 [main] INFO
org.apache.pulsar.broker.service.BrokerService - Started Pulsar Broker
service

log.info("Started Pulsar Broker service on {}, TLS: {}, listener: {}",
ch.localAddress(),
isTls ? SslContext.defaultServerProvider().toString() : "(none)",
StringUtils.defaultString(a.getListenerName(), "(none)"));
} catch (Exception e) {

As you can see, it didn't log the listened port.

Anything else?

hs_err_pid296.log

It crashed when loading the native SSL library:

Current thread (0x0000ffffaa98c800):  JavaThread "main"             [_thread_in_native, id=349, stack(0x0000ffffadd2d000,0x0000ffffadf2ba80) (2042K)]

Stack: [0x0000ffffadd2d000,0x0000ffffadf2ba80],  sp=0x0000ffffadf287f0,  free space=2029k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libnetty_tcnative_linux_aarch_6415896617181118050458.so+0x240ec]  init_have_lse_atomics+0xc
C  [ld-musl-aarch64.so.1+0x6b2fc]
C  [ld-musl-aarch64.so.1+0x6d058]  dlopen+0xc4
V  [libjvm.so+0xbc2078]  os::Linux::dlopen_helper(char const*, char*, int)+0x28
V  [libjvm.so+0xbc2364]  os::dll_load(char const*, char*, int)+0x74
V  [libjvm.so+0x90704c]  JVM_LoadLibrary+0x88
C  [libjava.so+0xf3cc]  Java_jdk_internal_loader_NativeLibraries_load+0x21c
j  jdk.internal.loader.NativeLibraries.load(Ljdk/internal/loader/NativeLibraries$NativeLibraryImpl;Ljava/lang/String;ZZ)Z+0 [email protected]
j  jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open()Z+57 [email protected]
j  jdk.internal.loader.NativeLibraries.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)Ljdk/internal/loader/NativeLibrary;+254 [email protected]
j  jdk.internal.loader.NativeLibraries.loadLibrary(Ljava/lang/Class;Ljava/io/File;)Ljdk/internal/loader/NativeLibrary;+51 [email protected]
j  java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/io/File;)Ljdk/internal/loader/NativeLibrary;+31 [email protected]
j  java.lang.Runtime.load0(Ljava/lang/Class;Ljava/lang/String;)V+61 [email protected]
j  java.lang.System.load(Ljava/lang/String;)V+7 [email protected]
j  io.netty.util.internal.NativeLibraryUtil.loadLibrary(Ljava/lang/String;Z)V+5
j  java.lang.invoke.LambdaForm$DMH+0x0000008800430000.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;I)V+11 [email protected]
j  java.lang.invoke.LambdaForm$MH+0x0000008800431000.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+54 [email protected]
j  java.lang.invoke.LambdaForm$MH+0x0000008800128400.invokeExact_MT(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+22 [email protected]
j  jdk.internal.reflect.DirectMethodHandleAccessor.invokeImpl(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+72 [email protected]
j  jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;+23 [email protected]
J 3874 c1 java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; [email protected] (108 bytes) @ 0x0000ffff9bf827e4 [0x0000ffff9bf82240+0x00000000000005a4]
j  io.netty.util.internal.NativeLibraryLoader$1.run()Ljava/lang/Object;+53
J 1613 c1 java.security.AccessController.doPrivileged(Ljava/security/PrivilegedAction;)Ljava/lang/Object; [email protected] (9 bytes) @ 0x0000ffff9bb92fd0 [0x0000ffff9bb92e00+0x00000000000001d0]
j  io.netty.util.internal.NativeLibraryLoader.loadLibraryByHelper(Ljava/lang/Class;Ljava/lang/String;Z)V+10
j  io.netty.util.internal.NativeLibraryLoader.loadLibrary(Ljava/lang/ClassLoader;Ljava/lang/String;Z)V+15
j  io.netty.util.internal.NativeLibraryLoader.load(Ljava/lang/String;Ljava/lang/ClassLoader;)V+352
j  io.netty.util.internal.NativeLibraryLoader.loadFirstAvailable(Ljava/lang/ClassLoader;[Ljava/lang/String;)V+33
j  io.netty.handler.ssl.OpenSsl.loadTcNative()V+297
j  io.netty.handler.ssl.OpenSsl.<clinit>()V+170
v  ~StubRoutines::call_stub 0x0000ffffa2d99130
V  [libjvm.so+0x83f4e8]  JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x218
V  [libjvm.so+0x819560]  InstanceKlass::call_class_initializer(JavaThread*)+0x280
V  [libjvm.so+0x81b214]  InstanceKlass::initialize_impl(JavaThread*)+0x5a4
V  [libjvm.so+0xa71540]  LinkResolver::resolve_static_call(CallInfo&, LinkInfo const&, bool, JavaThread*)+0x1a0
V  [libjvm.so+0xa71cf0]  LinkResolver::resolve_invoke(CallInfo&, Handle, constantPoolHandle const&, int, Bytecodes::Code, JavaThread*)+0x1e0
V  [libjvm.so+0x837358]  InterpreterRuntime::resolve_invoke(JavaThread*, Bytecodes::Code)+0x228
V  [libjvm.so+0x8377b8]  InterpreterRuntime::resolve_from_cache(JavaThread*, Bytecodes::Code)+0x108
j  io.netty.handler.ssl.SslContext.defaultProvider()Lio/netty/handler/ssl/SslProvider;+0
j  io.netty.handler.ssl.SslContext.defaultServerProvider()Lio/netty/handler/ssl/SslProvider;+0
j  io.netty.handler.ssl.SslContext.newServerContextInternal(Lio/netty/handler/ssl/SslProvider;Ljava/security/Provider;[Ljava/security/cert/X509Certificate;Ljavax/net/ssl/TrustManagerFactory;[Ljava/security/cert/X509Certificate;Ljava/security/PrivateKey;Ljava/lang/String;Ljavax/net/ssl/KeyManagerFactory;Ljava/lang/Iterable;Lio/netty/handler/ssl/CipherSuiteFilter;Lio/netty/handler/ssl/ApplicationProtocolConfig;JJLio/netty/handler/ssl/ClientAuth;[Ljava/lang/String;ZZLjava/security/SecureRandom;Ljava/lang/String;[Ljava/util/Map$Entry;)Lio/netty/handler/ssl/SslContext;+4
j  io.netty.handler.ssl.SslContextBuilder.build()Lio/netty/handler/ssl/SslContext;+101
j  org.apache.pulsar.common.util.SecurityUtility.createNettySslContextForServer(Lio/netty/handler/ssl/SslProvider;ZLjava/lang/String;Ljava/lang/String;Ljava/lang/String;Ljava/util/Set;Ljava/util/Set;Z)Lio/netty/handler/ssl/SslContext;+123
j  org.apache.pulsar.common.util.DefaultPulsarSslFactory.buildNettySslContext()Lio/netty/handler/ssl/SslContext;+279
j  org.apache.pulsar.common.util.DefaultPulsarSslFactory.createInternalSslContext()V+60
j  org.apache.pulsar.broker.service.PulsarChannelInitializer.<init>(Lorg/apache/pulsar/broker/PulsarService;Lorg/apache/pulsar/broker/service/PulsarChannelInitializer$PulsarChannelOptions;)V+87
j  org.apache.pulsar.broker.service.PulsarChannelInitializer$$Lambda+0x00000088004fa480.newPulsarChannelInitializer(Lorg/apache/pulsar/broker/PulsarService;Lorg/apache/pulsar/broker/service/PulsarChannelInitializer$PulsarChannelOptions;)Lorg/apache/pulsar/broker/service/PulsarChannelInitializer;+6
j  org.apache.pulsar.broker.service.BrokerService.start()V+201

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@BewareMyPower BewareMyPower added the type/bug The PR fixed a bug or issue reported a bug label Dec 12, 2024
@BewareMyPower
Copy link
Contributor Author

The Netty JNI is incorrectly linked:

cf43f09c95f1:/tmp$ ldd libnetty_tcnative_linux_aarch_6412030080574647118807.so 
	/lib/ld-musl-aarch64.so.1 (0xffff9501c000)
	librt.so.1 => /lib/ld-musl-aarch64.so.1 (0xffff9501c000)
	libdl.so.2 => /lib/ld-musl-aarch64.so.1 (0xffff9501c000)
	libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0xffff94dbc000)
	libc.so.6 => /lib/ld-musl-aarch64.so.1 (0xffff9501c000)
Error relocating libnetty_tcnative_linux_aarch_6412030080574647118807.so: __getauxval: symbol not found

From here,

musl refuses to tell you "symbol can't be found" and instead just crashes. With glibc these issues are caught and handled.

@mattisonchao
Copy link
Member

That might be a netty bug. netty/netty#14479

@BewareMyPower
Copy link
Contributor Author

Yes, it's a bug of Netty but I'm not sure if it's the same issue. It should be an issue with the JNI library of netty-tcnative-boringssl-static.

It can be verified by executing the following command after you start a container:

cd /tmp
unzip -q /pulsar/lib/io.netty-netty-tcnative-boringssl-static-*-aarch_64.jar 
ldd META-INF/native/libnetty_tcnative_linux_aarch_64.so 

With 4.0.0, the JAR version is 2.0.66 and the link is good:

	/lib/ld-musl-aarch64.so.1 (0xffff8670a000)
	librt.so.1 => /lib/ld-musl-aarch64.so.1 (0xffff8670a000)
	libcrypt.so.1 => /lib/libcrypt.so.1 (0xffff864da000)
	libdl.so.2 => /lib/ld-musl-aarch64.so.1 (0xffff8670a000)
	libc.so.6 => /lib/ld-musl-aarch64.so.1 (0xffff8670a000)
	libucontext.so.1 => /lib/libucontext.so.1 (0xffff864b9000)
	libobstack.so.1 => /usr/lib/libobstack.so.1 (0xffff86498000)

However, with 4.0.1, the JAR version is 2.0.69 and the link is broken so it crashes on musl Linux:

	/lib/ld-musl-aarch64.so.1 (0xffffb161c000)
	librt.so.1 => /lib/ld-musl-aarch64.so.1 (0xffffb161c000)
	libdl.so.2 => /lib/ld-musl-aarch64.so.1 (0xffffb161c000)
	libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0xffffb13bc000)
	libc.so.6 => /lib/ld-musl-aarch64.so.1 (0xffffb161c000)
Error relocating META-INF/native/libnetty_tcnative_linux_aarch_64.so: __getauxval: symbol not found

@BewareMyPower
Copy link
Contributor Author

I tried to reproduce this issue inside an alpine:3.20 container. It seems that the netty-tcnative-boringssl-static 2.0.66 can find the __getauxval symbol after installing the gcompat dependency, while 2.0.69 cannot.

~ # unzip -q netty-tcnative-boringssl-static-2.0.66.Final-linux-aarch_64.jar 
~ # ldd META-INF/native/libnetty_tcnative_linux_aarch_64.so 
	/lib/ld-musl-aarch64.so.1 (0xffffa4a08000)
	librt.so.1 => /lib/ld-musl-aarch64.so.1 (0xffffa4a08000)
Error loading shared library libcrypt.so.1: No such file or directory (needed by META-INF/native/libnetty_tcnative_linux_aarch_64.so)
	libdl.so.2 => /lib/ld-musl-aarch64.so.1 (0xffffa4a08000)
	libc.so.6 => /lib/ld-musl-aarch64.so.1 (0xffffa4a08000)
Error relocating META-INF/native/libnetty_tcnative_linux_aarch_64.so: __getauxval: symbol not found
~ # apk add gcompat
(1/3) Installing musl-obstack (1.2.3-r2)
(2/3) Installing libucontext (1.2-r3)
(3/3) Installing gcompat (1.1.0-r4)
OK: 138 MiB in 29 packages
~ # ldd META-INF/native/libnetty_tcnative_linux_aarch_64.so 
	/lib/ld-musl-aarch64.so.1 (0xffffbac74000)
	librt.so.1 => /lib/ld-musl-aarch64.so.1 (0xffffbac74000)
	libcrypt.so.1 => /lib/libcrypt.so.1 (0xffffbaa44000)
	libdl.so.2 => /lib/ld-musl-aarch64.so.1 (0xffffbac74000)
	libc.so.6 => /lib/ld-musl-aarch64.so.1 (0xffffbac74000)
	libucontext.so.1 => /lib/libucontext.so.1 (0xffffbaa23000)
	libobstack.so.1 => /usr/lib/libobstack.so.1 (0xffffbaa02000)

@lhotari
Copy link
Member

lhotari commented Dec 12, 2024

@BewareMyPower We don't have a pure musl base image. glibc in https://github.com/apache/pulsar/tree/master/docker/glibc-package gets added to the Alpine base mixed, making it a mixed musl and glibc environment which is not generally recommended.

https://gitlab.alpinelinux.org/alpine/aports/-/merge_requests/24647#note_176723 states it clearly:
"Combining glibc and musl runtimes is basically all but guaranteed to create an unstable environment, unless the system is appropriately configured (glibc side uses glibc binaries only, and vice versa)."

sgerrand/alpine-pkg-glibc#194 (comment) mentions:
"You really need everything pure musl, or everything musl (but works with gcompat or libc6-compat shim-like tools...) or everything glibc using this package)"

@lhotari
Copy link
Member

lhotari commented Dec 12, 2024

This gist contains a Dockerfile for building a minideb based Pulsar 4.0.1 image by copying the content from the Pulsar 4.0.1 image: https://gist.github.com/lhotari/3ffef8117743f7044e6bbdc3933bc029 . That could be useful for validating whether the problem reproduces without the mixed musl+glibc environment.

@BewareMyPower
Copy link
Contributor Author

BewareMyPower commented Dec 12, 2024

I opened an issue at the upstream side: netty/netty-tcnative#907, which includes a minimum reproduced steps on alpine:3.20.

It seems that the missed __getauxval symbol can be found from the gcompat dependency for 2.0.66 but not for 2.0.69.

@lhotari
Copy link
Member

lhotari commented Dec 12, 2024

I opened an issue at the upstream side: netty/netty-tcnative#907, which includes a minimum reproduced steps on alpine:3.20.

It seems that the missed __getauxval symbol can be found from the gcompat dependency for 2.0.66 but not for 2.0.69.

For Alpine base images, I think that it's necessary to have gcompat, libc6-compat, libuuid and possibly also libgcc packages installed to load netty-tcnative. @BewareMyPower Does it reproduce with all those packages installed?

@BewareMyPower
Copy link
Contributor Author

~ # apk add libc6-compat libuuid libgcc
(1/1) Installing libuuid (2.40.1-r1)
OK: 138 MiB in 30 packages
~ # ldd META-INF/native/libnetty_tcnative_linux_aarch_64.so 
	/lib/ld-musl-aarch64.so.1 (0xffff9c053000)
	librt.so.1 => /lib/ld-musl-aarch64.so.1 (0xffff9c053000)
	libdl.so.2 => /lib/ld-musl-aarch64.so.1 (0xffff9c053000)
	libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0xffff9bdf3000)
	libc.so.6 => /lib/ld-musl-aarch64.so.1 (0xffff9c053000)
Error relocating META-INF/native/libnetty_tcnative_linux_aarch_64.so: __getauxval: symbol not found

These extra dependencies don't work. Actually for 2.0.66, I only need to install gcc and gcompact to make it work

@BewareMyPower
Copy link
Contributor Author

 => WARN: InvalidDefaultArgInFrom: Default value for ARG ${PULSAR_IMAGE} results in empty or invalid base image name (line 10)                                                                                                                  0.0s
 => CANCELED [internal] load metadata for docker.io/bitnami/minideb:bookworm                                                                                                                                                                    0.0s
Dockerfile:10
--------------------
   8 |     
   9 |     # First create a stage with the original Pulsar image
  10 | >>> FROM ${PULSAR_IMAGE} AS pulsar
  11 |     
  12 |     ###  Create one stage to include JVM distribution
--------------------
ERROR: failed to solve: failed to parse stage name "apachepulsar/pulsar-all=4.0.1": invalid reference format

I'm trying to build this Dockerfile but it failed: https://gist.github.com/lhotari/3ffef8117743f7044e6bbdc3933bc029

@lhotari
Copy link
Member

lhotari commented Dec 12, 2024

~ # apk add libc6-compat libuuid libgcc
(1/1) Installing libuuid (2.40.1-r1)
OK: 138 MiB in 30 packages
~ # ldd META-INF/native/libnetty_tcnative_linux_aarch_64.so 
	/lib/ld-musl-aarch64.so.1 (0xffff9c053000)
	librt.so.1 => /lib/ld-musl-aarch64.so.1 (0xffff9c053000)
	libdl.so.2 => /lib/ld-musl-aarch64.so.1 (0xffff9c053000)
	libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0xffff9bdf3000)
	libc.so.6 => /lib/ld-musl-aarch64.so.1 (0xffff9c053000)
Error relocating META-INF/native/libnetty_tcnative_linux_aarch_64.so: __getauxval: symbol not found

These extra dependencies don't work. Actually for 2.0.66, I only need to install gcc and gcompact to make it work

Yes, the problem reproduces. I agree that this is a netty-tcnative issue with musl compatibility.

btw. I guess instead of gcc, that libgcc is preferred for 2.0.66 so that the image doesn't have a compiler installed.

2.0.66 can be loaded with gcompat and libgcc:

apk update
apk add gcompat libgcc
wget https://repo1.maven.org/maven2/io/netty/netty-tcnative-boringssl-static/2.0.66.Final/netty-tcnative-boringssl-static-2.0.66.Final-linux-aarch_64.jar
unzip netty-tcnative-boringssl-static-2.0.66.Final-linux-aarch_64.jar
ldd META-INF/native/libnetty_tcnative_linux_aarch_64.so

@lhotari
Copy link
Member

lhotari commented Dec 12, 2024

I'm trying to build this Dockerfile but it failed: https://gist.github.com/lhotari/3ffef8117743f7044e6bbdc3933bc029

I accidentially introduced a typo while updating the version to 4.0.1 . It's fixed now.

@BewareMyPower
Copy link
Contributor Author

@lhotari Your Dockerfile works so I think we can apply your fix for now.

@BewareMyPower
Copy link
Contributor Author

test                                                               latest      b8ab7278d588   2 hours ago     3.38GB
apachepulsar/pulsar                                                4.0.1       93b5506ac8eb   12 days ago     481MB

Though the image size is much bigger now.

@lhotari
Copy link
Member

lhotari commented Dec 12, 2024

test                                                               latest      b8ab7278d588   2 hours ago     3.38GB
apachepulsar/pulsar                                                4.0.1       93b5506ac8eb   12 days ago     481MB

Though the image size is much bigger now.

@BewareMyPower The example Dockerfile is for pulsar-all image. There's a comment how to pass the Pulsar image as a build arg. Passing --build-arg PULSAR_IMAGE=apachepulsar/pulsar:4.0.1 would solve this.

@lhotari
Copy link
Member

lhotari commented Dec 17, 2024

@BewareMyPower It looks like it's necessary to preload gcompat on Alpine (for example with ENV LD_PRELOAD=/lib/libgcompat.so.0 in Dockerfile). More details in netty/netty-tcnative#907 (comment) .
The first source to mention this that I found is grpc/grpc-java#8751 (comment) .

I doubt that this would play nicely with the glibc solution we have in Pulsar.

@lhotari
Copy link
Member

lhotari commented Dec 20, 2024

@BewareMyPower I found out that adding -e LD_PRELOAD=/lib/libgcompat.so.0 to the docker command line in pulsar-client-python's build-support/pulsar-test-service-start.sh script works around the issue. This is addressed in #23762 for future releases of the docker image. For existing docker images, it's a suitable workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants