Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV calling EventListener's onCompactionBegin() callback method with rocksdbjni 7.4.5 #10686

Open
dparrella opened this issue Sep 15, 2022 · 5 comments · May be fixed by #12844
Open

SIGSEGV calling EventListener's onCompactionBegin() callback method with rocksdbjni 7.4.5 #10686

dparrella opened this issue Sep 15, 2022 · 5 comments · May be fixed by #12844
Labels

Comments

@dparrella
Copy link

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Hello RocksDB Team!

My team has found a serious issue in production where RocksDB crashed and took the JVM down with it. This started when we introduced an EventListener that we were hoping would give us more targeted logging of events as they occur. In fact we added this to help with investigating #10258. The issue occurs specifically when background compaction starts and RocksDB attempts to call our EventListener's onCompactionBegin() method.

Any insights as to what we may be doing wrong or help to resolve the issue would be much appreciated!

Expected behavior

Background compaction event callbacks should not crash the JVM.

Actual behavior

JVM exits with SIGSEGV while calling onCompactionBegin() callback method on EventListener.

Partial dump from crash:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007eff76e93c51, pid=1, tid=146
#
# JRE version: OpenJDK Runtime Environment Corretto-11.0.16.9.1 (11.0.16.1+9) (build 11.0.16.1+9-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-11.0.16.9.1 (11.0.16.1+9-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x5a9c51]  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<1097844ul, G1BarrierSet>, (AccessInternal::BarrierType)2, 1097844ul>::oop_access_barrier(void*)+0x1
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   https://github.com/corretto/corretto-11/issues/
#
---------------  T H R E A D  ---------------

Current thread (0x00007efe78004000):  JavaThread "Thread-3" [_thread_in_vm, id=146, stack(0x00007efe857fe000,0x00007efe861fd000)]

Stack: [0x00007efe857fe000,0x00007efe861fd000],  sp=0x00007efe861fa870,  free space=10226k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x5a9c51]  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<1097844ul, G1BarrierSet>, (AccessInternal::BarrierType)2, 1097844ul>::oop_access_barrier(void*)+0x1
V  [libjvm.so+0x91bba0]  jni_GetMethodID+0x70
C  [librocksdbjni1005257159522923149.so+0x2799fe]  rocksdb::EventListenerJniCallback::OnCompactionBegin(rocksdb::DB*, rocksdb::CompactionJobInfo const&)+0x8e
C  [librocksdbjni1005257159522923149.so+0x3c84f4]  rocksdb::DBImpl::NotifyOnCompactionBegin(rocksdb::ColumnFamilyData*, rocksdb::Compaction*, rocksdb::Status const&, rocksdb::CompactionJobStats const&, int)+0x204
C  [librocksdbjni1005257159522923149.so+0x3cda0e]  rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x119e
C  [librocksdbjni1005257159522923149.so+0x3ce9e7]  rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xd7
C  [librocksdbjni1005257159522923149.so+0x3cef2e]  rocksdb::DBImpl::BGWorkCompaction(void*)+0x4e
C  [librocksdbjni1005257159522923149.so+0x6e9248]  rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x228
C  [librocksdbjni1005257159522923149.so+0x6e9412]  rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x62


siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000

Register to memory mapping:

RAX=0x0 is NULL
RBX=0x00007eff411a0438 points into unknown readable memory: 0x693c1495ffff0006 | 06 00 ff ff 95 14 3c 69
RCX=0x0000000000000004 is an unknown value
RDX=0x0 is NULL
RSP=0x00007efe861fa870 is pointing into the stack for thread: 0x00007efe78004000
RBP=0x00007efe861fa8d0 is pointing into the stack for thread: 0x00007efe78004000
RSI=0x000000000000000b is an unknown value
RDI=0x0 is NULL
R8 =0x00000000001349ef is an unknown value
R9 =0x0000000000000028 is an unknown value
R10=0x0 is NULL
R11=0x0 is NULL
R12=0x00007efe861fa890 is pointing into the stack for thread: 0x00007efe78004000
R13=0x00007efed217a40d: <offset 0x00000000008ed40d> in /tmp/librocksdbjni1005257159522923149.so at 0x00007efed188d000
R14=0x00007eff411a48a0 points into unknown readable memory: 0x4a28c588ffff0004 | 04 00 ff ff 88 c5 28 4a
R15=0x00007efe78004000 is a thread


Registers:
RAX=0x0000000000000000, RBX=0x00007eff411a0438, RCX=0x0000000000000004, RDX=0x0000000000000000
RSP=0x00007efe861fa870, RBP=0x00007efe861fa8d0, RSI=0x000000000000000b, RDI=0x0000000000000000
R8 =0x00000000001349ef, R9 =0x0000000000000028, R10=0x0000000000000000, R11=0x0000000000000000
R12=0x00007efe861fa890, R13=0x00007efed217a40d, R14=0x00007eff411a48a0, R15=0x00007efe78004000
RIP=0x00007eff76e93c51, EFLAGS=0x0000000000010246, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00007efe861fa870)
0x00007efe861fa870:   00007efe861fa8d0 00007eff77203cd8
0x00007efe861fa880:   0000000000000000 0000000000000000
0x00007efe861fa890:   00007efe001349ef 0000000000000000
0x00007efe861fa8a0:   00007efe861fa8d0 00007efe78004000

Instructions: (pc=0x00007eff76e93c51)
0x00007eff76e93b51:   03 b8 05 00 00 00 e9 2a fd ff ff 88 4a 02 b8 03
0x00007eff76e93b61:   00 00 00 e9 1d fd ff ff 88 42 02 b8 03 00 00 00
0x00007eff76e93b71:   e9 e0 fe ff ff 41 88 46 02 b8 03 00 00 00 e9 e5
0x00007eff76e93b81:   fd ff ff 41 88 46 03 b8 04 00 00 00 e9 d7 fd ff
0x00007eff76e93b91:   ff 88 42 03 b8 04 00 00 00 e9 b7 fe ff ff 41 88
0x00007eff76e93ba1:   46 04 b8 05 00 00 00 e9 bc fd ff ff 88 42 04 b8
0x00007eff76e93bb1:   05 00 00 00 e9 9c fe ff ff 41 88 46 05 b8 06 00
0x00007eff76e93bc1:   00 00 e9 a1 fd ff ff 88 42 05 b8 06 00 00 00 e9
0x00007eff76e93bd1:   81 fe ff ff 83 c8 80 49 c1 ed 38 41 88 46 07 45
0x00007eff76e93be1:   88 6e 08 b8 09 00 00 00 e9 7b fd ff ff 83 c8 80
0x00007eff76e93bf1:   49 c1 ed 38 88 42 07 44 88 6a 08 b8 09 00 00 00
0x00007eff76e93c01:   e9 50 fe ff ff 41 88 46 06 b8 07 00 00 00 e9 55
0x00007eff76e93c11:   fd ff ff 88 42 06 b8 07 00 00 00 e9 35 fe ff ff
0x00007eff76e93c21:   66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
0x00007eff76e93c31:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c41:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c51:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c61:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c71:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c81:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93c91:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93ca1:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93cb1:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93cc1:   48 8b 07 48 89 e5 5d c3 0f 1f 80 00 00 00 00 55
0x00007eff76e93cd1:   48 8b 37 48 89 e5 48 85 f6 74 10 48 8d 05 cd 38
0x00007eff76e93ce1:   f5 00 5d 48 8b 38 e9 14 5d 76 00 31 c0 5d c3 55
0x00007eff76e93cf1:   48 8b 37 48 89 e5 48 85 f6 74 10 48 8d 05 ad 38
0x00007eff76e93d01:   f5 00 5d 48 8b 38 e9 f4 5c 76 00 31 c0 5d c3 55
0x00007eff76e93d11:   48 89 e5 53 48 83 ec 08 48 8b 1f 48 85 db 74 08
0x00007eff76e93d21:   48 89 df e8 97 b9 1c 00 48 83 c4 08 48 89 d8 5b
0x00007eff76e93d31:   5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55
0x00007eff76e93d41:   48 89 e5 53 48 83 ec 08 48 8b 1f 48 85 db 74 08


Stack slot to memory mapping:
stack at sp + 0 slots: 0x00007efe861fa8d0 is pointing into the stack for thread: 0x00007efe78004000
stack at sp + 1 slots: 0x00007eff77203cd8: <offset 0x0000000000919cd8> in /usr/lib/jvm/java-11-amazon-corretto.x86_64/lib/server/libjvm.so at 0x00007eff768ea000
stack at sp + 2 slots: 0x0 is NULL
stack at sp + 3 slots: 0x0 is NULL
stack at sp + 4 slots: 0x00007efe001349ef is an unknown value
stack at sp + 5 slots: 0x0 is NULL
stack at sp + 6 slots: 0x00007efe861fa8d0 is pointing into the stack for thread: 0x00007efe78004000
stack at sp + 7 slots: 0x00007efe78004000 is a thread

Steps to reproduce the behavior

We defined a simple EventListener implementation that subscribes to the following events:

  • EnabledEventCallback.ON_BACKGROUND_ERROR
  • EnabledEventCallback.ON_COMPACTION_BEGIN
  • EnabledEventCallback.ON_COMPACTION_COMPLETED
  • EnabledEventCallback.ON_ERROR_RECOVERY_BEGIN
  • EnabledEventCallback.ON_ERROR_RECOVERY_COMPLETED
  • EnabledEventCallback.ON_EXTERNAL_FILE_INGESTED
  • EnabledEventCallback.ON_TABLE_FILE_DELETED

Then in our onCompactionBegin() implementation we log (perhaps a bit too verbosely) using SLF4J:

    private void logCompactionInfo(final String message, final CompactionJobInfo compactionJobInfo) {
        log.info("{}: Status code: [{}], reason: [{}], inputFiles: [{}], outputFiles: [{}], jobId: [{}],"
                        + "baseInputLevel: [{}], columnFamilyName: [{}], compression: [{}], threadId: [{}].",
                message,
                compactionJobInfo.status().getCodeString(),
                compactionJobInfo.compactionReason(),
                compactionJobInfo.inputFiles(),
                compactionJobInfo.outputFiles(),
                compactionJobInfo.jobId(),
                compactionJobInfo.baseInputLevel(),
                new String(compactionJobInfo.columnFamilyName(), StandardCharsets.UTF_8),
                compactionJobInfo.compression(),
                compactionJobInfo.threadId());
        compactionJobInfo.tableProperties().entrySet().stream().forEach(
                entry -> log.info("RocksDB table properties for key [{}], ColumnFamilyName: [{}], CompressionName: [{}], CreationTime: [{}],"
                                + "DataSize: [{}], FilterPolicyName: [{}], FilterSize: [{}], FixedKeyLen: [{}], MergeOperatorName: [{}],"
                                + "NumDataBlocks: [{}], NumDeletions: [{}], NumRangeDeletions: [{}], ReadableProperties: [{}], "
                                + "TopLevelIndexSize: [{}], UserCollectedProperties:[{}]",
                        entry.getKey(),
                        new String(entry.getValue().getColumnFamilyName(), StandardCharsets.UTF_8),
                        entry.getValue().getCompressionName(),
                        entry.getValue().getCreationTime(),
                        entry.getValue().getDataSize(),
                        entry.getValue().getFilterPolicyName(),
                        entry.getValue().getFilterSize(),
                        entry.getValue().getFixedKeyLen(),
                        entry.getValue().getMergeOperatorName(),
                        entry.getValue().getNumDataBlocks(),
                        entry.getValue().getNumDeletions(),
                        entry.getValue().getNumRangeDeletions(),
                        entry.getValue().getReadableProperties(),
                        entry.getValue().getTopLevelIndexSize(),
                        entry.getValue().getUserCollectedProperties())
        );
    }

What's strange is that we saw this for the first time in one of our production environments. The same code made its way through several stages of test environments before going to production.

Also worth noting is that the data being loaded into RocksDB is identical to the previously deployed version of our application. The RocksDB version is also the same. The only difference is that we added the EventListener.

One thing that is different with this production system is the size of the database. We store a "snapshot" of the database in a tar.gz file that is downloaded when the application bootstraps. The compressed archive is about 2.5 GB - much larger than what we have in our test systems.

The listener is added like this:

final Options options = new Options()
                .setCompactionStyle(getRocksDBCompactionStyle())
                .setCompressionType(getCompressionType())
                .setCreateIfMissing(true)
                .setIncreaseParallelism(getIncreaseParallelismThreadCount())
                .setKeepLogFileNum(keepLogFileNum)
                .setListeners(List.of(new RocksDBEventListener()))
                .setLogger(new RocksDBLogger().getLogger())
                .setMaxOpenFiles(maxOpenFiles)
                .setMaxSubcompactions(maxSubcompactions)
                .setAllowMmapReads(allowMmapReads)
                .setAllowMmapWrites(allowMmapWrites)
                .setTableFormatConfig(tableConfig)
                .optimizeForPointLookup(blockCacheSizeMB)
                .setRateLimiter(getRateLimiter());
@PuGuanyu
Copy link

I met the same problem, besides, I found NoClassDefFoundError in the dump output.
I guess the error my be related to spring boot classloader?

Event: 810.052 Thread 0x00007f2307048000 Exception <a 'java/lang/NoClassDefFoundError'{0x00000007fcf45240}: org/rocksdb/FlushJobInfo> (0x00000007fcf45240) thrown at [./src/hotspot/share/classfile/systemDictionary.cpp, line 235]
Event: 810.053 Thread 0x00007f223b358000 Exception <a 'java/lang/NoClassDefFoundError'{0x00000007fcf48f40}: org/rocksdb/CompactionJobInfo> (0x00000007fcf48f40) thrown at [./src/hotspot/share/classfile/systemDictionary.cpp, line 235]

@zz-x404
Copy link

zz-x404 commented Apr 23, 2024

Hey there, looks like this has been out standing for quite a while.

But I recently met the same issue - Adding EventListener caused the application to exit with SIGSEGV. And I can also find NoClassDefFoundError in the error report. I'm using rocksdbjni 8.6.7, running a Java Spring Boot 3.2.3 project.

Wondering if there's any ongoing work addressing this issue? Thanks!

@zz-x404
Copy link

zz-x404 commented Apr 23, 2024

As a workaround, I tried implementing the constructor of my subclass of AbstractEventListener

/**
* Creates an Event Listener that will
* receive all callbacks from C++.
* <p>
* If you don't need all callbacks, it is much more efficient to
* just register for the ones you need by calling
* {@link #AbstractEventListener(EnabledEventCallback...)} instead.
*/
protected AbstractEventListener() {
this(ON_FLUSH_COMPLETED, ON_FLUSH_BEGIN, ON_TABLE_FILE_DELETED, ON_COMPACTION_BEGIN,
ON_COMPACTION_COMPLETED, ON_TABLE_FILE_CREATED, ON_TABLE_FILE_CREATION_STARTED,
ON_MEMTABLE_SEALED, ON_COLUMN_FAMILY_HANDLE_DELETION_STARTED, ON_EXTERNAL_FILE_INGESTED,
ON_BACKGROUND_ERROR, ON_STALL_CONDITIONS_CHANGED, ON_FILE_READ_FINISH, ON_FILE_WRITE_FINISH,
ON_FILE_FLUSH_FINISH, ON_FILE_SYNC_FINISH, ON_FILE_RANGE_SYNC_FINISH,
ON_FILE_TRUNCATE_FINISH, ON_FILE_CLOSE_FINISH, SHOULD_BE_NOTIFIED_ON_FILE_IO,
ON_ERROR_RECOVERY_BEGIN, ON_ERROR_RECOVERY_COMPLETED);
}

only calling super(ON_BACKGROUND_ERROR, ON_ERROR_RECOVERY_BEGIN, ON_ERROR_RECOVERY_COMPLETED) for what I need.

At least now I'm not seeing the application exiting nor getting the NoClassDefFoundError errors.

Hope that narrows the problem down a little, perhaps.

@rhubner
Copy link
Contributor

rhubner commented Jun 18, 2024

Hello @dparrella @PuGuanyu @zz-x404

Thank you for reporting this issue and sorry it took us so long to pick it up.
I did some internal tests and I'm not able to reproduce this issue. I Even modified tests that compaction is triggered from RocksDB code and not from Java. Unfortunately no luck.

Do you think you will be able to recompile RocksDB in debug mode with DEBUG_LEVEL=2 PORTABLE=1 and run it with your code. This will provide a better stack trace. Can you please also resolve the line number as described here : https://github.com/facebook/rocksdb/wiki/JNI-Debugging#stack This information with exach SHA commit hash you used to compile will help us a lot.

If you are not able to compile source code by yourself, can you please provide Maven project with JUnit test where you are able to reproduce this error.

Radek

@rhubner
Copy link
Contributor

rhubner commented Jul 1, 2024

Hello Hello @dparrella @PuGuanyu @zz-x404,

I'm able to replicate your issue. It's really problem with class loader. Looks simmilar like #691 . I will submit fix soon.

Radek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants