Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] RocksDB corruption on tablegroups backup creation #22926

Closed
1 task done
pilshchikov opened this issue Jun 19, 2024 · 1 comment
Closed
1 task done

[DocDB] RocksDB corruption on tablegroups backup creation #22926

pilshchikov opened this issue Jun 19, 2024 · 1 comment
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation

Comments

@pilshchikov
Copy link
Contributor

pilshchikov commented Jun 19, 2024

Jira Link: DB-11844

Description

Steps:

  1. 3 nodes, RF=3, m7g.large, 8GB RAM, 2 CPU, ARM based image
  2. Create 4 tablegroups with 200 tables in each + 10 casual tables. Everything in one namespace.
  3. Running in cycle:
    3.1. Load data for 30 min in all tables (4 workloads in each 200 tables + 1 workload in 10 tables)
    3.2. Run nemesis in background (package loss for one node, partition network, clock skew, partition network)
    3.3. Create backup
    3.4. Stop nemesis
    3.5. Restore on other keyspace
    3.6. Check that data are the same
    3.7. Drop old namespace

On second cycle step 3.3 was failed with error:

YW 2024-06-19T06:59:04.564Z [ERROR] d6471a48-9be8-40c8-9c55-0a30ee7a44de from TaskExecutor in TaskPool-6 - Error occurred in subtask taskType : BackupTableYbc, taskState: Failure
com.yugabyte.yw.common.PlatformServiceException: Task id 2bbaba9e-2737-45f1-87dd-f578b3600f63_PGSQL_TABLE_TYPE_tg_9eecda_bc8b6d4e-dd8c-4281-8077-b0768e65d5da status: Failed with error TIMEOUT
	at com.yugabyte.yw.commissioner.YbcTaskBase.handleTaskCompleteStage(YbcTaskBase.java:95)
	at com.yugabyte.yw.commissioner.YbcTaskBase.pollTaskProgress(YbcTaskBase.java:66)
	at com.yugabyte.yw.commissioner.tasks.subtasks.BackupTableYbc.run(BackupTableYbc.java:202)
	at com.yugabyte.yw.commissioner.TaskExecutor$AbstractRunnableTask.run(TaskExecutor.java:908)
	at com.yugabyte.yw.commissioner.TaskExecutor$RunnableSubTask.run(TaskExecutor.java:1332)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at com.yugabyte.yw.common.logging.MDCAwareRunnable.run(MDCAwareRunnable.java:46)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
YW 2024-06-19T06:59:04.564Z [INFO] d6471a48-9be8-40c8-9c55-0a30ee7a44de from DefaultTaskExecutionListener in TaskPool-6 - Task taskType : BackupTableYbc, taskState: Failure is completed
YW 2024-06-19T06:59:04.564Z [DEBUG] d6471a48-9be8-40c8-9c55-0a30ee7a44de from TaskExecutor in TaskPool-6 - SubTaskGroup BackupTableYbc of type CreatingTableBackup at position 1: wait completed in 7237136ms
YW 2024-06-19T06:59:04.565Z [ERROR] d6471a48-9be8-40c8-9c55-0a30ee7a44de from CreateBackup in TaskPool-6 - Error executing task CreateBackup(2342fa49-8374-4ae6-af0b-da73b3b6b259) with error='SubTaskGroup BackupTableYbc of type CreatingTableBackup at position 1: completed 1 out of 1 tasks failed.'.
java.lang.RuntimeException: SubTaskGroup BackupTableYbc of type CreatingTableBackup at position 1: completed 1 out of 1 tasks failed.
	at com.yugabyte.yw.commissioner.TaskExecutor$RunnableTask.runSubTasks(TaskExecutor.java:1255)
	at com.yugabyte.yw.commissioner.tasks.CreateBackup.run(CreateBackup.java:163)
	at com.yugabyte.yw.commissioner.TaskExecutor$AbstractRunnableTask.run(TaskExecutor.java:908)
	at com.yugabyte.yw.commissioner.TaskExecutor$RunnableTask.run(TaskExecutor.java:1121)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at com.yugabyte.yw.common.logging.MDCAwareRunnable.run(MDCAwareRunnable.java:46)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: com.yugabyte.yw.common.PlatformServiceException: Task id 2bbaba9e-2737-45f1-87dd-f578b3600f63_PGSQL_TABLE_TYPE_tg_9eecda_bc8b6d4e-dd8c-4281-8077-b0768e65d5da status: Failed with error TIMEOUT
	at com.yugabyte.yw.commissioner.YbcTaskBase.handleTaskCompleteStage(YbcTaskBase.java:95)
	at com.yugabyte.yw.commissioner.YbcTaskBase.pollTaskProgress(YbcTaskBase.java:66)
	at com.yugabyte.yw.commissioner.tasks.subtasks.BackupTableYbc.run(BackupTableYbc.java:202)
	at com.yugabyte.yw.commissioner.TaskExecutor$AbstractRunnableTask.run(TaskExecutor.java:908)
	at com.yugabyte.yw.commissioner.TaskExecutor$RunnableSubTask.run(TaskExecutor.java:1332)
	... 6 common frames omitted

But before backup FATAL was thrown

F20240619 05:34:46 ../../src/yb/tablet/tablet.cc:1660] T 6528e39f7bca409daeb7172776908ed8 P c8f072144c0a4d5dbc64d7eff44f42d1: Failed to write a batch with 0 operations into RocksDB: Corruption (yb/tablet/tablet_metadata.cc:383): Cannot find packing with version 0 for table data_load_tg_71 (table_id=00004100000030008000000000004453 schema version=1 cotable_id=53440000-0000-0080-0030-000000410000 colocation_id=74479484): Not found (yb/dockv/schema_packing.cc:634): Schema packing not found: 0, available_versions: [1]
    @     0xaaaadbe596dc  google::LogMessage::SendToLog()
    @     0xaaaadbe5a580  google::LogMessage::Flush()
    @     0xaaaadbe5ac1c  google::LogMessageFatal::~LogMessageFatal()
    @     0xaaaadd239cb0  yb::tablet::Tablet::WriteToRocksDB()
    @     0xaaaadd24fa34  yb::tablet::Tablet::ApplyOperation()
    @     0xaaaadd24f2dc  yb::tablet::Tablet::ApplyRowOperations()
    @     0xaaaadd21826c  yb::tablet::WriteOperation::DoReplicated()
    @     0xaaaadd209444  yb::tablet::Operation::Replicated()
    @     0xaaaadd20ba50  yb::tablet::OperationDriver::ReplicationFinished()
    @     0xaaaadc36243c  yb::consensus::ConsensusRound::NotifyReplicationFinished()
    @     0xaaaadc3ad7f4  yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
    @     0xaaaadc3acb70  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0xaaaadc3898c8  yb::consensus::RaftConsensus::UpdateMajorityReplicated()
    @     0xaaaadc356bbc  yb::rpc::StrandTaskWithErrorFunc<>::Run()
    @     0xaaaadd16c990  yb::rpc::Strand::Done()
    @     0xaaaadd175ee4  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0xaaaadd99be98  yb::Thread::SuperviseThread()
    @     0xffffb19878b8  start_thread
    @     0xffffb19e3afc  thread_start

Logs in JIRA task first comment

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@pilshchikov pilshchikov added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation labels Jun 19, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jun 19, 2024
@pilshchikov pilshchikov changed the title [DocDB] [DocDB] RocksDB corruption on tablegroups backup creation Jun 19, 2024
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed status/awaiting-triage Issue awaiting triage priority/medium Medium priority issue labels Jun 20, 2024
@rthallamko3
Copy link
Contributor

@pilshchikov , This looks like a DUP of #23047, Can you re-open the issue if this repros on the latest build? cc @Huqicheng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation
Projects
None yet
Development

No branches or pull requests

4 participants