Multiple writer for the same sst caused by `close shard` #990

Rachelint · 2023-06-12T13:12:15Z

Describe this problem

Shard will be moved from nodes when process panic because if for any reason, all operations related to such a shard should be stopped before moving(especially the write operations).
However background works(flush, compaction, all of them are writes) will not be stoppend rightly before now. That caused a serious bug : multiple writers for one sst.

Server version

CeresDB Server
Version: 1.2.2
Git commit: 2e20665
Git branch: main
Opt level: 3
Rustc version: 1.69.0-nightly
Target: aarch64-apple-darwin
Build date: 2023-06-12T13:01:03.592984000Z

Steps to reproduce

Hard to reproduce, if must do this, steps following may can work:

Setup a ceresdb cluster with ceresmeta.
Trigger compaction/flush work for a specific table of shard in one node manually.
Move the shard to another node by ceresmeta manually before comapction/flush work finishing.
Trigger compaction/flush work for the table of shard manually in the new node.

Expected behavior

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

ShiKaiWi · 2023-06-16T10:18:56Z

After #998, the updates following the closing shard will be forbidden. However, some ssts may be still being written when close the shard, while these ssts may share the same ids with the new ssts created by the new node, leading to the multiple writers on the same sst.

Let's fix this problem in another PR. @baojinri

## Rationale Part of #990. Some background jobs are still allowed to execute, and it will lead to data corrupted when a table is migrated between different nodes because of multiple writers for the same table. ## Detailed Changes Introduce a flag called `invalid` in the table data to denote whether the serial executor is valid, and this flag is protected with the `TableOpSerialExecutor` in table data, and the `TableOpSerialExecutor` won't be acquired if the flag is set, that is to say, any table operation including updating manifest, altering table and so on, can't be executed after the flag is set because these operations require the `TableOpSerialExecutor`. Finally, the flag will be set when the table is closed.

Part of apache#990. Some background jobs are still allowed to execute, and it will lead to data corrupted when a table is migrated between different nodes because of multiple writers for the same table. Introduce a flag called `invalid` in the table data to denote whether the serial executor is valid, and this flag is protected with the `TableOpSerialExecutor` in table data, and the `TableOpSerialExecutor` won't be acquired if the flag is set, that is to say, any table operation including updating manifest, altering table and so on, can't be executed after the flag is set because these operations require the `TableOpSerialExecutor`. Finally, the flag will be set when the table is closed.

This reverts commit 85eb0b7. ## Rationale The changes introduced by #998 are not reasonable. Another fix will address #990. ## Detailed Changes Revert #998

## Rationale part of #990 ## Detailed Changes - Abstract a `IdAllocator` in `common_util` library - Use `IdAllocator` to alloc sst id - Persist `IdAllocator`'s `max id` to manifest ## Test Plan - Add new unit test for `IdAllocator` - Manual compatibility test: make ssts generated with old ceresdb-server, and deploy a new ceresdb-server with this changeset. Write some new data into it to make new ssts generated, and check whether the sst id is correct.

ShiKaiWi · 2023-07-07T09:36:34Z

After #998, the updates following the closing shard will be forbidden. However, some ssts may be still being written when close the shard, while these ssts may share the same ids with the new ssts created by the new node, leading to the multiple writers on the same sst.

Let's fix this problem in another PR. @baojinri

#1009 has fixed the problem. However, #998 actually didn't achieve the goal to prevent updates after table is closed. And #998 has been reverted, I guess I'll submit another change set to make all things work.

## Rationale Part of apache#990. Some background jobs are still allowed to execute, and it will lead to data corrupted when a table is migrated between different nodes because of multiple writers for the same table. ## Detailed Changes Introduce a flag called `invalid` in the table data to denote whether the serial executor is valid, and this flag is protected with the `TableOpSerialExecutor` in table data, and the `TableOpSerialExecutor` won't be acquired if the flag is set, that is to say, any table operation including updating manifest, altering table and so on, can't be executed after the flag is set because these operations require the `TableOpSerialExecutor`. Finally, the flag will be set when the table is closed.

…pache#1034) This reverts commit 85eb0b7. ## Rationale The changes introduced by apache#998 are not reasonable. Another fix will address apache#990. ## Detailed Changes Revert apache#998

Rachelint added the bug Something isn't working label Jun 12, 2023

ShiKaiWi self-assigned this Jun 16, 2023

ShiKaiWi mentioned this issue Jun 16, 2023

fix: avoid any updates after table is closed #998

Merged

baojinri mentioned this issue Jun 21, 2023

fix: persist sst id #1009

Merged

ShiKaiWi mentioned this issue Jun 26, 2023

Revert "fix: avoid any updates after table is closed (#998)" #1034

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple writer for the same sst caused by `close shard` #990

Multiple writer for the same sst caused by `close shard` #990

Rachelint commented Jun 12, 2023

ShiKaiWi commented Jun 16, 2023

ShiKaiWi commented Jul 7, 2023

Multiple writer for the same sst caused by close shard #990

Multiple writer for the same sst caused by close shard #990

Comments

Rachelint commented Jun 12, 2023

Describe this problem

Server version

Steps to reproduce

Expected behavior

Additional Information

ShiKaiWi commented Jun 16, 2023

ShiKaiWi commented Jul 7, 2023

Multiple writer for the same sst caused by `close shard` #990

Multiple writer for the same sst caused by `close shard` #990