Support online compaction for paimon #2194

shidayang · 2023-10-26T11:30:33Z

shidayang
Oct 26, 2023
Collaborator

Currently, there are two main forms of compaction for Paimon. One is inline compaction, which is embedded in the write task. The other is dedicated compaction, which starts a separate Flink task to perform compaction for a specific table. If a table has multiple writes, such as partial updates, then dedicated compaction is required.

It is cumbersome to start a Flink job every time a table is created.
As the number of compaction jobs increases, the management cost becomes high. Which task merges which table? What is the status of the task? Can the merging progress keep up?
If compaction runs continuously, it will waste resources for tables with high peak traffic but low normal traffic. How to run compaction tasks only when a table needs to be compacted?

Therefore, initiating a discussion on online compaction means discovering tables that need to be compacted through AMS and scheduling them to a shared resource for compaction merging. Paimon users are welcome to provide feedback on whether this is necessary.

ChenShuai1981 · 2023-10-28T03:25:45Z

ChenShuai1981
Oct 28, 2023

In the scenario of sparse table, since the inline compaction will impact the performance of writing data, we prefer to use the dedicated compaction. So we need a companion job for each sparse paimon table. And the need of management those companion job is increased. We want to know the running status of dedicated compaction job, including the performance of compaction, how many small files be compacted to large files and how long it takes, can it catch up with the speed of writing? etc. Hope Amoro can support online compaction for paimon, just like it supports iceberg now.

0 replies

HuangFru · 2023-11-13T02:49:34Z

HuangFru
Nov 13, 2023
Collaborator

If AMS supports online compaction of Paimon tables, we should also consider the following issues:

Should AMS manage the triggering of compaction? Or use a listener-like mechanism to obtain the trigger signal to complete the trigger compaction.
Is there some current limiting mechanism when multiple tables need to perform compaction tasks at the same time during peak periods? This is because there are generally not too many shared resources, otherwise it will not save resources.

0 replies

ChenShuai1981 · 2023-11-14T01:42:36Z

ChenShuai1981
Nov 14, 2023

I think AMS should manage the compaction of all paimon tables.
Not all paimon tables suited to perform compaction tasks with shared resources. Just like the session mode and application mode execution in apache flink. Application mode suites for those large jobs per job while session mode suites for others in a whole.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support online compaction for paimon #2194

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Support online compaction for paimon #2194

shidayang Oct 26, 2023 Collaborator

Replies: 3 comments

ChenShuai1981 Oct 28, 2023

HuangFru Nov 13, 2023 Collaborator

ChenShuai1981 Nov 14, 2023

shidayang
Oct 26, 2023
Collaborator

ChenShuai1981
Oct 28, 2023

HuangFru
Nov 13, 2023
Collaborator

ChenShuai1981
Nov 14, 2023