Support Auto Compaction #1156

sezruby · 2022-05-27T05:58:01Z

Description

Support Auto Compaction described in:
https://docs.databricks.com/delta/optimizations/auto-optimize.html#how-auto-compaction-works

We can support Auto compaction via a new post commit hook and OptimizeCommand with less size threshold.

spark.databricks.delta.autoCompact.enabled (default: false)
spark.databricks.delta.autoCompact.maxFileSize (default: 128MB)
spark.databricks.delta.autoCompact.minNumFiles (default: 50)

The configs above are same as Databricks Auto compaction.

New config1 - autoCompact.maxCompactBytes

As it will be triggered after every table update, I introduced another config to control the total amount of data to be optimized for an auto compaction operation:
spark.databricks.delta.autoCompact.maxCompactBytes (default: 20GB)

In Databricks, it's adjusted based on available cluster resources. The config is a quick and easy workaround for it.

New config2 - autoCompact.target

The PR adds another new config - autoCompact.target to change target files for auto compaction.
spark.databricks.delta.autoCompact.target (default: "partition")

table: target all files in the table
commit: target only added/updated files of the commit which is triggering auto compaction.
partition: target only the partitions containing any of added/updated files of the commit which is triggering auto compaction.

Users are usually writing/updating data only for few partitions, and don't expect changes in other partitions.
In case the table is not optimized, the default behavior table might cause some conflicts between other partitions unexpectedly and added/updated files in the triggering commit might not be optimized if there are many small files in other partitions.

Fixes #815

How was this patch tested?

Unit tests

Does this PR introduce any user-facing changes?

Support Auto compaction feature

sezruby · 2022-05-27T16:37:11Z

I didn't write a design doc & issue since it's straightforward.
Please let me know if we need a design documentation.

scottsand-db · 2022-05-27T17:54:07Z

Hi @sezruby - thanks for this PR! It will take some time for us to review and verify it. We will get back to you.

scottsand-db · 2022-05-31T21:33:17Z

Hi @sezruby - just updating you with the status on our end. We are very busy with planned features for the next release of Delta Lake, as well with preparation for the upcoming Data and AI summit in June.

So, it will take us some time to get back to you on this.

sezruby · 2022-07-26T03:58:00Z

@vkorukanti Could you review the PR when you have the time? TIA!

jaceklaskowski

Just very tiny nits 😎

core/src/main/scala/org/apache/spark/sql/delta/DeltaConfig.scala

core/src/main/scala/org/apache/spark/sql/delta/DeltaOperations.scala

core/src/main/scala/org/apache/spark/sql/delta/OptimisticTransaction.scala

jaceklaskowski · 2022-07-26T11:46:33Z

core/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

@@ -160,23 +160,80 @@ class OptimizeExecutor(

  private val isMultiDimClustering = zOrderByColumns.nonEmpty

-  def optimize(): Seq[Row] = {
+  def optimize(isAutoCompact: Boolean = false, targetFiles: Seq[AddFile] = Nil): Seq[Row] = {


isAutoCompact caught my attention (after auto above in Optimize) but don't really know how to make these two names the same. I feel they should really be the same but no idea how. Sorry.

That Optimize is for deltaLog. The current name looks fine to me but let me know a better one if any.

jaceklaskowski · 2022-07-26T11:47:46Z

core/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

-      val maxFileSize = sparkSession.sessionState.conf.getConf(
-        DeltaSQLConf.DELTA_OPTIMIZE_MAX_FILE_SIZE)
+      val maxFileSize = if (isAutoCompact) {
+        sparkSession.sessionState.conf.getConf(DeltaSQLConf.AUTO_COMPACT_MAX_FILE_SIZE)


I think it'd be handy to have a table property too.

We can add it with another PR. In Databricks, there's targetFileSize property https://docs.databricks.com/delta/optimizations/file-mgmt.html#set-a-target-size

core/src/main/scala/org/apache/spark/sql/delta/hooks/DoAutoCompaction.scala

core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

jaceklaskowski · 2022-07-29T12:38:12Z

core/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

+        1
+      } else {
+        // compaction
+        2


Let's start with a couple of vals to give some meaning to these numbers first, e.g.

val MULTI_DIM_CLUSTERING=1 val COMPACTION_MORE_FILES = 2

(not sure these names are correct but that's the idea)

core/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

core/src/test/scala/org/apache/spark/sql/delta/DeltaAutoOptimizeSuite.scala

sezruby · 2022-08-04T03:40:02Z

@vkorukanti Could you review the PR when you have the time? TIA!

@vkorukanti @scottsand-db A gentle reminder. This one is simpler than Optimize Write so I would like to merge this PR first.

sezruby · 2022-08-04T03:42:45Z

core/src/main/scala/org/apache/spark/sql/delta/commands/OptimizeTableCommand.scala

      require(maxFileSize > 0, "maxFileSize must be > 0")

-      val candidateFiles = txn.filterFiles(partitionPredicate)
+      val minNumFilesInDir = optimizeType.minNumFiles
+      val (candidateFiles, filesToProcess) = optimizeType.targetFiles


candidateFiles for statistics. not sure it's still required for debugging

sezruby · 2022-08-04T05:39:03Z

core/src/main/scala/org/apache/spark/sql/delta/sources/DeltaSQLConf.scala

+      )
+      .stringConf
+      .transform(_.toLowerCase(Locale.ROOT))
+      .createWithDefault("table")


Actually I would prefer to set "partition" as default though I set "table" because it's databricks default
I don't expect any meaningful gain from "table"

scottsand-db · 2022-09-15T15:17:24Z

Can you please fix the conflicts?

sezruby · 2022-09-29T06:47:14Z

@scottsand-db @zsxwing Could you review the PR?

pedrosalgadowork · 2022-10-20T16:48:41Z

We are also having this issue, we can't define disjoint conditions from both merge and optimize if they are done concurrently.

sezruby · 2022-10-21T02:38:37Z

We are also having this issue, we can't define disjoint conditions from both merge and optimize if they are done concurrently.

@pedrosalgadowork which issue do you mean by? is it related to auto compaction?

sezruby · 2022-10-21T02:53:33Z

@scottsand-db @zsxwing @tdas Could you review the PR?

rasidhan · 2022-11-20T14:26:33Z

@scottsand-db @zsxwing @tdas - can you help review this PR? Its been open for several months now with no updates/comments recently.

felipepessoto · 2022-12-22T02:43:48Z

Would be great to have this on Delta 2.3. Is it the plan to merge it soon?

Kimahriman · 2023-04-09T13:59:46Z

Looks like there's some conflicts with the new DV stuff, had to update some things rebasing things on the 2.3 release in my fork.

Would be great to get some more looks at this and get this merged in, this is a highly valuable and missing feature.

Signed-off-by: Eunjin Song <[email protected]> Co-authored-by: Sandip Raiyani <[email protected]>

sezruby · 2023-06-23T20:04:39Z

@dennyglee @scottsand-db @zsxwing @tdas Could you review the PR?

sezruby · 2023-07-06T05:34:06Z

@dennyglee @scottsand-db @zsxwing @tdas Could you review the PR? I'll resolve the conflict once you started actively reviewing.

sezruby · 2023-07-17T19:06:54Z

@dennyglee @scottsand-db @zsxwing @tdas @allisonport-db Could you review the PR?

resulyrt93 · 2023-08-16T14:37:59Z

Is there any obstacle to the review of this PR?

takkarharsh · 2023-10-29T03:46:54Z

@sezruby In Class spark/src/main/scala/org/apache/spark/sql/delta/OptimisticTransaction.scala method "groupFilesIntoBins"
val filteredByBinSize = bins.filter { bin => // bin size is equal to or greater than autoCompactMinNumFiles files bin.size >= autoCompactMinNumFiles || // or bin size + number of deletion vectors >= autoCompactMinNumFiles files bin.count(_.deletionVector != null) + bin.size >= autoCompactMinNumFiles }.map(b => (partition, b))

why are we using individual bin.size while comparing to autoCompactMinNumFiles ?

If total files size are greater than autoCompact.maxFileSize and total number of files are > MinNumFiles, but after segregating it in bins by size the individual bins will always have lesser files than MinNumFiles and hence it will not auto-compact the files.

Any particular reason for doing that ? i understand it might cause compaction of some small file but isn't it better than no compaction ?

bqiang-stackadapt · 2024-03-06T18:37:50Z

I think autoCompact should be available now? since it's documented here

tdas · 2024-03-06T19:50:34Z

Yes, since Delta 3.1. This PR can be closed now.

bqiang-stackadapt · 2024-03-06T19:54:33Z

Yes, since Delta 3.1. This PR can be closed now.

Thanks! Do you know whether autoCompact would work when using with spark structured streaming with delta table as the sink?

sezruby force-pushed the autocompact branch from e667418 to 60c404f Compare May 27, 2022 16:09

scottsand-db self-assigned this May 27, 2022

scottsand-db added the enhancement New feature or request label May 27, 2022

scottsand-db assigned sezruby and unassigned scottsand-db May 27, 2022

scottsand-db requested review from vkorukanti and scottsand-db May 27, 2022 17:54

sezruby force-pushed the autocompact branch 3 times, most recently from 32ce3ab to 4434db7 Compare June 10, 2022 03:03

sezruby force-pushed the autocompact branch 2 times, most recently from 63ef599 to e347d28 Compare July 19, 2022 01:43

jaceklaskowski suggested changes Jul 26, 2022

View reviewed changes

sezruby force-pushed the autocompact branch 2 times, most recently from 7479f46 to ab18631 Compare July 28, 2022 07:47

jaceklaskowski reviewed Jul 29, 2022

View reviewed changes

sezruby force-pushed the autocompact branch from a28636d to b4895e8 Compare August 3, 2022 23:08

sezruby commented Aug 4, 2022

View reviewed changes

dennyglee mentioned this pull request Sep 14, 2022

Roadmap 2022 H2 (discussion) #1307

Open

scottsand-db requested review from zsxwing and tdas and removed request for vkorukanti and tdas September 14, 2022 18:14

sezruby force-pushed the autocompact branch 2 times, most recently from d7a00bb to 0019c22 Compare September 15, 2022 22:15

tdas self-requested a review October 3, 2022 18:05

zsxwing mentioned this pull request Oct 12, 2022

[BUG] autoCompact not working #1427

Open

sezruby force-pushed the autocompact branch from 0019c22 to 7cc46e9 Compare December 27, 2022 21:03

mythrocks mentioned this pull request Mar 14, 2023

Support auto-compaction for Delta tables on [databricks] NVIDIA/spark-rapids#7889

Merged

sezruby force-pushed the autocompact branch from 7cc46e9 to cb51e65 Compare June 23, 2023 18:03

Auto compaction update

3b44cd4

Signed-off-by: Eunjin Song <[email protected]> Co-authored-by: Sandip Raiyani <[email protected]>

sezruby force-pushed the autocompact branch from cb51e65 to 3b44cd4 Compare June 23, 2023 19:39

felipepessoto mentioned this pull request Dec 28, 2023

[Spark] add auto-compact post commit hook #2414

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Auto Compaction #1156

Support Auto Compaction #1156

sezruby commented May 27, 2022 •

edited

Loading

sezruby commented May 27, 2022

scottsand-db commented May 27, 2022

scottsand-db commented May 31, 2022

sezruby commented Jul 26, 2022

jaceklaskowski left a comment

jaceklaskowski Jul 26, 2022

sezruby Jul 28, 2022

jaceklaskowski Jul 26, 2022

sezruby Jul 28, 2022

jaceklaskowski Jul 29, 2022

sezruby commented Aug 4, 2022

sezruby Aug 4, 2022

sezruby Aug 4, 2022

scottsand-db commented Sep 15, 2022

sezruby commented Sep 29, 2022

pedrosalgadowork commented Oct 20, 2022

sezruby commented Oct 21, 2022

sezruby commented Oct 21, 2022

rasidhan commented Nov 20, 2022

felipepessoto commented Dec 22, 2022

Kimahriman commented Apr 9, 2023

sezruby commented Jun 23, 2023

sezruby commented Jul 6, 2023

sezruby commented Jul 17, 2023

resulyrt93 commented Aug 16, 2023

takkarharsh commented Oct 29, 2023

bqiang-stackadapt commented Mar 6, 2024

tdas commented Mar 6, 2024

bqiang-stackadapt commented Mar 6, 2024

Support Auto Compaction #1156

Are you sure you want to change the base?

Support Auto Compaction #1156

Conversation

sezruby commented May 27, 2022 • edited Loading

Description

New config1 - autoCompact.maxCompactBytes

New config2 - autoCompact.target

How was this patch tested?

Does this PR introduce any user-facing changes?

sezruby commented May 27, 2022

scottsand-db commented May 27, 2022

scottsand-db commented May 31, 2022

sezruby commented Jul 26, 2022

jaceklaskowski left a comment

Choose a reason for hiding this comment

jaceklaskowski Jul 26, 2022

Choose a reason for hiding this comment

sezruby Jul 28, 2022

Choose a reason for hiding this comment

jaceklaskowski Jul 26, 2022

Choose a reason for hiding this comment

sezruby Jul 28, 2022

Choose a reason for hiding this comment

jaceklaskowski Jul 29, 2022

Choose a reason for hiding this comment

sezruby commented Aug 4, 2022

sezruby Aug 4, 2022

Choose a reason for hiding this comment

sezruby Aug 4, 2022

Choose a reason for hiding this comment

scottsand-db commented Sep 15, 2022

sezruby commented Sep 29, 2022

pedrosalgadowork commented Oct 20, 2022

sezruby commented Oct 21, 2022

sezruby commented Oct 21, 2022

rasidhan commented Nov 20, 2022

felipepessoto commented Dec 22, 2022

Kimahriman commented Apr 9, 2023

sezruby commented Jun 23, 2023

sezruby commented Jul 6, 2023

sezruby commented Jul 17, 2023

resulyrt93 commented Aug 16, 2023

takkarharsh commented Oct 29, 2023

bqiang-stackadapt commented Mar 6, 2024

tdas commented Mar 6, 2024

bqiang-stackadapt commented Mar 6, 2024

sezruby commented May 27, 2022 •

edited

Loading