Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateful auto compaction #8573

Merged
merged 14 commits into from
Oct 16, 2019
Merged

Conversation

jihoonson
Copy link
Contributor

@jihoonson jihoonson commented Sep 23, 2019

Fixes #8489.

Description

In addition to #8489, targetCompactionSizeBytes is dropped for the compaction task and auto compaction. targetCompactionSizeBytes was added for easy configuration, but it could be a misleading that optimizing segments should be done in terms of the size rather than number of rows. Dropping targetCompactionSizeBytes also makes things simpler such that all tasks can share the same partitionsSpec since targetCompactionSizeBytes makes sense only for compaction task.
maxRowsPerSegment now is a mandatory configuration for auto compaction. For compaction task, any partitionsSpec can be used.

Also fixed a bug that auto compaction couldn't compact an interval if there is only one segment. Note that compaction can split a segment into smaller ones.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths.
  • been tested in a test Druid cluster.

This change is Reviewable

@@ -178,12 +178,11 @@ private CoordinatorStats doRun(
final List<DataSegment> segmentsToCompact = iterator.next();
final String dataSourceName = segmentsToCompact.get(0).getDataSource();

if (segmentsToCompact.size() > 1) {
if (!segmentsToCompact.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line#179 should move in this if block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding this! Will fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@himanshug
Copy link
Contributor

For compaction task, any partitionsSpec can be used.

i.e. user can explicitly run CompactionTask with any partitionSpec . Auto compaction does not take partitionsSpec as config ... right ?

@jihoonson
Copy link
Contributor Author

i.e. user can explicitly run CompactionTask with any partitionSpec . Auto compaction does not take partitionsSpec as config ... right ?

For now, yes. And I'm planning to add it in the near future.


@Inject(optional = true) @PruneLoadSpec boolean pruneLoadSpec = false;
@Inject(optional = true) @PrunePartitionsSpec boolean prunePartitionsSpec = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add this to the comment on line 65?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -68,17 +66,19 @@
* github.com/google/guice/wiki/FrequentlyAskedQuestions#how-can-i-inject-optional-parameters-into-a-constructor
*/
@VisibleForTesting
public static class PruneLoadSpecHolder
public static class PruneSpecs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps keep the Holder suffix as it seems to be the naming convention for the optional constructor parameter injection pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -457,6 +488,7 @@ public Builder(DataSegment segment)
this.dimensions = segment.getDimensions();
this.metrics = segment.getMetrics();
this.shardSpec = segment.getShardSpec();
this.compactionPartitionsSpec = segment.compactionPartitionsSpec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the getter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines -232 to -260
@Test
public void testBucketMonthComparator()
{
DataSegment[] sortedOrder = {
makeDataSegment("test1", "2011-01-01/2011-01-02", "a"),
makeDataSegment("test1", "2011-01-02/2011-01-03", "a"),
makeDataSegment("test1", "2011-01-02/2011-01-03", "b"),
makeDataSegment("test2", "2011-01-01/2011-01-02", "a"),
makeDataSegment("test2", "2011-01-02/2011-01-03", "a"),
makeDataSegment("test1", "2011-02-01/2011-02-02", "a"),
makeDataSegment("test1", "2011-02-02/2011-02-03", "a"),
makeDataSegment("test1", "2011-02-02/2011-02-03", "b"),
makeDataSegment("test2", "2011-02-01/2011-02-02", "a"),
makeDataSegment("test2", "2011-02-02/2011-02-03", "a"),
};

List<DataSegment> shuffled = new ArrayList<>(Arrays.asList(sortedOrder));
Collections.shuffle(shuffled);

Set<DataSegment> theSet = new TreeSet<>(DataSegment.bucketMonthComparator());
theSet.addAll(shuffled);

int index = 0;
for (DataSegment dataSegment : theSet) {
Assert.assertEquals(sortedOrder[index], dataSegment);
++index;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test no longer needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bucketMonthComparator() is not used anywhere.

DataSegmentPusher segmentPusher
)
{
return appenderatorsManager.createOfflineAppenderatorForTask(
taskId,
dataSchema,
appenderatorConfig.withBasePersistDirectory(toolbox.getPersistDir()),
firehoseFactory instanceof IngestSegmentFirehoseFactory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to using instanceof is to add another method to FirehoseFactory (i.e., polymorphism)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is ugly, but I don't have a better idea. What makes most sense to me is adding a new method getTaskType(), but it's a pretty big refactoring which is not necessary in this PR.

return newTuningConfig.withPartitionsSpec(
new DynamicPartitionsSpec(
dynamicPartitionsSpec.getMaxRowsPerSegment(),
dynamicPartitionsSpec.getMaxTotalRowsOr(Long.MAX_VALUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative that allocates one fewer TuningConfig for the DynamicPartitionsSpec case:

PartitionsSpec partitionsSpec = newTuningConfig.getGivenOrDefaultPartitionsSpec());
if (partitionsSpec instanceof DynamicPartitionsSpec) {
    final DynamicPartitionsSpec dynamicPartitionsSpec = (DynamicPartitionsSpec) partitionsSpec;
    partitionsSpec = new DynamicPartitionsSpec(
        dynamicPartitionsSpec.getMaxRowsPerSegment(),
        dynamicPartitionsSpec.getMaxTotalRowsOr(Long.MAX_VALUE),
    );
}
return newTuningConfig.withPartitionsSpec(partitionsSpec);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -191,6 +192,10 @@ public void testRun() throws Exception
Intervals.of("2014-01-01T0%d:00:00/2014-01-01T0%d:00:00", i, i + 1),
segments.get(i).getInterval()
);
Assert.assertEquals(
new DynamicPartitionsSpec(5000000, Long.MAX_VALUE),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps save this as a named constant since it's used a lot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -201,7 +208,7 @@ private void updateQueue(String dataSourceName, DataSourceCompactionConfig confi
.filter(holder -> {
final List<PartitionChunk<DataSegment>> chunks = Lists.newArrayList(holder.getObject().iterator());
final long partitionBytes = chunks.stream().mapToLong(chunk -> chunk.getObject().getSize()).sum();
return chunks.size() > 1
return chunks.size() > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer !chunks.isEmpty() (similar to the change you made on line 185)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

candidates.segments.get(0).getDataSource(),
candidates.segments.get(0).getInterval()
);
if (candidates.getNumSegments() > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer !candidates.isEmpty()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -229,15 +236,58 @@ public boolean hasNext()
}
}

private static boolean needsCompaction(DataSourceCompactionConfig config, SegmentsToCompact candidates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which tests cover this new logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DruidCoordinatorSegmentCompactorTest verifies the behavior of DruidCoordinatorSegmentCompactor.

@jihoonson
Copy link
Contributor Author

I also changed dataSegment to remember lastCompactionState including partitionsSpec and indexSpec per discussion at #8489 (comment).

@lgtm-com
Copy link

lgtm-com bot commented Oct 11, 2019

This pull request introduces 1 alert when merging 47dca61 into 6c60929 - view on LGTM.com

new alerts:

  • 1 for Self assignment

import java.util.Map;
import java.util.Objects;

public class CompactionState
Copy link
Contributor

@himanshug himanshug Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add javadoc for this class describing what this is, why it has the field it does ...(I know there is some discussion in proposal, but it would be very non-obvious for someone reading the code) and what guarantees it provides ... e.g. something like if a CompactionTask is run with parameters matching here then row distribution in segments created would be exactly same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added javadoc. Please take a look if it's enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

public class CompactionState
{
private final PartitionsSpec partitionsSpec;
// org.apache.druid.segment.IndexSpec cannot be used here to avoid the dependency cycle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't understand what is the cycle ? .. do you mean IndexSpec can contain a CompactionState , so json serde would fail ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is IndexSpec is in the processing module while this class is in the core module. Since processing has a dependency on core, I cannot add a new dependency of core -> processing since it will introduce a cycle. I updated this comment more understandable.

Copy link
Contributor

@himanshug himanshug Oct 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks.

maybe in next round of module merge: merge core into processing if there is no use case of anyone depending on druid-core directly :)

@@ -786,8 +786,7 @@ A description of the compaction config is:
|`dataSource`|dataSource name to be compacted.|yes|
|`taskPriority`|[Priority](../ingestion/tasks.html#priority) of compaction task.|no (default = 25)|
|`inputSegmentSizeBytes`|Maximum number of total segment bytes processed per compaction task. Since a time chunk must be processed in its entirety, if the segments for a particular time chunk have a total size in bytes greater than this parameter, compaction will not run for that time chunk. Because each compaction task runs with a single thread, setting this value too far above 1–2GB will result in compaction tasks taking an excessive amount of time.|no (default = 419430400)|
|`targetCompactionSizeBytes`|The target segment size, for each segment, after compaction. The actual sizes of compacted segments might be slightly larger or smaller than this value. Each compaction task may generate more than one output segment, and it will try to keep each output segment close to this configured size. This configuration cannot be used together with `maxRowsPerSegment`.|no (default = 419430400)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably add a blurb in release notes for this , just in case some people set this property and expect something to happen.

isReingest = dataSchema.getDataSource().equals(((IngestSegmentFirehoseFactory) firehoseFactory).getDataSource());
} else {
isReingest = false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to drive this from auto compaction code in druid coordinator instead as IngestSegmentFirehoseFactory could be used outside of auto compaction as well. For example, as a user , knowing my data flow, I can setup re-index task to run every day for previous day's data .. sort of a manual compaction. But in that case, the CompactionState doesn't need to be preserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I thought lastCompactionState could be useful even for manual compaction as well. How about adding a parameter to taskContext to store lastCompactionState, so that other users also can use it if they want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah , that would work and let user (in this case auto compaction code) explicitly say whether CompactionState should be saved or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added a new task context configuration. I'm still not sure whether this should be documented though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I think this config is best left undocumented and for auto compaction internal usage only.

@jihoonson
Copy link
Contributor Author

Some docs should be further updated, but I will do it once this PR and #8570 are merged.

@jihoonson jihoonson merged commit 4046c86 into apache:master Oct 16, 2019
@jihoonson
Copy link
Contributor Author

Thanks for the review @ccaominh and @himanshug!

@jon-wei jon-wei added this to the 0.17.0 milestone Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stateful auto compaction
4 participants