Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support filtering data in Auto Compaction #11922

Merged
merged 14 commits into from
Nov 24, 2021
Merged

Conversation

maytasm
Copy link
Contributor

@maytasm maytasm commented Nov 16, 2021

Support filtering data in Auto Compaction

Description

This PR adds filter field via the transformSpec to manual compaction and auto compaction.
By setting filter, the user will be able to drop data matching the filter from the datasource.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the PR going to add more changes? One thing I found interesting is that we are re-compacting the data if the filter changes. But the data we are compacting has already passed through a filter. Those records are gone and are not going to come back. Say if I first do a filter on dim = a and then change the filter to dim = b. It might be nice to add a warning in the UI for this when we eventually add those changes.

}

@Nullable
public DimFilter getFilter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it missing JsonProperty annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed

Comment on lines 369 to 376
ClientCompactionTaskTransformSpec transformSpec;
if (config.getTransformSpec() != null) {
transformSpec = new ClientCompactionTaskTransformSpec(
config.getTransformSpec().getFilter()
);
} else {
transformSpec = null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ClientCompactionTaskTransformSpec transformSpec;
if (config.getTransformSpec() != null) {
transformSpec = new ClientCompactionTaskTransformSpec(
config.getTransformSpec().getFilter()
);
} else {
transformSpec = null;
}
ClientCompactionTaskTransformSpec transformSpec = null;
if (config.getTransformSpec() != null) {
transformSpec = new ClientCompactionTaskTransformSpec(
config.getTransformSpec().getFilter()
);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - could be simplified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@maytasm
Copy link
Contributor Author

maytasm commented Nov 17, 2021

is the PR going to add more changes? One thing I found interesting is that we are re-compacting the data if the filter changes. But the data we are compacting has already passed through a filter. Those records are gone and are not going to come back. Say if I first do a filter on dim = a and then change the filter to dim = b. It might be nice to add a warning in the UI for this when we eventually add those changes.

I have finish adding all changes to this PR (implementation, unit tests, docs, integration tests). Your understanding is correct. That is now true for other auto compaction config that can remove data. For example, if you remove one of your dimension using auto compaction then add that dimension back later, the removed dimension will not come back (although that dimension in future data will not be removed). I'll work on clarifying this in the docs to make sure that user are well aware of this new behavior in a separate PR.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

2 high level comments since you are making a lot of changes in this area:

  1. It would be good to have some sort of integration test with segments in different compaction states so that we can simulate an upgrade where users have segments with different compaction states and validate that we don't re-compact segments each time we add a new field here.
  2. It would be good to add a warning in the docs that filtering data that matches an entire segment may not work as expected since Druid doesn't handle "empty segments" too well today.

All my comments do not need to block this PR, and can be addressed in future changes.

@Test
public void testIteratorReturnsSegmentsAsSegmentsWasCompactedAndHaveDifferentFilter() throws Exception
{
NullHandling.initializeForTests();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this should be done at the test class level.

@maytasm maytasm merged commit bb3d2a4 into apache:master Nov 24, 2021
@maytasm maytasm deleted the IMPLY-11780 branch November 24, 2021 18:56
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants