Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable ZK-based segment loading, remove usage of skife from DruidCoordinatorConfig #15705

Merged
merged 25 commits into from
Apr 29, 2024

Conversation

kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Jan 17, 2024

Description

This is a follow up to #14695 to move configs from using skife to regular JsonConfigurator.

Advantages:

  • More structured configs that adhere to rest of the Druid code.
  • Classes that need only a small subset of configs need not be passed the entire DruidCoordinatorConfig object
  • Easier validation of configs done on service startup rather than when running duties.

Change summary

  • Add new config classes CoordinatorRunConfig, CoordinatorKillConfigs, CoordinatorPeriodConfig
  • Bind new config classes to property paths instead of DruidCoordinatorConfig
  • Make DruidCoordinatorConfig act as a container for the above configs
  • Add MetadataCleanupConfig to serve as a common config container for all metadata cleanups
  • Move validations from MetadataCleanupDuty to DruidCoordinatorConfig for cleaner code and to allow coordinator to fail fast on startup in case of invalid config values.
  • Disable zk-based segment loading option
  • All configs (except zk-based loading) retain their current paths and default values

Added configs

1. CoordinatorRunConfig

Path: druid.coordinator
Fields: period, startDelay
Description: Configs related to the running of the coordinator

2. CoordinatorPeriodConfig

Path: druid.coordinator.period
Fields: indexingPeriod, metadataStoreManagementPeriod
Description: Additional periods related to the operation of the coordinator

3. CoordinatorKillConfigs

Path: druid.coordinator.kill
Fields 1: audit, compaction, datasource, rules, supervisors, pendingSegments
Description: Each of the above fields is deserialized as a MetadataCleanupConfig which in-turn contains the following fields:

  • on: Whether cleanup is enabled
  • period: Cleanup period
  • durationToRetain: Duration of metadata to retain

Fields 2: on, period, durationToRetain, bufferPeriod, ignoreDurationToRetain, maxSegments
Description: These fields are related to cleanup of unused segments.

ZK-based segment loading disabled

  • ZK-based segment loading has been deprecated for several Druid releases
  • This PR is getting rid of the config druid.coordinator.loadqueuepeon.type, thus always using HttpLoadQueuePeon
  • The code for ZK-based segment loading is not being removed yet as it might be needed for testing purposes.

Release note

Zookeeper-based segment loading is being removed as it is known to have issues and has been deprecated for several releases. The recent improvements made to the Druid coordinator are also known to work much better with HTTP-based segment loading.

The following configs are being removed as they are not needed anymore:

  • druid.coordinator.load.timeout: Not needed as the default value of this parameter (15 minutes) is known to work well for all clusters
  • druid.coordinator.loadqueuepeon.type: Not needed as this value will always be http
  • druid.coordinator.curator.loadqueuepeon.numCallbackThreads: Not needed as zookeeper(curator)-based segment loading is not an option anymore

Auto-cleanup of compaction configs of inactive datasources is now enabled by default.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz kfaraz added the WIP label Jan 17, 2024
@kfaraz kfaraz added Release Notes and removed WIP labels Jan 18, 2024
Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Checkpointing an initial review with some questions and suggestions.

.withCoordinatorKillMaxSegments(10)
.withCoordinatorKillIgnoreDurationToRetain(false)
.build()
// 100ms is a great price to pay if it removes the flakeyness,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;)

Suggested change
// 100ms is a great price to pay if it removes the flakeyness,
// 100ms is a great price to pay if it removes the flakiness,

);
this.metadataSupervisorManager = metadataSupervisorManager;
log.warn("This is only an example implementation of a custom duty and"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose if the delegate is KillSupervisors, this log is not accurate anymore?
Also, curious, if this is meant to be an example implementation of a custom duty, should this code just reside in a separate tree in the repo?

Copy link
Contributor Author

@kfaraz kfaraz Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I actually wanted to get rid of this duty altogether and have a simpler DummyCustomCoordinatorDuty which just logged something in every run rather than doing something meaningful. It doesn't make sense to have a full fledged implementation and then advise users against using it.

What do you think?

@kfaraz
Copy link
Contributor Author

kfaraz commented Apr 24, 2024

Thanks a lot for the review, @abhishekrb19 ! I have replied to your comments.

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactor @kfaraz! The configuration is so much easier to follow now. I've left a few non-blocking comments.


public class MetadataCleanupConfig
{
public static final MetadataCleanupConfig DEFAULT = new MetadataCleanupConfig(null, null, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I look into this again, it looks like this DEFAULT conflcits with the actual defaults applied in the constructor. Do we need this static null config -- won't the json creator below actually handle the default values when it's invoked from the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't actually conflict with the constructor. Since it is using the same constructor, it would have the same default values applied, i.e. true, 1 day, 90 days.

This is just used as short-hand in tests and in the class CoordinatorKillConfigs.

@abhishekrb19 abhishekrb19 merged commit aa46314 into apache:master Apr 29, 2024
88 checks passed
@kfaraz kfaraz deleted the remove_coord_skife_config branch April 30, 2024 02:26
kfaraz added a commit that referenced this pull request May 1, 2024
Follow up to #15705

Changes:
- Remove references to ZK-based segment loading in the docs
- Fix doc for existing config `druid.coordinator.loadqueuepeon.http.repeatDelay`
findingrish pushed a commit to findingrish/druid that referenced this pull request May 3, 2024
* Remove usage of skife from DruidCoordinatorConfig

* Remove old config class

* Address static checks

* Fix tests

* Remove unnecessary mocks

* Fix config typos

* Fix config condition

* Fix test, spotbug check

* Move validation to DruidCoordinatorConfig

* Move DruidCoordinatorConfig to different package

* Fix validation of killunusedconfig

* Simplify and fix KillSupervisorsCustomDuty

* Address review comments

* Fix new tests

* Add KillUnusedSchemasConfig

* Remove KillUnusedSchemasConfig

* Minor renames
kfaraz pushed a commit that referenced this pull request May 6, 2024
* Remove usage of skife from DruidCoordinatorConfig (#15705)

* Followup changes to 15817 (Segment schema publishing and polling) (#16368)
@kfaraz kfaraz changed the title Remove usage of skife from DruidCoordinatorConfig Disable ZK-based segment loading, remove usage of skife from DruidCoordinatorConfig May 10, 2024
@kfaraz kfaraz mentioned this pull request Jul 30, 2024
10 tasks
@kfaraz kfaraz added this to the 30.0.0 milestone Jul 30, 2024
kfaraz added a commit that referenced this pull request Aug 6, 2024
Background:
ZK-based segment loading has been completely disabled in #15705 .
ZK `servedSegmentsPath` has been deprecated since Druid 0.7.1, #1182 .
This legacy path has been replaced by the `liveSegmentsPath` and is not used in the code anymore.

Changes:
- Never create ZK loadQueuePath as it is never used.
- Never create ZK servedSegmentsPath as it is never used.
- Do not create ZK liveSegmentsPath if announcement on ZK is disabled
- Fix up tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants