optionally disable all of hardcoded zookeeper use #9507

himanshug · 2020-03-12T06:20:16Z

Paving Path Towards #9053

Description

This patch adds a new configuration property druid.zk.service.enabled=true/false, default = true on all nodes to disable all of zookeeper activities that get setup even if user chooses to use HTTP based segment and task management. Some of those are....

historical announcing itself as data server in zk
historicals watching zk for segment load/drop requests
historicals announcing segments in zk
middle managers watching zk for task assignment requests
middle managers doing task status updates in zk
(above set of things are required, so that curator based task/segment management continues to work in rolling deployment scenario and also could be interchanged with http based task/segment management at any time)
HttpRemoteTaskRunner periodically doing cleanup of tasks from zk because MiddleManager continues to update the status in zk
external discovery announcements done via ServiceAnnouncer to keep tranquility working

This property is undocumented for now till k8s based discovery extension PR shows up, that will have all the necessary documentation including setting druid.zk.service.enabled=false .

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.
added integration tests.
been tested in a test Druid cluster.

Key changed/added classes in this PR

ZkEnabledmentConfig
CliXXX
XXXModule
And few others that directly/indirectly depended on CuratorFramework

himanshug · 2020-03-20T22:19:05Z

@clintropolis can you please review/approve this one ?

clintropolis · 2020-03-20T22:24:33Z

@clintropolis can you please review/approve this one ?

Sorry yes, I have been meaning to take a look just haven't had much time to spare lately 😅I will try to have a look sometime today/this weekend.

himanshug · 2020-03-20T22:35:38Z

@clintropolis thanks, I totally understand that :)

clintropolis · 2020-03-24T04:18:57Z

Some thoughts as I've been reviewing this (sorry I haven't finished yet):

Do you view this as an interim configuration, to allow your work to proceed on an alternative discovery mechanism, until we can decouple zookeeper specific code from all of the places that need to check this setting? or is the plan to leave it like this? So far I find it kind of ugly to have a setting like this due to all of the if/else branches it causes, but maybe there is some obvious reason I haven't got to yet on why we aren't adding some sort of druid.discovery.type=zk|none instead of this enable/disable setting. I know some of the current HTTP modes are sort of leaky in terms of still doing zk stuff to support rolling update situations to transition settings, i would be in favor of breaking the current versions that support that, and adding some sort of composite or special mode to run both versions just for transition scenarios if that is the main driver to have the setting be this way.

I'll keep reviewing, and try to finish up later tonight.

himanshug · 2020-03-24T05:22:40Z

@clintropolis Your concern is legit, I am not sure if you have seen https://groups.google.com/forum/#!msg/druid-development/tWnwPyL0Vk4/2uLwqgQiAAAJ and https://groups.google.com/forum/#!msg/druid-development/eIWDPfhpM_U/AzMRxSQGAgAJ , but please take a look at those. That will explain that extensible "discovery" to be implemented in extensions is "node/service discovery" and "leader election". Other zookeeper usage for segment/task management should be totally replaced by HTTP counterparts which I have been using since long time. All the disabling of non-extensible use of zookeeper introduced in this PR can be removed if/when we can delete zookeeper based segment/task management.
I hope with kubernetes discovery extension HTTP based segment/task management will get more adoption (as zookeeper based counterparts wouldn't work in that setting) and will eventually be deleted and so would all of the optionality introduced in this PR.

druid.discovery.type=zk does exist, it is druid.discovery.type=curator and is default which leads to usage of curator based discovery abstractions impl CuratorDruidLeaderSelector, CurationDruidNodeAnnouncer and CuratorDruidNodeDiscoveryProvider.

Next PR in this chain would have kubernetes based impl for all those i.e. K8sDruidLeaderSelector, K8sDruidNodeAnnouncer and K8sDruidNodeDiscoveryProvider

clintropolis

Btw, re-read my last comment and the tone is sort of ambiguous on whether I'm being negative - I'm totally not trying to be a hater and I think this approach is fine as is, just trying to get a better idea on what finished #9053 looks like 😅.

clintropolis · 2020-03-24T05:22:18Z

server/src/main/java/org/apache/druid/server/http/HistoricalResource.java

  }

  @GET
  @Path("/readiness")
  public Response getReadiness()
  {
-    if (coordinator.isStarted()) {
+    if (segmentLoadDropHandler.isStarted()) {


Hmm, is this actually true if still using zk segment loading? It was wrong before for the same reasons I think for http segment loading. Maybe this resource should accept either segment loader like SegmentListerResource, or we should have another http resource that we bind instead depending which mode is enabled?

primary thing checked by this endpoint is that historical startup sequence has finished loading/announcing segments that it already had on disk which is often a time consuming activity. that is still ensured by segmentLoadDropHandler.isStarted()

clintropolis · 2020-03-24T05:29:26Z

server/src/main/java/org/apache/druid/curator/CuratorModule.java

    JsonConfigProvider.bind(binder, CURATOR_CONFIG_PREFIX, CuratorConfig.class);
    JsonConfigProvider.bind(binder, EXHIBITOR_CONFIG_PREFIX, ExhibitorConfig.class);
  }

  @Provides
  @LazySingleton
  @SuppressForbidden(reason = "System#err")
-  public CuratorFramework makeCurator(CuratorConfig config, EnsembleProvider ensembleProvider, Lifecycle lifecycle)
+  public CuratorFramework makeCurator(ZkEnablementConfig zkEnablementConfig, CuratorConfig config, EnsembleProvider ensembleProvider, Lifecycle lifecycle)


I haven't looked too closely yet, but I wonder if this would be better if this was marked as @Nullable and returned null instead of throwing the runtime exception, and shift the burden of validating that curator is available to the settings that do require it, such as inventory, segment loading, and task management? The other stuff might be able to be simplified a bit and not have to care about having the setting, and could probably avoid having some of the signature changes to use providers.

Main reason for failing loudly here is so that if I forgot to disable zk in some code path, then this method would immediately fail on node start with guice injection errors leading to quick discovery of exactly what is missed.
this helped me catch quite a few places that I missed.

clintropolis · 2020-03-24T05:30:39Z

server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinator.java

@@ -443,7 +455,7 @@ public void moveSegment(
            () -> {
              try {
                if (serverInventoryView.isSegmentLoadedByServer(toServer.getName(), segment) &&
-                    curator.checkExists().forPath(toLoadQueueSegPath) == null &&
+                    (curator == null || curator.checkExists().forPath(toLoadQueueSegPath) == null) &&


This doesn't necessarily need to change in this PR, but it seems kind of leaky that this thing has a CuratorFramework at all, it seems like the load peon should provide this check so it can just be a no-op for non-zk. and then DruidCoordinator no longer needs a curator or zk paths I think?

yeah, I hope zk based segment management will just go away and this will go as well.

clintropolis · 2020-03-24T05:33:46Z

server/src/main/java/org/apache/druid/server/coordinator/LoadQueueTaskMaster.java

@@ -67,7 +68,7 @@ public LoadQueuePeon giveMePeon(ImmutableDruidServer server)
      return new HttpLoadQueuePeon(server.getURL(), jsonMapper, httpClient, config, peonExec, callbackExec);
    } else {
      return new CuratorLoadQueuePeon(
-          curator,
+          curatorFrameworkProvider.get(),


Just thinking out loud, doesn't need to be addressed in this PR, it seems like maybe LoadQueueTaskMaster maybe needs some sort of peon factory that is set by config so that it doesn't have to care about individual implementations or curators and the like

could be, but you know my preference by now :)
I wish people would migrate to using HTTP segment management and that would remain the only way and this would be deleted.

clintropolis · 2020-03-24T05:46:27Z

server/src/main/java/org/apache/druid/server/coordination/BatchDataSegmentAnnouncer.java

@@ -99,13 +107,28 @@ public BatchDataSegmentAnnouncer(
      return rv;
    };

-    if (this.config.isSkipSegmentAnnouncementOnZk()) {
+    isSkipSegmentAnnouncementOnZk = !zkEnablementConfig.isEnabled() || config.isSkipSegmentAnnouncementOnZk();


It seems like this class is almost entirely to handle zk stuff, does it need to be bound and exist at all if zk is disabled?

this is also calling methods in
private final ChangeRequestHistory<DataSegmentChangeRequest> changes = new ChangeRequestHistory<>();

which is used by the http endpoint for segment sync.

However, yeah it takes more work to announce stuff in zk so it looks like that this class is primarily doing that whereas it is actually supporting both.
If/When we are able to delete zk based segment management this class would shrink significantly.
Also, unrelated, it could possibly be refactored to separate two things it is doing so that things are more clear.
However, you know my preference, just delete zk stuff :)

himanshug · 2020-03-24T17:11:15Z

@clintropolis thanks for clarification but I took the comments as in coming from a curious reviewer. In fact, with written communication, I always try to imagine a happy person speaking things.

clintropolis · 2020-03-25T19:24:22Z

@clintropolis thanks for clarification but I took the comments as in coming from a curious reviewer. In fact, with written communication, I always try to imagine a happy person speaking things.

👍 😅

I will try to get back to this today and finish up.

clintropolis · 2020-04-06T21:46:17Z

I will try to get back to this today and finish up.

Oh man, this comment didn't age well, sorry I will try to get back to this PR this week.

himanshug · 2020-06-02T00:52:12Z

@clintropolis please take a look when you get a chance.

build appears to be failing due to newly introduced code coverage requirements and I am not entirely sure about how to fulfill that, but I will take a look (may be we need to tune down the coverage requirement , don't know), however PR is reviewable/complete otherwise.

clintropolis · 2020-06-04T08:11:09Z

@clintropolis please take a look when you get a chance.

sorry, totally forgot about this PR 😅, I will try to have another look soon

stale · 2020-08-03T09:00:36Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

clintropolis · 2020-08-19T08:29:10Z

Sorry, I totally keep forgetting about this PR. tl;dr if you fix up the conflicts, I'm +1.

I think a lot of my hesitation on this one is that I'm not fully in the 'lets get rid of Zookeeper' camp myself, since I haven't yet seen a cluster using only HTTP based stuff at the same scale that I have seen Zookeeper based clusters, and have some lingering worries about how chill things are when there are a large pool of brokers on a very large cluster, since it seems like significant increase in direct HTTP traffic between individual nodes. This is why I was wondering if instead of the if statements to just try to hide the curator stuff in interfaces that can be a no-op for non zk.

However, that is not really a good reason to hold up this PR, which in the best case makes it easy to remove zookeeper stuffs in the future if we go all in on that, or, at minimum, has at least already done the work of finding and marking all of the places that do directly depend on zk stuffs, so that interfaces like I was describing could be added in the future if we decide to keep zk around as an option for operators.

himanshug · 2020-08-19T22:30:34Z

thanks, I will work on to fix the conflicts sometime in a couple of days, then it can be merged.

Also, actually most of the stuff is hidden behind interfaces so zk will continue to work as long as we want it to work. I am sure things could always be improved and it will continue to evolve in future. Most important thing is that this patch is essential to make progress in #9053 and that extension is almost ready.
Regarding the worries with HTTP: I have run that for long time and only way to get confidence is to really try it out and have it get more adoption. My hope is that with k8s based discovery extension, it would get more adoption and consequently more testing in many different clusters.

pan3793 · 2020-09-14T10:40:22Z

It would be nice if druid support k8s in mainline, will it be addressed in next release?

himanshug · 2020-10-26T17:48:51Z

@clintropolis sorry, been a while, I am planning to fix the conflicts here and get it back to working... would you be able to review/merge ?

clintropolis · 2020-10-26T19:16:12Z

@clintropolis sorry, been a while, I am planning to fix the conflicts here and get it back to working... would you be able to review/merge ?

will do 👍

clintropolis · 2020-10-26T19:58:48Z

looks like a compilation failure on:

[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /home/travis/build/apache/druid/integration-tests/src/main/java/org/apache/druid/cli/CliHistoricalForQueryRetryTest.java:[46,15] [MissingOverride] configure overrides method in CliHistorical; expected @Override
    (see https://errorprone.info/bugpattern/MissingOverride)
  Did you mean '@Override @Inject'?

and a test failure:

[ERROR] Errors: 
[ERROR] org.apache.druid.server.coordinator.DruidCoordinatorTest.testBalancerThreadNumber(org.apache.druid.server.coordinator.DruidCoordinatorTest)
[ERROR]   Run 1: DruidCoordinatorTest.testBalancerThreadNumber:690 » NullPointer
[ERROR]   Run 2: DruidCoordinatorTest.testBalancerThreadNumber:690 » NullPointer
[ERROR]   Run 3: DruidCoordinatorTest.testBalancerThreadNumber:690 » NullPointer
[ERROR]   Run 4: DruidCoordinatorTest.testBalancerThreadNumber:690 » NullPointer

otherwise, lgtm

himanshug · 2020-10-26T20:23:36Z

yep, working on to fix the build....

himanshug · 2020-10-26T23:22:50Z

@clintropolis at this point build is fine except for the code coverage checks in some of the trivial changes which either can only execute when zk is disabled for real and druid process is started or already tested when integration tests run (but coverage tool probably only relies on coverage done by unit tests). I have looked at all of the failed code coverage red flags in above builds but not sure how to improve it or whether doing something to bend code coverage would actually achieve anything. so, would you consider ignoring the coverage check.

clintropolis · 2020-10-27T02:41:22Z

@clintropolis at this point build is fine except for the code coverage checks in some of the trivial changes which either can only execute when zk is disabled for real and druid process is started or already tested when integration tests run (but coverage tool probably only relies on coverage done by unit tests). I have looked at all of the failed code coverage red flags in above builds but not sure how to improve it or whether doing something to bend code coverage would actually achieve anything. so, would you consider ignoring the coverage check.

Yeah, I am not sure how meaningful the missing coverage could be in this case, so it seems reasonable to ignore it.

himanshug · 2020-10-27T05:35:55Z

@clintropolis thanks!

* optionally disable all of hardcoded zookeeper use * fix DruidCoordinatorTest compilation * fix test in DruidCoordinatorTest * fix strict compilation Co-authored-by: Himanshu Gupta <fill email>

optionally disable all of hardcoded zookeeper use

746ffeb

himanshug added Area - ZooKeeper/Curator Development Blocker labels Mar 12, 2020

clintropolis reviewed Mar 24, 2020

View reviewed changes

Merge remote-tracking branch 'apache/master' into optional_disable_zk

42c0c85

stale bot added the stale label Aug 3, 2020

clintropolis removed the stale label Aug 4, 2020

Himanshu Gupta added 2 commits October 26, 2020 11:52

Merge remote-tracking branch 'origin/master' into optional_disable_zk

867f624

fix DruidCoordinatorTest compilation

2382d1d

Himanshu Gupta added 2 commits October 26, 2020 14:44

fix test in DruidCoordinatorTest

2ac9784

fix strict compilation

02badd8

clintropolis approved these changes Oct 27, 2020

View reviewed changes

himanshug merged commit ee13630 into apache:master Oct 27, 2020

himanshug deleted the optional_disable_zk branch October 27, 2020 18:21

jihoonson added this to the 0.21.0 milestone Jan 4, 2021

jihoonson mentioned this pull request Jan 13, 2021

[Draft] 0.21.0 Release Notes #10752

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optionally disable all of hardcoded zookeeper use #9507

optionally disable all of hardcoded zookeeper use #9507

himanshug commented Mar 12, 2020 •

edited

Loading

himanshug commented Mar 20, 2020

clintropolis commented Mar 20, 2020

himanshug commented Mar 20, 2020

clintropolis commented Mar 24, 2020 •

edited

Loading

himanshug commented Mar 24, 2020

clintropolis left a comment

clintropolis Mar 24, 2020

himanshug Mar 24, 2020

clintropolis Mar 24, 2020

himanshug Mar 24, 2020

clintropolis Mar 24, 2020

himanshug Mar 24, 2020

clintropolis Mar 24, 2020

himanshug Mar 24, 2020

clintropolis Mar 24, 2020

himanshug Mar 24, 2020

himanshug commented Mar 24, 2020 •

edited

Loading

clintropolis commented Mar 25, 2020

clintropolis commented Apr 6, 2020

himanshug commented Jun 2, 2020

clintropolis commented Jun 4, 2020

stale bot commented Aug 3, 2020

clintropolis commented Aug 19, 2020

himanshug commented Aug 19, 2020

pan3793 commented Sep 14, 2020

himanshug commented Oct 26, 2020

clintropolis commented Oct 26, 2020

clintropolis commented Oct 26, 2020 •

edited

Loading

himanshug commented Oct 26, 2020

himanshug commented Oct 26, 2020

clintropolis commented Oct 27, 2020

himanshug commented Oct 27, 2020

optionally disable all of hardcoded zookeeper use #9507

optionally disable all of hardcoded zookeeper use #9507

Conversation

himanshug commented Mar 12, 2020 • edited Loading

Description

Key changed/added classes in this PR

himanshug commented Mar 20, 2020

clintropolis commented Mar 20, 2020

himanshug commented Mar 20, 2020

clintropolis commented Mar 24, 2020 • edited Loading

himanshug commented Mar 24, 2020

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

himanshug commented Mar 24, 2020 • edited Loading

clintropolis commented Mar 25, 2020

clintropolis commented Apr 6, 2020

himanshug commented Jun 2, 2020

clintropolis commented Jun 4, 2020

stale bot commented Aug 3, 2020

clintropolis commented Aug 19, 2020

himanshug commented Aug 19, 2020

pan3793 commented Sep 14, 2020

himanshug commented Oct 26, 2020

clintropolis commented Oct 26, 2020

clintropolis commented Oct 26, 2020 • edited Loading

himanshug commented Oct 26, 2020

himanshug commented Oct 26, 2020

clintropolis commented Oct 27, 2020

himanshug commented Oct 27, 2020

himanshug commented Mar 12, 2020 •

edited

Loading

clintropolis commented Mar 24, 2020 •

edited

Loading

himanshug commented Mar 24, 2020 •

edited

Loading

clintropolis commented Oct 26, 2020 •

edited

Loading