Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optionally disable all of hardcoded zookeeper use #9507

Merged
merged 6 commits into from
Oct 27, 2020

Conversation

himanshug
Copy link
Contributor

@himanshug himanshug commented Mar 12, 2020

Paving Path Towards #9053

Description

This patch adds a new configuration property druid.zk.service.enabled=true/false, default = true on all nodes to disable all of zookeeper activities that get setup even if user chooses to use HTTP based segment and task management. Some of those are....

  • historical announcing itself as data server in zk
  • historicals watching zk for segment load/drop requests
  • historicals announcing segments in zk
  • middle managers watching zk for task assignment requests
  • middle managers doing task status updates in zk
    (above set of things are required, so that curator based task/segment management continues to work in rolling deployment scenario and also could be interchanged with http based task/segment management at any time)
  • HttpRemoteTaskRunner periodically doing cleanup of tasks from zk because MiddleManager continues to update the status in zk
  • external discovery announcements done via ServiceAnnouncer to keep tranquility working

This property is undocumented for now till k8s based discovery extension PR shows up, that will have all the necessary documentation including setting druid.zk.service.enabled=false .


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • ZkEnabledmentConfig
  • CliXXX
  • XXXModule
  • And few others that directly/indirectly depended on CuratorFramework

@himanshug
Copy link
Contributor Author

@clintropolis can you please review/approve this one ?

@clintropolis
Copy link
Member

@clintropolis can you please review/approve this one ?

Sorry yes, I have been meaning to take a look just haven't had much time to spare lately 😅I will try to have a look sometime today/this weekend.

@himanshug
Copy link
Contributor Author

@clintropolis thanks, I totally understand that :)

@clintropolis
Copy link
Member

clintropolis commented Mar 24, 2020

Some thoughts as I've been reviewing this (sorry I haven't finished yet):

Do you view this as an interim configuration, to allow your work to proceed on an alternative discovery mechanism, until we can decouple zookeeper specific code from all of the places that need to check this setting? or is the plan to leave it like this? So far I find it kind of ugly to have a setting like this due to all of the if/else branches it causes, but maybe there is some obvious reason I haven't got to yet on why we aren't adding some sort of druid.discovery.type=zk|none instead of this enable/disable setting. I know some of the current HTTP modes are sort of leaky in terms of still doing zk stuff to support rolling update situations to transition settings, i would be in favor of breaking the current versions that support that, and adding some sort of composite or special mode to run both versions just for transition scenarios if that is the main driver to have the setting be this way.

I'll keep reviewing, and try to finish up later tonight.

@himanshug
Copy link
Contributor Author

@clintropolis Your concern is legit, I am not sure if you have seen https://groups.google.com/forum/#!msg/druid-development/tWnwPyL0Vk4/2uLwqgQiAAAJ and https://groups.google.com/forum/#!msg/druid-development/eIWDPfhpM_U/AzMRxSQGAgAJ , but please take a look at those. That will explain that extensible "discovery" to be implemented in extensions is "node/service discovery" and "leader election". Other zookeeper usage for segment/task management should be totally replaced by HTTP counterparts which I have been using since long time. All the disabling of non-extensible use of zookeeper introduced in this PR can be removed if/when we can delete zookeeper based segment/task management.
I hope with kubernetes discovery extension HTTP based segment/task management will get more adoption (as zookeeper based counterparts wouldn't work in that setting) and will eventually be deleted and so would all of the optionality introduced in this PR.

druid.discovery.type=zk does exist, it is druid.discovery.type=curator and is default which leads to usage of curator based discovery abstractions impl CuratorDruidLeaderSelector, CurationDruidNodeAnnouncer and CuratorDruidNodeDiscoveryProvider.

Next PR in this chain would have kubernetes based impl for all those i.e. K8sDruidLeaderSelector, K8sDruidNodeAnnouncer and K8sDruidNodeDiscoveryProvider

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, re-read my last comment and the tone is sort of ambiguous on whether I'm being negative - I'm totally not trying to be a hater and I think this approach is fine as is, just trying to get a better idea on what finished #9053 looks like 😅.

}

@GET
@Path("/readiness")
public Response getReadiness()
{
if (coordinator.isStarted()) {
if (segmentLoadDropHandler.isStarted()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this actually true if still using zk segment loading? It was wrong before for the same reasons I think for http segment loading. Maybe this resource should accept either segment loader like SegmentListerResource, or we should have another http resource that we bind instead depending which mode is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

primary thing checked by this endpoint is that historical startup sequence has finished loading/announcing segments that it already had on disk which is often a time consuming activity. that is still ensured by segmentLoadDropHandler.isStarted()

JsonConfigProvider.bind(binder, CURATOR_CONFIG_PREFIX, CuratorConfig.class);
JsonConfigProvider.bind(binder, EXHIBITOR_CONFIG_PREFIX, ExhibitorConfig.class);
}

@Provides
@LazySingleton
@SuppressForbidden(reason = "System#err")
public CuratorFramework makeCurator(CuratorConfig config, EnsembleProvider ensembleProvider, Lifecycle lifecycle)
public CuratorFramework makeCurator(ZkEnablementConfig zkEnablementConfig, CuratorConfig config, EnsembleProvider ensembleProvider, Lifecycle lifecycle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked too closely yet, but I wonder if this would be better if this was marked as @Nullable and returned null instead of throwing the runtime exception, and shift the burden of validating that curator is available to the settings that do require it, such as inventory, segment loading, and task management? The other stuff might be able to be simplified a bit and not have to care about having the setting, and could probably avoid having some of the signature changes to use providers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main reason for failing loudly here is so that if I forgot to disable zk in some code path, then this method would immediately fail on node start with guice injection errors leading to quick discovery of exactly what is missed.
this helped me catch quite a few places that I missed.

@@ -443,7 +455,7 @@ public void moveSegment(
() -> {
try {
if (serverInventoryView.isSegmentLoadedByServer(toServer.getName(), segment) &&
curator.checkExists().forPath(toLoadQueueSegPath) == null &&
(curator == null || curator.checkExists().forPath(toLoadQueueSegPath) == null) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't necessarily need to change in this PR, but it seems kind of leaky that this thing has a CuratorFramework at all, it seems like the load peon should provide this check so it can just be a no-op for non-zk. and then DruidCoordinator no longer needs a curator or zk paths I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I hope zk based segment management will just go away and this will go as well.

@@ -67,7 +68,7 @@ public LoadQueuePeon giveMePeon(ImmutableDruidServer server)
return new HttpLoadQueuePeon(server.getURL(), jsonMapper, httpClient, config, peonExec, callbackExec);
} else {
return new CuratorLoadQueuePeon(
curator,
curatorFrameworkProvider.get(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud, doesn't need to be addressed in this PR, it seems like maybe LoadQueueTaskMaster maybe needs some sort of peon factory that is set by config so that it doesn't have to care about individual implementations or curators and the like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be, but you know my preference by now :)
I wish people would migrate to using HTTP segment management and that would remain the only way and this would be deleted.

@@ -99,13 +107,28 @@ public BatchDataSegmentAnnouncer(
return rv;
};

if (this.config.isSkipSegmentAnnouncementOnZk()) {
isSkipSegmentAnnouncementOnZk = !zkEnablementConfig.isEnabled() || config.isSkipSegmentAnnouncementOnZk();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this class is almost entirely to handle zk stuff, does it need to be bound and exist at all if zk is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also calling methods in
private final ChangeRequestHistory<DataSegmentChangeRequest> changes = new ChangeRequestHistory<>();

which is used by the http endpoint for segment sync.

However, yeah it takes more work to announce stuff in zk so it looks like that this class is primarily doing that whereas it is actually supporting both.
If/When we are able to delete zk based segment management this class would shrink significantly.
Also, unrelated, it could possibly be refactored to separate two things it is doing so that things are more clear.
However, you know my preference, just delete zk stuff :)

@himanshug
Copy link
Contributor Author

himanshug commented Mar 24, 2020

@clintropolis thanks for clarification but I took the comments as in coming from a curious reviewer. In fact, with written communication, I always try to imagine a happy person speaking things.

@clintropolis
Copy link
Member

@clintropolis thanks for clarification but I took the comments as in coming from a curious reviewer. In fact, with written communication, I always try to imagine a happy person speaking things.

👍 😅

I will try to get back to this today and finish up.

@clintropolis
Copy link
Member

I will try to get back to this today and finish up.

Oh man, this comment didn't age well, sorry I will try to get back to this PR this week.

@himanshug
Copy link
Contributor Author

@clintropolis please take a look when you get a chance.

build appears to be failing due to newly introduced code coverage requirements and I am not entirely sure about how to fulfill that, but I will take a look (may be we need to tune down the coverage requirement , don't know), however PR is reviewable/complete otherwise.

@clintropolis
Copy link
Member

@clintropolis please take a look when you get a chance.

sorry, totally forgot about this PR 😅, I will try to have another look soon

@stale
Copy link

stale bot commented Aug 3, 2020

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

@stale stale bot added the stale label Aug 3, 2020
@clintropolis clintropolis removed the stale label Aug 4, 2020
@clintropolis
Copy link
Member

Sorry, I totally keep forgetting about this PR. tl;dr if you fix up the conflicts, I'm +1.

I think a lot of my hesitation on this one is that I'm not fully in the 'lets get rid of Zookeeper' camp myself, since I haven't yet seen a cluster using only HTTP based stuff at the same scale that I have seen Zookeeper based clusters, and have some lingering worries about how chill things are when there are a large pool of brokers on a very large cluster, since it seems like significant increase in direct HTTP traffic between individual nodes. This is why I was wondering if instead of the if statements to just try to hide the curator stuff in interfaces that can be a no-op for non zk.

However, that is not really a good reason to hold up this PR, which in the best case makes it easy to remove zookeeper stuffs in the future if we go all in on that, or, at minimum, has at least already done the work of finding and marking all of the places that do directly depend on zk stuffs, so that interfaces like I was describing could be added in the future if we decide to keep zk around as an option for operators.

@himanshug
Copy link
Contributor Author

thanks, I will work on to fix the conflicts sometime in a couple of days, then it can be merged.

Also, actually most of the stuff is hidden behind interfaces so zk will continue to work as long as we want it to work. I am sure things could always be improved and it will continue to evolve in future. Most important thing is that this patch is essential to make progress in #9053 and that extension is almost ready.
Regarding the worries with HTTP: I have run that for long time and only way to get confidence is to really try it out and have it get more adoption. My hope is that with k8s based discovery extension, it would get more adoption and consequently more testing in many different clusters.

@pan3793
Copy link
Member

pan3793 commented Sep 14, 2020

It would be nice if druid support k8s in mainline, will it be addressed in next release?

@himanshug
Copy link
Contributor Author

@clintropolis sorry, been a while, I am planning to fix the conflicts here and get it back to working... would you be able to review/merge ?

@clintropolis
Copy link
Member

@clintropolis sorry, been a while, I am planning to fix the conflicts here and get it back to working... would you be able to review/merge ?

will do 👍

@clintropolis
Copy link
Member

clintropolis commented Oct 26, 2020

looks like a compilation failure on:

[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /home/travis/build/apache/druid/integration-tests/src/main/java/org/apache/druid/cli/CliHistoricalForQueryRetryTest.java:[46,15] [MissingOverride] configure overrides method in CliHistorical; expected @Override
    (see https://errorprone.info/bugpattern/MissingOverride)
  Did you mean '@Override @Inject'?

and a test failure:

[ERROR] Errors: 
[ERROR] org.apache.druid.server.coordinator.DruidCoordinatorTest.testBalancerThreadNumber(org.apache.druid.server.coordinator.DruidCoordinatorTest)
[ERROR]   Run 1: DruidCoordinatorTest.testBalancerThreadNumber:690 » NullPointer
[ERROR]   Run 2: DruidCoordinatorTest.testBalancerThreadNumber:690 » NullPointer
[ERROR]   Run 3: DruidCoordinatorTest.testBalancerThreadNumber:690 » NullPointer
[ERROR]   Run 4: DruidCoordinatorTest.testBalancerThreadNumber:690 » NullPointer

otherwise, lgtm

@himanshug
Copy link
Contributor Author

yep, working on to fix the build....

@himanshug
Copy link
Contributor Author

@clintropolis at this point build is fine except for the code coverage checks in some of the trivial changes which either can only execute when zk is disabled for real and druid process is started or already tested when integration tests run (but coverage tool probably only relies on coverage done by unit tests). I have looked at all of the failed code coverage red flags in above builds but not sure how to improve it or whether doing something to bend code coverage would actually achieve anything. so, would you consider ignoring the coverage check.

@clintropolis
Copy link
Member

@clintropolis at this point build is fine except for the code coverage checks in some of the trivial changes which either can only execute when zk is disabled for real and druid process is started or already tested when integration tests run (but coverage tool probably only relies on coverage done by unit tests). I have looked at all of the failed code coverage red flags in above builds but not sure how to improve it or whether doing something to bend code coverage would actually achieve anything. so, would you consider ignoring the coverage check.

Yeah, I am not sure how meaningful the missing coverage could be in this case, so it seems reasonable to ignore it.

@himanshug
Copy link
Contributor Author

@clintropolis thanks!

@himanshug himanshug merged commit ee13630 into apache:master Oct 27, 2020
@himanshug himanshug deleted the optional_disable_zk branch October 27, 2020 18:21
@jihoonson jihoonson added this to the 0.21.0 milestone Jan 4, 2021
JulianJaffePinterest pushed a commit to JulianJaffePinterest/druid that referenced this pull request Jan 22, 2021
* optionally disable all of hardcoded zookeeper use

* fix DruidCoordinatorTest compilation

* fix test in DruidCoordinatorTest

* fix strict compilation

Co-authored-by: Himanshu Gupta <fill email>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants