-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optionally disable all of hardcoded zookeeper use #9507
Conversation
@clintropolis can you please review/approve this one ? |
Sorry yes, I have been meaning to take a look just haven't had much time to spare lately 😅I will try to have a look sometime today/this weekend. |
@clintropolis thanks, I totally understand that :) |
Some thoughts as I've been reviewing this (sorry I haven't finished yet): Do you view this as an interim configuration, to allow your work to proceed on an alternative discovery mechanism, until we can decouple zookeeper specific code from all of the places that need to check this setting? or is the plan to leave it like this? So far I find it kind of ugly to have a setting like this due to all of the if/else branches it causes, but maybe there is some obvious reason I haven't got to yet on why we aren't adding some sort of I'll keep reviewing, and try to finish up later tonight. |
@clintropolis Your concern is legit, I am not sure if you have seen https://groups.google.com/forum/#!msg/druid-development/tWnwPyL0Vk4/2uLwqgQiAAAJ and https://groups.google.com/forum/#!msg/druid-development/eIWDPfhpM_U/AzMRxSQGAgAJ , but please take a look at those. That will explain that extensible "discovery" to be implemented in extensions is "node/service discovery" and "leader election". Other zookeeper usage for segment/task management should be totally replaced by
Next PR in this chain would have kubernetes based impl for all those i.e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, re-read my last comment and the tone is sort of ambiguous on whether I'm being negative - I'm totally not trying to be a hater and I think this approach is fine as is, just trying to get a better idea on what finished #9053 looks like 😅.
} | ||
|
||
@GET | ||
@Path("/readiness") | ||
public Response getReadiness() | ||
{ | ||
if (coordinator.isStarted()) { | ||
if (segmentLoadDropHandler.isStarted()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is this actually true if still using zk segment loading? It was wrong before for the same reasons I think for http segment loading. Maybe this resource should accept either segment loader like SegmentListerResource
, or we should have another http resource that we bind instead depending which mode is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
primary thing checked by this endpoint is that historical startup sequence has finished loading/announcing segments that it already had on disk which is often a time consuming activity. that is still ensured by segmentLoadDropHandler.isStarted()
JsonConfigProvider.bind(binder, CURATOR_CONFIG_PREFIX, CuratorConfig.class); | ||
JsonConfigProvider.bind(binder, EXHIBITOR_CONFIG_PREFIX, ExhibitorConfig.class); | ||
} | ||
|
||
@Provides | ||
@LazySingleton | ||
@SuppressForbidden(reason = "System#err") | ||
public CuratorFramework makeCurator(CuratorConfig config, EnsembleProvider ensembleProvider, Lifecycle lifecycle) | ||
public CuratorFramework makeCurator(ZkEnablementConfig zkEnablementConfig, CuratorConfig config, EnsembleProvider ensembleProvider, Lifecycle lifecycle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked too closely yet, but I wonder if this would be better if this was marked as @Nullable
and returned null instead of throwing the runtime exception, and shift the burden of validating that curator is available to the settings that do require it, such as inventory, segment loading, and task management? The other stuff might be able to be simplified a bit and not have to care about having the setting, and could probably avoid having some of the signature changes to use providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main reason for failing loudly here is so that if I forgot to disable zk in some code path, then this method would immediately fail on node start with guice injection errors leading to quick discovery of exactly what is missed.
this helped me catch quite a few places that I missed.
@@ -443,7 +455,7 @@ public void moveSegment( | |||
() -> { | |||
try { | |||
if (serverInventoryView.isSegmentLoadedByServer(toServer.getName(), segment) && | |||
curator.checkExists().forPath(toLoadQueueSegPath) == null && | |||
(curator == null || curator.checkExists().forPath(toLoadQueueSegPath) == null) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't necessarily need to change in this PR, but it seems kind of leaky that this thing has a CuratorFramework
at all, it seems like the load peon should provide this check so it can just be a no-op for non-zk. and then DruidCoordinator
no longer needs a curator or zk paths I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I hope zk based segment management will just go away and this will go as well.
@@ -67,7 +68,7 @@ public LoadQueuePeon giveMePeon(ImmutableDruidServer server) | |||
return new HttpLoadQueuePeon(server.getURL(), jsonMapper, httpClient, config, peonExec, callbackExec); | |||
} else { | |||
return new CuratorLoadQueuePeon( | |||
curator, | |||
curatorFrameworkProvider.get(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking out loud, doesn't need to be addressed in this PR, it seems like maybe LoadQueueTaskMaster
maybe needs some sort of peon factory that is set by config so that it doesn't have to care about individual implementations or curators and the like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be, but you know my preference by now :)
I wish people would migrate to using HTTP
segment management and that would remain the only way and this would be deleted.
@@ -99,13 +107,28 @@ public BatchDataSegmentAnnouncer( | |||
return rv; | |||
}; | |||
|
|||
if (this.config.isSkipSegmentAnnouncementOnZk()) { | |||
isSkipSegmentAnnouncementOnZk = !zkEnablementConfig.isEnabled() || config.isSkipSegmentAnnouncementOnZk(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this class is almost entirely to handle zk stuff, does it need to be bound and exist at all if zk is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also calling methods in
private final ChangeRequestHistory<DataSegmentChangeRequest> changes = new ChangeRequestHistory<>();
which is used by the http endpoint for segment sync.
However, yeah it takes more work to announce stuff in zk so it looks like that this class is primarily doing that whereas it is actually supporting both.
If/When we are able to delete zk based segment management this class would shrink significantly.
Also, unrelated, it could possibly be refactored to separate two things it is doing so that things are more clear.
However, you know my preference, just delete zk stuff :)
@clintropolis thanks for clarification but I took the comments as in coming from a curious reviewer. In fact, with written communication, I always try to imagine a happy person speaking things. |
👍 😅 I will try to get back to this today and finish up. |
Oh man, this comment didn't age well, sorry I will try to get back to this PR this week. |
@clintropolis please take a look when you get a chance. build appears to be failing due to newly introduced code coverage requirements and I am not entirely sure about how to fulfill that, but I will take a look (may be we need to tune down the coverage requirement , don't know), however PR is reviewable/complete otherwise. |
sorry, totally forgot about this PR 😅, I will try to have another look soon |
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions. |
Sorry, I totally keep forgetting about this PR. tl;dr if you fix up the conflicts, I'm +1. I think a lot of my hesitation on this one is that I'm not fully in the 'lets get rid of Zookeeper' camp myself, since I haven't yet seen a cluster using only HTTP based stuff at the same scale that I have seen Zookeeper based clusters, and have some lingering worries about how chill things are when there are a large pool of brokers on a very large cluster, since it seems like significant increase in direct HTTP traffic between individual nodes. This is why I was wondering if instead of the However, that is not really a good reason to hold up this PR, which in the best case makes it easy to remove zookeeper stuffs in the future if we go all in on that, or, at minimum, has at least already done the work of finding and marking all of the places that do directly depend on zk stuffs, so that interfaces like I was describing could be added in the future if we decide to keep zk around as an option for operators. |
thanks, I will work on to fix the conflicts sometime in a couple of days, then it can be merged. Also, actually most of the stuff is hidden behind interfaces so zk will continue to work as long as we want it to work. I am sure things could always be improved and it will continue to evolve in future. Most important thing is that this patch is essential to make progress in #9053 and that extension is almost ready. |
It would be nice if druid support k8s in mainline, will it be addressed in next release? |
@clintropolis sorry, been a while, I am planning to fix the conflicts here and get it back to working... would you be able to review/merge ? |
will do 👍 |
looks like a compilation failure on:
and a test failure:
otherwise, lgtm |
yep, working on to fix the build.... |
@clintropolis at this point build is fine except for the code coverage checks in some of the trivial changes which either can only execute when zk is disabled for real and druid process is started or already tested when integration tests run (but coverage tool probably only relies on coverage done by unit tests). I have looked at all of the failed code coverage red flags in above builds but not sure how to improve it or whether doing something to bend code coverage would actually achieve anything. so, would you consider ignoring the coverage check. |
Yeah, I am not sure how meaningful the missing coverage could be in this case, so it seems reasonable to ignore it. |
@clintropolis thanks! |
* optionally disable all of hardcoded zookeeper use * fix DruidCoordinatorTest compilation * fix test in DruidCoordinatorTest * fix strict compilation Co-authored-by: Himanshu Gupta <fill email>
Paving Path Towards #9053
Description
This patch adds a new configuration property
druid.zk.service.enabled=true/false, default = true
on all nodes to disable all of zookeeper activities that get setup even if user chooses to use HTTP based segment and task management. Some of those are....(above set of things are required, so that curator based task/segment management continues to work in rolling deployment scenario and also could be interchanged with http based task/segment management at any time)
ServiceAnnouncer
to keep tranquility workingThis property is undocumented for now till k8s based discovery extension PR shows up, that will have all the necessary documentation including setting
druid.zk.service.enabled=false
.This PR has:
Key changed/added classes in this PR
ZkEnabledmentConfig
CliXXX
XXXModule
CuratorFramework