Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DruidLeaderClient.goAsync() is broken #9701

Closed
jihoonson opened this issue Apr 15, 2020 · 11 comments · Fixed by #9702
Closed

DruidLeaderClient.goAsync() is broken #9701

jihoonson opened this issue Apr 15, 2020 · 11 comments · Fixed by #9702
Labels
Milestone

Comments

@jihoonson
Copy link
Contributor

jihoonson commented Apr 15, 2020

Affected Version

0.18.0

Description

This bug seems to be introduced after #9481. Before this PR, DruidLeaderClient used to use ServerDiscoverySelector to find the leader. Since only the leader Overlord and Coordinator announce themselves, ServerDiscoverySelector could return the current leader to DruidLeaderClient. But now, DruidLeaderClient picks up the first node returned from DruidNodeDiscovery which knows all announced nodes. Since DruidLeaderClient.goAsync() doesn't follow redirection (#8716), some methods of the system table which use goAsync() API returns an error with the Temporary Redirect status.

@jihoonson jihoonson added the Bug label Apr 15, 2020
@jihoonson jihoonson added this to the 0.18.0 milestone Apr 15, 2020
@jihoonson
Copy link
Contributor Author

This is a release blocker for 0.18.0 since Druid doesn't work with multi-masters.

@jihoonson
Copy link
Contributor Author

I think we can revert #9481 for both master and 0.18.0 branches since it doesn't have any changes other than removing ServerDiscoverySelector from DruidLeaderClient. @himanshug does it make sense?

@himanshug
Copy link
Contributor

@jihoonson DruidLeaderClient is designed to follow redirects so as to land requests to leader. can you please describe, what functionality is broken by this change ?

@jihoonson
Copy link
Contributor Author

jihoonson commented Apr 15, 2020

When it tries to follow redirects, it checks the current known leader which calls pickOneHost(). Before #9481, this function used to use ServerDiscoverySelector which returns the known leader and fall back to DruidNodeDiscovery when there is no known leader, but now it always returns the first node from DruidNodeDiscovery.

private String pickOneHost()
{
Iterator<DiscoveryDruidNode> iter = druidNodeDiscovery.getAllNodes().iterator();
if (iter.hasNext()) {
DiscoveryDruidNode node = iter.next();
return StringUtils.format(
"%s://%s",
node.getDruidNode().getServiceScheme(),
node.getDruidNode().getHostAndPortToUse()
);
}

@himanshug
Copy link
Contributor

himanshug commented Apr 15, 2020

ignoring name of that method, when we make request to wrong leader, we should get a HTTP 302 back and that redirect is then followed to get the request to right node (see

String redirectUrlStr = fullResponseHolder.getResponse().headers().get("Location");
) and newly discovered leader is then cached ( see
currentKnownLeader.set(StringUtils.format(
)

@gianm
Copy link
Contributor

gianm commented Apr 15, 2020

The problem happens with DruidLeaderClient.goAsync, which lets callers provide their own HttpResponseHandler, but doesn't contain the leader-following logic. It is used by the Druid SQL system tables so they can avoid materializing the entire response for the APIs they are hitting (which could be large). Perhaps the best fix is that it should learn how to follow leaders.

@jihoonson
Copy link
Contributor Author

Yep, sorry for confusing. @himanshug you're right about DruidLeaderClient.g(), but it happens with goAsync() as @gianm mentioned which is reported in #8716.

I agree that the best approach to fix this would fix #8716, but we are running out of time for 0.18.0, I'm still inclined to reverting #9481 for now.

@gianm
Copy link
Contributor

gianm commented Apr 15, 2020

Reverting the patch makes sense to me, then for the next release, adding it back along with implementing redirects for goAsync.

@himanshug
Copy link
Contributor

ah, I see , wasn't aware of this change. I agree, right fix is to make it follow leader.

for now, I think right way to fix it to not revert #9481 but rather use DruidLeaderSelector interface to get the leader.. so as to not add hardcoded dependency back to zk. Similar to what router is doing DruidLeaderSelector.getCurrentLeader() for example https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/server/AsyncManagementForwardingServlet.java#L95

that will revert the behavior equivalent to using zk specific ServerDiscoverySelector

let me do a PR unless someone sees problems with that .. @gianm @jihoonson ?

@jihoonson jihoonson changed the title DruidLeaderClient always thinks the first node returned from DruidNodeDiscovery is the leader DruidLeaderClient.goAsync() is broken Apr 15, 2020
@himanshug
Copy link
Contributor

or rather, to expedite things, let us revert #9481 for now.

I will revive that along with bringing in DruidLeaderSelector.getCurrentLeader() or follow redirect in goAsync later.

@jihoonson
Copy link
Contributor Author

@gianm @himanshug thanks for understanding! The PR is ready at #9702.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants