Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster topology lookup should not replaces self-node details with host and port from RedisURI when RedisURI is load balancer #712

Closed
warrenzhu25 opened this issue Feb 26, 2018 · 11 comments
Labels
type: feature A new feature

Comments

@warrenzhu25
Copy link

Today it's common that managed Redis Cluster expose a load balancer as main connection point that randomly forward requests to real Redis node. Popular ones are Azure Redis Cache and AWS ElasticCache. When Lettuce load topology view by calling CLUSTER NODES, the node with tag SELF will be replaced with load balancer URL.

Lettuce Version: 4.4.0.Final

This leads to 2 problems:

  1. Load Balancer is considered as a normal node with slots assigned, but it forward requests randomly. So we will get randomly 'MOVED' response when request is forwarded into certain node.

  2. When we try to connect the node being replaced and cluster node membership verification enabled(default), RedisException with message This connection point is not known in the cluster view will be thrown.

  3. The symptom cannot be stably reproduced since we rely on PartitionsConsensus.HEALTHY_MAJORITY to randomly return topology view. Only the view containing replaced self node the issue could be reproduced.

I believe this is same as #312, but the fix is not working. The root cause is inside NodeTopologyView# Constructor()

    NodeTopologyView(RedisURI redisURI, String clusterNodes, String clientList, long latency) {

        this.available = true;
        this.redisURI = redisURI;

        this.partitions = ClusterPartitionParser.parse(clusterNodes);
        this.connectedClients = getClients(clientList);

        this.clusterNodes = clusterNodes;
        this.clientList = clientList;
        this.latency = latency;

        getOwnPartition().setUri(redisURI); // Updated self mode to RedisURL
    }

@mp911de I don't think simply removing this line is a good fix. I want to understand the motivation behind replacing ip and port with RedisURL. After that, I'm happy to provide a pull request to fix this.

@mp911de
Copy link
Collaborator

mp911de commented Feb 26, 2018

Setting the own (MYSELF) partition to the URI it was obtained from is done to preserve the user-provided endpoint. This behavior was primarily required when Redis Cluster did not provide a way to set the external port/IP address under which it was reachable (Docker/NATted use).

I'd need to investigate more in-depth whether we still require this behavior. Did you try removing the line and run the tests?

@mp911de mp911de added the status: waiting-for-feedback We need additional information before we can continue label Feb 26, 2018
warrenzhu25 pushed a commit to warrenzhu25/lettuce-core that referenced this issue Mar 1, 2018
@warrenzhu25
Copy link
Author

You can check the CI results after remove that line. If removing this breaks other features, what do you think introducing a field to indicate one node is actually load balancer? Then we can skip replacing URI.

@mp911de
Copy link
Collaborator

mp911de commented Mar 5, 2018

Sorry, that's not how open source contributions work. If you change/break something, it's your responsibility to find out that something broke and report it rather than employ the project maintainer to monitor PR/CI builds.

However, maybe we can introduce node aliasing for a smoother resolution instead of having a single URI.

@warrenzhu25
Copy link
Author

Sorry that I haven't made myself clear. I put the broken test results here to help you understand the effects of just removing this since you said that you would investigate whether we need this feature. That doesn't mean that's the final pull request I want to make. Before I draft the final fix, I want to hear from you about investigation result. Then I can make sure that I go in the right direction. Thanks.

@mp911de
Copy link
Collaborator

mp911de commented Mar 7, 2018

Alright, I see. I think introducing an optional node alias (or aliases) might be the way to go. The tests which broke perform a lookup with the RedisURI they used to connect to the node. Redis reports within the tests 127.0.0.1 instead of localhost which causes that the self-node cannot be looked up.

A lookup by host/port would consider the alias as well and we share the view of the Redis Cluster as well as from the seed-node perspective.

@mp911de mp911de added type: feature A new feature and removed status: waiting-for-feedback We need additional information before we can continue labels Mar 7, 2018
@warrenzhu25
Copy link
Author

warrenzhu25 commented Mar 8, 2018

I suppose you mean something like this:

class RedisClusterNode{
    private RedisURI uri; // for ip and port got from redis
    private Set<RedisURI> aliases; // for public domain host and port
}

Then in node topology views of type Map<RedisURI, Partitions>, we can find partition views by using either uri or aliases.

But I don't think this could distinguish a normal node from load balance node, i suggest to add one bool to indicate whether a RedisURI is load balancer, then we skip replacing self. What do you think?

warrenzhu25 pushed a commit to warrenzhu25/lettuce-core that referenced this issue Mar 9, 2018
@warrenzhu25
Copy link
Author

Please take a look at my PR. Automatic triggered CI failed some tests, but when I tried to rebuild it, it passed. I'm confused with build difference between pr and push.

@mp911de
Copy link
Collaborator

mp911de commented Mar 9, 2018

TravisCI sometimes runs into issues. The changes you introduced are not related to adding an alias/aliases within the nodes of a partition but rather you're changing the URI.

@warrenzhu25
Copy link
Author

Yes, please check my below comment. I think introducing alias won't solve this issue. But I could help do this if this is necessary for Docker/NATted use

mp911de added a commit that referenced this issue Mar 19, 2018
Lettuce now provides a cluster node alias to better cope with load balancers or alternate cluster names. We are using the Redis reported hostname as URI (was previously the seed node URI) and add the seed node URI as alias. This way we make sure to connect the appropriate host directly without connecting the balancer and receiving redirects.

Redirects can happen in an environment where the load balancer forwards connections to random cluster master nodes.
mp911de added a commit that referenced this issue Mar 19, 2018
Lettuce now provides a cluster node alias to better cope with load balancers or alternate cluster names. We are using the Redis reported hostname as URI (was previously the seed node URI) and add the seed node URI as alias. This way we make sure to connect the appropriate host directly without connecting the balancer and receiving redirects.

Redirects can happen in an environment where the load balancer forwards connections to random cluster master nodes.
@mp911de
Copy link
Collaborator

mp911de commented Mar 19, 2018

That's in place now, snapshot builds are available. Care to give it a spin?

@mp911de mp911de added this to the Lettuce 4.5.0 milestone Mar 19, 2018
@mp911de mp911de closed this as completed Mar 21, 2018
@warrenzhu25
Copy link
Author

Worked perfectly. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A new feature
Projects
None yet
Development

No branches or pull requests

2 participants