Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non recoverable state in withMultipleHostAndPort when consuls are blacklisted #388

Open
bt-wil opened this issue Dec 9, 2024 · 3 comments · May be fixed by #390
Open

non recoverable state in withMultipleHostAndPort when consuls are blacklisted #388

bt-wil opened this issue Dec 9, 2024 · 3 comments · May be fixed by #390
Assignees
Labels
bug Something isn't working
Milestone

Comments

@bt-wil
Copy link

bt-wil commented Dec 9, 2024

When using the withMultipleHostAndPort, there is a blacklistTime duration. I would expect that the bad host that was blacklisted would be retried after the duration is over, but it is not the case.

import com.google.common.net.HostAndPort;

import java.time.Duration;
import java.time.Instant;
import java.util.Arrays;

public class CheckConsul {
    public static void main(String[] args) {
        var consul = Consul.builder().withPing(false).withMultipleHostAndPort(
                        Arrays.asList(
                                HostAndPort.fromString("a.foo.bar.baz:8500"),
                                HostAndPort.fromString("b.foo.bar.baz:8500")
                        ),
                        Duration.ofMillis(2000)
                )
                .build();

        var client = consul.keyValueClient();

        for (int i = 0; i < 10; i++) {
            try {
                var res = client.getValueAsString("foo").orElse("");
                System.out.println(Instant.now() + " Result: " + res);
            } catch (Exception e) {
                System.out.println(Instant.now() + " Error: " + e);
            }
            System.out.println("Sleeping");
            try {
                Thread.sleep(1000);
            } catch (InterruptedException ignored) {
            }
        }
    }
}


2024-12-09T19:23:43.837941Z Error: org.kiwiproject.consul.ConsulException: Unable to successfully determine a viable host for communication.
Sleeping
2024-12-09T19:23:44.846135Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
2024-12-09T19:23:45.851187Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
2024-12-09T19:23:46.855286Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
2024-12-09T19:23:47.861092Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
2024-12-09T19:23:48.866362Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
2024-12-09T19:23:49.871131Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
2024-12-09T19:23:50.878096Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
2024-12-09T19:23:51.884974Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
2024-12-09T19:23:52.893066Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping

The issue is in https://github.com/kiwiproject/consul-client/blob/v1.4.2/src/main/java/org/kiwiproject/consul/util/failover/ConsulFailoverInterceptor.java#L94 where the viability is checked, but not take into account of the blacklist timeout. And the resetting of the timeout only happens when inside isRequestViable:

@sleberknight sleberknight added the bug Something isn't working label Dec 12, 2024
@sleberknight sleberknight self-assigned this Dec 12, 2024
@sleberknight
Copy link
Member

I did some initial investigation into this and it definitely appears to be a 🐛 .

The BlacklistingConsulFailoverStrategy#findTargetNotInBlacklist method is the only place where the target HostAndPort is removed from the blacklist if it has expired. And findTargetNotInBlacklist is only called by BlacklistingConsulFailoverStrategy#computeNextStage.

In ConsulFailoverInterceptor#intercept, computeNextStage is only called within the while loop.

So, if all targets have been blacklisted, the conditional on line 94 in #intercept ( strategy.isRequestViable(originalRequest) ) will always return false and none of the targets can ever get out of the blacklist.

After some hacking around, adding lots of logging, moving some code around, I added code in isRequestViable to check if a target should be removed from the blacklist. It is basically the same code as in findTargetNotInBlacklist, though I extracted it to a method and call that in both of the methods.

After making that change the logging shows targets getting blacklisted, and then coming out of the blacklist after the expiration period. Then they go back into the blacklist, and come out, etc. (Since a.foo.bar.baz and b.foo.bar.baz are obviously never going to work.)

With the changes to check for expiration in both places, and adding a print statement when a target is removed from the blacklist, running the sample application provided by @bt-wil results in:

2024-12-12T22:16:37.737459630Z Error: org.kiwiproject.consul.ConsulException: Unable to successfully determine a viable host for communication.
Sleeping
2024-12-12T22:16:38.738873267Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
Removing a.foo.bar.baz:8500 from blacklist
Removing b.foo.bar.baz:8500 from blacklist
2024-12-12T22:16:39.741584664Z Error: org.kiwiproject.consul.ConsulException: Unable to successfully determine a viable host for communication.
Sleeping
2024-12-12T22:16:40.742279470Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
Removing a.foo.bar.baz:8500 from blacklist
Removing b.foo.bar.baz:8500 from blacklist
2024-12-12T22:16:41.745005947Z Error: org.kiwiproject.consul.ConsulException: Unable to successfully determine a viable host for communication.
Sleeping
2024-12-12T22:16:42.745712284Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
Removing a.foo.bar.baz:8500 from blacklist
Removing b.foo.bar.baz:8500 from blacklist
2024-12-12T22:16:43.748126911Z Error: org.kiwiproject.consul.ConsulException: Unable to successfully determine a viable host for communication.
Sleeping
2024-12-12T22:16:44.748854388Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
Removing a.foo.bar.baz:8500 from blacklist
Removing b.foo.bar.baz:8500 from blacklist
2024-12-12T22:16:45.751210854Z Error: org.kiwiproject.consul.ConsulException: Unable to successfully determine a viable host for communication.
Sleeping
2024-12-12T22:16:46.752042971Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping

You can see the errors now alternate between "Unable to successfully determine a viable host for communication" and "Consul failover strategy has determined that there are no viable hosts remaining" as the targets are added and removed from the blacklist. If I add a Logback configuration to log at WARN and above, we can also see the errors resulting from failed calls:

22:19:09.862 [main] WARN  o.k.c.u.f.ConsulFailoverInterceptor - Got 'java.net.UnknownHostException:a.foo.bar.baz: Name or service not known' when connecting to http://a.foo.bar.baz:8500/v1/kv/foo (enable DEBUG level to see stack trace)
22:19:09.875 [main] WARN  o.k.c.u.f.ConsulFailoverInterceptor - Got 'java.net.UnknownHostException:b.foo.bar.baz: Name or service not known' when connecting to http://b.foo.bar.baz:8500/v1/kv/foo (enable DEBUG level to see stack trace)
2024-12-12T22:19:09.876303777Z Error: org.kiwiproject.consul.ConsulException: Unable to successfully determine a viable host for communication.
Sleeping
2024-12-12T22:19:10.877378795Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
Removing a.foo.bar.baz:8500 from blacklist
22:19:11.878 [main] WARN  o.k.c.u.f.ConsulFailoverInterceptor - Got 'java.net.UnknownHostException:a.foo.bar.baz' when connecting to http://a.foo.bar.baz:8500/v1/kv/foo (enable DEBUG level to see stack trace)
Removing b.foo.bar.baz:8500 from blacklist
22:19:11.879 [main] WARN  o.k.c.u.f.ConsulFailoverInterceptor - Got 'java.net.UnknownHostException:b.foo.bar.baz' when connecting to http://b.foo.bar.baz:8500/v1/kv/foo (enable DEBUG level to see stack trace)
2024-12-12T22:19:11.879693001Z Error: org.kiwiproject.consul.ConsulException: Unable to successfully determine a viable host for communication.
Sleeping
2024-12-12T22:19:12.880436228Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
Removing a.foo.bar.baz:8500 from blacklist
22:19:13.881 [main] WARN  o.k.c.u.f.ConsulFailoverInterceptor - Got 'java.net.UnknownHostException:a.foo.bar.baz' when connecting to http://a.foo.bar.baz:8500/v1/kv/foo (enable DEBUG level to see stack trace)
Removing b.foo.bar.baz:8500 from blacklist
22:19:13.882 [main] WARN  o.k.c.u.f.ConsulFailoverInterceptor - Got 'java.net.UnknownHostException:b.foo.bar.baz' when connecting to http://b.foo.bar.baz:8500/v1/kv/foo (enable DEBUG level to see stack trace)
2024-12-12T22:19:13.882655915Z Error: org.kiwiproject.consul.ConsulException: Unable to successfully determine a viable host for communication.
Sleeping
2024-12-12T22:19:14.883330722Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
Removing a.foo.bar.baz:8500 from blacklist
22:19:15.884 [main] WARN  o.k.c.u.f.ConsulFailoverInterceptor - Got 'java.net.UnknownHostException:a.foo.bar.baz' when connecting to http://a.foo.bar.baz:8500/v1/kv/foo (enable DEBUG level to see stack trace)
Removing b.foo.bar.baz:8500 from blacklist
22:19:15.885 [main] WARN  o.k.c.u.f.ConsulFailoverInterceptor - Got 'java.net.UnknownHostException:b.foo.bar.baz' when connecting to http://b.foo.bar.baz:8500/v1/kv/foo (enable DEBUG level to see stack trace)
2024-12-12T22:19:15.885776779Z Error: org.kiwiproject.consul.ConsulException: Unable to successfully determine a viable host for communication.
Sleeping
2024-12-12T22:19:16.886599245Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping
Removing a.foo.bar.baz:8500 from blacklist
22:19:17.888 [main] WARN  o.k.c.u.f.ConsulFailoverInterceptor - Got 'java.net.UnknownHostException:a.foo.bar.baz' when connecting to http://a.foo.bar.baz:8500/v1/kv/foo (enable DEBUG level to see stack trace)
Removing b.foo.bar.baz:8500 from blacklist
22:19:17.888 [main] WARN  o.k.c.u.f.ConsulFailoverInterceptor - Got 'java.net.UnknownHostException:b.foo.bar.baz' when connecting to http://b.foo.bar.baz:8500/v1/kv/foo (enable DEBUG level to see stack trace)
2024-12-12T22:19:17.888953272Z Error: org.kiwiproject.consul.ConsulException: Unable to successfully determine a viable host for communication.
Sleeping
2024-12-12T22:19:18.889646999Z Error: org.kiwiproject.consul.ConsulException: Consul failover strategy has determined that there are no viable hosts remaining.
Sleeping

So I think this fixes the problem and may be a decent solution without completely changing the code.

When I get some additional time, I will clean things up, remove the absurd amount of logging I added, and create a PR. It will probably not before Christmas 🎄 though.

@bt-wil
Copy link
Author

bt-wil commented Dec 12, 2024

I added code in isRequestViable to check if a target should be removed from the blacklist.

It is similar to what I had did to fix the issue, but didn't spend too much on checking whether there could be some side effect.

diff --git a/src/main/java/org/kiwiproject/consul/util/failover/strategy/BlacklistingConsulFailoverStrategy.java b/src/main/java/org/kiwiproject/consul/util/failover/strategy/BlacklistingConsulFailoverStrategy.java
index 9928a77..ec14137 100644
--- a/src/main/java/org/kiwiproject/consul/util/failover/strategy/BlacklistingConsulFailoverStrategy.java
+++ b/src/main/java/org/kiwiproject/consul/util/failover/strategy/BlacklistingConsulFailoverStrategy.java
@@ -47,6 +47,7 @@ public class BlacklistingConsulFailoverStrategy implements ConsulFailoverStrateg

         // Create a host and port
         final HostAndPort initialTarget = fromRequest(previousRequest);
+        HostAndPort next = initialTarget;

         // If the previous response failed, disallow this request from going through.
         // A 404 does NOT indicate a failure in this case, so it should never blacklist the previous target.
@@ -56,31 +57,22 @@ public class BlacklistingConsulFailoverStrategy implements ConsulFailoverStrateg

         // Handle the case when the blacklist contains the target we care about
         if (blacklist.containsKey(initialTarget)) {
-
             // Find the first entity that doesn't exist in the blacklist
             Optional<HostAndPort> optionalNext = findTargetNotInBlacklist();

             if (optionalNext.isEmpty()) {
                 return Optional.empty();
             }
-            HostAndPort next = optionalNext.get();
-
-            // Construct the next URL using the old parameters (ensures we don't have to do
-            // a copy-on-write
-            final HttpUrl nextURL = previousRequest.url().newBuilder().host(next.getHost()).port(next.getPort()).build();
-
-            // Return the result
-            return Optional.of(previousRequest.newBuilder().url(nextURL).build());
-        } else {
-
-            // Construct the next URL using the old parameters (ensures we don't have to do
-            // a copy-on-write
-            final HttpUrl nextURL = previousRequest.url().newBuilder().host(initialTarget.getHost()).port(initialTarget.getPort()).build();

-            // Return the result
-            return Optional.of(previousRequest.newBuilder().url(nextURL).build());
+            next = optionalNext.get();
         }

+        // Construct the next URL using the old parameters (ensures we don't have to do
+        // a copy-on-write
+        final HttpUrl nextURL = previousRequest.url().newBuilder().host(next.getHost()).port(next.getPort()).build();
+
+        // Return the result
+        return Optional.of(previousRequest.newBuilder().url(nextURL).build());
     }

     /**
@@ -91,34 +83,40 @@ public class BlacklistingConsulFailoverStrategy implements ConsulFailoverStrateg
      */
     private Optional<HostAndPort> findTargetNotInBlacklist() {
         return targets.stream().filter(target -> {
+            Instant blacklistedAt = blacklist.get(target);

-            // If we have blacklisted this key
-            if (blacklist.containsKey(target)) {
-
-                // Get when we blacklisted this key
-                final Instant blacklistedAt = blacklist.get(target);
-
-                // If the duration between the blacklist time ("then") and "now" is greater than the
-                // timeout duration (Duration(then, now) - timeout < 0), then remove the blacklist entry
-                if (Duration.between(blacklistedAt, Instant.now()).minusMillis(timeout).isNegative()) {
-                    return false;
-                } else {
-                    blacklist.remove(target);
-                    return true;
-                }
-            } else {
+            if (blacklistedAt == null) {
                 return true;
+            } else if (isPastBlacklistDuration(target)) {
+                // Remove the target from blacklist if duration passed.
+                blacklist.remove(target);
+                return true;
+            } else {
+                return false;
             }
         }).findAny();
     }

+    private boolean isPastBlacklistDuration(HostAndPort hostAndPort) {
+        Instant blacklistedAt = blacklist.get(hostAndPort);
+
+        if (blacklistedAt == null) {
+            return true;
+        }
+
+        // If the duration between the blacklist time ("then") and "now" is greater than the
+        // timeout duration (Duration(then, now) - timeout < 0)
+        return !Duration.between(blacklistedAt, Instant.now()).minusMillis(timeout).isNegative();
+    }
+
     private static boolean previousResponseFailedAndWasNot404(Response previousResponse) {
         return nonNull(previousResponse) && !previousResponse.isSuccessful() && previousResponse.code() != 404;
     }

     @Override
     public boolean isRequestViable(@NonNull Request current) {
-        return (targets.size() > blacklist.size()) || !blacklist.containsKey(fromRequest(current));
+        HostAndPort hostAndPort = fromRequest(current);
+        return (targets.size() > blacklist.size()) || !blacklist.containsKey(hostAndPort) || isPastBlacklistDuration(hostAndPort);
     }

     @Override

updated: not have to create hostAndPort when not necessary.

     @Override
     public boolean isRequestViable(@NonNull Request current) {
-        return (targets.size() > blacklist.size()) || !blacklist.containsKey(fromRequest(current));
+        if (targets.size() > blacklist.size()) {
+            return true;
+        }
+        HostAndPort hostAndPort = fromRequest(current);
+        return !blacklist.containsKey(hostAndPort) || isPastBlacklistDuration(hostAndPort);

@sleberknight
Copy link
Member

Ok sounds like we're on the same page. I also don't know if there might be some side-effect since I wasn't the original author of this library. But since we both did similar things, that makes me feel better about it.

I will try to find some time "soon" and push up a PR.

@sleberknight sleberknight added this to the 1.4.3 milestone Dec 17, 2024
sleberknight added a commit that referenced this issue Dec 20, 2024
* Factor duplicate logic from if and else branches in computeNextStage.
* Extract blacklist expiration duration logic from findTargetNotInBlacklist
  to new method isPastBlacklistDuration.
* Change isRequestViable to call findTargetNotInBlacklist. Because
  findTargetNotInBlacklist will remove a target from the blacklist if
  past the expiration, this now ensures a correct result. If the target
  is not blacklisted, then the request is viable. If it or any other target
  is blacklisted but past the expiration, that target is considered viable
  and removed from the blacklist. Only if all targets are currently
  in blacklist and none have expired is the request considered not viable.
* Create addToBlacklist method to replace duplicate blacklist.put calls.
* Create isBlacklisted method to check if a target is in blacklist.
* Create isNotBlacklisted method to check if a target is not in blacklist.
* Rename fromRequest to hostAndPortFromRequest for clarity.
* Renamed the blacklistingConsulFailoverStrategy field in the test
  to just "strategy" since it is much shorter, and because the test
  is only related to the BlacklistingConsulFailoverStrategy.
* Add more tests. Move the original tests inside a nested class
  ComputeNextStage. The new tests are nested inside their
  own nested classes for the various scenarios. I kept the tests for
  blacklsit expiration for computeNextStage in a separate nested
  class, ComputeNextStageExpiration, mainly b/c they are new and
  are specifically for testing the expiration logic. Also added dedicated
  tests for isRequest and the (internal) isPastBlacklistDuration method.

Fixes #388
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants