Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: endless matching loop #199

Merged
merged 1 commit into from
Jun 28, 2024
Merged

fix: endless matching loop #199

merged 1 commit into from
Jun 28, 2024

Conversation

ydylla
Copy link
Collaborator

@ydylla ydylla commented Jun 6, 2024

@mholt Sadly it is not over yet. Now I also saw a runaway caddy with max cpu load and normal traffic. But timeline wise it is impossible that it is the same as #193. Since I could trace back this one to a bug in the fix for #194.

It happens for example with the proxy_protocol matcher and handler. If we receive a proxy proto header from a not allowed source the matcher matches, but the handler does not consume any of the prefetched bytes. Since the handler is not terminal we jump back to the beginning of the matching loop and match the same matcher/handler combo again. It never stops because we never call cx.prefetch again. Since isFirstPrefetch = false is unreachable in this situation (the statement is at the wrong place). It always must be set false after the first iteration.

You can reproduce it with this config:

{
  "admin": {
    "disabled": true
  },
  "logging": {
    "logs": {
      "default": {"level":"DEBUG", "encoder": {"format":"console"}}
    }
  },
  "apps": {
    "layer4": {
      "servers": {
        "loop": {
          "listen": ["0.0.0.0:8080"],
          "routes": [
            {
              "match": [
                {"proxy_protocol": {}}
              ],
              "handle": [
                {
                  "handler": "proxy_protocol",
                  "allow": ["1.1.1.1/32"]
                }
              ]
            }
          ]
        }
      }
    }
  }
}

and echo "PROXY TCP4 255.255.255.255 255.255.255.255 65535 65535" | nc 127.0.0.1 8080.
The connections opened by this command will never finish and will fully utilize one core.

I also switched from an endless loop to a limited for loop in hope future bugs can be prevented by this. The 10k iteration limit was chosen by feeling. I have no data on how many prefetch calls are realistically needed in practice. If you don't like it we can also keep the endless for loop.

This happens for example with the proxy_protocol matcher and handler. If we receive a proxy proto header from a not allowed source the matcher matches, but the handler does not consume any of the prefetched bytes. Since the handler is not terminal we jump back to the beginning of the matching loop and match the same matcher/handler combo again. It never stops because we never call cx.prefetch again. Since `isFirstPrefetch = false` is unreachable in this situation (the statement is at the wrong place). It always must be set false after the first iteration.
This bug was caused by the fix for #194
@ydylla ydylla requested a review from mholt June 6, 2024 22:13
Copy link
Owner

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also switched from an endless loop to a limited for loop in hope future bugs can be prevented by this.

Excellent. I was thinking as I read your post that this would be a good idea. 👍

The 10k iteration limit was chosen by feeling. I have no data on how many prefetch calls are realistically needed in practice.

I'm guessing that's needlessly high, but maybe let's leave it for now, and if things are working well, we can try reducing it.

Thanks so much for the patch. It's quite elegant!

layer4/routes.go Show resolved Hide resolved
@@ -184,5 +183,8 @@ func (routes RouteList) Compile(logger *zap.Logger, matchingTimeout time.Duratio
}
}
}

logger.Error("matching connection", zap.String("remote", cx.RemoteAddr().String()), zap.Error(errors.New("number of prefetch calls exhausted")))
return nil
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should return the error here. But this is also probably fine. Any thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's that way because I wanted to differentiate errors during matching vs errors from the handlers. "matching connection" vs. "handling connection". But I agree it looks bad/wrong.
We could swap the positions. Log "matching connection" in server.go#L168 and "handling connection" in routes.go#L173. Then the HandlerFunc in Compile could return errors directly everywhere.
But the logic for only logging with warning level for ErrDeadlineExceeded should probably be kept in the compiled function. Otherwise it must be duplicated to all other places that call Handle on the compiled route. Like the listener wrapper or the subroute handler.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so it's in a function that acts as a handler representing the entire route, which is both matching and handling. That's a good question, I guess, whether we should treat their errors separately or differently. A matcher should only return an error if a decision (true/false) couldn't be reached; that should be rare, and notable, since it indicates a problem with the connection. Maybe we should treat their errors equally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response. For now, I would leave it as it is. I think the different messages are useful to differentiate errors during the routing/matching phase and caddy actually knowing what to do with the connection. Since errors during matching are likely caused by clients sending garbage, disconnecting early or configuration problems. But errors during handing likely indicate real problems for real/valid users.

@WeidiDeng
Copy link
Contributor

The arbitray loop limit is the wrong approach to the bug, see here.

The reason your example blocks is that the wierd holder loops you set, When provisioning, you use the nopNextHandler which just handles the connection to next provided one, which is this thing here, then it just loops and loops and no way to progress. Unlike nopHandler, this type handles the connection in a hacky way. Idealy we should use the correct fallback handler in this case.

This if is not executed is that the buffer is not fully comsumed in the first place, proxy protocol matcher only needs a few data to determine the type, and in the handler these data is not used at all.

@ydylla
Copy link
Collaborator Author

ydylla commented Jun 27, 2024

The arbitray loop limit is the wrong approach to the bug

The limit is not the fix for this bug. The isFirstPrefetch = false was at the wrong place. A fix would also be possible by just moving the line. The use of a limited for loop was just to prevent future bugs similar to this one (and it looks nicer without the bool flag).

@WeidiDeng
Copy link
Contributor

WeidiDeng commented Jun 28, 2024

In your example, this connection is handled by the handler, the handler just didn't do anything about it (no op). What I'm trying to say is that your approach doesn't handle this type of case. Did the matcher return false, nil because it requires more data or is the match failed by that point? The second pass is unnecessary because it's handled by the handler but your code couldn't distinguish that case.

@ydylla
Copy link
Collaborator Author

ydylla commented Jun 28, 2024

In your example, this connection is handled by the handler, the handler just didn't do anything about it (no op)

The config in the initial comment does not make much sense in general. Since the proxy_protocol handler is not a terminal handler. It was just a minimal reproducer.

@mholt Can we please merge this PR. Like I said in the initial comment this is just a fix for a rather critical bug I accidentally introduced while fixing #194. This bug was in master far too long already and it has nothing to do with the changed behavior of the listener wrapper (let's discuss this in #207).

@mholt
Copy link
Owner

mholt commented Jun 28, 2024

@ydylla Absolutely, sorry I lost track of things with summertime... will merge :) And we can iterate on this to improve it if needed, no problem. Thanks for your dedication to this!! It's super awesome! (Also I appreciate your feedback @WeidiDeng )

@mholt mholt merged commit ca3e2f3 into master Jun 28, 2024
6 checks passed
@mholt mholt deleted the fix-endless-loop branch June 28, 2024 16:36
@WeidiDeng
Copy link
Contributor

The config in the initial comment does not make much sense in general. Since the proxy_protocol handler is not a terminal handler. It was just a minimal reproducer.

In your example, this connection is handled by the handler, the handler just didn't do anything about it (no op)

The config in the initial comment does not make much sense in general. Since the proxy_protocol handler is not a terminal handler. It was just a minimal reproducer.

But you didn't fix the underlying problem. Adding an iteration counter is a hack. The proxy protocol handler is called in the example but you don't acknowledge the fact and still retry the match loop. That's the underlying problem.

@ydylla
Copy link
Collaborator Author

ydylla commented Jun 28, 2024

But you didn't fix the underlying problem. Adding an iteration counter is a hack

Sorry I can't follow you. Can you please provide a config/example to reproduce the problem you still see with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants