Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul does not close TCP connections probably when blocking queries are aborted (connections hanging in FIN_WAIT-2) #8524

Open
fho opened this issue Aug 17, 2020 · 0 comments
Labels
theme/api Relating to the HTTP API interface type/bug Feature does not function as expected

Comments

@fho
Copy link

fho commented Aug 17, 2020

Overview of the Issue

When a Consul client cancels a blocking HTTP-query, the TCP connection is not closed correctly by the server.
The TCP connection stays in the FIN_WAIT-2 state until it's tcp_fin_timeout expires.
The FIN_WAIT-2 means that the client is waiting for an ACK from the server.

When a lot of blocking queries are cancelled and retried at the same time, the servers http_max_conns_per_client can be hit the further client queries will fail.

I expect that the server closes the http connection completely when a blocking query is aborted and no TCP connections are left when the client program terminates.

We run into this issue in our grpc-consul-resolver (https://github.com/simplesurance/grpcconsulresolver).
A testcase opened and closed a lot of grpc-client connection in a short timeframe.
When the GRPC connection is closed, the blocking query of the grpcconsulresolver is cancelled by cancelling it's context.
The TCP connections piled up until the consul server did not accept further queries.
This issue could also be triggered in a production environment, for example when a lot of applications are redeployed in parallel and they all cancel consul blocking queries.
It would also work as DoS attack.

Reproduction Steps

With a Go-Client using the consul package as client

  1. Run a consul agent (consul agent -dev -enable-script-checks -node=web -ui)
  2. Run the following Go program, it creates a new service entry, queries the service entry once to get waitIndex and then in a loop does a blocking query with the waitIndex that is aborted via the context.
package main

import (
	"context"
	"errors"
	"log"
	"time"

	consul "github.com/hashicorp/consul/api"
)

const consulAddr = "localhost:8500"
const consulServiceName = "testentry"

func main() {
	clt, err := consul.NewClient(&consul.Config{
		Address: consulAddr,
	})
	if err != nil {
		log.Fatal(err)
	}

	_, err = clt.Catalog().Register(&consul.CatalogRegistration{
		Node:    "localhost",
		Address: "127.0.0.1:1010",
		Service: &consul.AgentService{
			ID:      consulServiceName,
			Service: consulServiceName,
		},
	},
		nil,
	)
	if err != nil {
		log.Fatal(err)
	}

	log.Printf("registered consul service %s\n", consulServiceName)

	_, qm, err := clt.Health().Checks(consulServiceName, nil)
	if err != nil {
		log.Fatal(err)
	}

	for i := 0; ; i++ {
		log.Printf("loop: %d\n", i)

		ctx, _ := context.WithTimeout(context.Background(), 2*time.Millisecond)

		opts := (&consul.QueryOptions{}).WithContext(ctx)
		opts.WaitIndex = qm.LastIndex
		_, _, err = clt.Health().Checks(consulServiceName, opts)
		if err != nil {
			if !errors.Is(err, context.DeadlineExceeded) {
				log.Fatal(err)
			}
		}

		time.Sleep(25 * time.Millisecond)
	}
}

After some time the consul query failed with an EOF error and the program terminates. The consul server reached it's http_max_conns_per_client limit.

netstat or ss -atupn '( dport = :8500 )' will show a lot of TCP connections in FIN-WAIT-2 state.

With Curl

  1. Run a consul agent (consul agent -dev -enable-script-checks -node=web -ui)
  2. Create a service entry:
    curl --request PUT --data '{ "Address": "localhost", "Node": "web",  "Service": { "Service": "testentry"  } }' http://localhost:8500/v1/catalog/register
  3. Query the service entry to get the waitIndex:
    curl localhost:8500/v1/catalog/service/testentry | jq '.[0]."CreateIndex"'
  4. List the TCP connections to consul: ss -atupn '( dport = :8500 )' (or run it via watch to monitor it)
  5. Do a blocking query with the retrieved waitIndex and a short timeout:
     curl --max-time 1 localhost:8500/v1/catalog/service/testentry?index=<WAITINDEX>
  6. List the TCP connections to consul: ss -atupn '( dport = :8500 )', for every aborted curl query a FIN-WAIT-2 TCP connection appears.

Operating system and Environment details

  • Linux 5.4.52
  • Consul server 1.8.3
  • Consul go package 1.8.3
  • Go test program compiled with go 1.14.5
  • curl 7.71.1
  • /proc/sys/net/ipv4/tcp_fin_timeout is 60
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/api Relating to the HTTP API interface type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

2 participants