Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul 0.8.x LAN servers attempting to connect to each other using TCP/8302 #3058

Closed
agy opened this issue May 18, 2017 · 3 comments
Closed
Labels
type/docs Documentation needs to be created/updated/clarified

Comments

@agy
Copy link
Contributor

agy commented May 18, 2017

Description of the Issue (and unexpected/desired result)

While testing upgrades of my Consul servers I noticed the following error message periodically occurring:

    2017/05/18 21:03:12 [DEBUG] memberlist: Failed to join 10.80.146.6: dial tcp 10.80.146.6:8302: i/o timeout
    2017/05/18 21:03:12 [DEBUG] consul: Failed to flood-join "ip-10-80-146-6" at 10.80.146.6: 1 error(s) occurred:

* Failed to join 10.80.146.6: dial tcp 10.80.146.6:8302: i/o timeout

These servers are all on the same LAN (no WAN connections have been configured) and they have "firewall rules" (AWS security groups) preventing TCP/8302 connections between each other. The ports used documentation indicates that this should only be used for WAN usage.

Using ss I can confirm that the connections are being attempted.

ubuntu@ip-10-229-17-230:~$ sudo ss -ntp
State       Recv-Q Send-Q                                             Local Address:Port                                                            Peer Address:Port
SYN-SENT    0      1                                                  10.229.17.230:38380                                                           10.244.1.158:8302                users:(("consul",pid=12398,fd=15))
[...]

Since I'm testing upgrades, the consul members output looks like this:

ubuntu@ip-10-229-17-230:~$ ./consul members
    2017/05/18 21:16:03 [DEBUG] http: Request GET /v1/agent/members (111.235µs) from=127.0.0.1:59698
Node              Address             Status  Type    Build  Protocol  DC
ip-10-229-17-230  10.229.17.230:8301  alive   server  0.8.3  2         us-west-1-staging
ip-10-244-1-158   10.244.1.158:8301   alive   server  0.7.5  2         us-west-1-staging
ip-10-80-146-6    10.80.146.6:8301    alive   server  0.7.5  2         us-west-1-staging
ubuntu@ip-10-229-17-230:~$ ./consul members -wan
Node                                Address             Status  Type    Build  Protocol  DC
ip-10-229-17-230.us-west-1-staging  10.229.17.230:8302  alive   server  0.8.3  2         us-west-1-staging

I have tested the Consul binaries from 0.8.3 down to 0.7.5 and can confirm that this behaviour was introduced in 0.8.0.

Basic connectivity checks:

ubuntu@ip-10-229-17-230:~$ nc -w 2 -vz 10.244.1.158 8302
nc: connect to 10.244.1.158 port 8302 (tcp) timed out: Operation now in progress

ubuntu@ip-10-229-17-230:~$ nc -w 2 -vz 10.244.1.158 8301
Connection to 10.244.1.158 8301 port [tcp/*] succeeded!

ubuntu@ip-10-229-17-230:~$ nc -w 2 -vz 10.244.1.158 8300
Connection to 10.244.1.158 8300 port [tcp/*] succeeded!

What I expect:

The Consul LAN servers should not attempt to connect to each other using TCP/8302.

consul version for Server

Server: Consul v0.8.3

consul info for Server

Server:

ubuntu@ip-10-229-17-230:~$ ./consul info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease =
	revision = ea2a82b
	version = 0.8.3
consul:
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr = 10.244.1.158:8300
	server = true
raft:
	applied_index = 272
	commit_index = 272
	fsm_pending = 0
	last_contact = 36.905173ms
	last_log_index = 272
	last_log_term = 7
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:10.244.1.158:8300 Address:10.244.1.158:8300} {Suffrage:Voter ID:10.229.17.230:8300 Address:10.229.17.230:8300} {Suffrage:Voter ID:10.80.146.6:8300 Address:10.80.146.6:8300}]
	latest_configuration_index = 166
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 7
runtime:
	arch = amd64
	cpu_count = 1
	goroutines = 75
	max_procs = 1
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = true
	event_queue = 0
	event_time = 4
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 19
	members = 3
	query_queue = 0
	query_time = 1
serf_wan:
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

Three node, test cluster all running Ubuntu 16.04

The servers are started with:

./consul agent -server -bootstrap-expect 3 -config-dir consul.d

Example of one of the Consul server config files (redacted):

$ cat consul.d/*json
{
  "acl_enforce_version_8": false,
  "ui": true
}
{
  "client_addr": "127.0.0.1",
  "data_dir": "/mnt/consul",
  "start_join": [
    "10.80.146.6",
    "10.229.17.230",
    "10.244.1.158"
  ],
  "recursor": "8.8.8.8",
  "dogstatsd_addr": "127.0.0.1:8125",
  "leave_on_terminate": true,
  "acl_datacenter": "us-west-1-staging",
  "acl_default_policy": "allow",
  "acl_down_policy": "extend-cache",
  "acl_master_token": "REDACTED",
  "encrypt": "REDACTED",
  "bind_addr": "0.0.0.0",
  "datacenter": "us-west-1-staging",
  "log_level": "trace",
  "enable_syslog": true,
  "syslog_facility": "LOCAL7"
}
{
  "node_name": "ip-10-229-17-230",
  "advertise_addr": "10.229.17.230"
}

Reproduction steps

  1. Create a simple 3 nodes Consul cluster running 0.7.5 with firewall (or security group) rules preventing the server nodes from connecting to each other on TCP/8302.
  2. Update one of the servers to 0.8.3
  3. Notice that the 0.8.3 server attempts to connect to it's peers using TCP/8302 and failing

Please let me know if this is expected, a config error on my part or if there is any further information that you require?

@agy
Copy link
Contributor Author

agy commented May 18, 2017

To follow up.

When I allow TCP/8302 between the servers, I see log messages like the following:

    2017/05/18 21:35:06 [DEBUG] memberlist: Initiating push/pull sync with: 10.80.146.6:8302
    2017/05/18 21:35:06 [INFO] serf: EventMemberJoin: ip-10-244-1-158.us-west-1-staging 10.244.1.158
    2017/05/18 21:35:06 [INFO] serf: EventMemberJoin: ip-10-80-146-6.us-west-1-staging 10.80.146.6
    2017/05/18 21:35:06 [DEBUG] consul: Successfully performed flood-join for "ip-10-80-146-6" at 10.80.146.6:8302
    2017/05/18 21:35:06 [INFO] consul: Handled member-join event for server "ip-10-244-1-158.us-west-1-staging" in area "wan"
    2017/05/18 21:35:06 [INFO] consul: Handled member-join event for server "ip-10-80-146-6.us-west-1-staging" in area "wan"
    2017/05/18 21:35:11 [DEBUG] memberlist: Stream connection from=10.80.146.6:42764
    2017/05/18 21:35:13 [DEBUG] memberlist: Failed ping: ip-10-80-146-6.us-west-1-staging (timeout reached)
    2017/05/18 21:35:14 [DEBUG] memberlist: Stream connection from=10.80.146.6:42768
    2017/05/18 21:35:15 [WARN] memberlist: Was able to connect to ip-10-80-146-6.us-west-1-staging but other probes failed, network may be misconfigured

This is really confusing since I did not specifically configure WAN networking.

consul info

ubuntu@ip-10-229-17-230:~$ ./consul info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 1
build:
	prerelease =
	revision = ea2a82b
	version = 0.8.3
consul:
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 10.229.17.230:8300
	server = true
raft:
	applied_index = 498
	commit_index = 498
	fsm_pending = 0
	last_contact = 0
	last_log_index = 498
	last_log_term = 10
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:10.244.1.158:8300 Address:10.244.1.158:8300} {Suffrage:Voter ID:10.229.17.230:8300 Address:10.229.17.230:8300} {Suffrage:Voter ID:10.80.146.6:8300 Address:10.80.146.6:8300}]
	latest_configuration_index = 166
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 10
runtime:
	arch = amd64
	cpu_count = 1
	goroutines = 85
	max_procs = 1
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = true
	event_queue = 0
	event_time = 5
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 22
	members = 3
	query_queue = 0
	query_time = 1
serf_wan:
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 4
	members = 3
	query_queue = 0
	query_time = 1

@slackpad
Copy link
Contributor

Hi @agy this is working as designed for the new "WAN flood join" feature (from changelog):

WAN Join Flooding: A new routine was added that looks for Consul servers in the LAN and makes sure that they are joined into the WAN as well. This catches up up newly-added servers onto the WAN as soon as they join the LAN, keeping them in sync automatically. [GH-2801]

If you use the WAN at all this is always what you want, though we have gotten some folks confused by this that don't use it. We hesitated adding extra config complexity but we are listening to feedback on this. We should update the documentation to make this more clear about port 8302 though.

@slackpad slackpad added the type/docs Documentation needs to be created/updated/clarified label May 19, 2017
@agy
Copy link
Contributor Author

agy commented May 19, 2017

@slackpad Thanks for the update. I read the release notes and the docs and it wasn't clear to me that this change would have LAN machines connecting to each other on the WAN port. I do however understand the rationale.

This does seem to now be a documentation issue so I'm closing this specific bug report.

For those that come later, I added the following, on my Consul servers, to resolve the issue:

  • Allow TCP/8302 from all other Consul servers
  • Allow UDP/8302 from all other Consul servers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

No branches or pull requests

2 participants