-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ensure online status and route changes are propagated #1564
Conversation
9082450
to
2624ea0
Compare
@kradalby thanks, I'll test it today |
Thank you, the situation has improved (before this PR, node marked as offline forever), but the nodes are still 'flapping' between online-offline states. I found a simple way to check this locally; it's enough to run version: "3.8"
services:
ts01:
image: tailscale/tailscale:v1.48.2
network_mode: host
environment:
TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/'
TS_USERSPACE: 'true'
TS_AUTHKEY: 'XXXX'
ts02:
image: tailscale/tailscale:v1.48.2
network_mode: host
environment:
TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/'
TS_USERSPACE: 'true'
TS_AUTHKEY: 'XXXX' Leave it running for ~10 minutes; the nodes occasionally go 'offline'. at
at
at
at
in tailscale logs lot of errors like this:
|
Am I correct to understand that the exitnode/subnet router issue is "solved" when the node is correctly reported online? Or is that a second issue we will have after? |
Some behaviors from 1561-online-issue branch
I am using same configuration from the v0.22.3 version |
10ec28a
to
8cad13d
Compare
@kradalby error: builder for '/nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv' failed with exit code 1;
last 10 log lines:
> + Tags: []string{},
> + PrimaryRoutes: []netip.Prefix{s"192.168.0.0/24"},
> + LastSeen: s"2009-11-10 23:09:00 +0000 UTC",
> + MachineAuthorized: true,
> + ...
> + },
> )
> FAIL
> FAIL github.com/juanfont/headscale/hscontrol/mapper 0.109s
> FAIL
For full logs, run 'nix log /nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv'.
error: builder for '/nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv' failed with exit code 1;
last 10 log lines:
> + Tags: []string{},
> + PrimaryRoutes: []netip.Prefix{s"192.168.0.0/24"},
> + LastSeen: s"2009-11-10 23:09:00 +0000 UTC",
> + MachineAuthorized: true,
> + ...
> + },
> )
> FAIL
> FAIL github.com/juanfont/headscale/hscontrol/mapper 0.109s
> FAIL
For full logs, run 'nix log /nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv'.
make: *** [Makefile:22: build] Error 1 |
8cad13d
to
7e9bead
Compare
@joepa37 I forgot to update some tests, please try again. |
From the last check I have updated the ACL policy with the following (sorry for not checking before :/) acls:
- action: accept
src: ["*"]
dst: ["*:*"] It seems the ACL configuration I been using is not working any more. /etc/headscale/acls.yamlacls: # Mesh Network - action: accept src: [group:servers] dst: [group:servers:*] # Administrators can access to all server resources - action: accept src: [group:admin] dst: [group:servers:*] # Developers can access other devs HTTP apps - action: accept src: [group:devs] proto: tcp dst: ['group:devs:80,443'] # All clients can access private http apps - action: accept src: [group:servers, group:admin, group:devs] dst: [10.30.0.0/16:80, 10.30.0.0/16:443] # Exit Nodes - action: accept src: [group:admin] dst: [ group:servers:0, 0.0.0.0/5:*, 8.0.0.0/7:*, 11.0.0.0/8:*, 12.0.0.0/6:*, 16.0.0.0/4:*, 32.0.0.0/3:*, 64.0.0.0/3:*, 96.0.0.0/6:*, 100.0.0.0/10:*, 100.128.0.0/9:*, 101.0.0.0/8:*, 102.0.0.0/7:*, 104.0.0.0/5:*, 112.0.0.0/5:*, 120.0.0.0/6:*, 124.0.0.0/7:*, 126.0.0.0/8:*, 128.0.0.0/3:*, 160.0.0.0/5:*, 168.0.0.0/6:*, 172.0.0.0/12:*, 172.32.0.0/11:*, 172.64.0.0/10:*, 172.128.0.0/9:*, 173.0.0.0/8:*, 174.0.0.0/7:*, 176.0.0.0/4:*, 192.0.0.0/9:*, 192.128.0.0/11:*, 192.160.0.0/13:*, 192.169.0.0/16:*, 192.170.0.0/15:*, 192.172.0.0/14:*, 192.176.0.0/12:*, 192.192.0.0/10:*, 193.0.0.0/8:*, 194.0.0.0/7:*, 196.0.0.0/6:*, 200.0.0.0/5:*, 208.0.0.0/4:* ] Tests I have repeated Other tests I have run Environments: |
Hmm, I think we need to track this in a separate issue.
So I think this means that subnet router and exit node now works and the left is:
This needs to be speeded up, particularly to make HA subnet routers useful.
HA routes are not updated correctly. Of course I would be keen to hear back from @vsychov as well for the exit/subnet router |
I've pushed a fix that should tell nodes quickly when a node disconnects, can you give that a shot @joepa37 ? |
It seems like the behavior is actually the opposite of what it's supposed to be. From the clients perspective both servers appears tailscale status
100.64.0.3 macbook-pro root macOS -
100.64.0.1 build-phx1-ad1.mesh.ts.net mesh linux idle; offers exit node; offline
100.64.0.2 test-phx1-ad1.mesh.ts.net mesh linux idle; offers exit node; offline Even if they are actually reachable tailscale status
100.64.0.3 macbook-pro root macOS -
100.64.0.1 build-phx1-ad1.mesh.ts.net mesh linux active; offers exit node; direct 129.153.71.250:51821; offline, tx 11308 rx 5068
100.64.0.2 test-phx1-ad1.mesh.ts.net mesh linux idle, tx 820 rx 732
|
@kradalby , I checked it again, and still not works: Test Environment:
Steps to Reproduce:
bash-5.1# date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:38:49 UTC 2023
100.64.0.1 test-host user2 linux -
100.64.0.2 test-host-fe57zhq6.user1.example.com user1 linux -
100.64.0.3 test-host-wi4tp7j2.user1.example.com user1 linux -
date && headscale route list
ID | Node | Prefix | Advertised | Enabled | Primary
1 | test-host-fe57zhq6 | 0.0.0.0/0 | true | true | -
2 | test-host-fe57zhq6 | ::/0 | true | true | -
3 | test-host-fe57zhq6 | 10.0.0.0/8 | true | true | true
4 | test-host-wi4tp7j2 | 0.0.0.0/0 | true | true | -
5 | test-host-wi4tp7j2 | ::/0 | true | true | -
6 | test-host-wi4tp7j2 | 10.0.0.0/8 | true | true | false
|
Ok, thanks, I will be away from my computer for a couple of days, I'll try to go back to the drawing board after that. |
So this might be true, but |
I've pushed two fixes, one for enable, and one for disable/failover nodes. I added a test case to verify that the enable/disable route command is not only caught by headscale, but also propagated to the nodes, please feel free to read through it and see if it makes sense. It would be great if you can help test the following:
Thank you |
Thanks @kradalby , I'll try tests it today |
I tested revision The exit node propagated very quickly, and there is no longer any flapping between the online and offline status. Tomorrow, I'll construct a more intricate schema based on a real network and will test scenarios with route failover and network flapping. There are also some additional issues:
Hostnames are missing users, and hostname resolution isn't working. |
Thanks for taking time to explain this, really helpful to me.
✅ Enabling a route: manually approved and auto approved PD:
|
hmm, I am not seeing this in my test, do you mind sharing your config? |
config.yaml---
# headscale will look for a configuration file named `config.yaml` (or `config.json`) in the following order:
#
# - `/etc/headscale`
# - `~/.headscale`
# - current working directory
# The url clients will connect to.
# Typically this will be a domain like:
#
# https://myheadscale.example.com:443
#
server_url: http://127.0.0.1:8080
# Address to listen to / bind to on the server
#
# For production:
# listen_addr: 0.0.0.0:8080
listen_addr: 127.0.0.1:8080
# Address to listen to /metrics, you may want
# to keep this endpoint private to your internal
# network
#
metrics_listen_addr: 127.0.0.1:9090
# Address to listen for gRPC.
# gRPC is used for controlling a headscale server
# remotely with the CLI
# Note: Remote access _only_ works if you have
# valid certificates.
#
# For production:
# grpc_listen_addr: 0.0.0.0:50443
grpc_listen_addr: 127.0.0.1:50443
# Allow the gRPC admin interface to run in INSECURE
# mode. This is not recommended as the traffic will
# be unencrypted. Only enable if you know what you
# are doing.
grpc_allow_insecure: false
# Private key used to encrypt the traffic between headscale
# and Tailscale clients.
# The private key file will be autogenerated if it's missing.
#
private_key_path: /home/user/headscale/private.key
# The Noise section includes specific configuration for the
# TS2021 Noise protocol
noise:
# The Noise private key is used to encrypt the
# traffic between headscale and Tailscale clients when
# using the new Noise-based protocol. It must be different
# from the legacy private key.
private_key_path: /home/user/headscale/noise_private.key
# List of IP prefixes to allocate tailaddresses from.
# Each prefix consists of either an IPv4 or IPv6 address,
# and the associated prefix length, delimited by a slash.
# It must be within IP ranges supported by the Tailscale
# client - i.e., subnets of 100.64.0.0/10 and fd7a:115c:a1e0::/48.
# See below:
# IPv6: https://github.com/tailscale/tailscale/blob/22ebb25e833264f58d7c3f534a8b166894a89536/net/tsaddr/tsaddr.go#LL81C52-L81C71
# IPv4: https://github.com/tailscale/tailscale/blob/22ebb25e833264f58d7c3f534a8b166894a89536/net/tsaddr/tsaddr.go#L33
# Any other range is NOT supported, and it will cause unexpected issues.
ip_prefixes:
- fd7a:115c:a1e0::/48
- 100.64.0.0/10
# DERP is a relay system that Tailscale uses when a direct
# connection cannot be established.
# https://tailscale.com/blog/how-tailscale-works/#encrypted-tcp-relays-derp
#
# headscale needs a list of DERP servers that can be presented
# to the clients.
derp:
server:
# If enabled, runs the embedded DERP server and merges it into the rest of the DERP config
# The Headscale server_url defined above MUST be using https, DERP requires TLS to be in place
enabled: false
# Region ID to use for the embedded DERP server.
# The local DERP prevails if the region ID collides with other region ID coming from
# the regular DERP config.
region_id: 999
# Region code and name are displayed in the Tailscale UI to identify a DERP region
region_code: "headscale"
region_name: "Headscale Embedded DERP"
# Listens over UDP at the configured address for STUN connections - to help with NAT traversal.
# When the embedded DERP server is enabled stun_listen_addr MUST be defined.
#
# For more details on how this works, check this great article: https://tailscale.com/blog/how-tailscale-works/
stun_listen_addr: "0.0.0.0:3478"
# List of externally available DERP maps encoded in JSON
urls:
- https://controlplane.tailscale.com/derpmap/default
# Locally available DERP map files encoded in YAML
#
# This option is mostly interesting for people hosting
# their own DERP servers:
# https://tailscale.com/kb/1118/custom-derp-servers/
#
# paths:
# - /etc/headscale/derp-example.yaml
paths: []
# If enabled, a worker will be set up to periodically
# refresh the given sources and update the derpmap
# will be set up.
auto_update_enabled: true
# How often should we check for DERP updates?
update_frequency: 24h
# Disables the automatic check for headscale updates on startup
disable_check_updates: false
# Time before an inactive ephemeral node is deleted?
ephemeral_node_inactivity_timeout: 30m
# Period to check for node updates within the tailnet. A value too low will severely affect
# CPU consumption of Headscale. A value too high (over 60s) will cause problems
# for the nodes, as they won't get updates or keep alive messages frequently enough.
# In case of doubts, do not touch the default 10s.
node_update_check_interval: 10s
# # Postgres config
# If using a Unix socket to connect to Postgres, set the socket path in the 'host' field and leave 'port' blank.
db_type: postgres
db_host: localhost
db_port: 5432
db_name: headscale
db_user: headscale
db_pass: password
# If other 'sslmode' is required instead of 'require(true)' and 'disabled(false)', set the 'sslmode' you need
# in the 'db_ssl' field. Refers to https://www.postgresql.org/docs/current/libpq-ssl.html Table 34.1.
db_ssl: false
### TLS configuration
#
## Let's encrypt / ACME
#
# headscale supports automatically requesting and setting up
# TLS for a domain with Let's Encrypt.
#
# URL to ACME directory
acme_url: https://acme-v02.api.letsencrypt.org/directory
# Email to register with ACME provider
acme_email: ""
# Domain name to request a TLS certificate for:
tls_letsencrypt_hostname: ""
# Path to store certificates and metadata needed by
# letsencrypt
# For production:
tls_letsencrypt_cache_dir: /home/user/headscale/cache
# Type of ACME challenge to use, currently supported types:
# HTTP-01 or TLS-ALPN-01
# See [docs/tls.md](docs/tls.md) for more information
tls_letsencrypt_challenge_type: HTTP-01
# When HTTP-01 challenge is chosen, letsencrypt must set up a
# verification endpoint, and it will be listening on:
# :http = port 80
tls_letsencrypt_listen: ":http"
## Use already defined certificates:
tls_cert_path: ""
tls_key_path: ""
log:
# Output formatting for logs: text or json
format: text
level: trace
# Path to a file containg ACL policies.
# ACLs can be defined as YAML or HUJSON.
# https://tailscale.com/kb/1018/acls/
acl_policy_path: ""
## DNS
#
# headscale supports Tailscale's DNS configuration and MagicDNS.
# Please have a look to their KB to better understand the concepts:
#
# - https://tailscale.com/kb/1054/dns/
# - https://tailscale.com/kb/1081/magicdns/
# - https://tailscale.com/blog/2021-09-private-dns-with-magicdns/
#
dns_config:
# Whether to prefer using Headscale provided DNS or use local.
override_local_dns: true
# List of DNS servers to expose to clients.
nameservers:
- 1.1.1.1
# NextDNS (see https://tailscale.com/kb/1218/nextdns/).
# "abc123" is example NextDNS ID, replace with yours.
#
# With metadata sharing:
# nameservers:
# - https://dns.nextdns.io/abc123
#
# Without metadata sharing:
# nameservers:
# - 2a07:a8c0::ab:c123
# - 2a07:a8c1::ab:c123
# Split DNS (see https://tailscale.com/kb/1054/dns/),
# list of search domains and the DNS to query for each one.
#
# restricted_nameservers:
# foo.bar.com:
# - 1.1.1.1
# darp.headscale.net:
# - 1.1.1.1
# - 8.8.8.8
# Search domains to inject.
domains: []
# Extra DNS records
# so far only A-records are supported (on the tailscale side)
# See https://github.com/juanfont/headscale/blob/main/docs/dns-records.md#Limitations
# extra_records:
# - name: "grafana.myvpn.example.com"
# type: "A"
# value: "100.64.0.3"
#
# # you can also put it in one line
# - { name: "prometheus.myvpn.example.com", type: "A", value: "100.64.0.3" }
# Whether to use [MagicDNS](https://tailscale.com/kb/1081/magicdns/).
# Only works if there is at least a nameserver defined.
magic_dns: true
# Defines the base domain to create the hostnames for MagicDNS.
# `base_domain` must be a FQDNs, without the trailing dot.
# The FQDN of the hosts will be
# `hostname.user.base_domain` (e.g., _myhost.myuser.example.com_).
base_domain: example.com
# Unix socket used for the CLI to connect without authentication
# Note: for production you will want to set this to something like:
unix_socket: /home/user/headscale/headscale.sock
unix_socket_permission: "0770"
#
# headscale supports experimental OpenID connect support,
# it is still being tested and might have some bugs, please
# help us test it.
# OpenID Connect
# oidc:
# only_start_if_oidc_is_available: true
# issuer: "https://your-oidc.issuer.com/path"
# client_id: "your-oidc-client-id"
# client_secret: "your-oidc-client-secret"
# # Alternatively, set `client_secret_path` to read the secret from the file.
# # It resolves environment variables, making integration to systemd's
# # `LoadCredential` straightforward:
# client_secret_path: "${CREDENTIALS_DIRECTORY}/oidc_client_secret"
# # client_secret and client_secret_path are mutually exclusive.
#
# # The amount of time from a node is authenticated with OpenID until it
# # expires and needs to reauthenticate.
# # Setting the value to "0" will mean no expiry.
# expiry: 180d
#
# # Use the expiry from the token received from OpenID when the user logged
# # in, this will typically lead to frequent need to reauthenticate and should
# # only been enabled if you know what you are doing.
# # Note: enabling this will cause `oidc.expiry` to be ignored.
# use_expiry_from_token: false
#
# # Customize the scopes used in the OIDC flow, defaults to "openid", "profile" and "email" and add custom query
# # parameters to the Authorize Endpoint request. Scopes default to "openid", "profile" and "email".
#
# scope: ["openid", "profile", "email", "custom"]
# extra_params:
# domain_hint: example.com
#
# # List allowed principal domains and/or users. If an authenticated user's domain is not in this list, the
# # authentication request will be rejected.
#
# allowed_domains:
# - example.com
# # Note: Groups from keycloak have a leading '/'
# allowed_groups:
# - /headscale
# allowed_users:
# - [email protected]
#
# # If `strip_email_domain` is set to `true`, the domain part of the username email address will be removed.
# # This will transform `[email protected]` to the user `first-name.last-name`
# # If `strip_email_domain` is set to `false` the domain part will NOT be removed resulting to the following
# user: `first-name.last-name.example.com`
#
# strip_email_domain: true
# Logtail configuration
# Logtail is Tailscales logging and auditing infrastructure, it allows the control panel
# to instruct tailscale nodes to log their activity to a remote server.
logtail:
# Enable logtail for this headscales clients.
# As there is currently no support for overriding the log server in headscale, this is
# disabled by default. Enabling this will make your clients send logs to Tailscale Inc.
enabled: false
# Enabling this option makes devices prefer a random port for WireGuard traffic over the
# default static port 41641. This option is intended as a workaround for some buggy
# firewall devices. See https://tailscale.com/kb/1181/firewalls/ for more information.
randomize_client_port: false docker-compose.yamlversion: "3.8"
services:
db:
image: postgres
command: ["postgres", "-c", "log_statement=all"]
ports:
- 5432:5432
environment:
POSTGRES_DB: headscale
POSTGRES_USER: headscale
POSTGRES_PASSWORD: password
adminer:
depends_on:
- db
image: adminer
ports:
- 8081:8080
ts00:
image: tailscale/tailscale:v1.48.2
network_mode: host
environment:
TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/ --advertise-exit-node --advertise-routes=10.0.0.0/8'
TS_USERSPACE: 'true'
TS_AUTHKEY: 'KEY_1'
TS_HOSTNAME: 'ts00'
ts01:
image: tailscale/tailscale:v1.48.2
network_mode: host
environment:
TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/ --advertise-exit-node --advertise-routes=10.0.0.0/8'
TS_USERSPACE: 'true'
TS_AUTHKEY: 'KEY_1'
TS_HOSTNAME: 'ts01'
ts-client:
image: tailscale/tailscale:v1.48.2
network_mode: host
environment:
TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/'
TS_USERSPACE: 'true'
TS_AUTHKEY: 'KEY_2'
TS_HOSTNAME: 'ts-client'
|
Ok, after running your config, I am seeing that the nodes struggle to join at all, but if they manage I dont see the name issue. I ran it with sqlite instead of postgres and it ran fine (both joining and the name). @vsychov can you try running your test with SQLite? @joepa37 what database are you running? |
@kradalby, I've reproduced the bug on sqlite, but it seems I missed an important detail in the description above; it only manifests after enabling routes. As a result, the steps to reproduce on sqlite are as follows:
Then, inside ts-client, check the status:
In my case, I use headscale within GCP's k8s (GKE), and I use a managed PostgreSQL instance because it's more convenient to work with than sqlite in terms of creating database backups. Also, due to #1482 (which was fixed in #1562, but hasn't been merged yet), I need to run a script to clear out this data. It's easier for me to do this using an external database, executing the script in a separate container. During testing, I use PostgreSQL just to be a bit closer to how I use it in real-life situations. |
df0c075
to
3014bc3
Compare
I have rewritten the failover logic and added quite a large synthetic test for failing over and ensuring the status is sent as expected. Please help testing HA and failover. I should also have fixed the missing username bug in DNS. I have rethought how the Online status is set, and this is currently implemented in the CLI and the HA failover, this is however not implemented in the Online map sent to nodes and might affect stuff like exit node. I'll work on that, but please test it anyways. |
@kradalby, I'll repeat all tests this week, and will test a few more cases, thanks for the great job. |
Thank you, I will work on the things I seem to have broken (Taildrop, Only DERP and web only logout flow) tomorrow. |
c0d9d8b
to
07d2c03
Compare
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
This PR further improves the state management system and tries to make sure that we get all nodes in sync continously. This is greatly enabled by a previous PR dropping support for older clients that allowed us to use a Patch field only sending small diffs for client updates. It also reworks how the HA subnet router is handled and it should be a bit easier to follow now. Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
bbb4c35
to
1b895c0
Compare
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
I conducted another series of tests, starting with two tailscale nodes on headscale revision 979569c, on the infrastructure described in my last message. I noticed an improvement in stability; the nodes no longer fluctuate between online and offline status, which is good news. The tailscale nodes were connected with the following parameters:
After enabling the routes, the routing table looked like this:
Traffic from the 10.110.0.0/20 network successfully passed to the 10.114.0.0/20 network and back. Next, I used the latest revision at the moment, af3c097, to test subnet route failover. I added another tailscale node ( 2023-12-03T16:25:53Z:
2023-12-03T16:26:03Z:
Despite all three nodes being online. |
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
I think this PR has grown a bit out of hand for the issue it was going to address, I think I propose the following, It should now pass all the tests we have, I think it should be reviewed and merged and then when I get test feedback on #1561, I will continue to address that in new PRs. As a result of this PR, I did notice two things:
the last I think is a new issue and should also be addressed separately. When this is merged, let us move all discussions of #1561 back to the issue. |
Signed-off-by: Kristoffer Dalby <[email protected]>
@@ -1,5 +1,6 @@ | |||
ignored/ | |||
tailscale/ | |||
.vscode/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You now use vscode??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes
The new alpha2 version fixed the issue, on my side 👍🥳 thanks a lot for the amazing work 🥳 |
This is an attempt to address #1561.