ensure online status and route changes are propagated #1564

kradalby · 2023-09-27T22:56:49Z

This is an attempt to address #1561.

kradalby · 2023-09-27T22:57:11Z

@vsychov could you test this branch in regards to #1561?

vsychov · 2023-09-28T06:42:55Z

@kradalby thanks, I'll test it today

vsychov · 2023-09-28T10:41:59Z

Thank you, the situation has improved (before this PR, node marked as offline forever), but the nodes are still 'flapping' between online-offline states.

I found a simple way to check this locally; it's enough to run headscale on 127.0.0.1:8080, and use the docker-compose file:

version: "3.8"
services:
  ts01:
    image: tailscale/tailscale:v1.48.2
    network_mode: host
    environment:
      TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/'
      TS_USERSPACE: 'true'
      TS_AUTHKEY: 'XXXX'

  ts02:
    image: tailscale/tailscale:v1.48.2
    network_mode: host
    environment:
      TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/'
      TS_USERSPACE: 'true'
      TS_AUTHKEY: 'XXXX'

Leave it running for ~10 minutes; the nodes occasionally go 'offline'.
headscale node list:

at 2023-09-28T10:30:20Z (offline):

ID | Hostname          | Name                       | MachineKey | NodeKey | User  | IP addresses                  | Ephemeral | Last seen           | Expiration          | Online  | Expired
3  | example-hostname | example-hostname          | [hbrPm]    | [LzCj3] | user1 | 100.64.0.1, fd7a:115c:a1e0::1 | true      | 2023-09-28 10:28:54 | 0001-01-01 00:00:00 | offline | no
4  | example-hostname | example-hostname-fqska9ih | [rXF+e]    | [O8SbL] | user2 | 100.64.0.2, fd7a:115c:a1e0::2 | true      | 2023-09-28 10:28:54 | 0001-01-01 00:00:00 | offline | no

at 2023-09-28T10:33:20Z (online):

ID | Hostname          | Name                       | MachineKey | NodeKey | User  | IP addresses                  | Ephemeral | Last seen           | Expiration          | Online | Expired
3  | example-hostname | example-hostname          | [hbrPm]    | [LzCj3] | user1 | 100.64.0.1, fd7a:115c:a1e0::1 | true      | 2023-09-28 10:32:54 | 0001-01-01 00:00:00 | online | no
4  | example-hostname | example-hostname-fqska9ih | [rXF+e]    | [O8SbL] | user2 | 100.64.0.2, fd7a:115c:a1e0::2 | true      | 2023-09-28 10:32:54 | 0001-01-01 00:00:00 | online | no

at 2023-09-28T10:35:55Z (offline):

ID | Hostname          | Name                       | MachineKey | NodeKey | User  | IP addresses                  | Ephemeral | Last seen           | Expiration          | Online  | Expired
3  | example-hostname | example-hostname          | [hbrPm]    | [LzCj3] | user1 | 100.64.0.1, fd7a:115c:a1e0::1 | true      | 2023-09-28 10:34:54 | 0001-01-01 00:00:00 | offline | no
4  | example-hostname | example-hostname-fqska9ih | [rXF+e]    | [O8SbL] | user2 | 100.64.0.2, fd7a:115c:a1e0::2 | true      | 2023-09-28 10:34:54 | 0001-01-01 00:00:00 | offline | no

at 2023-09-28T10:37:38Z (online):

ID | Hostname          | Name                       | MachineKey | NodeKey | User  | IP addresses                  | Ephemeral | Last seen           | Expiration          | Online | Expired
3  | example-hostname | example-hostname          | [hbrPm]    | [LzCj3] | user1 | 100.64.0.1, fd7a:115c:a1e0::1 | true      | 2023-09-28 10:36:54 | 0001-01-01 00:00:00 | online | no
4  | example-hostname | example-hostname-fqska9ih | [rXF+e]    | [O8SbL] | user2 | 100.64.0.2, fd7a:115c:a1e0::2 | true      | 2023-09-28 10:36:54 | 0001-01-01 00:00:00 | online | no

in tailscale logs lot of errors like this:

headscale-ts02-1  | 2023/09/28 10:26:54 control: map response long-poll timed out!
headscale-ts02-1  | 2023/09/28 10:26:54 Received error: PollNetMap: context canceled
headscale-ts01-1  | 2023/09/28 10:28:54 control: map response long-poll timed out!
headscale-ts01-1  | 2023/09/28 10:28:54 Received error: PollNetMap: context canceled
headscale-ts02-1  | 2023/09/28 10:28:54 control: map response long-poll timed out!
headscale-ts02-1  | 2023/09/28 10:28:54 Received error: PollNetMap: context canceled
headscale-ts01-1  | 2023/09/28 10:30:54 control: map response long-poll timed out!
headscale-ts01-1  | 2023/09/28 10:30:54 Received error: PollNetMap: context canceled
headscale-ts02-1  | 2023/09/28 10:30:54 control: map response long-poll timed out!
headscale-ts02-1  | 2023/09/28 10:30:54 Received error: PollNetMap: context canceled

kradalby · 2023-09-28T15:47:55Z

Am I correct to understand that the exitnode/subnet router issue is "solved" when the node is correctly reported online?

Or is that a second issue we will have after?

joepa37 · 2023-09-28T18:10:04Z

Some behaviors from 1561-online-issue branch

✅ Connectivity works (ping all)
✅ Online updated (really fast)
⚠️ Offline took a wile to update (almost a minute or so)
✅ Route HA primary change when offline is updated
❌ Routes not reachable at all
❌ Exit nodes not working

I am using same configuration from the v0.22.3 version

kradalby · 2023-09-28T19:28:38Z

@vsychov @joepa37, 10ec28a adds a integration test case that tries to verify that the online status is reported correctly over time, can you have a look if that makes sense and that it should replicate the same behaviour as the proposed reproduction case?

joepa37 · 2023-09-28T20:01:32Z

@kradalby make command gives me some errors

error: builder for '/nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv' failed with exit code 1;
       last 10 log lines:
       >             +            Tags:              []string{},
       >             +           PrimaryRoutes:     []netip.Prefix{s"192.168.0.0/24"},
       >             +            LastSeen:          s"2009-11-10 23:09:00 +0000 UTC",
       >             +             MachineAuthorized: true,
       >             +                 ...
       >             +      },
       >               )
       > FAIL
       > FAIL github.com/juanfont/headscale/hscontrol/mapper  0.109s
       > FAIL
       For full logs, run 'nix log /nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv'.
error: builder for '/nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv' failed with exit code 1;
       last 10 log lines:
       >             + 		Tags:              []string{},
       >             + 		PrimaryRoutes:     []netip.Prefix{s"192.168.0.0/24"},
       >             + 		LastSeen:          s"2009-11-10 23:09:00 +0000 UTC",
       >             + 		MachineAuthorized: true,
       >             + 		...
       >             + 	},
       >               )
       > FAIL
       > FAIL	github.com/juanfont/headscale/hscontrol/mapper	0.109s
       > FAIL
       For full logs, run 'nix log /nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv'.
make: *** [Makefile:22: build] Error 1

kradalby · 2023-09-28T20:13:26Z

@joepa37 I forgot to update some tests, please try again.

joepa37 · 2023-09-28T21:18:50Z

From the last check I have updated the ACL policy with the following (sorry for not checking before :/)

acls:
  - action: accept
    src: ["*"]
    dst: ["*:*"]

It seems the ACL configuration I been using is not working any more.

/etc/headscale/acls.yaml

acls:
  # Mesh Network
  - action: accept
    src: [group:servers]
    dst: [group:servers:*]
  # Administrators can access to all server resources
  - action: accept
    src: [group:admin]
    dst: [group:servers:*]
  # Developers can access other devs HTTP apps
  - action: accept
    src: [group:devs]
    proto: tcp
    dst: ['group:devs:80,443']
  # All clients can access private http apps
  - action: accept
    src: [group:servers, group:admin, group:devs]
    dst: [10.30.0.0/16:80, 10.30.0.0/16:443]
   # Exit Nodes
  - action: accept
    src: [group:admin]
    dst: [
      group:servers:0,
      0.0.0.0/5:*, 8.0.0.0/7:*, 11.0.0.0/8:*, 12.0.0.0/6:*, 16.0.0.0/4:*, 32.0.0.0/3:*, 64.0.0.0/3:*, 96.0.0.0/6:*, 100.0.0.0/10:*, 100.128.0.0/9:*, 101.0.0.0/8:*, 102.0.0.0/7:*, 104.0.0.0/5:*, 112.0.0.0/5:*, 120.0.0.0/6:*, 124.0.0.0/7:*, 126.0.0.0/8:*, 128.0.0.0/3:*, 160.0.0.0/5:*, 168.0.0.0/6:*, 172.0.0.0/12:*, 172.32.0.0/11:*, 172.64.0.0/10:*, 172.128.0.0/9:*, 173.0.0.0/8:*, 174.0.0.0/7:*, 176.0.0.0/4:*, 192.0.0.0/9:*, 192.128.0.0/11:*, 192.160.0.0/13:*, 192.169.0.0/16:*, 192.170.0.0/15:*, 192.172.0.0/14:*, 192.176.0.0/12:*, 192.192.0.0/10:*, 193.0.0.0/8:*, 194.0.0.0/7:*, 196.0.0.0/6:*, 200.0.0.0/5:*, 208.0.0.0/4:*
    ]

Tests I have repeated
✅ Connectivity works (ping all)
✅ Online updated (really fast)
⚠️ Offline took a while to update (almost a minute or so)
✅ Route HA primary change when offline is updated
✅ Routes are reachable
✅ Exit nodes are working
❌ HA Route failover not working yet
When the primary node is updated to offline the headscale route list have updated the primary to the second node but the clients still not able to reach the route.
I have to manually tailscale down and tailscale up on the client to make the route reachable from the new primary node.

Other tests I have run
✅ Register server with preauthkeys
✅ Register user with oicd (new Signed in via your OIDC provider template)
✅ Taildrop
✅ tailscale ping
✅ tailscale ssh

Environments:
Servers: Ubuntu 22.04
Clients: Windows 11 Pro, macOS Catalina 10.15.7, iPadOS 16.7
Tailscale: 1.48.2, 1.50.0

kradalby · 2023-09-28T21:39:55Z

It seems the ACL configuration I been using is not working any more.

Hmm, I think we need to track this in a separate issue.

Tests I have repeated

So I think this means that subnet router and exit node now works and the left is:

⚠️ Offline took a wile to update (almost a minute or so)

This needs to be speeded up, particularly to make HA subnet routers useful.

❌ HA Route failover not working yet
When the primary node is updated to offline the headscale route list have updated the primary to the second node but the clients still not able to reach the route.
I have to manually tailscale down and tailscale up on the client to make the route reachable from the new primary node.

HA routes are not updated correctly.

Of course I would be keen to hear back from @vsychov as well for the exit/subnet router

kradalby · 2023-09-28T22:44:58Z

I've pushed a fix that should tell nodes quickly when a node disconnects, can you give that a shot @joepa37 ?

joepa37 · 2023-09-28T23:41:03Z

It seems like the behavior is actually the opposite of what it's supposed to be.

From the clients perspective both servers appears offline

tailscale status
100.64.0.3      macbook-pro                 root       macOS   -
100.64.0.1      build-phx1-ad1.mesh.ts.net  mesh       linux   idle; offers exit node; offline
100.64.0.2      test-phx1-ad1.mesh.ts.net   mesh       linux   idle; offers exit node; offline

Even if they are actually reachable

tailscale status
100.64.0.3      macbook-pro                 root     macOS   -
100.64.0.1      build-phx1-ad1.mesh.ts.net  mesh     linux   active; offers exit node; direct 129.153.71.250:51821; offline, tx 11308 rx 5068
100.64.0.2      test-phx1-ad1.mesh.ts.net   mesh     linux   idle, tx 820 rx 732

headscale node list and headscale route list have the same behavior at before, update the offline status after a minute or so.

vsychov · 2023-09-29T09:58:52Z

@kradalby , I checked it again, and still not works:

Test Environment:

Revision: 16b2bcb

Steps to Reproduce:

Execute the following commands to create users and preauthkeys:

headscale users create user1
headscale users create user2
headscale preauthkey create --user user1 --reusable --ephemeral
headscale preauthkey create --user user2 --reusable --ephemeral

Set up clients in docker-compose with the given configurations:

ts00:
  image: tailscale/tailscale:v1.48.2
  network_mode: host
  environment:
    TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/ --advertise-exit-node --advertise-routes=10.0.0.0/8'
    TS_USERSPACE: 'true'
    TS_AUTHKEY: 'replace-by-user1-key'

ts01:
  image: tailscale/tailscale:v1.48.2
  network_mode: host
  environment:
    TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/  --advertise-exit-node --advertise-routes=10.0.0.0/8'
    TS_USERSPACE: 'true'
    TS_AUTHKEY: 'replace-by-user1-key'

ts-client:
  image: tailscale/tailscale:v1.48.2
  network_mode: host
  environment:
    TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/'
    TS_USERSPACE: 'true'
    TS_AUTHKEY: 'replace-by-user2-key'

Inside ts-client, check the status and note the hostnames:

bash-5.1# date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:38:49 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6.user1.example.com user1        linux   -
100.64.0.3      test-host-wi4tp7j2.user1.example.com user1        linux   -

Approve and check the routes list at 2023-09-29T09:40:32Z:

date && headscale  route list
ID | Node                       | Prefix     | Advertised | Enabled | Primary
1  | test-host-fe57zhq6 | 0.0.0.0/0  | true       | true    | -
2  | test-host-fe57zhq6 | ::/0       | true       | true    | -
3  | test-host-fe57zhq6 | 10.0.0.0/8 | true       | true    | true
4  | test-host-wi4tp7j2 | 0.0.0.0/0  | true       | true    | -
5  | test-host-wi4tp7j2 | ::/0       | true       | true    | -
6  | test-host-wi4tp7j2 | 10.0.0.0/8 | true       | true    | false

Inside ts-client, recheck the status and observe the hostnames. They don't contain user:

bash-5.1# date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:40:45 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6..example.com user1        linux   -
100.64.0.3      test-host-wi4tp7j2..example.com user1        linux   -

bash-5.1# date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:41:35 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6..example.com user1        linux   -
100.64.0.3      test-host-wi4tp7j2..example.com user1        linux   -

date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:46:05 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6..example.com user1        linux   -
100.64.0.3      test-host-wi4tp7j2..example.com user1        linux   -

Restart headscale and check the status again (still no exit node offered):

bash-5.1# date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:47:42 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6..example.com user1        linux   -
100.64.0.3      test-host-wi4tp7j2..example.com user1        linux   -

2023-09-29T09:55:18Z, headscale node list, nodes shown as offline:

ID | Hostname          | Name                       | MachineKey | NodeKey | User  | IP addresses                  | Ephemeral | Last seen           | Expiration          | Online  | Expired
1  | test-host | test-host          | [EpOBA]    | [v8FlV] | user2 | 100.64.0.1, fd7a:115c:a1e0::1 | true      | 2023-09-29 09:54:09 | 0001-01-01 00:00:00 | offline | no
2  | test-host | test-host-fe57zhq6 | [w9oQW]    | [q71Kw] | user1 | 100.64.0.2, fd7a:115c:a1e0::2 | true      | 2023-09-29 09:54:09 | 0001-01-01 00:00:00 | offline | no
3  | test-host | test-host-wi4tp7j2 | [5pB9g]    | [d2Pw6] | user1 | 100.64.0.3, fd7a:115c:a1e0::3 | true      | 2023-09-29 09:54:09 | 0001-01-01 00:00:00 | offline | no

Node apears in list after some time (10 min after restart), but now they marked as offline:

bash-5.1# date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:56:52 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6.user1.example.com user1        linux   idle; offers exit node; offline
100.64.0.3      test-host-wi4tp7j2.user1.example.com user1        linux   idle; offers exit node; offline

kradalby · 2023-09-29T15:27:18Z

Ok, thanks, I will be away from my computer for a couple of days, I'll try to go back to the drawing board after that.

kradalby · 2023-10-01T20:47:35Z

It seems like the behavior is actually the opposite of what it's supposed to be.

So this might be true, but offline actually does not mean that the node is offline from the network or the internet, it means that it is not connected to the control server (headscale). So if two nodes has eachothers last endpoints, it will still be able to connect if the node is marked as offline.

kradalby · 2023-10-02T00:59:16Z

I've pushed two fixes, one for enable, and one for disable/failover nodes.

I added a test case to verify that the enable/disable route command is not only caught by headscale, but also propagated to the nodes, please feel free to read through it and see if it makes sense.

It would be great if you can help test the following:

Enabling a route
Enabling an exit node
Disabling a route
Disabling an exit node
Subnet failover
Observations about Online status (please see my previous comment for details)

Thank you

vsychov · 2023-10-02T05:23:51Z

Thanks @kradalby , I'll try tests it today

vsychov · 2023-10-02T20:12:39Z

I tested revision 17f887c1b902d26ac1c49178164925ffa8b607c8 using Docker on localhost, and it has greatly improved.

The exit node propagated very quickly, and there is no longer any flapping between the online and offline status. Tomorrow, I'll construct a more intricate schema based on a real network and will test scenarios with route failover and network flapping.

There are also some additional issues:

bash-5.1# tailscale --socket=/tmp/tailscaled.sock status
100.64.0.3      ts-client            user2        linux   -
100.64.0.1      ts00..example.com    user1        linux   idle; offers exit node
100.64.0.2      ts01..example.com    user1        linux   idle; offers exit node

Hostnames are missing users, and hostname resolution isn't working.
ts00..example.com - obviously doesn't work.
ts00.user1.example.com - also doesn't resolve.

joepa37 · 2023-10-04T16:16:12Z

So this might be true, but offline actually does not mean that the node is offline from the network or the internet, it means that it is not connected to the control server (headscale). So if two nodes has eachothers last endpoints, it will still be able to connect if the node is marked as offline.

Thanks for taking time to explain this, really helpful to me.

It would be great if you can help test the following:

✅ Enabling a route: manually approved and auto approved
⚠️ Enabling an exit node: sometimes appears grey out on the client side (node offline)
✅ Disabling a route: working as expected
✅ Disabling an exit node: working as expected
❌ Subnet failover: not working at all (node is marked as offline after a minute or so, then primary route is switched; but client can't reach the new primary unless tailscale is restarted)
⚠️Observations about Online status: sometimes nodes are marked as offline even if they can reach the headscale server

PD:

I am using the default config.yaml
To force the failover I have run tailscale down on the primary route
The first time headscale register the nodes, appears with the wrong name as @vsychov already mentioned

kradalby · 2023-10-05T14:12:11Z

Hostnames are missing users, and hostname resolution isn't working.
ts00..example.com - obviously doesn't work.
ts00.user1.example.com - also doesn't resolve.

hmm, I am not seeing this in my test, do you mind sharing your config?

vsychov · 2023-10-05T18:15:29Z

Hostnames are missing users, and hostname resolution isn't working.
ts00..example.com - obviously doesn't work.
ts00.user1.example.com - also doesn't resolve.

hmm, I am not seeing this in my test, do you mind sharing your config?

config.yaml

---
# headscale will look for a configuration file named `config.yaml` (or `config.json`) in the following order:
#
# - `/etc/headscale`
# - `~/.headscale`
# - current working directory

# The url clients will connect to.
# Typically this will be a domain like:
#
# https://myheadscale.example.com:443
#
server_url: http://127.0.0.1:8080

# Address to listen to / bind to on the server
#
# For production:
# listen_addr: 0.0.0.0:8080
listen_addr: 127.0.0.1:8080

# Address to listen to /metrics, you may want
# to keep this endpoint private to your internal
# network
#
metrics_listen_addr: 127.0.0.1:9090

# Address to listen for gRPC.
# gRPC is used for controlling a headscale server
# remotely with the CLI
# Note: Remote access _only_ works if you have
# valid certificates.
#
# For production:
# grpc_listen_addr: 0.0.0.0:50443
grpc_listen_addr: 127.0.0.1:50443

# Allow the gRPC admin interface to run in INSECURE
# mode. This is not recommended as the traffic will
# be unencrypted. Only enable if you know what you
# are doing.
grpc_allow_insecure: false

# Private key used to encrypt the traffic between headscale
# and Tailscale clients.
# The private key file will be autogenerated if it's missing.
#
private_key_path: /home/user/headscale/private.key

# The Noise section includes specific configuration for the
# TS2021 Noise protocol
noise:
  # The Noise private key is used to encrypt the
  # traffic between headscale and Tailscale clients when
  # using the new Noise-based protocol. It must be different
  # from the legacy private key.
  private_key_path: /home/user/headscale/noise_private.key

# List of IP prefixes to allocate tailaddresses from.
# Each prefix consists of either an IPv4 or IPv6 address,
# and the associated prefix length, delimited by a slash.
# It must be within IP ranges supported by the Tailscale
# client - i.e., subnets of 100.64.0.0/10 and fd7a:115c:a1e0::/48.
# See below:
# IPv6: https://github.com/tailscale/tailscale/blob/22ebb25e833264f58d7c3f534a8b166894a89536/net/tsaddr/tsaddr.go#LL81C52-L81C71
# IPv4: https://github.com/tailscale/tailscale/blob/22ebb25e833264f58d7c3f534a8b166894a89536/net/tsaddr/tsaddr.go#L33
# Any other range is NOT supported, and it will cause unexpected issues.
ip_prefixes:
  - fd7a:115c:a1e0::/48
  - 100.64.0.0/10

# DERP is a relay system that Tailscale uses when a direct
# connection cannot be established.
# https://tailscale.com/blog/how-tailscale-works/#encrypted-tcp-relays-derp
#
# headscale needs a list of DERP servers that can be presented
# to the clients.
derp:
  server:
    # If enabled, runs the embedded DERP server and merges it into the rest of the DERP config
    # The Headscale server_url defined above MUST be using https, DERP requires TLS to be in place
    enabled: false

    # Region ID to use for the embedded DERP server.
    # The local DERP prevails if the region ID collides with other region ID coming from
    # the regular DERP config.
    region_id: 999

    # Region code and name are displayed in the Tailscale UI to identify a DERP region
    region_code: "headscale"
    region_name: "Headscale Embedded DERP"

    # Listens over UDP at the configured address for STUN connections - to help with NAT traversal.
    # When the embedded DERP server is enabled stun_listen_addr MUST be defined.
    #
    # For more details on how this works, check this great article: https://tailscale.com/blog/how-tailscale-works/
    stun_listen_addr: "0.0.0.0:3478"

  # List of externally available DERP maps encoded in JSON
  urls:
    - https://controlplane.tailscale.com/derpmap/default

  # Locally available DERP map files encoded in YAML
  #
  # This option is mostly interesting for people hosting
  # their own DERP servers:
  # https://tailscale.com/kb/1118/custom-derp-servers/
  #
  # paths:
  #   - /etc/headscale/derp-example.yaml
  paths: []

  # If enabled, a worker will be set up to periodically
  # refresh the given sources and update the derpmap
  # will be set up.
  auto_update_enabled: true

  # How often should we check for DERP updates?
  update_frequency: 24h

# Disables the automatic check for headscale updates on startup
disable_check_updates: false

# Time before an inactive ephemeral node is deleted?
ephemeral_node_inactivity_timeout: 30m

# Period to check for node updates within the tailnet. A value too low will severely affect
# CPU consumption of Headscale. A value too high (over 60s) will cause problems
# for the nodes, as they won't get updates or keep alive messages frequently enough.
# In case of doubts, do not touch the default 10s.
node_update_check_interval: 10s

# # Postgres config
# If using a Unix socket to connect to Postgres, set the socket path in the 'host' field and leave 'port' blank.
db_type: postgres
db_host: localhost
db_port: 5432
db_name: headscale
db_user: headscale
db_pass: password

# If other 'sslmode' is required instead of 'require(true)' and 'disabled(false)', set the 'sslmode' you need
# in the 'db_ssl' field. Refers to https://www.postgresql.org/docs/current/libpq-ssl.html Table 34.1.
db_ssl: false

### TLS configuration
#
## Let's encrypt / ACME
#
# headscale supports automatically requesting and setting up
# TLS for a domain with Let's Encrypt.
#
# URL to ACME directory
acme_url: https://acme-v02.api.letsencrypt.org/directory

# Email to register with ACME provider
acme_email: ""

# Domain name to request a TLS certificate for:
tls_letsencrypt_hostname: ""

# Path to store certificates and metadata needed by
# letsencrypt
# For production:
tls_letsencrypt_cache_dir: /home/user/headscale/cache

# Type of ACME challenge to use, currently supported types:
# HTTP-01 or TLS-ALPN-01
# See [docs/tls.md](docs/tls.md) for more information
tls_letsencrypt_challenge_type: HTTP-01
# When HTTP-01 challenge is chosen, letsencrypt must set up a
# verification endpoint, and it will be listening on:
# :http = port 80
tls_letsencrypt_listen: ":http"

## Use already defined certificates:
tls_cert_path: ""
tls_key_path: ""

log:
  # Output formatting for logs: text or json
  format: text
  level: trace

# Path to a file containg ACL policies.
# ACLs can be defined as YAML or HUJSON.
# https://tailscale.com/kb/1018/acls/
acl_policy_path: ""

## DNS
#
# headscale supports Tailscale's DNS configuration and MagicDNS.
# Please have a look to their KB to better understand the concepts:
#
# - https://tailscale.com/kb/1054/dns/
# - https://tailscale.com/kb/1081/magicdns/
# - https://tailscale.com/blog/2021-09-private-dns-with-magicdns/
#
dns_config:
  # Whether to prefer using Headscale provided DNS or use local.
  override_local_dns: true

  # List of DNS servers to expose to clients.
  nameservers:
    - 1.1.1.1

  # NextDNS (see https://tailscale.com/kb/1218/nextdns/).
  # "abc123" is example NextDNS ID, replace with yours.
  #
  # With metadata sharing:
  # nameservers:
  #   - https://dns.nextdns.io/abc123
  #
  # Without metadata sharing:
  # nameservers:
  #   - 2a07:a8c0::ab:c123
  #   - 2a07:a8c1::ab:c123

  # Split DNS (see https://tailscale.com/kb/1054/dns/),
  # list of search domains and the DNS to query for each one.
  #
  # restricted_nameservers:
  #   foo.bar.com:
  #     - 1.1.1.1
  #   darp.headscale.net:
  #     - 1.1.1.1
  #     - 8.8.8.8

  # Search domains to inject.
  domains: []

  # Extra DNS records
  # so far only A-records are supported (on the tailscale side)
  # See https://github.com/juanfont/headscale/blob/main/docs/dns-records.md#Limitations
  # extra_records:
  #   - name: "grafana.myvpn.example.com"
  #     type: "A"
  #     value: "100.64.0.3"
  #
  #   # you can also put it in one line
  #   - { name: "prometheus.myvpn.example.com", type: "A", value: "100.64.0.3" }

  # Whether to use [MagicDNS](https://tailscale.com/kb/1081/magicdns/).
  # Only works if there is at least a nameserver defined.
  magic_dns: true

  # Defines the base domain to create the hostnames for MagicDNS.
  # `base_domain` must be a FQDNs, without the trailing dot.
  # The FQDN of the hosts will be
  # `hostname.user.base_domain` (e.g., _myhost.myuser.example.com_).
  base_domain: example.com

# Unix socket used for the CLI to connect without authentication
# Note: for production you will want to set this to something like:
unix_socket: /home/user/headscale/headscale.sock
unix_socket_permission: "0770"
#
# headscale supports experimental OpenID connect support,
# it is still being tested and might have some bugs, please
# help us test it.
# OpenID Connect
# oidc:
#   only_start_if_oidc_is_available: true
#   issuer: "https://your-oidc.issuer.com/path"
#   client_id: "your-oidc-client-id"
#   client_secret: "your-oidc-client-secret"
#   # Alternatively, set `client_secret_path` to read the secret from the file.
#   # It resolves environment variables, making integration to systemd's
#   # `LoadCredential` straightforward:
#   client_secret_path: "${CREDENTIALS_DIRECTORY}/oidc_client_secret"
#   # client_secret and client_secret_path are mutually exclusive.
#
#   # The amount of time from a node is authenticated with OpenID until it
#   # expires and needs to reauthenticate.
#   # Setting the value to "0" will mean no expiry.
#   expiry: 180d
#
#   # Use the expiry from the token received from OpenID when the user logged
#   # in, this will typically lead to frequent need to reauthenticate and should
#   # only been enabled if you know what you are doing.
#   # Note: enabling this will cause `oidc.expiry` to be ignored.
#   use_expiry_from_token: false
#
#   # Customize the scopes used in the OIDC flow, defaults to "openid", "profile" and "email" and add custom query
#   # parameters to the Authorize Endpoint request. Scopes default to "openid", "profile" and "email".
#
#   scope: ["openid", "profile", "email", "custom"]
#   extra_params:
#     domain_hint: example.com
#
#   # List allowed principal domains and/or users. If an authenticated user's domain is not in this list, the
#   # authentication request will be rejected.
#
#   allowed_domains:
#     - example.com
#   # Note: Groups from keycloak have a leading '/'
#   allowed_groups:
#     - /headscale
#   allowed_users:
#     - [email protected]
#
#   # If `strip_email_domain` is set to `true`, the domain part of the username email address will be removed.
#   # This will transform `[email protected]` to the user `first-name.last-name`
#   # If `strip_email_domain` is set to `false` the domain part will NOT be removed resulting to the following
#   user: `first-name.last-name.example.com`
#
#   strip_email_domain: true

# Logtail configuration
# Logtail is Tailscales logging and auditing infrastructure, it allows the control panel
# to instruct tailscale nodes to log their activity to a remote server.
logtail:
  # Enable logtail for this headscales clients.
  # As there is currently no support for overriding the log server in headscale, this is
  # disabled by default. Enabling this will make your clients send logs to Tailscale Inc.
  enabled: false

# Enabling this option makes devices prefer a random port for WireGuard traffic over the
# default static port 41641. This option is intended as a workaround for some buggy
# firewall devices. See https://tailscale.com/kb/1181/firewalls/ for more information.
randomize_client_port: false

docker-compose.yaml

version: "3.8"
services:
  db:
    image: postgres
    command: ["postgres", "-c", "log_statement=all"]
    ports:
      - 5432:5432

    environment:
      POSTGRES_DB: headscale
      POSTGRES_USER: headscale
      POSTGRES_PASSWORD: password

  adminer:
    depends_on:
      - db
    image: adminer
    ports:
      - 8081:8080

  ts00:
    image: tailscale/tailscale:v1.48.2
    network_mode: host
    environment:
      TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/ --advertise-exit-node --advertise-routes=10.0.0.0/8'
      TS_USERSPACE: 'true'
      TS_AUTHKEY: 'KEY_1'
      TS_HOSTNAME: 'ts00'

  ts01:
    image: tailscale/tailscale:v1.48.2
    network_mode: host
    environment:
      TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/  --advertise-exit-node --advertise-routes=10.0.0.0/8'
      TS_USERSPACE: 'true'
      TS_AUTHKEY: 'KEY_1'
      TS_HOSTNAME: 'ts01'

  ts-client:
    image: tailscale/tailscale:v1.48.2
    network_mode: host
    environment:
      TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/'
      TS_USERSPACE: 'true'
      TS_AUTHKEY: 'KEY_2'
      TS_HOSTNAME: 'ts-client'

docker compose up -d db
run headscale with config.yaml
create user1 and user2, and generate key_1 and key_2, replace it in docker-compose.yaml
docker compose up ts00 ts01 ts-client

kradalby · 2023-10-06T09:14:47Z

Ok, after running your config, I am seeing that the nodes struggle to join at all, but if they manage I dont see the name issue.

I ran it with sqlite instead of postgres and it ran fine (both joining and the name).

@vsychov can you try running your test with SQLite?
for curiosity, why are you running with postgres instead of sqlite?

@joepa37 what database are you running?
Can you report back and test with the opposite one of the main one?

vsychov · 2023-10-06T15:05:35Z

@kradalby, I've reproduced the bug on sqlite, but it seems I missed an important detail in the description above; it only manifests after enabling routes. As a result, the steps to reproduce on sqlite are as follows:

Create user1 and user2, generate key_1 and key_2, then replace them in docker-compose.yaml.

headscale users create user1
headscale users create user2
headscale preauthkey create --user user1 --reusable --ephemeral
headscale preauthkey create --user user2 --reusable --ephemeral

Run docker compose up ts00 ts01 ts-client.
After that, enable all the routes:

headscale route enable -r 1
headscale route enable -r 2
headscale route enable -r 3
headscale route enable -r 4
headscale route enable -r 5
headscale route enable -r 6

Then, inside ts-client, check the status:

# tailscale --socket=/tmp/tailscaled.sock status
100.64.0.2      ts-client            user2        linux   -
100.64.0.1      ts00..example.com    user1        linux   idle; offers exit node
100.64.0.3      ts01..example.com    user1        linux   idle; offers exit node

for curiosity, why are you running with postgres instead of sqlite?

In my case, I use headscale within GCP's k8s (GKE), and I use a managed PostgreSQL instance because it's more convenient to work with than sqlite in terms of creating database backups. Also, due to #1482 (which was fixed in #1562, but hasn't been merged yet), I need to run a script to clear out this data. It's easier for me to do this using an external database, executing the script in a separate container. During testing, I use PostgreSQL just to be a bit closer to how I use it in real-life situations.

kradalby · 2023-10-10T14:13:55Z

I have rewritten the failover logic and added quite a large synthetic test for failing over and ensuring the status is sent as expected. Please help testing HA and failover.

I should also have fixed the missing username bug in DNS.

I have rethought how the Online status is set, and this is currently implemented in the CLI and the HA failover, this is however not implemented in the Online map sent to nodes and might affect stuff like exit node. I'll work on that, but please test it anyways.

kradalby · 2023-11-29T14:01:29Z

@vsychov @joepa37

Is the 🚀 to be interpreted as "It is all good now and we should ship it" or "Exciting, will be testing"? 😆

vsychov · 2023-11-29T19:48:28Z

@kradalby, I'll repeat all tests this week, and will test a few more cases, thanks for the great job.

kradalby · 2023-11-30T06:39:02Z

Thank you, I will work on the things I seem to have broken (Taildrop, Only DERP and web only logout flow) tomorrow.

Signed-off-by: Kristoffer Dalby <[email protected]>

This PR further improves the state management system and tries to make sure that we get all nodes in sync continously. This is greatly enabled by a previous PR dropping support for older clients that allowed us to use a Patch field only sending small diffs for client updates. It also reworks how the HA subnet router is handled and it should be a bit easier to follow now. Signed-off-by: Kristoffer Dalby <[email protected]>

Signed-off-by: Kristoffer Dalby <[email protected]>

vsychov · 2023-12-03T16:32:45Z

I conducted another series of tests, starting with two tailscale nodes on headscale revision 979569c, on the infrastructure described in my last message. I noticed an improvement in stability; the nodes no longer fluctuate between online and offline status, which is good news.

The tailscale nodes were connected with the following parameters:

tailscale-tmp-ams3: tailscale up --login-server https://headscale-test.example.com --auth-key XXXX --accept-dns --accept-routes --shields-up=false --advertise-exit-node --advertise-routes=10.110.0.0/20

tailscale-tmp-test: tailscale up --login-server https://headscale-test.example.com --auth-key XXXX --accept-dns --accept-routes --shields-up=false --advertise-exit-node --advertise-routes=10.114.0.0/20

After enabling the routes, the routing table looked like this:

root@headscale-test-6-5c7d4f7d6b-hn8g5:/# headscale route list
ID | Node               | Prefix        | Advertised | Enabled | Primary
1  | tailscale-tmp-ams3 | ::/0          | true       | true    | -
2  | tailscale-tmp-ams3 | 10.110.0.0/20 | true       | true    | true
3  | tailscale-tmp-ams3 | 0.0.0.0/0     | true       | true    | -
4  | tailscale-tmp-test | 0.0.0.0/0     | true       | true    | -
5  | tailscale-tmp-test | ::/0          | true       | true    | -
6  | tailscale-tmp-test | 10.114.0.0/20 | true       | true    | true

Traffic from the 10.110.0.0/20 network successfully passed to the 10.114.0.0/20 network and back.

Next, I used the latest revision at the moment, af3c097, to test subnet route failover. I added another tailscale node (tailscale-tmp2-fra1), with parameters similar to tailscale-tmp-test. After enabling the routes, I discovered a problem: there was no longer a Primary route to the 10.114.0.0/20 network:

2023-12-03T16:25:53Z:

root@headscale-test-6-f65f999c7-kgb9c:/# headscale node list
ID | Hostname            | Name                | MachineKey | NodeKey | User                 | IP addresses                  | Ephemeral | Last seen           | Expiration          | Connected | Expired
1  | tailscale-tmp-test  | tailscale-tmp-test  | [MHyGh]    | [hFr8i] | user.example.com     | 100.64.0.1, fd7a:115c:a1e0::1 | false     | 2023-12-03 16:25:44 | 0001-01-01 00:00:00 | online    | no
3  | tailscale-tmp2-fra1 | tailscale-tmp2-fra1 | [zLP+K]    | [fbCkD] | user-two.example.com | 100.64.0.3, fd7a:115c:a1e0::3 | false     | 2023-12-03 16:25:45 | 0001-01-01 00:00:00 | online    | no
2  | tailscale-tmp-ams3  | tailscale-tmp-ams3  | [1w0kC]    | [Fjkoy] | user.example.com     | 100.64.0.2, fd7a:115c:a1e0::2 | false     | 2023-12-03 16:25:45 | 0001-01-01 00:00:00 | online    | no

2023-12-03T16:26:03Z:

root@headscale-test-6-f65f999c7-kgb9c:/# headscale route list
ID | Node                | Prefix        | Advertised | Enabled | Primary
4  | tailscale-tmp-test  | 0.0.0.0/0     | true       | true    | -
5  | tailscale-tmp-test  | ::/0          | true       | true    | -
6  | tailscale-tmp-test  | 10.114.0.0/20 | true       | true    | false
7  | tailscale-tmp2-fra1 | 0.0.0.0/0     | true       | true    | -
8  | tailscale-tmp2-fra1 | ::/0          | true       | true    | -
9  | tailscale-tmp2-fra1 | 10.114.0.0/20 | true       | true    | false
1  | tailscale-tmp-ams3  | ::/0          | true       | true    | -
2  | tailscale-tmp-ams3  | 10.110.0.0/20 | true       | true    | true
3  | tailscale-tmp-ams3  | 0.0.0.0/0     | true       | true    | -

Despite all three nodes being online.

Signed-off-by: Kristoffer Dalby <[email protected]>

kradalby · 2023-12-06T10:06:38Z

I think this PR has grown a bit out of hand for the issue it was going to address, I think I propose the following,

It should now pass all the tests we have, I think it should be reviewed and merged and then when I get test feedback on #1561, I will continue to address that in new PRs.

As a result of this PR, I did notice two things:

It makes TestDERPServerScenario and TestSSHIsBlockedInACL more flaky, particularly on github actions, which I will file separate issues for to address
It made me discover that we do not propagate Expiry information correctly (see new commented out test in last commit)

the last I think is a new issue and should also be addressed separately.

When this is merged, let us move all discussions of #1561 back to the issue.

Signed-off-by: Kristoffer Dalby <[email protected]>

juanfont · 2023-12-09T16:40:20Z

.gitignore

@@ -1,5 +1,6 @@
 ignored/
 tailscale/
+.vscode/


You now use vscode??

naimo84 · 2023-12-11T17:00:49Z

The new alpha2 version fixed the issue, on my side 👍🥳 thanks a lot for the amazing work 🥳

kradalby force-pushed the 1561-online-issue branch from 9082450 to 2624ea0 Compare September 27, 2023 23:02

kradalby force-pushed the 1561-online-issue branch from 10ec28a to 8cad13d Compare September 28, 2023 19:39

kradalby force-pushed the 1561-online-issue branch from 8cad13d to 7e9bead Compare September 28, 2023 20:13

kradalby changed the title ~~update LastSeen in db and mapper~~ ensure online status and route changes are propagated Oct 2, 2023

kradalby force-pushed the 1561-online-issue branch 2 times, most recently from df0c075 to 3014bc3 Compare October 10, 2023 12:57

kradalby force-pushed the 1561-online-issue branch from c0d9d8b to 07d2c03 Compare November 30, 2023 08:23

kradalby mentioned this pull request Nov 30, 2023

headscale server stopped answering after a day of uptime despite listening of all the ports #1572

Closed

kradalby added 4 commits December 1, 2023 07:20

add vscode to gitignore

2e5ba99

Signed-off-by: Kristoffer Dalby <[email protected]>

remove unused image builder

b1d642c

Signed-off-by: Kristoffer Dalby <[email protected]>

flake hash update

1b895c0

Signed-off-by: Kristoffer Dalby <[email protected]>

kradalby force-pushed the 1561-online-issue branch from bbb4c35 to 1b895c0 Compare December 1, 2023 06:22

kradalby added 6 commits December 1, 2023 07:24

update changelog

83560e4

Signed-off-by: Kristoffer Dalby <[email protected]>

update nodes if services are changed

495ff53

Signed-off-by: Kristoffer Dalby <[email protected]>

fix save node when routes/services

979569c

Signed-off-by: Kristoffer Dalby <[email protected]>

more debugging

817fd3d

Signed-off-by: Kristoffer Dalby <[email protected]>

only check online status if offline

70c1017

Signed-off-by: Kristoffer Dalby <[email protected]>

fix test

af3c097

Signed-off-by: Kristoffer Dalby <[email protected]>

Nickiel12 mentioned this pull request Dec 3, 2023

Main branch - server cannot connect to itself #1633

Closed

kradalby added 2 commits December 6, 2023 02:37

fix logout notification, fix potential tailscaled panic when removing

3ca953a

Signed-off-by: Kristoffer Dalby <[email protected]>

Fix and make todo for expired nodes

8b3721b

Signed-off-by: Kristoffer Dalby <[email protected]>

kradalby marked this pull request as ready for review December 6, 2023 10:42

kradalby requested a review from juanfont as a code owner December 6, 2023 10:42

fix derp issue

9dbac20

Signed-off-by: Kristoffer Dalby <[email protected]>

juanfont approved these changes Dec 9, 2023

View reviewed changes

kradalby merged commit f65f4ec into juanfont:main Dec 9, 2023
46 checks passed

This was referenced Dec 9, 2023

Bugs related to state change in #1492 #1561

Closed

Tailscale client v1.48.x: Multiple instances of routes in headscale and --exit-node advertisement dns name not working #1539

Closed

sniff122 mentioned this pull request Jan 8, 2024

Subnet router ACL's broken on 0.23.0-alpha1 #1604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure online status and route changes are propagated #1564

ensure online status and route changes are propagated #1564

kradalby commented Sep 27, 2023

kradalby commented Sep 27, 2023

vsychov commented Sep 28, 2023

vsychov commented Sep 28, 2023 •

edited

Loading

kradalby commented Sep 28, 2023

joepa37 commented Sep 28, 2023 •

edited

Loading

kradalby commented Sep 28, 2023

joepa37 commented Sep 28, 2023

kradalby commented Sep 28, 2023

joepa37 commented Sep 28, 2023 •

edited

Loading

kradalby commented Sep 28, 2023

kradalby commented Sep 28, 2023

joepa37 commented Sep 28, 2023 •

edited

Loading

vsychov commented Sep 29, 2023

kradalby commented Sep 29, 2023

kradalby commented Oct 1, 2023

kradalby commented Oct 2, 2023

vsychov commented Oct 2, 2023

vsychov commented Oct 2, 2023

joepa37 commented Oct 4, 2023

kradalby commented Oct 5, 2023

vsychov commented Oct 5, 2023

kradalby commented Oct 6, 2023

vsychov commented Oct 6, 2023 •

edited

Loading

kradalby commented Oct 10, 2023

kradalby commented Nov 29, 2023

vsychov commented Nov 29, 2023 •

edited

Loading

kradalby commented Nov 30, 2023

vsychov commented Dec 3, 2023

kradalby commented Dec 6, 2023

juanfont Dec 9, 2023

kradalby Dec 10, 2023

naimo84 commented Dec 11, 2023

ensure online status and route changes are propagated #1564

ensure online status and route changes are propagated #1564

Conversation

kradalby commented Sep 27, 2023

kradalby commented Sep 27, 2023

vsychov commented Sep 28, 2023

vsychov commented Sep 28, 2023 • edited Loading

kradalby commented Sep 28, 2023

joepa37 commented Sep 28, 2023 • edited Loading

kradalby commented Sep 28, 2023

joepa37 commented Sep 28, 2023

kradalby commented Sep 28, 2023

joepa37 commented Sep 28, 2023 • edited Loading

kradalby commented Sep 28, 2023

kradalby commented Sep 28, 2023

joepa37 commented Sep 28, 2023 • edited Loading

vsychov commented Sep 29, 2023

kradalby commented Sep 29, 2023

kradalby commented Oct 1, 2023

kradalby commented Oct 2, 2023

vsychov commented Oct 2, 2023

vsychov commented Oct 2, 2023

joepa37 commented Oct 4, 2023

kradalby commented Oct 5, 2023

vsychov commented Oct 5, 2023

kradalby commented Oct 6, 2023

vsychov commented Oct 6, 2023 • edited Loading

kradalby commented Oct 10, 2023

kradalby commented Nov 29, 2023

vsychov commented Nov 29, 2023 • edited Loading

kradalby commented Nov 30, 2023

vsychov commented Dec 3, 2023

kradalby commented Dec 6, 2023

juanfont Dec 9, 2023

Choose a reason for hiding this comment

kradalby Dec 10, 2023

Choose a reason for hiding this comment

naimo84 commented Dec 11, 2023

vsychov commented Sep 28, 2023 •

edited

Loading

joepa37 commented Sep 28, 2023 •

edited

Loading

joepa37 commented Sep 28, 2023 •

edited

Loading

joepa37 commented Sep 28, 2023 •

edited

Loading

vsychov commented Oct 6, 2023 •

edited

Loading

vsychov commented Nov 29, 2023 •

edited

Loading