Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure online status and route changes are propagated #1564

Merged
merged 13 commits into from
Dec 9, 2023

Conversation

kradalby
Copy link
Collaborator

This is an attempt to address #1561.

@kradalby
Copy link
Collaborator Author

@vsychov could you test this branch in regards to #1561?

@vsychov
Copy link
Contributor

vsychov commented Sep 28, 2023

@kradalby thanks, I'll test it today

@vsychov
Copy link
Contributor

vsychov commented Sep 28, 2023

Thank you, the situation has improved (before this PR, node marked as offline forever), but the nodes are still 'flapping' between online-offline states.

I found a simple way to check this locally; it's enough to run headscale on 127.0.0.1:8080, and use the docker-compose file:

version: "3.8"
services:
  ts01:
    image: tailscale/tailscale:v1.48.2
    network_mode: host
    environment:
      TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/'
      TS_USERSPACE: 'true'
      TS_AUTHKEY: 'XXXX'

  ts02:
    image: tailscale/tailscale:v1.48.2
    network_mode: host
    environment:
      TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/'
      TS_USERSPACE: 'true'
      TS_AUTHKEY: 'XXXX'

Leave it running for ~10 minutes; the nodes occasionally go 'offline'.
headscale node list:

at 2023-09-28T10:30:20Z (offline):

ID | Hostname          | Name                       | MachineKey | NodeKey | User  | IP addresses                  | Ephemeral | Last seen           | Expiration          | Online  | Expired
3  | example-hostname | example-hostname          | [hbrPm]    | [LzCj3] | user1 | 100.64.0.1, fd7a:115c:a1e0::1 | true      | 2023-09-28 10:28:54 | 0001-01-01 00:00:00 | offline | no
4  | example-hostname | example-hostname-fqska9ih | [rXF+e]    | [O8SbL] | user2 | 100.64.0.2, fd7a:115c:a1e0::2 | true      | 2023-09-28 10:28:54 | 0001-01-01 00:00:00 | offline | no

at 2023-09-28T10:33:20Z (online):

ID | Hostname          | Name                       | MachineKey | NodeKey | User  | IP addresses                  | Ephemeral | Last seen           | Expiration          | Online | Expired
3  | example-hostname | example-hostname          | [hbrPm]    | [LzCj3] | user1 | 100.64.0.1, fd7a:115c:a1e0::1 | true      | 2023-09-28 10:32:54 | 0001-01-01 00:00:00 | online | no
4  | example-hostname | example-hostname-fqska9ih | [rXF+e]    | [O8SbL] | user2 | 100.64.0.2, fd7a:115c:a1e0::2 | true      | 2023-09-28 10:32:54 | 0001-01-01 00:00:00 | online | no

at 2023-09-28T10:35:55Z (offline):

ID | Hostname          | Name                       | MachineKey | NodeKey | User  | IP addresses                  | Ephemeral | Last seen           | Expiration          | Online  | Expired
3  | example-hostname | example-hostname          | [hbrPm]    | [LzCj3] | user1 | 100.64.0.1, fd7a:115c:a1e0::1 | true      | 2023-09-28 10:34:54 | 0001-01-01 00:00:00 | offline | no
4  | example-hostname | example-hostname-fqska9ih | [rXF+e]    | [O8SbL] | user2 | 100.64.0.2, fd7a:115c:a1e0::2 | true      | 2023-09-28 10:34:54 | 0001-01-01 00:00:00 | offline | no

at 2023-09-28T10:37:38Z (online):

ID | Hostname          | Name                       | MachineKey | NodeKey | User  | IP addresses                  | Ephemeral | Last seen           | Expiration          | Online | Expired
3  | example-hostname | example-hostname          | [hbrPm]    | [LzCj3] | user1 | 100.64.0.1, fd7a:115c:a1e0::1 | true      | 2023-09-28 10:36:54 | 0001-01-01 00:00:00 | online | no
4  | example-hostname | example-hostname-fqska9ih | [rXF+e]    | [O8SbL] | user2 | 100.64.0.2, fd7a:115c:a1e0::2 | true      | 2023-09-28 10:36:54 | 0001-01-01 00:00:00 | online | no

in tailscale logs lot of errors like this:

headscale-ts02-1  | 2023/09/28 10:26:54 control: map response long-poll timed out!
headscale-ts02-1  | 2023/09/28 10:26:54 Received error: PollNetMap: context canceled
headscale-ts01-1  | 2023/09/28 10:28:54 control: map response long-poll timed out!
headscale-ts01-1  | 2023/09/28 10:28:54 Received error: PollNetMap: context canceled
headscale-ts02-1  | 2023/09/28 10:28:54 control: map response long-poll timed out!
headscale-ts02-1  | 2023/09/28 10:28:54 Received error: PollNetMap: context canceled
headscale-ts01-1  | 2023/09/28 10:30:54 control: map response long-poll timed out!
headscale-ts01-1  | 2023/09/28 10:30:54 Received error: PollNetMap: context canceled
headscale-ts02-1  | 2023/09/28 10:30:54 control: map response long-poll timed out!
headscale-ts02-1  | 2023/09/28 10:30:54 Received error: PollNetMap: context canceled

@kradalby
Copy link
Collaborator Author

Am I correct to understand that the exitnode/subnet router issue is "solved" when the node is correctly reported online?

Or is that a second issue we will have after?

@joepa37
Copy link

joepa37 commented Sep 28, 2023

Some behaviors from 1561-online-issue branch

  • ✅ Connectivity works (ping all)
  • ✅ Online updated (really fast)
  • ⚠️ Offline took a wile to update (almost a minute or so)
  • ✅ Route HA primary change when offline is updated
  • ❌ Routes not reachable at all
  • ❌ Exit nodes not working

I am using same configuration from the v0.22.3 version

@kradalby
Copy link
Collaborator Author

@vsychov @joepa37, 10ec28a adds a integration test case that tries to verify that the online status is reported correctly over time, can you have a look if that makes sense and that it should replicate the same behaviour as the proposed reproduction case?

@joepa37
Copy link

joepa37 commented Sep 28, 2023

@kradalby make command gives me some errors

error: builder for '/nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv' failed with exit code 1;
       last 10 log lines:
       >             +            Tags:              []string{},
       >             +           PrimaryRoutes:     []netip.Prefix{s"192.168.0.0/24"},
       >             +            LastSeen:          s"2009-11-10 23:09:00 +0000 UTC",
       >             +             MachineAuthorized: true,
       >             +                 ...
       >             +      },
       >               )
       > FAIL
       > FAIL github.com/juanfont/headscale/hscontrol/mapper  0.109s
       > FAIL
       For full logs, run 'nix log /nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv'.
error: builder for '/nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv' failed with exit code 1;
       last 10 log lines:
       >             + 		Tags:              []string{},
       >             + 		PrimaryRoutes:     []netip.Prefix{s"192.168.0.0/24"},
       >             + 		LastSeen:          s"2009-11-10 23:09:00 +0000 UTC",
       >             + 		MachineAuthorized: true,
       >             + 		...
       >             + 	},
       >               )
       > FAIL
       > FAIL	github.com/juanfont/headscale/hscontrol/mapper	0.109s
       > FAIL
       For full logs, run 'nix log /nix/store/al77c86nv3y90aay4gi2r3sxnvn0hrll-headscale-dev.drv'.
make: *** [Makefile:22: build] Error 1

@kradalby
Copy link
Collaborator Author

@joepa37 I forgot to update some tests, please try again.

@joepa37
Copy link

joepa37 commented Sep 28, 2023

From the last check I have updated the ACL policy with the following (sorry for not checking before :/)

acls:
  - action: accept
    src: ["*"]
    dst: ["*:*"]

It seems the ACL configuration I been using is not working any more.

/etc/headscale/acls.yaml
acls:
  # Mesh Network
  - action: accept
    src: [group:servers]
    dst: [group:servers:*]
  # Administrators can access to all server resources
  - action: accept
    src: [group:admin]
    dst: [group:servers:*]
  # Developers can access other devs HTTP apps
  - action: accept
    src: [group:devs]
    proto: tcp
    dst: ['group:devs:80,443']
  # All clients can access private http apps
  - action: accept
    src: [group:servers, group:admin, group:devs]
    dst: [10.30.0.0/16:80, 10.30.0.0/16:443]
   # Exit Nodes
  - action: accept
    src: [group:admin]
    dst: [
      group:servers:0,
      0.0.0.0/5:*, 8.0.0.0/7:*, 11.0.0.0/8:*, 12.0.0.0/6:*, 16.0.0.0/4:*, 32.0.0.0/3:*, 64.0.0.0/3:*, 96.0.0.0/6:*, 100.0.0.0/10:*, 100.128.0.0/9:*, 101.0.0.0/8:*, 102.0.0.0/7:*, 104.0.0.0/5:*, 112.0.0.0/5:*, 120.0.0.0/6:*, 124.0.0.0/7:*, 126.0.0.0/8:*, 128.0.0.0/3:*, 160.0.0.0/5:*, 168.0.0.0/6:*, 172.0.0.0/12:*, 172.32.0.0/11:*, 172.64.0.0/10:*, 172.128.0.0/9:*, 173.0.0.0/8:*, 174.0.0.0/7:*, 176.0.0.0/4:*, 192.0.0.0/9:*, 192.128.0.0/11:*, 192.160.0.0/13:*, 192.169.0.0/16:*, 192.170.0.0/15:*, 192.172.0.0/14:*, 192.176.0.0/12:*, 192.192.0.0/10:*, 193.0.0.0/8:*, 194.0.0.0/7:*, 196.0.0.0/6:*, 200.0.0.0/5:*, 208.0.0.0/4:*
    ]
  

Tests I have repeated
✅ Connectivity works (ping all)
✅ Online updated (really fast)
⚠️ Offline took a while to update (almost a minute or so)
✅ Route HA primary change when offline is updated
✅ Routes are reachable
✅ Exit nodes are working
❌ HA Route failover not working yet
When the primary node is updated to offline the headscale route list have updated the primary to the second node but the clients still not able to reach the route.
I have to manually tailscale down and tailscale up on the client to make the route reachable from the new primary node.

Other tests I have run
✅ Register server with preauthkeys
✅ Register user with oicd (new Signed in via your OIDC provider template)
✅ Taildrop
tailscale ping
tailscale ssh

Environments:
Servers: Ubuntu 22.04
Clients: Windows 11 Pro, macOS Catalina 10.15.7, iPadOS 16.7
Tailscale: 1.48.2, 1.50.0

@kradalby
Copy link
Collaborator Author

It seems the ACL configuration I been using is not working any more.

Hmm, I think we need to track this in a separate issue.

Tests I have repeated

So I think this means that subnet router and exit node now works and the left is:

⚠️ Offline took a wile to update (almost a minute or so)

This needs to be speeded up, particularly to make HA subnet routers useful.

❌ HA Route failover not working yet
When the primary node is updated to offline the headscale route list have updated the primary to the second node but the clients still not able to reach the route.
I have to manually tailscale down and tailscale up on the client to make the route reachable from the new primary node.

HA routes are not updated correctly.

Of course I would be keen to hear back from @vsychov as well for the exit/subnet router

@kradalby
Copy link
Collaborator Author

I've pushed a fix that should tell nodes quickly when a node disconnects, can you give that a shot @joepa37 ?

@joepa37
Copy link

joepa37 commented Sep 28, 2023

It seems like the behavior is actually the opposite of what it's supposed to be.

From the clients perspective both servers appears offline

tailscale status
100.64.0.3      macbook-pro                 root       macOS   -
100.64.0.1      build-phx1-ad1.mesh.ts.net  mesh       linux   idle; offers exit node; offline
100.64.0.2      test-phx1-ad1.mesh.ts.net   mesh       linux   idle; offers exit node; offline

Even if they are actually reachable

tailscale status
100.64.0.3      macbook-pro                 root     macOS   -
100.64.0.1      build-phx1-ad1.mesh.ts.net  mesh     linux   active; offers exit node; direct 129.153.71.250:51821; offline, tx 11308 rx 5068
100.64.0.2      test-phx1-ad1.mesh.ts.net   mesh     linux   idle, tx 820 rx 732

headscale node list and headscale route list have the same behavior at before, update the offline status after a minute or so.

@vsychov
Copy link
Contributor

vsychov commented Sep 29, 2023

@kradalby , I checked it again, and still not works:

Test Environment:

Steps to Reproduce:

  1. Execute the following commands to create users and preauthkeys:

    headscale users create user1
    headscale users create user2
    headscale preauthkey create --user user1 --reusable --ephemeral
    headscale preauthkey create --user user2 --reusable --ephemeral
  2. Set up clients in docker-compose with the given configurations:

    ts00:
      image: tailscale/tailscale:v1.48.2
      network_mode: host
      environment:
        TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/ --advertise-exit-node --advertise-routes=10.0.0.0/8'
        TS_USERSPACE: 'true'
        TS_AUTHKEY: 'replace-by-user1-key'
    
    ts01:
      image: tailscale/tailscale:v1.48.2
      network_mode: host
      environment:
        TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/  --advertise-exit-node --advertise-routes=10.0.0.0/8'
        TS_USERSPACE: 'true'
        TS_AUTHKEY: 'replace-by-user1-key'
    
    ts-client:
      image: tailscale/tailscale:v1.48.2
      network_mode: host
      environment:
        TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/'
        TS_USERSPACE: 'true'
        TS_AUTHKEY: 'replace-by-user2-key'
  3. Inside ts-client, check the status and note the hostnames:

bash-5.1# date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:38:49 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6.user1.example.com user1        linux   -
100.64.0.3      test-host-wi4tp7j2.user1.example.com user1        linux   -
  1. Approve and check the routes list at 2023-09-29T09:40:32Z:
date && headscale  route list
ID | Node                       | Prefix     | Advertised | Enabled | Primary
1  | test-host-fe57zhq6 | 0.0.0.0/0  | true       | true    | -
2  | test-host-fe57zhq6 | ::/0       | true       | true    | -
3  | test-host-fe57zhq6 | 10.0.0.0/8 | true       | true    | true
4  | test-host-wi4tp7j2 | 0.0.0.0/0  | true       | true    | -
5  | test-host-wi4tp7j2 | ::/0       | true       | true    | -
6  | test-host-wi4tp7j2 | 10.0.0.0/8 | true       | true    | false
  1. Inside ts-client, recheck the status and observe the hostnames. They don't contain user:
bash-5.1# date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:40:45 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6..example.com user1        linux   -
100.64.0.3      test-host-wi4tp7j2..example.com user1        linux   -

bash-5.1# date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:41:35 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6..example.com user1        linux   -
100.64.0.3      test-host-wi4tp7j2..example.com user1        linux   -

date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:46:05 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6..example.com user1        linux   -
100.64.0.3      test-host-wi4tp7j2..example.com user1        linux   -

  1. Restart headscale and check the status again (still no exit node offered):
bash-5.1# date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:47:42 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6..example.com user1        linux   -
100.64.0.3      test-host-wi4tp7j2..example.com user1        linux   -
  1. 2023-09-29T09:55:18Z, headscale node list, nodes shown as offline:
ID | Hostname          | Name                       | MachineKey | NodeKey | User  | IP addresses                  | Ephemeral | Last seen           | Expiration          | Online  | Expired
1  | test-host | test-host          | [EpOBA]    | [v8FlV] | user2 | 100.64.0.1, fd7a:115c:a1e0::1 | true      | 2023-09-29 09:54:09 | 0001-01-01 00:00:00 | offline | no
2  | test-host | test-host-fe57zhq6 | [w9oQW]    | [q71Kw] | user1 | 100.64.0.2, fd7a:115c:a1e0::2 | true      | 2023-09-29 09:54:09 | 0001-01-01 00:00:00 | offline | no
3  | test-host | test-host-wi4tp7j2 | [5pB9g]    | [d2Pw6] | user1 | 100.64.0.3, fd7a:115c:a1e0::3 | true      | 2023-09-29 09:54:09 | 0001-01-01 00:00:00 | offline | no
  1. Node apears in list after some time (10 min after restart), but now they marked as offline:
bash-5.1# date && tailscale --socket=/tmp/tailscaled.sock status
Fri Sep 29 09:56:52 UTC 2023
100.64.0.1      test-host    user2        linux   -
100.64.0.2      test-host-fe57zhq6.user1.example.com user1        linux   idle; offers exit node; offline
100.64.0.3      test-host-wi4tp7j2.user1.example.com user1        linux   idle; offers exit node; offline

@kradalby
Copy link
Collaborator Author

Ok, thanks, I will be away from my computer for a couple of days, I'll try to go back to the drawing board after that.

@kradalby
Copy link
Collaborator Author

kradalby commented Oct 1, 2023

It seems like the behavior is actually the opposite of what it's supposed to be.

So this might be true, but offline actually does not mean that the node is offline from the network or the internet, it means that it is not connected to the control server (headscale). So if two nodes has eachothers last endpoints, it will still be able to connect if the node is marked as offline.

@kradalby
Copy link
Collaborator Author

kradalby commented Oct 2, 2023

I've pushed two fixes, one for enable, and one for disable/failover nodes.

I added a test case to verify that the enable/disable route command is not only caught by headscale, but also propagated to the nodes, please feel free to read through it and see if it makes sense.

It would be great if you can help test the following:

  • Enabling a route
  • Enabling an exit node
  • Disabling a route
  • Disabling an exit node
  • Subnet failover
  • Observations about Online status (please see my previous comment for details)

Thank you

@kradalby kradalby changed the title update LastSeen in db and mapper ensure online status and route changes are propagated Oct 2, 2023
@vsychov
Copy link
Contributor

vsychov commented Oct 2, 2023

Thanks @kradalby , I'll try tests it today

@vsychov
Copy link
Contributor

vsychov commented Oct 2, 2023

I tested revision 17f887c1b902d26ac1c49178164925ffa8b607c8 using Docker on localhost, and it has greatly improved.

The exit node propagated very quickly, and there is no longer any flapping between the online and offline status. Tomorrow, I'll construct a more intricate schema based on a real network and will test scenarios with route failover and network flapping.

There are also some additional issues:

bash-5.1# tailscale --socket=/tmp/tailscaled.sock status
100.64.0.3      ts-client            user2        linux   -
100.64.0.1      ts00..example.com    user1        linux   idle; offers exit node
100.64.0.2      ts01..example.com    user1        linux   idle; offers exit node

Hostnames are missing users, and hostname resolution isn't working.
ts00..example.com - obviously doesn't work.
ts00.user1.example.com - also doesn't resolve.

@joepa37
Copy link

joepa37 commented Oct 4, 2023

So this might be true, but offline actually does not mean that the node is offline from the network or the internet, it means that it is not connected to the control server (headscale). So if two nodes has eachothers last endpoints, it will still be able to connect if the node is marked as offline.

Thanks for taking time to explain this, really helpful to me.

It would be great if you can help test the following:

✅ Enabling a route: manually approved and auto approved
⚠️ Enabling an exit node: sometimes appears grey out on the client side (node offline)
✅ Disabling a route: working as expected
✅ Disabling an exit node: working as expected
❌ Subnet failover: not working at all (node is marked as offline after a minute or so, then primary route is switched; but client can't reach the new primary unless tailscale is restarted)
⚠️Observations about Online status: sometimes nodes are marked as offline even if they can reach the headscale server

PD:

  • I am using the default config.yaml
  • To force the failover I have run tailscale down on the primary route
  • The first time headscale register the nodes, appears with the wrong name as @vsychov already mentioned

@kradalby
Copy link
Collaborator Author

kradalby commented Oct 5, 2023

Hostnames are missing users, and hostname resolution isn't working.
ts00..example.com - obviously doesn't work.
ts00.user1.example.com - also doesn't resolve.

hmm, I am not seeing this in my test, do you mind sharing your config?

@vsychov
Copy link
Contributor

vsychov commented Oct 5, 2023

Hostnames are missing users, and hostname resolution isn't working.
ts00..example.com - obviously doesn't work.
ts00.user1.example.com - also doesn't resolve.

hmm, I am not seeing this in my test, do you mind sharing your config?

config.yaml
---
# headscale will look for a configuration file named `config.yaml` (or `config.json`) in the following order:
#
# - `/etc/headscale`
# - `~/.headscale`
# - current working directory

# The url clients will connect to.
# Typically this will be a domain like:
#
# https://myheadscale.example.com:443
#
server_url: http://127.0.0.1:8080

# Address to listen to / bind to on the server
#
# For production:
# listen_addr: 0.0.0.0:8080
listen_addr: 127.0.0.1:8080

# Address to listen to /metrics, you may want
# to keep this endpoint private to your internal
# network
#
metrics_listen_addr: 127.0.0.1:9090

# Address to listen for gRPC.
# gRPC is used for controlling a headscale server
# remotely with the CLI
# Note: Remote access _only_ works if you have
# valid certificates.
#
# For production:
# grpc_listen_addr: 0.0.0.0:50443
grpc_listen_addr: 127.0.0.1:50443

# Allow the gRPC admin interface to run in INSECURE
# mode. This is not recommended as the traffic will
# be unencrypted. Only enable if you know what you
# are doing.
grpc_allow_insecure: false

# Private key used to encrypt the traffic between headscale
# and Tailscale clients.
# The private key file will be autogenerated if it's missing.
#
private_key_path: /home/user/headscale/private.key

# The Noise section includes specific configuration for the
# TS2021 Noise protocol
noise:
  # The Noise private key is used to encrypt the
  # traffic between headscale and Tailscale clients when
  # using the new Noise-based protocol. It must be different
  # from the legacy private key.
  private_key_path: /home/user/headscale/noise_private.key

# List of IP prefixes to allocate tailaddresses from.
# Each prefix consists of either an IPv4 or IPv6 address,
# and the associated prefix length, delimited by a slash.
# It must be within IP ranges supported by the Tailscale
# client - i.e., subnets of 100.64.0.0/10 and fd7a:115c:a1e0::/48.
# See below:
# IPv6: https://github.com/tailscale/tailscale/blob/22ebb25e833264f58d7c3f534a8b166894a89536/net/tsaddr/tsaddr.go#LL81C52-L81C71
# IPv4: https://github.com/tailscale/tailscale/blob/22ebb25e833264f58d7c3f534a8b166894a89536/net/tsaddr/tsaddr.go#L33
# Any other range is NOT supported, and it will cause unexpected issues.
ip_prefixes:
  - fd7a:115c:a1e0::/48
  - 100.64.0.0/10

# DERP is a relay system that Tailscale uses when a direct
# connection cannot be established.
# https://tailscale.com/blog/how-tailscale-works/#encrypted-tcp-relays-derp
#
# headscale needs a list of DERP servers that can be presented
# to the clients.
derp:
  server:
    # If enabled, runs the embedded DERP server and merges it into the rest of the DERP config
    # The Headscale server_url defined above MUST be using https, DERP requires TLS to be in place
    enabled: false

    # Region ID to use for the embedded DERP server.
    # The local DERP prevails if the region ID collides with other region ID coming from
    # the regular DERP config.
    region_id: 999

    # Region code and name are displayed in the Tailscale UI to identify a DERP region
    region_code: "headscale"
    region_name: "Headscale Embedded DERP"

    # Listens over UDP at the configured address for STUN connections - to help with NAT traversal.
    # When the embedded DERP server is enabled stun_listen_addr MUST be defined.
    #
    # For more details on how this works, check this great article: https://tailscale.com/blog/how-tailscale-works/
    stun_listen_addr: "0.0.0.0:3478"

  # List of externally available DERP maps encoded in JSON
  urls:
    - https://controlplane.tailscale.com/derpmap/default

  # Locally available DERP map files encoded in YAML
  #
  # This option is mostly interesting for people hosting
  # their own DERP servers:
  # https://tailscale.com/kb/1118/custom-derp-servers/
  #
  # paths:
  #   - /etc/headscale/derp-example.yaml
  paths: []

  # If enabled, a worker will be set up to periodically
  # refresh the given sources and update the derpmap
  # will be set up.
  auto_update_enabled: true

  # How often should we check for DERP updates?
  update_frequency: 24h

# Disables the automatic check for headscale updates on startup
disable_check_updates: false

# Time before an inactive ephemeral node is deleted?
ephemeral_node_inactivity_timeout: 30m

# Period to check for node updates within the tailnet. A value too low will severely affect
# CPU consumption of Headscale. A value too high (over 60s) will cause problems
# for the nodes, as they won't get updates or keep alive messages frequently enough.
# In case of doubts, do not touch the default 10s.
node_update_check_interval: 10s

# # Postgres config
# If using a Unix socket to connect to Postgres, set the socket path in the 'host' field and leave 'port' blank.
db_type: postgres
db_host: localhost
db_port: 5432
db_name: headscale
db_user: headscale
db_pass: password

# If other 'sslmode' is required instead of 'require(true)' and 'disabled(false)', set the 'sslmode' you need
# in the 'db_ssl' field. Refers to https://www.postgresql.org/docs/current/libpq-ssl.html Table 34.1.
db_ssl: false

### TLS configuration
#
## Let's encrypt / ACME
#
# headscale supports automatically requesting and setting up
# TLS for a domain with Let's Encrypt.
#
# URL to ACME directory
acme_url: https://acme-v02.api.letsencrypt.org/directory

# Email to register with ACME provider
acme_email: ""

# Domain name to request a TLS certificate for:
tls_letsencrypt_hostname: ""

# Path to store certificates and metadata needed by
# letsencrypt
# For production:
tls_letsencrypt_cache_dir: /home/user/headscale/cache

# Type of ACME challenge to use, currently supported types:
# HTTP-01 or TLS-ALPN-01
# See [docs/tls.md](docs/tls.md) for more information
tls_letsencrypt_challenge_type: HTTP-01
# When HTTP-01 challenge is chosen, letsencrypt must set up a
# verification endpoint, and it will be listening on:
# :http = port 80
tls_letsencrypt_listen: ":http"

## Use already defined certificates:
tls_cert_path: ""
tls_key_path: ""

log:
  # Output formatting for logs: text or json
  format: text
  level: trace

# Path to a file containg ACL policies.
# ACLs can be defined as YAML or HUJSON.
# https://tailscale.com/kb/1018/acls/
acl_policy_path: ""

## DNS
#
# headscale supports Tailscale's DNS configuration and MagicDNS.
# Please have a look to their KB to better understand the concepts:
#
# - https://tailscale.com/kb/1054/dns/
# - https://tailscale.com/kb/1081/magicdns/
# - https://tailscale.com/blog/2021-09-private-dns-with-magicdns/
#
dns_config:
  # Whether to prefer using Headscale provided DNS or use local.
  override_local_dns: true

  # List of DNS servers to expose to clients.
  nameservers:
    - 1.1.1.1

  # NextDNS (see https://tailscale.com/kb/1218/nextdns/).
  # "abc123" is example NextDNS ID, replace with yours.
  #
  # With metadata sharing:
  # nameservers:
  #   - https://dns.nextdns.io/abc123
  #
  # Without metadata sharing:
  # nameservers:
  #   - 2a07:a8c0::ab:c123
  #   - 2a07:a8c1::ab:c123

  # Split DNS (see https://tailscale.com/kb/1054/dns/),
  # list of search domains and the DNS to query for each one.
  #
  # restricted_nameservers:
  #   foo.bar.com:
  #     - 1.1.1.1
  #   darp.headscale.net:
  #     - 1.1.1.1
  #     - 8.8.8.8

  # Search domains to inject.
  domains: []

  # Extra DNS records
  # so far only A-records are supported (on the tailscale side)
  # See https://github.com/juanfont/headscale/blob/main/docs/dns-records.md#Limitations
  # extra_records:
  #   - name: "grafana.myvpn.example.com"
  #     type: "A"
  #     value: "100.64.0.3"
  #
  #   # you can also put it in one line
  #   - { name: "prometheus.myvpn.example.com", type: "A", value: "100.64.0.3" }

  # Whether to use [MagicDNS](https://tailscale.com/kb/1081/magicdns/).
  # Only works if there is at least a nameserver defined.
  magic_dns: true

  # Defines the base domain to create the hostnames for MagicDNS.
  # `base_domain` must be a FQDNs, without the trailing dot.
  # The FQDN of the hosts will be
  # `hostname.user.base_domain` (e.g., _myhost.myuser.example.com_).
  base_domain: example.com

# Unix socket used for the CLI to connect without authentication
# Note: for production you will want to set this to something like:
unix_socket: /home/user/headscale/headscale.sock
unix_socket_permission: "0770"
#
# headscale supports experimental OpenID connect support,
# it is still being tested and might have some bugs, please
# help us test it.
# OpenID Connect
# oidc:
#   only_start_if_oidc_is_available: true
#   issuer: "https://your-oidc.issuer.com/path"
#   client_id: "your-oidc-client-id"
#   client_secret: "your-oidc-client-secret"
#   # Alternatively, set `client_secret_path` to read the secret from the file.
#   # It resolves environment variables, making integration to systemd's
#   # `LoadCredential` straightforward:
#   client_secret_path: "${CREDENTIALS_DIRECTORY}/oidc_client_secret"
#   # client_secret and client_secret_path are mutually exclusive.
#
#   # The amount of time from a node is authenticated with OpenID until it
#   # expires and needs to reauthenticate.
#   # Setting the value to "0" will mean no expiry.
#   expiry: 180d
#
#   # Use the expiry from the token received from OpenID when the user logged
#   # in, this will typically lead to frequent need to reauthenticate and should
#   # only been enabled if you know what you are doing.
#   # Note: enabling this will cause `oidc.expiry` to be ignored.
#   use_expiry_from_token: false
#
#   # Customize the scopes used in the OIDC flow, defaults to "openid", "profile" and "email" and add custom query
#   # parameters to the Authorize Endpoint request. Scopes default to "openid", "profile" and "email".
#
#   scope: ["openid", "profile", "email", "custom"]
#   extra_params:
#     domain_hint: example.com
#
#   # List allowed principal domains and/or users. If an authenticated user's domain is not in this list, the
#   # authentication request will be rejected.
#
#   allowed_domains:
#     - example.com
#   # Note: Groups from keycloak have a leading '/'
#   allowed_groups:
#     - /headscale
#   allowed_users:
#     - [email protected]
#
#   # If `strip_email_domain` is set to `true`, the domain part of the username email address will be removed.
#   # This will transform `[email protected]` to the user `first-name.last-name`
#   # If `strip_email_domain` is set to `false` the domain part will NOT be removed resulting to the following
#   user: `first-name.last-name.example.com`
#
#   strip_email_domain: true

# Logtail configuration
# Logtail is Tailscales logging and auditing infrastructure, it allows the control panel
# to instruct tailscale nodes to log their activity to a remote server.
logtail:
  # Enable logtail for this headscales clients.
  # As there is currently no support for overriding the log server in headscale, this is
  # disabled by default. Enabling this will make your clients send logs to Tailscale Inc.
  enabled: false

# Enabling this option makes devices prefer a random port for WireGuard traffic over the
# default static port 41641. This option is intended as a workaround for some buggy
# firewall devices. See https://tailscale.com/kb/1181/firewalls/ for more information.
randomize_client_port: false
docker-compose.yaml
version: "3.8"
services:
  db:
    image: postgres
    command: ["postgres", "-c", "log_statement=all"]
    ports:
      - 5432:5432

    environment:
      POSTGRES_DB: headscale
      POSTGRES_USER: headscale
      POSTGRES_PASSWORD: password

  adminer:
    depends_on:
      - db
    image: adminer
    ports:
      - 8081:8080

  ts00:
    image: tailscale/tailscale:v1.48.2
    network_mode: host
    environment:
      TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/ --advertise-exit-node --advertise-routes=10.0.0.0/8'
      TS_USERSPACE: 'true'
      TS_AUTHKEY: 'KEY_1'
      TS_HOSTNAME: 'ts00'

  ts01:
    image: tailscale/tailscale:v1.48.2
    network_mode: host
    environment:
      TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/  --advertise-exit-node --advertise-routes=10.0.0.0/8'
      TS_USERSPACE: 'true'
      TS_AUTHKEY: 'KEY_1'
      TS_HOSTNAME: 'ts01'

  ts-client:
    image: tailscale/tailscale:v1.48.2
    network_mode: host
    environment:
      TS_EXTRA_ARGS: '--login-server=http://127.0.0.1:8080/'
      TS_USERSPACE: 'true'
      TS_AUTHKEY: 'KEY_2'
      TS_HOSTNAME: 'ts-client'
  • docker compose up -d db
  • run headscale with config.yaml
  • create user1 and user2, and generate key_1 and key_2, replace it in docker-compose.yaml
  • docker compose up ts00 ts01 ts-client

@kradalby
Copy link
Collaborator Author

kradalby commented Oct 6, 2023

Ok, after running your config, I am seeing that the nodes struggle to join at all, but if they manage I dont see the name issue.

I ran it with sqlite instead of postgres and it ran fine (both joining and the name).

@vsychov can you try running your test with SQLite?
for curiosity, why are you running with postgres instead of sqlite?

@joepa37 what database are you running?
Can you report back and test with the opposite one of the main one?

@vsychov
Copy link
Contributor

vsychov commented Oct 6, 2023

@kradalby, I've reproduced the bug on sqlite, but it seems I missed an important detail in the description above; it only manifests after enabling routes. As a result, the steps to reproduce on sqlite are as follows:

  • Create user1 and user2, generate key_1 and key_2, then replace them in docker-compose.yaml.
headscale users create user1
headscale users create user2
headscale preauthkey create --user user1 --reusable --ephemeral
headscale preauthkey create --user user2 --reusable --ephemeral
  • Run docker compose up ts00 ts01 ts-client.
  • After that, enable all the routes:
headscale route enable -r 1
headscale route enable -r 2
headscale route enable -r 3
headscale route enable -r 4
headscale route enable -r 5
headscale route enable -r 6

Then, inside ts-client, check the status:

# tailscale --socket=/tmp/tailscaled.sock status
100.64.0.2      ts-client            user2        linux   -
100.64.0.1      ts00..example.com    user1        linux   idle; offers exit node
100.64.0.3      ts01..example.com    user1        linux   idle; offers exit node

for curiosity, why are you running with postgres instead of sqlite?

In my case, I use headscale within GCP's k8s (GKE), and I use a managed PostgreSQL instance because it's more convenient to work with than sqlite in terms of creating database backups. Also, due to #1482 (which was fixed in #1562, but hasn't been merged yet), I need to run a script to clear out this data. It's easier for me to do this using an external database, executing the script in a separate container. During testing, I use PostgreSQL just to be a bit closer to how I use it in real-life situations.

@kradalby kradalby force-pushed the 1561-online-issue branch 2 times, most recently from df0c075 to 3014bc3 Compare October 10, 2023 12:57
@kradalby
Copy link
Collaborator Author

I have rewritten the failover logic and added quite a large synthetic test for failing over and ensuring the status is sent as expected. Please help testing HA and failover.

I should also have fixed the missing username bug in DNS.

I have rethought how the Online status is set, and this is currently implemented in the CLI and the HA failover, this is however not implemented in the Online map sent to nodes and might affect stuff like exit node. I'll work on that, but please test it anyways.

@kradalby
Copy link
Collaborator Author

@vsychov @joepa37

Is the 🚀 to be interpreted as "It is all good now and we should ship it" or "Exciting, will be testing"? 😆

@vsychov
Copy link
Contributor

vsychov commented Nov 29, 2023

@kradalby, I'll repeat all tests this week, and will test a few more cases, thanks for the great job.

@kradalby
Copy link
Collaborator Author

Thank you, I will work on the things I seem to have broken (Taildrop, Only DERP and web only logout flow) tomorrow.

Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
This PR further improves the state management system and tries to
make sure that we get all nodes in sync continously.

This is greatly enabled by a previous PR dropping support for older
clients that allowed us to use a Patch field only sending small diffs
for client updates.

It also reworks how the HA subnet router is handled and it should
be a bit easier to follow now.

Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
@vsychov
Copy link
Contributor

vsychov commented Dec 3, 2023

I conducted another series of tests, starting with two tailscale nodes on headscale revision 979569c, on the infrastructure described in my last message. I noticed an improvement in stability; the nodes no longer fluctuate between online and offline status, which is good news.

The tailscale nodes were connected with the following parameters:

tailscale-tmp-ams3: tailscale up --login-server https://headscale-test.example.com --auth-key XXXX --accept-dns --accept-routes --shields-up=false --advertise-exit-node --advertise-routes=10.110.0.0/20

tailscale-tmp-test: tailscale up --login-server https://headscale-test.example.com --auth-key XXXX --accept-dns --accept-routes --shields-up=false --advertise-exit-node --advertise-routes=10.114.0.0/20

After enabling the routes, the routing table looked like this:

root@headscale-test-6-5c7d4f7d6b-hn8g5:/# headscale route list
ID | Node               | Prefix        | Advertised | Enabled | Primary
1  | tailscale-tmp-ams3 | ::/0          | true       | true    | -
2  | tailscale-tmp-ams3 | 10.110.0.0/20 | true       | true    | true
3  | tailscale-tmp-ams3 | 0.0.0.0/0     | true       | true    | -
4  | tailscale-tmp-test | 0.0.0.0/0     | true       | true    | -
5  | tailscale-tmp-test | ::/0          | true       | true    | -
6  | tailscale-tmp-test | 10.114.0.0/20 | true       | true    | true

Traffic from the 10.110.0.0/20 network successfully passed to the 10.114.0.0/20 network and back.

Next, I used the latest revision at the moment, af3c097, to test subnet route failover. I added another tailscale node (tailscale-tmp2-fra1), with parameters similar to tailscale-tmp-test. After enabling the routes, I discovered a problem: there was no longer a Primary route to the 10.114.0.0/20 network:

2023-12-03T16:25:53Z:

root@headscale-test-6-f65f999c7-kgb9c:/# headscale node list
ID | Hostname            | Name                | MachineKey | NodeKey | User                 | IP addresses                  | Ephemeral | Last seen           | Expiration          | Connected | Expired
1  | tailscale-tmp-test  | tailscale-tmp-test  | [MHyGh]    | [hFr8i] | user.example.com     | 100.64.0.1, fd7a:115c:a1e0::1 | false     | 2023-12-03 16:25:44 | 0001-01-01 00:00:00 | online    | no
3  | tailscale-tmp2-fra1 | tailscale-tmp2-fra1 | [zLP+K]    | [fbCkD] | user-two.example.com | 100.64.0.3, fd7a:115c:a1e0::3 | false     | 2023-12-03 16:25:45 | 0001-01-01 00:00:00 | online    | no
2  | tailscale-tmp-ams3  | tailscale-tmp-ams3  | [1w0kC]    | [Fjkoy] | user.example.com     | 100.64.0.2, fd7a:115c:a1e0::2 | false     | 2023-12-03 16:25:45 | 0001-01-01 00:00:00 | online    | no

2023-12-03T16:26:03Z:

root@headscale-test-6-f65f999c7-kgb9c:/# headscale route list
ID | Node                | Prefix        | Advertised | Enabled | Primary
4  | tailscale-tmp-test  | 0.0.0.0/0     | true       | true    | -
5  | tailscale-tmp-test  | ::/0          | true       | true    | -
6  | tailscale-tmp-test  | 10.114.0.0/20 | true       | true    | false
7  | tailscale-tmp2-fra1 | 0.0.0.0/0     | true       | true    | -
8  | tailscale-tmp2-fra1 | ::/0          | true       | true    | -
9  | tailscale-tmp2-fra1 | 10.114.0.0/20 | true       | true    | false
1  | tailscale-tmp-ams3  | ::/0          | true       | true    | -
2  | tailscale-tmp-ams3  | 10.110.0.0/20 | true       | true    | true
3  | tailscale-tmp-ams3  | 0.0.0.0/0     | true       | true    | -

Despite all three nodes being online.

@kradalby
Copy link
Collaborator Author

kradalby commented Dec 6, 2023

I think this PR has grown a bit out of hand for the issue it was going to address, I think I propose the following,

It should now pass all the tests we have, I think it should be reviewed and merged and then when I get test feedback on #1561, I will continue to address that in new PRs.

As a result of this PR, I did notice two things:

  • It makes TestDERPServerScenario and TestSSHIsBlockedInACL more flaky, particularly on github actions, which I will file separate issues for to address
  • It made me discover that we do not propagate Expiry information correctly (see new commented out test in last commit)

the last I think is a new issue and should also be addressed separately.

When this is merged, let us move all discussions of #1561 back to the issue.

@kradalby kradalby marked this pull request as ready for review December 6, 2023 10:42
@kradalby kradalby requested a review from juanfont as a code owner December 6, 2023 10:42
Signed-off-by: Kristoffer Dalby <[email protected]>
@@ -1,5 +1,6 @@
ignored/
tailscale/
.vscode/
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You now use vscode??

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes

@naimo84
Copy link

naimo84 commented Dec 11, 2023

The new alpha2 version fixed the issue, on my side 👍🥳 thanks a lot for the amazing work 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants