Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codereorg: Part 5, state #1492

Merged
merged 31 commits into from
Sep 19, 2023
Merged

Codereorg: Part 5, state #1492

merged 31 commits into from
Sep 19, 2023

Conversation

kradalby
Copy link
Collaborator

@kradalby kradalby commented Jun 21, 2023

This PR reworks the state management in headscale, instead of relying on timestamps and frequent checking, and sending the entire world every time, we now send incremental updates based on the change that has happened.

Closes #1390
Fixes #1461

@kradalby kradalby force-pushed the codereorg-p5-state branch 2 times, most recently from 631d1a8 to e4b0e82 Compare June 24, 2023 08:05
@kradalby kradalby marked this pull request as ready for review June 25, 2023 11:31
@kradalby kradalby requested a review from juanfont as a code owner June 25, 2023 11:31
@kradalby kradalby marked this pull request as draft June 25, 2023 12:14
@kradalby kradalby force-pushed the codereorg-p5-state branch 4 times, most recently from 32a0567 to b5be2ef Compare July 5, 2023 17:13
@kradalby
Copy link
Collaborator Author

kradalby commented Jul 7, 2023

All tests pass locally, it is hard to know if Github actions is reliable or not.

@kradalby
Copy link
Collaborator Author

kradalby commented Jul 7, 2023

My observation so far is that tests including a larger amount of nodes are more unreliable, particularly with excess network requirements:

  • PingAll (the base of the typically failed ones)
  • Taildrop

kradalby and others added 5 commits September 11, 2023 05:08
Signed-off-by: Kristoffer Dalby <[email protected]>
This commit replaces the timestamp based state system with a new
one that has update channels directly to the connected nodes. It
will send an update to all listening clients via the polling
mechanism.

It introduces a new package notifier, which has a concurrency safe
manager for all our channels to the connected nodes.

Signed-off-by: Kristoffer Dalby <[email protected]>
This commits extends the mapper with functions for creating "delta"
MapResponses for different purposes (peer changed, peer removed, derp).

This wires up the new state management with a new StateUpdate struct
letting the poll worker know what kind of update to send to the
connected nodes.

Signed-off-by: Kristoffer Dalby <[email protected]>
Co-Authored-By: Jason <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
@kradalby kradalby force-pushed the codereorg-p5-state branch 2 times, most recently from 93788d5 to 1d7b736 Compare September 11, 2023 10:59
kradalby and others added 5 commits September 11, 2023 06:02
Closes juanfont#1544

Co-Authored-By: Patrick Huang <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
This field is no longer used, it was used in our old state
"algorithm" to determine if we should send an update.

Signed-off-by: Kristoffer Dalby <[email protected]>
This commit rearranges the poll handler to immediatly accept
updates and notify its peers and return, not travel down the
function for a bit. This reduces the DB calls and other
holdups that isnt necessary to send a "lite response", a
map response without peers, or accepting an endpoint update.

Signed-off-by: Kristoffer Dalby <[email protected]>
Signed-off-by: Kristoffer Dalby <[email protected]>
@kradalby kradalby marked this pull request as ready for review September 12, 2023 03:19
@vsychov
Copy link
Contributor

vsychov commented Sep 15, 2023

Hello there,
Thank you for your hard work on headscale!

I'm start testing this changes, and found bug with "on-the-fly" state update:

  1. Start headscale.
  2. Connect client (macos): /Applications/Tailscale.app/Contents/MacOS/Tailscale up --login-server=https://headscale-test.example.com/
  3. Connect headscale-test-1 (linux): tailscale login --login-server=https://headscale-test.example.com/ --advertise-exit-node
  4. Connect headscale-test-ams3-1 and headscale-test-ams3-2 (linux): tailscale up --login-server=https://headscale-test.example.com/ --advertise-routes=10.0.0.0/8 --ssh
  5. Approve the routes and verify:
ID | Machine               | Prefix     | Advertised | Enabled | Primary
1  | headscale-test-1      | 0.0.0.0/0  | true       | true    | -
2  | headscale-test-1      | ::/0       | true       | true    | -
4  | headscale-test-ams3-2 | 10.0.0.0/8 | true       | true    | true
3  | headscale-test-ams3-1 | 10.0.0.0/8 | true       | true    | false
  1. Try to ping:
/Applications/Tailscale.app/Contents/MacOS/Tailscale ping 10.110.0.53
2023/09/15 23:06:47 no matching peer
➜  ~ /Applications/Tailscale.app/Contents/MacOS/Tailscale status
100.64.0.1      node1   user.example.com macOS   -
100.64.0.2      headscale-test-1     user.example.com linux   -
100.64.0.3      headscale-test-ams3-1..example.com user.example.com linux   -
100.64.0.4      headscale-test-ams3-2..example.com user.example.com linux   -
  1. The exit-node on headscale-test-1 does not appear, and the ping to 10.110.0.53 does not work (though it should through headscale-test-ams3-1).
  2. The problem is resolved by restarting headscale, but the status does not update until the restart (I wait ~10 min until restart headscale).

@vsychov
Copy link
Contributor

vsychov commented Sep 15, 2023

It seems there is also an issue with updating the "last-seen"/"online" status. I wanted to test the subnet-router failover, but as soon as I executed tailscale down on headscale-test-ams3-2, I expected the traffic to switch through headscale-test-ams3-1. However, headscale-test-ams3-1 was unexpectedly marked as "offline":

root@headscale-f7489674b-4j5jf:/# date
Fri Sep 15 21:30:43 UTC 2023
root@headscale-f7489674b-4j5jf:/# headscale nodes list
ID | Hostname              | Name                  | MachineKey | NodeKey | User                 | IP addresses                  | Ephemeral | Last seen           | Expiration          | Online  | Expired
....
3  | headscale-test-ams3-1 | headscale-test-ams3-1 | [K4v5r]    | [LkHjz] | user.example.com | 100.64.0.3, fd7a:115c:a1e0::3 | false     | 2023-09-15 21:14:13 | 2023-09-16 16:58:33 | offline | no
....
root@headscale-f7489674b-4j5jf:/# 

Although there are no network connectivity issues, tailscale is working, and headscale is accessible via http:

root@headscale-test-ams3-1:~# systemctl status tailscaled.service
● tailscaled.service - Tailscale node agent
     Loaded: loaded (/lib/systemd/system/tailscaled.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2023-09-15 20:57:34 UTC; 32min ago
       Docs: https://tailscale.com/kb/
    Process: 8483 ExecStartPre=/usr/sbin/tailscaled --cleanup (code=exited, status=0/SUCCESS)
   Main PID: 8511 (tailscaled)
     Status: "Connected; user.example.com; 100.64.0.3 fd7a:115c:a1e0::3"
      Tasks: 8 (limit: 1116)
     Memory: 39.5M
        CPU: 1.760s
     CGroup: /system.slice/tailscaled.service
             └─8511 /usr/sbin/tailscaled --state=/var/lib/tailscale/tailscaled.state --socket=/run/tailscale/tailscaled.sock --port=41641
root@headscale-test-ams3-1:~# curl -I https://headscale-test.example.com/api/v1/machine
HTTP/2 401
date: Fri, 15 Sep 2023 21:30:33 GMT
content-type: text/plain; charset=utf-8
content-length: 12
strict-transport-security: max-age=15724800; includeSubDomains

root@headscale-test-ams3-1:~#

@kradalby
Copy link
Collaborator Author

@vsychov Thank you, I appreciate this a lot. I think what I will do since this PR is so messy already is to get it in, and then it would be great if you could create two issues for me on this (and other things you find), and then we make them blocking before alpha.

I would also appreciate your continued testing of this work.

@juanfont
Copy link
Owner

Approved!

changed := make(types.Machines, len(machineKeys))
lastSeen := make(map[tailcfg.NodeID]bool)
for idx, machineKey := range machineKeys {
peer, err := m.db.GetMachineByID(machineKey)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixing MachineKey with IDs of the DB?

}

func NewNotifier() *Notifier {
return &Notifier{}
}

func (n *Notifier) AddNode(machineKey string, c chan<- struct{}) {
func (n *Notifier) AddNode(machineKey string, c chan<- types.StateUpdate) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here seems to be the MachineKey of the Node.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have nodes with the same machinekey but different nodekeys, right?

@kradalby kradalby merged commit 096ac31 into juanfont:main Sep 19, 2023
44 checks passed
@kradalby kradalby deleted the codereorg-p5-state branch September 19, 2023 15:20
@kradalby
Copy link
Collaborator Author

@vsychov did you try the tests with the route/exit node issue including the last commit? Would you mind opening an issue for that one?

I'll have a look at the online status now.

@vsychov
Copy link
Contributor

vsychov commented Sep 26, 2023

@kradalby , I'll try my tests on current main revision, and if that still reproducible, I'll create issues for that.

@vsychov
Copy link
Contributor

vsychov commented Sep 26, 2023

Problems still appear, I created issue: #1561

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strange shutdown/restart behavior on 0.22.2
3 participants