Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] online status doesn't change if connection is interrupted #2129

Open
4 tasks done
moserpjm opened this issue Sep 12, 2024 · 11 comments · May be fixed by #2131
Open
4 tasks done

[Bug] online status doesn't change if connection is interrupted #2129

moserpjm opened this issue Sep 12, 2024 · 11 comments · May be fixed by #2131
Labels
bug Something isn't working no-stale-bot
Milestone

Comments

@moserpjm
Copy link

moserpjm commented Sep 12, 2024

Is this a support request?

  • This is not a support request

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

If the connection of a client is interrupted (pull the cable, disconnect from wifi) headscale never changes its status to offline.
The state changes to offline when I restart headscale or caddy. (which terminates all connections)
I found the bug while working on an OPNSense plugin for Tailscale. It has CARP support which downs tailscale on the slave node. This should trigger a failover of the routes but it didn't. During the failover the internet connection of the slave firewall gets interrupted for a few seconds. My CARP hook executed tailscale down in this time window. -> Both firewalls show up as online, the routes are not failing over.

Expected Behavior

The node should go offline after some time.

Steps To Reproduce

Connect a device and inerrupt it's internet connection.

Environment

- OS: Flatcar Linux
- Headscale version: v0.23.0-beta.5
- Tailscale version: 1.72.1

I tested it with an Android Phone and my Linux machine. (Ubuntu 22.04)
The reverse proxy is Caddy with one line standard configuration reverse proxy setup. Headscale runs in a Docker container with sqlite DB.

Runtime environment

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

Anything else?

I can provide further logs and dumps if this problem does not appear in another setup.

@moserpjm moserpjm added the bug Something isn't working label Sep 12, 2024
@kradalby kradalby added this to the v0.23.0 milestone Sep 12, 2024
@kradalby
Copy link
Collaborator

Can you please confirm this issue without the reverse proxy?

@moserpjm
Copy link
Author

Yes I can but it's going to take some time to build a lab setup. BTW I checked the behaviour of the reverse proxy. It doesn't reuse connections.

@moserpjm
Copy link
Author

OK. I did a quick and dirty setup. No fancy stuff. Ubuntu 24.04, deb package, no firewall, no proxy, no OIDC.
Connected my Android phone and disabled WiFi and 5G.

Those are the relevant lines in the log:

Sep 12 22:41:00 headscale-test headscale[1391]: 2024-09-12T22:41:00Z INF ../../../home/runner/work/headscale/headscale/hscontrol/auth.go:603 > Node successfully authorized node=localhost
Sep 12 22:41:00 headscale-test headscale[1391]: 2024-09-12T22:41:00Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:705 > node has connected, mapSession: 0xc0004a6300, chan: 0xc00029bce0 node=localhost node.id=1 omitPeers=false readOnly=false stream=true
Sep 12 22:59:32 headscale-test headscale[1391]: 2024-09-12T22:59:32Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:705 > node has disconnected, mapSession: 0xc0004a6300, chan: 0xc00029bce0 node=localhost node.id=1 omitPeers=false readOnly=false stream=true

It set the node offline after 18 Minutes. I'm not sure if I waited that long on my fancy setup. Going to test it. Should this take so long? For subnet router HA this is a very long time.

I think it should be pretty easy for you to replicate this setup and attach a debugger. ;)
If you need a server I can quickly do some Ansible magic.

@moserpjm
Copy link
Author

OK. Same behaviour on my fancy system. It took 16 minutes.

@kradalby
Copy link
Collaborator

Thanks, I'll set up a integration tests to reproduce too, I had a few minutes yesterday and think I managed to see this with a RPi. I suspect there is something wrong with how the keep alive is sent, which should trigger offline by failing to send.

kradalby added a commit to kradalby/headscale that referenced this issue Sep 13, 2024
Signed-off-by: Kristoffer Dalby <[email protected]>
kradalby added a commit to kradalby/headscale that referenced this issue Sep 13, 2024
Signed-off-by: Kristoffer Dalby <[email protected]>
@kradalby
Copy link
Collaborator

I have research this issue, and I am starting to suspect that it has always been broken, I am seeing around 16m in my integrations tests and it seems to be out of Go's control.

So I found this blogpost that describes this behaviour, but from the client side.

I am unsure if this can be solved without implementing some other way to discover if the client is still online. I have an idea, but it requires quite some re-engineering and I think it will have to come in later version.

Could you please try to see if this behaviour is present in v0.22.3 in the lab you set up?

@kradalby
Copy link
Collaborator

I suspect what we need to do is figure out how we can use something like PingRequest (https://github.com/tailscale/tailscale/blob/main/tailcfg/tailcfg.go#L1663) to check if a node is there.

I'm going to try to confirm if this is new behaviour, and if it is not, I will say this is out of scope for 0.23.0, and move it to next.

@moserpjm
Copy link
Author

I know this behaviour very well. Serial port libs also retry for ages. The solution is what JDBC connection pools do for ages. If there's no traffic for x seconds send a keepalive. If max idle time is reached kill the connection and hope that the underlyig library and OS really close it.

@kradalby
Copy link
Collaborator

The thing is that we send keepalives every minute~, but in go, flushing these messages to a gone connection does not produce any errors.

So we would need a keepalive variant that calls back, which we do not have and it will require an effort. Since this is likely the current behaviour (trying to verify but have no lab yet, so please help), I will not hold up this release and try to work it in toa future one.

@kradalby
Copy link
Collaborator

I've confirmed that this issue occurs in 0.22.3, so I will push this to next, we should def solve it, but it requires more thought that something we should add just before a upcoming release.

@kradalby kradalby modified the milestones: v0.23.0, Next Sep 13, 2024
@Zeashh
Copy link

Zeashh commented Sep 16, 2024

I've noticed this issue when networks change, for instance on a phone when I change from cellular data to wifi and vice versa. After changing the network I have to reconnect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working no-stale-bot
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@kradalby @moserpjm @Zeashh and others