Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker network connection time outs to host over time #8861

Open
2 tasks done
rg9400 opened this issue Oct 6, 2020 · 70 comments
Open
2 tasks done

Docker network connection time outs to host over time #8861

rg9400 opened this issue Oct 6, 2020 · 70 comments

Comments

@rg9400
Copy link

rg9400 commented Oct 6, 2020

  • I have tried with the latest version of my channel (Stable or Edge)
  • I have uploaded Diagnostics
  • Diagnostics ID: 7E746511-651C-4A74-8C84-91189E8962C1/20201006161122

Expected behavior

I would expect services running inside Docker containers in a WSL backend to be able to reliably communicate with applications running on the host, even with frequent polling

Actual behavior

Due to #8590, I have to run some applications that require high download speeds on the host. I have multiple applications inside Docker containers running inside a Docker bridge network that poll this application every few seconds. When launching WSL, the applications are able to communicate reliably, but this connection deteriorates over time, and after 1-2 days, I notice frequent connection timed out responses from the application running on the host. Running wsl --shutdown and restarting the Docker daemon fixes the issue temporarily. Shifting applications out of Docker and onto the host fixes their communication issues as well. It may be related to the overall network issues linked above.

To be clear, it can still connect. It just starts timing out more and more often the longer the network/containers have been up.

Information

  • Windows Version: 2004 (OS Build 19041.508)
  • Docker Desktop Version: 2.4.1.0 (48583)
  • Are you running inside a virtualized Windows e.g. on a cloud server or on a mac VM: No

I have had this problem ever since starting to use Docker for Windows with the WSL2 backend.

Steps to reproduce the behavior

  1. Run an application on the Windows host. I tried with NZBGet (host ip: 192.168.1.2)
  2. Poll this application from within a Docker container inside a Docker bridge network living within WSL2. I polled 192.168.1.2:6789 every few seconds
  3. Check back in a day to see if the connection is timing out more frequently.
  4. Restart WSL/Docker daemon, notice that the connection is suddenly more reliable though it will begin to deteriorate over time again
@rg9400
Copy link
Author

rg9400 commented Oct 9, 2020

This seems to improve if you use the recommended host.docker.internal option instead of using the IP of the host machine directly

@rg9400
Copy link
Author

rg9400 commented Oct 19, 2020

Further update on this. While the above does prolong the deterioration, it still eventually happens. After 4-5 days, timeouts start occurring at increasing frequency, with it eventually reaching a point where timeouts are happening on almost every few calls, requiring a full restart of WSL and Docker to function.

@markoueis
Copy link

markoueis commented Oct 21, 2020

We have the same issue

  1. Using 2.4.0.0
  2. We use host.docker.internal

We have a service running on the host.

If i try to hit host.docker.internal from within a linux container i can always get it to trip up eventually after say 5000 curl requests to http:\host.docker.internal\service (it timesout for one request)

If i try http:\host.docker.internal\service from the host, it works flawlessly even after 10000 curl requests

Sometimes, intermittently, and we can't find out why, it starts to fail much more frequently (like maybe every 100 curl requests)

Something is up with the networking...

Here is a very simple test to show what's going on:
ezgif-3-7115a7f3b7ab

@markoueis
Copy link

In my limited testing, i created a loopback adapter and used it instead. I created an ip 10.0.75.2 and used it instead. It's much more reliable. It's an ugly work around but it might work at least to help show where the issue might be.

@markoueis
Copy link

Hey guys, this is still happening pretty consistently. Is anyone looking at the reliability/performance of these things? Is this the wrong place to post this?

@rg9400
Copy link
Author

rg9400 commented Dec 23, 2020

I was able to send this via their support and have them reproduce the issue. They diagnosed the cause, but said it would involve some major refactoring, so they didn't have a target fix date. Below is the issue as mentioned by them

I can reproduce the bug now. If I query the vpnkit diagnostics with this program https://github.com/djs55/bug-repros/tree/main/tools/vpnkit-diagnostics while the connection is stuck then I observe: (for my particular repro the port number was 51580. I discovered this using wireshark to explore the trace)

$ tcpdump -r capture\\all.pcap port 51580
15:57:03.021934 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195077730 ecr 0,nop,wscale 7], length 0
15:57:04.064094 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195078771 ecr 0,nop,wscale 7], length 0 15:57:06.111633 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195080819 ecr 0,nop,wscale 7], length 0
15:57:10.143908 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195084851 ecr 0,nop,wscale 7], length 0
15:57:18.464142 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195093171 ecr 0,nop,wscale 7], length 0
15:57:34.848536 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195109555 ecr 0,nop,wscale 7], length 0
15:58:07.103411 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195141811 ecr 0,nop,wscale 7], length 0

which is a stuck TCP handshake from the Linux point of view. The same thing is probably visible in a live trace from docker run -it --privileged --net=host djs55/tcpdump -n -i eth0.

Using sysinternals process explorer to examine the vpnkit.exe process, I only see 1 TCP connection at a time (although a larger than ideal number of UDP connections which are DNS-related I think). There's no sign of a resource leak.

When this manifests I can still establish other TCP connections and run the test again -- the impact seems limited to the 1 handshake failure.

The vpnkit diagnostics has a single TCP flow registered:

> cat .\flows
TCP 192.168.65.3:51580 > 192.168.65.2:6789 socket = open last_active_time = 1605023899.0

which means that vpnkit itself thinks the flow is connected, although the handshake never completed.

@markoueis
Copy link

Woah, thanks for this update @rg9400. Glad you got it on their radar. So your work around is to restart docker and wsl --shutdown? I've been trying to use another IP (loopback adapter) as opposed to host.docker.internal or whatever host.docker.internal points to. But I'm not 100% sure that solves the problem permanently. Maybe its just a new IP so it will work for a little and then deteriorate again over time. Based on your explanation of the root cause, that might indeed be the case.

@rg9400
Copy link
Author

rg9400 commented Dec 23, 2020

Yeah, for now I am just living with it and restarting WSL/Docker every now and then when the connection timeouts become too frequent and unbearable.

@markoueis
Copy link

What can we do to get this worked on. Is there work happening on it? or a ticket we can follow? This still bugs us quite consistently.

@markoueis
Copy link

markoueis commented May 19, 2021

I want to keep this thread alive as this is a massive pain for folks especially because they don't know its happening. This needs to become more reliable.

Here is a newer diagnostic id: F4D29FA0-6778-40B8-B312-BADEA278BB3B/20210521171355

Also discovered that just killing vpnkit.exe in task manager reduces the problem. It restarts almost instantly and connections resume much better without having to restart containers or anything. But problem eventually reoccurs.

@stormmuller
Copy link

We have about 15 services in our docker-compose file and all of them do an npm install. A cacheless build is impossible because it tries to build all the services at once and the npm install steps timeout because trying to download that many packages just kills bandwidth.

I'm not using the --parallel flag
I've set the following environment variables:

  • COMPOSE_HTTP_TIMEOUT=240
  • COMPOSE_PARALLEL_LIMIT=2

But non of this seems to change the behavior

@bradleyayers
Copy link

This happens on macOS too, in fact quite reliably after ~7 minutes and ~13,000 requests of hitting a HTTP server:

Server:

$ python3 -mhttp.server 8015

Client (siege):

$ cat <<EOF > siegerc
timeout = 1
failures = 1
EOF
$ docker run --rm -v $(pwd)/siegerc:/tmp/siegerc -t funkygibbon/siege --rc=/tmp/siegerc -t2000s -c2 -d0.1 http://host.docker.internal:8015/api/foo

Output:

New configuration template added to /root/.siege
Run siege -C to view the current settings in that file
** SIEGE 4.0.4
** Preparing 2 concurrent users for battle.
The server is now under siege...[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(1) sock.c:240: Connection timed out
siege aborted due to excessive socket failure; you
can change the failure threshold in $HOME/.siegerc

Transactions:		       13949 hits
Availability:		       99.99 %
Elapsed time:		      378.89 secs
Data transferred:	        6.24 MB
Response time:		        0.00 secs
Transaction rate:	       36.82 trans/sec
Throughput:		        0.02 MB/sec
Concurrency:		        0.10
Successful transactions:           0
Failed transactions:	           1
Longest transaction:	        0.05
Shortest transaction:	        0.00

What's interesting is that it gets progressively worse from there, the timeouts happen more and more frequently. Restarting the HTTP server doesn't help, but restarting it on another port does (e.g. from 8019 -> 8020). From there you get another 7 minutes of 100% success before it starts degrading again.

I tried adding an IP alias to my loopback adapter and hitting that instead of host.docker.internal but it had the same behavior (i.e. degraded after 7 minutes). The same goes for using the IP (192.168.65.2) and skipping the DNS resolution.

@rg9400
Copy link
Author

rg9400 commented Oct 14, 2021

This issue remains unresolved. The devs indicated it required major rework, but I haven't heard back from them in 6 months on the progress.

@docker-robott
Copy link
Collaborator

Issues go stale after 90 days of inactivity.
Mark the issue as fresh with /remove-lifecycle stale comment.
Stale issues will be closed after an additional 30 days of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle stale

@rg9400
Copy link
Author

rg9400 commented Jan 12, 2022

/remove-lifecycle stale

@zadirion
Copy link

I am also affected by this issue. I thought at one point it was because of TCP keepalive on sockets, and the sockets not being closed as fast as they are opened, thus a exhausting the max number of available sockets. But the problem doesn't go away even if my containers stop opening connections for a while, only a restart of docker and wsl seems to fix this.
This issue should be on high priority...

@artzwinger
Copy link

I cannot connect from a container to a host port even using telnet.
Network mode is bridge, which is default, but "host" mode also doesn't work.

I tried to guess host IP, but also I tried this:
extra_hosts:
- "host.docker.internal:host-gateway"
Both options didn't work.

Telnet connection from host machine to this host port does work well.

In previous Docker versions it was working fine! Seems it's broken since some update maybe from 2021-2022.

@artzwinger
Copy link

Upd.
It was my Ubuntu UFW that was blocking containers from connecting to host ports.

@raarts
Copy link

raarts commented May 4, 2022

Having this exact problem on MacOS. Restarting Docker fixes the problem (for a while).

@bernhof
Copy link

bernhof commented May 5, 2022

We have reports of this occurring across teams on Windows and macOS as well. We have no reports of this issue occuring on Linux.

Someone noticed that on macOS, simply waiting ~15mins often alleviates the problem.

@metacity
Copy link

metacity commented May 11, 2022

We're also experiencing this (using host.docker.internal) on Docker Desktop for Windows. Strangely enough, Docker version up to 4.5.1 seem to work fine, but versions 4.6.x and 4.7.x instantly bring up the problem. Connections work for some time, but then the timeouts start. All checks of "C:\Program Files\Docker\Docker\resources\com.docker.diagnose.exe" check pass.

@RomanShumkov
Copy link

I'm experiencing the same problem with increasing amount of timeouts over time while using host.docker.internal.

@stamosv
Copy link

stamosv commented May 30, 2022

I'm also experiencing the same problem. Downgrade to 4.5.1 looks that solves the issue.

@gregfrog
Copy link

gregfrog commented Dec 8, 2022

Thanks for the update, and I will file a support request. Apart from anything else, their answer ignores that this is an issue for users who aren't on Windows. Just restarting Docker Desktop all the time isn't an acceptable workaround IMO.

I suspect I am running into this at the moment. If having to restart the VM that Docker is running in, rebooting in essence, is not a blocker, what is? Hardware damage?

@tristanbrown
Copy link

This is absolutely a blocker for me, as I cannot run scheduled tasks reliably.

@roele
Copy link

roele commented Feb 18, 2023

The following workaround resolved the issue for me
https://emerle.dev/2022/05/06/the-nasty-gotcha-in-docker/

@acedanger
Copy link

The following workaround resolved the issue for me
https://emerle.dev/2022/05/06/the-nasty-gotcha-in-docker/

Adding an archive in case the post or site goes down.

https://archive.ph/fk6dC

@nk9
Copy link

nk9 commented Feb 20, 2023

While this is useful information, I am not sure that it's actually related to this bug. The error described in the post is "Connection reset by peer." However, the problem in this issue is "Connection timed out." The exact error may differ depending on which software you're using, but the key thing is that you send packets that just never arrive. The connection isn't reset, it just stops moving data and effectively becomes /dev/null.

There are reproduction steps here, and I'm happy to be proven wrong. If someone can run the Python reproduction above and confirm that the problem doesn't occur on recent versions of Docker Desktop with the idle time set to 0, then I'll stand corrected. But @rg9400 spoke with Docker themselves, who acknowledged the problem and said they didn't have a fix. If the solution was as easy as changing vpnKitMaxPortIdleTime, surely they would have mentioned that.

If you would like changes in the behavior of vpnKitMaxPortIdleTime, I suggest you open a different issue.

@robertnisipeanu
Copy link

I also replied a few months ago with that fix, and my problem was a connection time out for an nginx reverse proxy and PING command, not a connection reset.

@tristanbrown
Copy link

I'm thinking this is a port saturation issue, similar to what's described here. I recently restarted my Docker service, but once the problem crops up again, I'll try going through some of these troubleshooting steps.

@BenjaminPelletier
Copy link

I'm about 90% sure this issue applies to me as well, but it's devilishly difficult to tell for sure. I'll refer to a tool for reproduction that I wrote in my observations below:

  1. The issue appears to happen about once every 10^1 continuous integration invocations on a project I work on, and each continuous integration run probably has 10^3-10^4 HTTP requests sent between containers on the same GitHub Actions Linux cloud VM
  2. The issue also happens on my development machine, a laptop with MacOS Ventura 13.2.1
  3. All requests I have observed this issue with have been addressed to host.docker.internal, perhaps mainly because nearly all of my requests are addressed there, but while troubleshooting I was unable to reproduce when sending requests to an IP (using Docker's default bridge network) nor a service name (using a custom bridge network created for the purpose) -- see the reproduction repo for more notes.
  4. The rate of occurrence varies a lot, and not according to any pattern I've been able to identify. The past week, I've had a connection timeout within 10^1-10^2 requests on my development machine with that rate persisting through a laptop reboot. After creating a Docker network to (unsuccessfully) attempt reproduction with containers communicating through that network, not only did the issue not occur using the custom bridge network, but the issue also vanished entirely -- my 100% reliable method of reproduction went to 0%.
  5. The issue does not depend on long handlers; I could reliably (at a ~10% rate) reproduce the issue sending queries to an unconfigured nginx container
  6. The issue does not depend on long timeouts; my simple reproduction used 5-second timeouts
  7. The issue does not depend on a long-running container; I could rm -f the client+server containers, start a new client container with a slightly different image, and have the issue reproducing within the first 100 requests at one time on my laptop
  8. The issue does not depend on external network traversal; all my observations have been for requests between containers on the same system using host.docker.internal.

@mirrorspock
Copy link

We are running Docker version Docker version 20.10.22, build 3a2c30b on Ubuntu 22.04.2 LTS
and are experiencing the same issue.

We are running a node-red flow which queries a mssql server every 5 minutes, and randomly the connection to the SQL server just gets a 30000ms timeout, the next attempt will be successful..

@tutcugil
Copy link

We are experiencing same issue, almost every 10 minutes, SQL queries from our containers getting slower, then it resolves until the next 10 minute period.

Docker Desktop version v4.17.0
Windows Server 2022 - WSL2 1.0.3.0 backend

is there any update on this?

@rhux
Copy link

rhux commented Apr 12, 2023

The following workaround resolved the issue for me https://emerle.dev/2022/05/06/the-nasty-gotcha-in-docker/

I had also been experiencing this for several months. Doing this workaround appears to have fixed the issue.

@ganeshkrishnan1
Copy link

Got this issue with windows 11 on WSL and Docker version 23.0.3, build 3e7cbfd

We are running to server so this error becomes untenable.

@nk9
Copy link

nk9 commented Apr 25, 2023

Please note that an experimental build of vpnkit has been released in this parallel issue which attempts to resolve what may be the underlying problem here. Users experiencing this should install the experimental builds if possible and feed back to @djs55 in the vpnkit issue as to whether the problem is resolved, and if you notice any side effects.

@rg9400
Copy link
Author

rg9400 commented Apr 25, 2023

Per my testing of the experimental build, the issue is significantly improved but not resolved. There are still timeouts, just a lot less. When running thousands of curls, I still notice stuck handshakes that don't instantly close but take a minute or two to resolve. The difference is that most such instances do clear out before the timeout.

I just still wanted to confirm that the connections still are getting stuck even if the overall symptoms are a lot better

@Junto026
Copy link

Junto026 commented Dec 15, 2023

I believe I am facing this same problem on MacOS Sonoma 14.1.1, running Docker Desktop for Mac (Apple Silicon) 4.25.2.

I want to try downgrading to 4.5.0 (it's insane the issue is going on that long). Does anybody have an install file? The oldest available here is 4.9.1.

EDIT: Docker Desktop for MacOS (Apple Silicon) can be downloaded here.

EDIT2: Confirmed, downgrading fixed the issue. I’ve been running with stable connections for weeks now.

@sorcer1122
Copy link

sorcer1122 commented Jul 7, 2024

Facing the same issue on Debian 12. Checked ufw logs and whitelisted container's IP address with sudo ufw allow from 172.17.0.2, this fixed it.

@kierankhan
Copy link

kierankhan commented Jul 29, 2024

Pretty stuck on this as I am not using docker desktop, only docker engine on ubuntu. Reverting to 4.5.0 (docker engine 20.10.12) breaks everything, so if anyone has other workarounds lmk

@Junto026
Copy link

Pretty stuck on this as I am not using docker desktop, only docker engine on ubuntu. Reverting to 4.5.0 (docker engine 20.10.12) breaks everything, so if anyone has other workarounds lmk

If you’re running on Linux I don’t think you’ll experience this exact issue. It seems to only happen when running Docker on MacOS or Windows.

@ashwinrayaprolu
Copy link

I'm too facing this issue..
While not windows, i can pretty much replicate this issue on linux

seq 1 100000 | xargs -Iname -n1 -P200 curl --write-out ' %{http_code} , %{time_total}s \n' --silent --output /dev/null "http://10.98.160.112/health" | awk '{ if ( $3+0 > 0.06 ) print $1, $3}'

Just used a curl parallel request features to test response code and time taken..
It starts going up

image

@phygineer
Copy link

phygineer commented Aug 27, 2024

same observations on mac, still not fixed

It took me a while to realize this issue.

Screenshot 2024-08-28 at 1 39 44 AM Screenshot 2024-08-28 at 1 39 25 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests