Docker network connection time outs to host over time #8861

rg9400 · 2020-10-06T18:12:26Z

I have tried with the latest version of my channel (Stable or Edge)
I have uploaded Diagnostics
Diagnostics ID: 7E746511-651C-4A74-8C84-91189E8962C1/20201006161122

Expected behavior

I would expect services running inside Docker containers in a WSL backend to be able to reliably communicate with applications running on the host, even with frequent polling

Actual behavior

Due to #8590, I have to run some applications that require high download speeds on the host. I have multiple applications inside Docker containers running inside a Docker bridge network that poll this application every few seconds. When launching WSL, the applications are able to communicate reliably, but this connection deteriorates over time, and after 1-2 days, I notice frequent connection timed out responses from the application running on the host. Running wsl --shutdown and restarting the Docker daemon fixes the issue temporarily. Shifting applications out of Docker and onto the host fixes their communication issues as well. It may be related to the overall network issues linked above.

To be clear, it can still connect. It just starts timing out more and more often the longer the network/containers have been up.

Information

Windows Version: 2004 (OS Build 19041.508)
Docker Desktop Version: 2.4.1.0 (48583)
Are you running inside a virtualized Windows e.g. on a cloud server or on a mac VM: No

I have had this problem ever since starting to use Docker for Windows with the WSL2 backend.

Steps to reproduce the behavior

Run an application on the Windows host. I tried with NZBGet (host ip: 192.168.1.2)
Poll this application from within a Docker container inside a Docker bridge network living within WSL2. I polled 192.168.1.2:6789 every few seconds
Check back in a day to see if the connection is timing out more frequently.
Restart WSL/Docker daemon, notice that the connection is suddenly more reliable though it will begin to deteriorate over time again

The text was updated successfully, but these errors were encountered:

rg9400 · 2020-10-09T18:27:08Z

This seems to improve if you use the recommended host.docker.internal option instead of using the IP of the host machine directly

rg9400 · 2020-10-19T13:23:41Z

Further update on this. While the above does prolong the deterioration, it still eventually happens. After 4-5 days, timeouts start occurring at increasing frequency, with it eventually reaching a point where timeouts are happening on almost every few calls, requiring a full restart of WSL and Docker to function.

markoueis · 2020-10-21T20:14:49Z

We have the same issue

Using 2.4.0.0
We use host.docker.internal

We have a service running on the host.

If i try to hit host.docker.internal from within a linux container i can always get it to trip up eventually after say 5000 curl requests to http:\host.docker.internal\service (it timesout for one request)

If i try http:\host.docker.internal\service from the host, it works flawlessly even after 10000 curl requests

Sometimes, intermittently, and we can't find out why, it starts to fail much more frequently (like maybe every 100 curl requests)

Something is up with the networking...

Here is a very simple test to show what's going on:

markoueis · 2020-10-27T18:50:00Z

In my limited testing, i created a loopback adapter and used it instead. I created an ip 10.0.75.2 and used it instead. It's much more reliable. It's an ugly work around but it might work at least to help show where the issue might be.

markoueis · 2020-12-23T15:38:02Z

Hey guys, this is still happening pretty consistently. Is anyone looking at the reliability/performance of these things? Is this the wrong place to post this?

rg9400 · 2020-12-23T16:47:09Z

I was able to send this via their support and have them reproduce the issue. They diagnosed the cause, but said it would involve some major refactoring, so they didn't have a target fix date. Below is the issue as mentioned by them

I can reproduce the bug now. If I query the vpnkit diagnostics with this program https://github.com/djs55/bug-repros/tree/main/tools/vpnkit-diagnostics while the connection is stuck then I observe: (for my particular repro the port number was 51580. I discovered this using wireshark to explore the trace)
$ tcpdump -r capture\\all.pcap port 51580
15:57:03.021934 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195077730 ecr 0,nop,wscale 7], length 0
15:57:04.064094 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195078771 ecr 0,nop,wscale 7], length 0 15:57:06.111633 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195080819 ecr 0,nop,wscale 7], length 0
15:57:10.143908 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195084851 ecr 0,nop,wscale 7], length 0
15:57:18.464142 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195093171 ecr 0,nop,wscale 7], length 0
15:57:34.848536 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195109555 ecr 0,nop,wscale 7], length 0
15:58:07.103411 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195141811 ecr 0,nop,wscale 7], length 0
which is a stuck TCP handshake from the Linux point of view. The same thing is probably visible in a live trace from docker run -it --privileged --net=host djs55/tcpdump -n -i eth0.

Using sysinternals process explorer to examine the vpnkit.exe process, I only see 1 TCP connection at a time (although a larger than ideal number of UDP connections which are DNS-related I think). There's no sign of a resource leak.

When this manifests I can still establish other TCP connections and run the test again -- the impact seems limited to the 1 handshake failure.

The vpnkit diagnostics has a single TCP flow registered:
> cat .\flows
TCP 192.168.65.3:51580 > 192.168.65.2:6789 socket = open last_active_time = 1605023899.0
which means that vpnkit itself thinks the flow is connected, although the handshake never completed.

markoueis · 2020-12-23T16:53:49Z

Woah, thanks for this update @rg9400. Glad you got it on their radar. So your work around is to restart docker and wsl --shutdown? I've been trying to use another IP (loopback adapter) as opposed to host.docker.internal or whatever host.docker.internal points to. But I'm not 100% sure that solves the problem permanently. Maybe its just a new IP so it will work for a little and then deteriorate again over time. Based on your explanation of the root cause, that might indeed be the case.

rg9400 · 2020-12-23T17:23:05Z

Yeah, for now I am just living with it and restarting WSL/Docker every now and then when the connection timeouts become too frequent and unbearable.

markoueis · 2021-03-03T19:20:05Z

What can we do to get this worked on. Is there work happening on it? or a ticket we can follow? This still bugs us quite consistently.

markoueis · 2021-05-19T19:58:34Z

I want to keep this thread alive as this is a massive pain for folks especially because they don't know its happening. This needs to become more reliable.

Here is a newer diagnostic id: F4D29FA0-6778-40B8-B312-BADEA278BB3B/20210521171355

Also discovered that just killing vpnkit.exe in task manager reduces the problem. It restarts almost instantly and connections resume much better without having to restart containers or anything. But problem eventually reoccurs.

stormmuller · 2021-08-03T08:51:22Z

We have about 15 services in our docker-compose file and all of them do an npm install. A cacheless build is impossible because it tries to build all the services at once and the npm install steps timeout because trying to download that many packages just kills bandwidth.

I'm not using the --parallel flag
I've set the following environment variables:

COMPOSE_HTTP_TIMEOUT=240
COMPOSE_PARALLEL_LIMIT=2

But non of this seems to change the behavior

bradleyayers · 2021-08-23T03:55:43Z

This happens on macOS too, in fact quite reliably after ~7 minutes and ~13,000 requests of hitting a HTTP server:

Server:

$ python3 -mhttp.server 8015

Client (siege):

$ cat <<EOF > siegerc
timeout = 1
failures = 1
EOF
$ docker run --rm -v $(pwd)/siegerc:/tmp/siegerc -t funkygibbon/siege --rc=/tmp/siegerc -t2000s -c2 -d0.1 http://host.docker.internal:8015/api/foo

Output:

New configuration template added to /root/.siege
Run siege -C to view the current settings in that file
** SIEGE 4.0.4
** Preparing 2 concurrent users for battle.
The server is now under siege...[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(1) sock.c:240: Connection timed out
siege aborted due to excessive socket failure; you
can change the failure threshold in $HOME/.siegerc

Transactions:		       13949 hits
Availability:		       99.99 %
Elapsed time:		      378.89 secs
Data transferred:	        6.24 MB
Response time:		        0.00 secs
Transaction rate:	       36.82 trans/sec
Throughput:		        0.02 MB/sec
Concurrency:		        0.10
Successful transactions:           0
Failed transactions:	           1
Longest transaction:	        0.05
Shortest transaction:	        0.00

What's interesting is that it gets progressively worse from there, the timeouts happen more and more frequently. Restarting the HTTP server doesn't help, but restarting it on another port does (e.g. from 8019 -> 8020). From there you get another 7 minutes of 100% success before it starts degrading again.

I tried adding an IP alias to my loopback adapter and hitting that instead of host.docker.internal but it had the same behavior (i.e. degraded after 7 minutes). The same goes for using the IP (192.168.65.2) and skipping the DNS resolution.

rg9400 · 2021-10-14T15:17:03Z

This issue remains unresolved. The devs indicated it required major rework, but I haven't heard back from them in 6 months on the progress.

docker-robott · 2022-01-12T01:00:15Z

Issues go stale after 90 days of inactivity.
Mark the issue as fresh with /remove-lifecycle stale comment.
Stale issues will be closed after an additional 30 days of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows.
/lifecycle stale

rg9400 · 2022-01-12T01:03:17Z

/remove-lifecycle stale

zadirion · 2022-03-16T04:15:27Z

I am also affected by this issue. I thought at one point it was because of TCP keepalive on sockets, and the sockets not being closed as fast as they are opened, thus a exhausting the max number of available sockets. But the problem doesn't go away even if my containers stop opening connections for a while, only a restart of docker and wsl seems to fix this.
This issue should be on high priority...

artzwinger · 2022-04-18T21:57:18Z

I cannot connect from a container to a host port even using telnet.
Network mode is bridge, which is default, but "host" mode also doesn't work.

I tried to guess host IP, but also I tried this:
extra_hosts:
- "host.docker.internal:host-gateway"
Both options didn't work.

Telnet connection from host machine to this host port does work well.

In previous Docker versions it was working fine! Seems it's broken since some update maybe from 2021-2022.

artzwinger · 2022-04-18T21:59:44Z

Upd.
It was my Ubuntu UFW that was blocking containers from connecting to host ports.

raarts · 2022-05-04T22:24:55Z

Having this exact problem on MacOS. Restarting Docker fixes the problem (for a while).

bernhof · 2022-05-05T09:53:07Z

We have reports of this occurring across teams on Windows and macOS as well. We have no reports of this issue occuring on Linux.

Someone noticed that on macOS, simply waiting ~15mins often alleviates the problem.

metacity · 2022-05-11T07:23:52Z

We're also experiencing this (using host.docker.internal) on Docker Desktop for Windows. Strangely enough, Docker version up to 4.5.1 seem to work fine, but versions 4.6.x and 4.7.x instantly bring up the problem. Connections work for some time, but then the timeouts start. All checks of "C:\Program Files\Docker\Docker\resources\com.docker.diagnose.exe" check pass.

RomanShumkov · 2022-05-29T07:52:36Z

I'm experiencing the same problem with increasing amount of timeouts over time while using host.docker.internal.

stamosv · 2022-05-30T15:00:48Z

I'm also experiencing the same problem. Downgrade to 4.5.1 looks that solves the issue.

gregfrog · 2022-12-08T07:28:30Z

Thanks for the update, and I will file a support request. Apart from anything else, their answer ignores that this is an issue for users who aren't on Windows. Just restarting Docker Desktop all the time isn't an acceptable workaround IMO.

I suspect I am running into this at the moment. If having to restart the VM that Docker is running in, rebooting in essence, is not a blocker, what is? Hardware damage?

tristanbrown · 2023-02-17T18:58:10Z

This is absolutely a blocker for me, as I cannot run scheduled tasks reliably.

roele · 2023-02-18T01:11:27Z

The following workaround resolved the issue for me
https://emerle.dev/2022/05/06/the-nasty-gotcha-in-docker/

acedanger · 2023-02-20T14:24:36Z

The following workaround resolved the issue for me
https://emerle.dev/2022/05/06/the-nasty-gotcha-in-docker/

Adding an archive in case the post or site goes down.

https://archive.ph/fk6dC

nk9 · 2023-02-20T15:11:44Z

While this is useful information, I am not sure that it's actually related to this bug. The error described in the post is "Connection reset by peer." However, the problem in this issue is "Connection timed out." The exact error may differ depending on which software you're using, but the key thing is that you send packets that just never arrive. The connection isn't reset, it just stops moving data and effectively becomes /dev/null.

There are reproduction steps here, and I'm happy to be proven wrong. If someone can run the Python reproduction above and confirm that the problem doesn't occur on recent versions of Docker Desktop with the idle time set to 0, then I'll stand corrected. But @rg9400 spoke with Docker themselves, who acknowledged the problem and said they didn't have a fix. If the solution was as easy as changing vpnKitMaxPortIdleTime, surely they would have mentioned that.

If you would like changes in the behavior of vpnKitMaxPortIdleTime, I suggest you open a different issue.

robertnisipeanu · 2023-02-20T15:31:23Z

I also replied a few months ago with that fix, and my problem was a connection time out for an nginx reverse proxy and PING command, not a connection reset.

tristanbrown · 2023-02-21T00:21:11Z

I'm thinking this is a port saturation issue, similar to what's described here. I recently restarted my Docker service, but once the problem crops up again, I'll try going through some of these troubleshooting steps.

BenjaminPelletier · 2023-02-24T21:26:10Z

I'm about 90% sure this issue applies to me as well, but it's devilishly difficult to tell for sure. I'll refer to a tool for reproduction that I wrote in my observations below:

The issue appears to happen about once every 10^1 continuous integration invocations on a project I work on, and each continuous integration run probably has 10^3-10^4 HTTP requests sent between containers on the same GitHub Actions Linux cloud VM
The issue also happens on my development machine, a laptop with MacOS Ventura 13.2.1
All requests I have observed this issue with have been addressed to host.docker.internal, perhaps mainly because nearly all of my requests are addressed there, but while troubleshooting I was unable to reproduce when sending requests to an IP (using Docker's default bridge network) nor a service name (using a custom bridge network created for the purpose) -- see the reproduction repo for more notes.
The rate of occurrence varies a lot, and not according to any pattern I've been able to identify. The past week, I've had a connection timeout within 10^1-10^2 requests on my development machine with that rate persisting through a laptop reboot. After creating a Docker network to (unsuccessfully) attempt reproduction with containers communicating through that network, not only did the issue not occur using the custom bridge network, but the issue also vanished entirely -- my 100% reliable method of reproduction went to 0%.
The issue does not depend on long handlers; I could reliably (at a ~10% rate) reproduce the issue sending queries to an unconfigured nginx container
The issue does not depend on long timeouts; my simple reproduction used 5-second timeouts
The issue does not depend on a long-running container; I could rm -f the client+server containers, start a new client container with a slightly different image, and have the issue reproducing within the first 100 requests at one time on my laptop
The issue does not depend on external network traversal; all my observations have been for requests between containers on the same system using host.docker.internal.

mirrorspock · 2023-03-10T12:25:58Z

We are running Docker version Docker version 20.10.22, build 3a2c30b on Ubuntu 22.04.2 LTS
and are experiencing the same issue.

We are running a node-red flow which queries a mssql server every 5 minutes, and randomly the connection to the SQL server just gets a 30000ms timeout, the next attempt will be successful..

tutcugil · 2023-03-23T09:20:32Z

We are experiencing same issue, almost every 10 minutes, SQL queries from our containers getting slower, then it resolves until the next 10 minute period.

Docker Desktop version v4.17.0
Windows Server 2022 - WSL2 1.0.3.0 backend

is there any update on this?

rhux · 2023-04-12T13:28:10Z

The following workaround resolved the issue for me https://emerle.dev/2022/05/06/the-nasty-gotcha-in-docker/

I had also been experiencing this for several months. Doing this workaround appears to have fixed the issue.

ganeshkrishnan1 · 2023-04-25T13:51:02Z

Got this issue with windows 11 on WSL and Docker version 23.0.3, build 3e7cbfd

We are running to server so this error becomes untenable.

nk9 · 2023-04-25T14:15:50Z

Please note that an experimental build of vpnkit has been released in this parallel issue which attempts to resolve what may be the underlying problem here. Users experiencing this should install the experimental builds if possible and feed back to @djs55 in the vpnkit issue as to whether the problem is resolved, and if you notice any side effects.

rg9400 · 2023-04-25T15:50:11Z

Per my testing of the experimental build, the issue is significantly improved but not resolved. There are still timeouts, just a lot less. When running thousands of curls, I still notice stuck handshakes that don't instantly close but take a minute or two to resolve. The difference is that most such instances do clear out before the timeout.

I just still wanted to confirm that the connections still are getting stuck even if the overall symptoms are a lot better

Junto026 · 2023-12-15T18:21:03Z

I believe I am facing this same problem on MacOS Sonoma 14.1.1, running Docker Desktop for Mac (Apple Silicon) 4.25.2.

I want to try downgrading to 4.5.0 (it's insane the issue is going on that long). Does anybody have an install file? The oldest available here is 4.9.1.

EDIT: Docker Desktop for MacOS (Apple Silicon) can be downloaded here.

EDIT2: Confirmed, downgrading fixed the issue. I’ve been running with stable connections for weeks now.

sorcer1122 · 2024-07-07T08:15:17Z

Facing the same issue on Debian 12. Checked ufw logs and whitelisted container's IP address with sudo ufw allow from 172.17.0.2, this fixed it.

kierankhan · 2024-07-29T16:03:41Z

Pretty stuck on this as I am not using docker desktop, only docker engine on ubuntu. Reverting to 4.5.0 (docker engine 20.10.12) breaks everything, so if anyone has other workarounds lmk

Junto026 · 2024-07-29T16:22:28Z

Pretty stuck on this as I am not using docker desktop, only docker engine on ubuntu. Reverting to 4.5.0 (docker engine 20.10.12) breaks everything, so if anyone has other workarounds lmk

If you’re running on Linux I don’t think you’ll experience this exact issue. It seems to only happen when running Docker on MacOS or Windows.

ashwinrayaprolu · 2024-08-06T23:23:48Z

I'm too facing this issue..
While not windows, i can pretty much replicate this issue on linux

seq 1 100000 | xargs -Iname -n1 -P200 curl --write-out ' %{http_code} , %{time_total}s \n' --silent --output /dev/null "http://10.98.160.112/health" | awk '{ if ( $3+0 > 0.06 ) print $1, $3}'

Just used a curl parallel request features to test response code and time taken..
It starts going up

phygineer · 2024-08-27T20:10:01Z

same observations on mac, still not fixed

It took me a while to realize this issue.

docker-robott added the version/2.4.1.0 label Oct 6, 2020

markoueis mentioned this issue Oct 26, 2020

How does this issues list work? #9247

Closed

markoueis mentioned this issue May 11, 2021

Intermittent networking issue on WSL to external IP microsoft/WSL#6644

Closed

bradleyayers mentioned this issue Aug 23, 2021

Network Errors, Connection Resets, Requiring Docker Restart docker/for-mac#3448

Open

2 tasks

npoczynek mentioned this issue Dec 10, 2021

Outbound routing from containers consistently fails under high traffic load docker/for-mac#6086

Closed

3 tasks

docker-robott added the lifecycle/stale label Jan 12, 2022

docker-robott removed the lifecycle/stale label Jan 12, 2022

artagnan21 mentioned this issue Jan 17, 2023

Bug: Error - "WARN [dns over tls] cannot update files" qdm12/gluetun#1333

Closed

BenjaminPelletier mentioned this issue Feb 24, 2023

CI sometimes fails due to timeout on query to service interuss/monitoring#28

Closed

rg9400 mentioned this issue Apr 18, 2023

Networking severely degraded on versions > 4.12 #13393

Closed

3 tasks

scandey mentioned this issue May 7, 2023

UDP from Docker/Podman machine on Mac/Windows times out after 90 seconds OpenC3/cosmos#654

Closed

Junto026 mentioned this issue Dec 17, 2023

Containers Networking Fails after 2-4 days docker/for-mac#7116

Closed

Commifreak mentioned this issue Jun 28, 2024

Docker network connection time outs to container from host over time moby/moby#48082

Closed

bsousaa added area/network status/triage labels Aug 29, 2024

aldy505 mentioned this issue Oct 6, 2024

Relay Server: Not Enough Memory on Health Check even though stats show otherwise getsentry/self-hosted#3330

Open

usedhondacivic mentioned this issue Oct 27, 2024

Docker Desktop stops forwarding packets on exposed ports after a short time little-red-rover/lrr-ros#10

Closed

Docker network connection time outs to host over time #8861

Docker network connection time outs to host over time #8861

Comments

rg9400 commented Oct 6, 2020 • edited Loading

Expected behavior

Actual behavior

Information

Steps to reproduce the behavior

rg9400 commented Oct 9, 2020

rg9400 commented Oct 19, 2020

markoueis commented Oct 21, 2020 • edited Loading

markoueis commented Oct 27, 2020

markoueis commented Dec 23, 2020

rg9400 commented Dec 23, 2020

markoueis commented Dec 23, 2020

rg9400 commented Dec 23, 2020

markoueis commented Mar 3, 2021

markoueis commented May 19, 2021 • edited Loading

stormmuller commented Aug 3, 2021

bradleyayers commented Aug 23, 2021

rg9400 commented Oct 14, 2021

docker-robott commented Jan 12, 2022

rg9400 commented Jan 12, 2022

zadirion commented Mar 16, 2022

artzwinger commented Apr 18, 2022

artzwinger commented Apr 18, 2022

raarts commented May 4, 2022

bernhof commented May 5, 2022

metacity commented May 11, 2022 • edited Loading

RomanShumkov commented May 29, 2022

stamosv commented May 30, 2022 • edited Loading

gregfrog commented Dec 8, 2022

tristanbrown commented Feb 17, 2023

roele commented Feb 18, 2023

acedanger commented Feb 20, 2023

nk9 commented Feb 20, 2023

robertnisipeanu commented Feb 20, 2023

tristanbrown commented Feb 21, 2023

BenjaminPelletier commented Feb 24, 2023

mirrorspock commented Mar 10, 2023

tutcugil commented Mar 23, 2023

rhux commented Apr 12, 2023

ganeshkrishnan1 commented Apr 25, 2023

nk9 commented Apr 25, 2023 • edited Loading

rg9400 commented Apr 25, 2023

Junto026 commented Dec 15, 2023 • edited Loading

sorcer1122 commented Jul 7, 2024 • edited Loading

kierankhan commented Jul 29, 2024 • edited Loading

Junto026 commented Jul 29, 2024

ashwinrayaprolu commented Aug 6, 2024

phygineer commented Aug 27, 2024 • edited Loading

rg9400 commented Oct 6, 2020 •

edited

Loading

markoueis commented Oct 21, 2020 •

edited

Loading

markoueis commented May 19, 2021 •

edited

Loading

metacity commented May 11, 2022 •

edited

Loading

stamosv commented May 30, 2022 •

edited

Loading

nk9 commented Apr 25, 2023 •

edited

Loading

Junto026 commented Dec 15, 2023 •

edited

Loading

sorcer1122 commented Jul 7, 2024 •

edited

Loading

kierankhan commented Jul 29, 2024 •

edited

Loading

phygineer commented Aug 27, 2024 •

edited

Loading