Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication connection timeout #1097

Closed
BathoryPeter opened this issue Jun 10, 2024 · 18 comments
Closed

Replication connection timeout #1097

BathoryPeter opened this issue Jun 10, 2024 · 18 comments

Comments

@BathoryPeter
Copy link

Osmosis fails to download minute diffs from the planet server. Not all, but most update attempts run into a connection timeout.

INFO: Reading current server state. [ReplicationState(timestamp=Mon Jun 10 09:20:03 CEST 2024, sequenceNumber=6127042)]
[2024-06-10 09:22:01] 117880 pid 117828 still running                                                    
[2024-06-10 09:23:01] 117958 pid 117828 still running                                                    
Jun 10, 2024 9:23:14 AM org.openstreetmap.osmosis.core.pipeline.common.ActiveTaskManager waitForCompletion
SEVERE: Thread for task 1-read-replication-interval failed                                               
org.openstreetmap.osmosis.core.OsmosisRuntimeException: Unable to read the state from the server.        
        at org.openstreetmap.osmosis.replication.common.ServerStateReader.getServerState(ServerStateReader.java:95)
        at org.openstreetmap.osmosis.replication.common.ServerStateReader.getServerState(ServerStateReader.java:60)
        at org.openstreetmap.osmosis.replication.v0_6.BaseReplicationDownloader.download(BaseReplicationDownloader.java:218)
        at org.openstreetmap.osmosis.replication.v0_6.BaseReplicationDownloader.runImpl(BaseReplicationDownloader.java:293)
        at org.openstreetmap.osmosis.replication.v0_6.BaseReplicationDownloader.run(BaseReplicationDownloader.java:372)
        at java.base/java.lang.Thread.run(Thread.java:829)                                               
Caused by: java.net.ConnectException: Connection timed out (Connection timed out) 

Lowering the maxInterval in configuration.txt to 60s helps a bit, but increasing to 1h always results a timeout.
Checked on my production server and local PC, the result was the same.

I first noticed the issue on 2024-06-09 at 0:15 UTC.

@BathoryPeter
Copy link
Author

I can see a significant drop in S3 graphs.

@tomhughes
Copy link
Member

Well you seem to have narrowed in on one tiny window - if you look at the last 24 hours it all looks normal and all our replication feeds are running fine so it seems to be an issue specific to your connection to AWS.

@BathoryPeter
Copy link
Author

The issue is still present. With maxInterval=120 about half of the requests times out, and my replag is continuously increasing:

replag-pinpoint=1717884000,1718005819

My server connects from Frankfurt, but I experiencing the same problem here, from Budapest, Hungary.

@tomhughes
Copy link
Member

As I say we have machines in at least six locations on five different networks that are pulling from the feed with no problem - they're using osmium not osmosis of course but I don't see why that would make a difference.

@BathoryPeter
Copy link
Author

Well you seem to have narrowed in on one tiny window

The drop on the graph coincides exactly with the first error in my logs.

Would a verbose osmosis log help?

@tomhughes
Copy link
Member

Well you seem to have narrowed in on one tiny window

The drop on the graph coincides exactly with the first error in my logs.

There are brief ups and downs all the time though - the long term average clearly doesn't show any significant decrease.

Would a verbose osmosis log help?

No, it would not. None of us have used osmosis for years and in any case it's a network timeout so what exactly do you expect a verbose log to show? There is a problem with packets from your network getting to and/or from Amazon and there is nothing much we can do to help with that.

@BathoryPeter
Copy link
Author

Hmm, I did an attempt replacing baseUrl to amazon, and that completely solved the problem:

#baseUrl=https://planet.openstreetmap.org/replication/minute/
baseUrl=https://osm-planet-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com/planet/replication/minute
maxInterval=3600

@tomhughes
Copy link
Member

So your problem is reaching he.net in Amsterdam then by the sounds of it.

@Firefishy
Copy link
Member

@BathoryPeter Please could you run a traceroute planet.openstreetmap.org or a mtr --report-wide --report-cycles 10 planet.openstreetmap.org

@BathoryPeter
Copy link
Author

From Frankfurt Düsseldorf:

traceroute to planet.openstreetmap.org (184.104.179.145), 30 hops max, 60 byte packets
 1  ip-161-97-128-11.static.contabo.net (161.97.128.11)  1.512 ms  1.488 ms  1.510 ms
 2  et-4-0-8.edge6.Dusseldorf1.Level3.net (62.67.22.193)  1.566 ms 10.0.50.1 (10.0.50.1)  1.411 ms  1.279 ms
 3  et-4-0-8.edge6.Dusseldorf1.Level3.net (62.67.22.193)  1.351 ms ae2.3210.edge4.frf1.neo.colt.net (171.75.9.147)  4.772 ms  4.751 ms
 4  ae2.3210.edge4.frf1.neo.colt.net (171.75.9.147)  4.723 ms e0-5.core2.fra1.he.net (216.66.87.197)  5.201 ms ae2.3210.edge4.frf1.neo.colt.net (171.75.9.147)  4.690 ms
 5  e0-5.core2.fra1.he.net (216.66.87.197)  5.677 ms  5.854 ms *
 6  * port-channel1.core3.fra1.he.net (184.104.198.26)  4.823 ms *
 7  port-channel2.core3.fra2.he.net (72.52.92.70)  5.476 ms * *
 8  openstreetmap-foundation.port-channel7.switch2.ams2.he.net (184.104.202.70)  8.809 ms  8.792 ms *
 9  openstreetmap-foundation.port-channel7.switch2.ams2.he.net (184.104.202.70)  13.858 ms * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *
Start: 2024-06-10T10:51:17+0200
HOST: carto-map                                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2a02:c206::a                                                0.0%    10    1.1   1.3   1.1   1.8   0.2
  2.|-- ge-7-0-6.bar1.Munich1.Level3.net                            0.0%    10    1.3   5.6   1.2  21.7   6.9
  3.|-- lo-0-0-v6.edge4.Frankfurt1.Level3.net                       0.0%    10    9.7   5.3   4.6   9.7   1.6
  4.|-- e0-6.core2.fra1.he.net                                     10.0%    10    5.8   6.1   5.3  10.4   1.6
  5.|-- ???                                                        100.0    10    0.0   0.0   0.0   0.0   0.0
  6.|-- ???                                                        100.0    10    0.0   0.0   0.0   0.0   0.0
  7.|-- ???                                                        100.0    10    0.0   0.0   0.0   0.0   0.0
  8.|-- openstreetmap-foundation.port-channel7.switch2.ams2.he.net  0.0%    10   13.3  11.4   8.8  13.8   1.9
  9.|-- norbert.openstreetmap.org                                   0.0%    10    7.5   7.6   7.5   8.0   0.1

@mmd-osm
Copy link

mmd-osm commented Jun 10, 2024

Similar reports from @pa5cal in https://community.openstreetmap.org/t/what-is-the-preferred-way-to-download-planet-diff-files/108854/10

@pa5cal
Copy link

pa5cal commented Jun 10, 2024

Thanks for linking me here @mmd-osm !

I also get a lot of connection timeouts when using OSMOSIS and other tools for minutely and changeset diffs .

other_diff_status-day

Today I temporarily switched to https://download.openstreetmap.fr/replication/planet/minute

@Firefishy
Copy link
Member

Firefishy commented Jun 10, 2024

It appears there may have been an issue with apache on the planet.openstreetmap.org webserver. The log was being flooded by AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit. but apache and server otherwise appeared ok.

I have restarted apache and the logged error has gone away for now.

@BathoryPeter
Copy link
Author

I can confirm that the issue is gone.

@pa5cal
Copy link

pa5cal commented Jun 10, 2024

Thank you very much, @Firefishy !

I have not had a single timeout in the last 15 minutes. At least my services are running normally again and the downloads are available as fast as usual.

For your information: At least on my server, the timeouts described here occur about every three months. As mentioned, they disappear after about 24 hours. I don't know if the Apache has been restarted or something.

@Firefishy
Copy link
Member

I suspect this is due to a faulty version of apache, we run a custom build to workaround some other apache bugs. We move back to distro release in Debian 12 and/or Ubuntu 24.04

@tomhughes
Copy link
Member

I don't think it's custom as such, it's just a backport of a later version.

@Firefishy
Copy link
Member

Closing. If the issue returns feel free to re-open ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants