Replication connection timeout #1097

BathoryPeter · 2024-06-10T07:46:29Z

Osmosis fails to download minute diffs from the planet server. Not all, but most update attempts run into a connection timeout.

INFO: Reading current server state. [ReplicationState(timestamp=Mon Jun 10 09:20:03 CEST 2024, sequenceNumber=6127042)]
[2024-06-10 09:22:01] 117880 pid 117828 still running                                                    
[2024-06-10 09:23:01] 117958 pid 117828 still running                                                    
Jun 10, 2024 9:23:14 AM org.openstreetmap.osmosis.core.pipeline.common.ActiveTaskManager waitForCompletion
SEVERE: Thread for task 1-read-replication-interval failed                                               
org.openstreetmap.osmosis.core.OsmosisRuntimeException: Unable to read the state from the server.        
        at org.openstreetmap.osmosis.replication.common.ServerStateReader.getServerState(ServerStateReader.java:95)
        at org.openstreetmap.osmosis.replication.common.ServerStateReader.getServerState(ServerStateReader.java:60)
        at org.openstreetmap.osmosis.replication.v0_6.BaseReplicationDownloader.download(BaseReplicationDownloader.java:218)
        at org.openstreetmap.osmosis.replication.v0_6.BaseReplicationDownloader.runImpl(BaseReplicationDownloader.java:293)
        at org.openstreetmap.osmosis.replication.v0_6.BaseReplicationDownloader.run(BaseReplicationDownloader.java:372)
        at java.base/java.lang.Thread.run(Thread.java:829)                                               
Caused by: java.net.ConnectException: Connection timed out (Connection timed out)

Lowering the maxInterval in configuration.txt to 60s helps a bit, but increasing to 1h always results a timeout.
Checked on my production server and local PC, the result was the same.

I first noticed the issue on 2024-06-09 at 0:15 UTC.

The text was updated successfully, but these errors were encountered:

BathoryPeter · 2024-06-10T07:49:37Z

I can see a significant drop in S3 graphs.

tomhughes · 2024-06-10T07:52:24Z

Well you seem to have narrowed in on one tiny window - if you look at the last 24 hours it all looks normal and all our replication feeds are running fine so it seems to be an issue specific to your connection to AWS.

BathoryPeter · 2024-06-10T08:10:19Z

The issue is still present. With maxInterval=120 about half of the requests times out, and my replag is continuously increasing:

My server connects from Frankfurt, but I experiencing the same problem here, from Budapest, Hungary.

tomhughes · 2024-06-10T08:21:35Z

As I say we have machines in at least six locations on five different networks that are pulling from the feed with no problem - they're using osmium not osmosis of course but I don't see why that would make a difference.

BathoryPeter · 2024-06-10T08:26:24Z

Well you seem to have narrowed in on one tiny window

The drop on the graph coincides exactly with the first error in my logs.

Would a verbose osmosis log help?

tomhughes · 2024-06-10T08:31:31Z

Well you seem to have narrowed in on one tiny window

The drop on the graph coincides exactly with the first error in my logs.

There are brief ups and downs all the time though - the long term average clearly doesn't show any significant decrease.

Would a verbose osmosis log help?

No, it would not. None of us have used osmosis for years and in any case it's a network timeout so what exactly do you expect a verbose log to show? There is a problem with packets from your network getting to and/or from Amazon and there is nothing much we can do to help with that.

BathoryPeter · 2024-06-10T08:35:24Z

Hmm, I did an attempt replacing baseUrl to amazon, and that completely solved the problem:

#baseUrl=https://planet.openstreetmap.org/replication/minute/
baseUrl=https://osm-planet-eu-central-1.s3.dualstack.eu-central-1.amazonaws.com/planet/replication/minute
maxInterval=3600

tomhughes · 2024-06-10T08:39:06Z

So your problem is reaching he.net in Amsterdam then by the sounds of it.

Firefishy · 2024-06-10T08:49:35Z

@BathoryPeter Please could you run a traceroute planet.openstreetmap.org or a mtr --report-wide --report-cycles 10 planet.openstreetmap.org

BathoryPeter · 2024-06-10T08:59:57Z

From ~~Frankfurt~~ Düsseldorf:

traceroute to planet.openstreetmap.org (184.104.179.145), 30 hops max, 60 byte packets
 1  ip-161-97-128-11.static.contabo.net (161.97.128.11)  1.512 ms  1.488 ms  1.510 ms
 2  et-4-0-8.edge6.Dusseldorf1.Level3.net (62.67.22.193)  1.566 ms 10.0.50.1 (10.0.50.1)  1.411 ms  1.279 ms
 3  et-4-0-8.edge6.Dusseldorf1.Level3.net (62.67.22.193)  1.351 ms ae2.3210.edge4.frf1.neo.colt.net (171.75.9.147)  4.772 ms  4.751 ms
 4  ae2.3210.edge4.frf1.neo.colt.net (171.75.9.147)  4.723 ms e0-5.core2.fra1.he.net (216.66.87.197)  5.201 ms ae2.3210.edge4.frf1.neo.colt.net (171.75.9.147)  4.690 ms
 5  e0-5.core2.fra1.he.net (216.66.87.197)  5.677 ms  5.854 ms *
 6  * port-channel1.core3.fra1.he.net (184.104.198.26)  4.823 ms *
 7  port-channel2.core3.fra2.he.net (72.52.92.70)  5.476 ms * *
 8  openstreetmap-foundation.port-channel7.switch2.ams2.he.net (184.104.202.70)  8.809 ms  8.792 ms *
 9  openstreetmap-foundation.port-channel7.switch2.ams2.he.net (184.104.202.70)  13.858 ms * *
10  * * *
11  * * *
12  * * *
13  * * *
14  * * *
15  * * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

Start: 2024-06-10T10:51:17+0200
HOST: carto-map                                                  Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 2a02:c206::a                                                0.0%    10    1.1   1.3   1.1   1.8   0.2
  2.|-- ge-7-0-6.bar1.Munich1.Level3.net                            0.0%    10    1.3   5.6   1.2  21.7   6.9
  3.|-- lo-0-0-v6.edge4.Frankfurt1.Level3.net                       0.0%    10    9.7   5.3   4.6   9.7   1.6
  4.|-- e0-6.core2.fra1.he.net                                     10.0%    10    5.8   6.1   5.3  10.4   1.6
  5.|-- ???                                                        100.0    10    0.0   0.0   0.0   0.0   0.0
  6.|-- ???                                                        100.0    10    0.0   0.0   0.0   0.0   0.0
  7.|-- ???                                                        100.0    10    0.0   0.0   0.0   0.0   0.0
  8.|-- openstreetmap-foundation.port-channel7.switch2.ams2.he.net  0.0%    10   13.3  11.4   8.8  13.8   1.9
  9.|-- norbert.openstreetmap.org                                   0.0%    10    7.5   7.6   7.5   8.0   0.1

mmd-osm · 2024-06-10T09:18:00Z

Similar reports from @pa5cal in https://community.openstreetmap.org/t/what-is-the-preferred-way-to-download-planet-diff-files/108854/10

pa5cal · 2024-06-10T09:23:29Z

Thanks for linking me here @mmd-osm !

I also get a lot of connection timeouts when using OSMOSIS and other tools for minutely and changeset diffs .

Today I temporarily switched to https://download.openstreetmap.fr/replication/planet/minute

Firefishy · 2024-06-10T12:17:24Z

It appears there may have been an issue with apache on the planet.openstreetmap.org webserver. The log was being flooded by AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit. but apache and server otherwise appeared ok.

I have restarted apache and the logged error has gone away for now.

BathoryPeter · 2024-06-10T12:29:57Z

I can confirm that the issue is gone.

pa5cal · 2024-06-10T12:33:08Z

Thank you very much, @Firefishy !

I have not had a single timeout in the last 15 minutes. At least my services are running normally again and the downloads are available as fast as usual.

For your information: At least on my server, the timeouts described here occur about every three months. As mentioned, they disappear after about 24 hours. I don't know if the Apache has been restarted or something.

Firefishy · 2024-06-12T10:01:28Z

I suspect this is due to a faulty version of apache, we run a custom build to workaround some other apache bugs. We move back to distro release in Debian 12 and/or Ubuntu 24.04

tomhughes · 2024-06-12T10:07:51Z

I don't think it's custom as such, it's just a backport of a later version.

Firefishy · 2024-06-13T13:20:38Z

Closing. If the issue returns feel free to re-open ticket.

Firefishy closed this as completed Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication connection timeout #1097

Replication connection timeout #1097

BathoryPeter commented Jun 10, 2024

BathoryPeter commented Jun 10, 2024

tomhughes commented Jun 10, 2024

BathoryPeter commented Jun 10, 2024

tomhughes commented Jun 10, 2024

BathoryPeter commented Jun 10, 2024

tomhughes commented Jun 10, 2024

BathoryPeter commented Jun 10, 2024

tomhughes commented Jun 10, 2024

Firefishy commented Jun 10, 2024

BathoryPeter commented Jun 10, 2024

mmd-osm commented Jun 10, 2024

pa5cal commented Jun 10, 2024 •

edited

Loading

Firefishy commented Jun 10, 2024 •

edited

Loading

BathoryPeter commented Jun 10, 2024

pa5cal commented Jun 10, 2024

Firefishy commented Jun 12, 2024

tomhughes commented Jun 12, 2024

Firefishy commented Jun 13, 2024

Replication connection timeout #1097

Replication connection timeout #1097

Comments

BathoryPeter commented Jun 10, 2024

BathoryPeter commented Jun 10, 2024

tomhughes commented Jun 10, 2024

BathoryPeter commented Jun 10, 2024

tomhughes commented Jun 10, 2024

BathoryPeter commented Jun 10, 2024

tomhughes commented Jun 10, 2024

BathoryPeter commented Jun 10, 2024

tomhughes commented Jun 10, 2024

Firefishy commented Jun 10, 2024

BathoryPeter commented Jun 10, 2024

mmd-osm commented Jun 10, 2024

pa5cal commented Jun 10, 2024 • edited Loading

Firefishy commented Jun 10, 2024 • edited Loading

BathoryPeter commented Jun 10, 2024

pa5cal commented Jun 10, 2024

Firefishy commented Jun 12, 2024

tomhughes commented Jun 12, 2024

Firefishy commented Jun 13, 2024

pa5cal commented Jun 10, 2024 •

edited

Loading

Firefishy commented Jun 10, 2024 •

edited

Loading