-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible issue in serial sequences on current RADB NRTM #877
Comments
NTT is seeing the same thing on 4.2.8 |
And for the record we have had to |
I have had a total of 5 instances where we got into this mode since November of 2022. Below is the fist error message in each of the instances. Maybe these match up with Troy2914's instances: 2022-11-02 01:50:33,793 irrd[1845133]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for LACNIC: %% ERROR: Serials 103009 - 110626 do not exist |
In my log file, I only see "Received NRTM response for ... Serials .... do not exist" occasionally. For instance, the RADB lines above. However, for queries to the LEVEL3 IRR, I see "encountered parsing error while parsing query "-g LEVEL3.... Serials .... do not exist" all the time. In all the cases where the query is TO the LEVEL3 IRR, the query range is and the reported non-existent serials is in the range of a small number to the number specified in the query. This is not the case with the RADB queries. |
This is an interesting one. Given that you both experience issues, and some people raised issues about the IRRexplorer instance, a best guess is that RADB's NRTM streams are the common part here. I'll see if my own logs show anything on this. Also wondering if @schelcj has ideas, or whether the timings align with something? I do not know RADB's config, or whether they do anything unusual. Current implementationLet me run through how this is implemented so we all know precisely what messages mean and for my own overview. On the client side, IRRD keeps a serial_newest_mirror which only relates to the mirror source. When running NRTM, it requests On the server side, responses are built by the NRTMGenerator, which takes a
Log sample 12023-12-04 13:39:13,176 irrd[1188550]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 255032 on nrtm.radb.net:43 This is normal behaviour that suggests you are up to date. Log sample 22023-12-04 13:54:14,507 irrd[1189565]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 255032 on nrtm.radb.net:43 This request and response are scenario 1: the remote journal starts at 268299, but you're asking for 255032. You are too far behind, it doesn't have the entries older than 268299. This is only recoverable with a full reload. There are a few possible causes for this situation: your IRRD has been down for a while, the NRTMv3 server keeps only very short journals so the window is very narrow, or there has been a persistent issue before this which was preventing you from keeping up with NRTM. You'll have to dig more into your logs to figure out which, or perhaps RADB can tell us more about anything unusual on their end. Other log samplesIt is not possible to know what was happening there, as you need to know both the query and the response to figure out whether this is scenario 1 or 3. |
As one of the operators of IRRExplorer (which uses irrd as one of its backends) I can confirm that we've seen similar problems over the past months as well. We're currently running irrd 4.4.2. |
Notice Log Sample 2 is exactly 15 minutes after Log Sample 1. There were no RADB updates on the client side between those two samples. So the first serial to request on the client side should not change. I don’t know where “268299” comes from.
I will make myself available for you when ever you want to see more directly into the LEVEL3 system. I know, that doesn’t give you access to the RADB system at the same time. And of course, we don’t know how to trigger the error. I’m writing a script to more closely monitor the log so I will know shortly after this condition pops its head again.
D
|
We've seen the same internally and have been investigating as well. We saw the same serial constraint violation for the same serial as @dhutchin-ctl. Our nrtm.radb.net is fed via nrtm from the authoritative host and when this error hit any futher updates were blocked. A reseed resolved the issue but fairly certain we saw this once before after the v4 migration. We are also adding more monitoring and still investigating. |
Ah right, that gives a bit more context. Using my earlier comment on NRTM responses and their precise meaning, we can narrow down. We can assume the client keeps requesting 255032-LAST. At first, RADB replies: 2023-12-04 13:39:13,300 irrd[1188550]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: % Warning: there are no newer updates available This is expected in a correctly running setup where you are up to date. This message means RADB has journal entries up to and including 255031, and you have those too in the mirror. All good. 15 minutes later, RADB replies: 2023-12-04 13:54:14,622 irrd[1189565]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %% ERROR: Serials 255032 - 268299 do not exist (The assumption that this is the same 255032-LAST query is consistent with the fact that the response mentions this serial. So we can be quite sure the same query was used both times.) RADB now reports that its first serial is actually 268299 in the journal. That means it jumped 13267 serials in 15 minutes. 13K changes in 15 minutes is rare but not impossible, but what is strange here is that the journal did not just grow with at least 13K entries, the oldest 13K were also removed from the journal. It is impossible for clients to keep up in that case, or recover from that situation. There are two scenarios I can think of:
|
if I am reading your two scenarios properly, those are both things to address on the RADB side? |
For the serials jumping we have a maintainer whose automated processes may have gone sideways and producing a lot of unnecessary updates. We are reaching out to them to address. |
that sounds like a 4.4.3 request :) to add a limit/throttle. |
Back to the original issue; reviewing logs this morning shows that our nrtm.radb.net host retrieved, and applied, updates up to and shortly after 225031 then for unknown reasons tried to reapply that serial and ran into the unique key constraint preventing all further updates. We only had logging at INFO level so there isn't much more to go on at this time but have enabled DEBUG logging to review if this happens again.
|
Ok, we can see what those logs produce, but the uniqueness in the database isn't something that should ever be hit really - it's there as a safeguard against data corruption. I also can't immediately think of a case where this can happen, but can not say more about this anyways without knowing which instances you are running, what their configs are, and whether there's any replication going on. Further logs will also help, e.g. I am wondering whether this instance has received 255031 over NRTMv3 before and is trying to insert it twice, or whether it came up with that serial on its own somehow (which it should not). |
Working getting further logs, there is no replication for the hosts in question.
That appears to be what has happened. 255031 was received over nrtmv3, applied, subsequent updates were received and applied, then the update for 255031 was attempted again. |
btw, this seems to have happened again on the 25th of December. |
with serial number 530173 |
I think the problem started earlier than the 25th. Full disclosure... I do a irrd_mirror_force_reload on RADB once a week via cron. The failure coincides with that action. Coincidence? Hard to believe that. Here are some (maybe not all) pertinent entries from log file. Start at 12-23 16:52z, the time of the last appearingly normal NRTM update (we update every 15 minutes): 2023-12-23 16:52:08,351 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB So at that time, my interpretation is that there are no updates newer than serial 530173. At 12-23 17:00z I ran irrd_mirror_force_reload: At the next NRTM interval (17:07), I see that the current serial has jumped to 535978, but the import via FTP seems to go OK: Then at the next NRTM interval (which is only 13 seconds later) is when things start to go haywire (hmmm, where did that expression come from?): And things stay like that until 12-25 12:56z when the only change is that the range of serials reported in the RADB response to our query (same query as above) is "535979 - 543646". And that continues every 15 minutes until 12-27 15:26 when I did another irrd_mirror_force_reload: 2023-12-27 15:26:40,040 irrd[1578473]: [irrd.storage.database_handler#INFO] force_reload flag set for RADB, serial synchronisation based on current settings: False 2023-12-27 15:30:27,663 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB Now everything is looking reasonable: 2023-12-27 16:01:00,553 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB 2023-12-27 16:16:02,170 irrd[2869178]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB Have I left anything out of this report that might be in our logs? Please let me know. RADB folks please chime in here. |
we did not do a force reload until about 20utc on the 25th. And had our noc contact [email protected] |
We saw the same sql constraint violation but being the holidays I haven't dug into the logs yet. A reseed was required internally again. |
Reviewing a recent incident of this constraint violation and noticed that this exception is thrown by an
Snippet of logs of operations on the same prefix. why they are doing this I don't know.
|
This is very useful data, and the issue is then probably somewhere in the handling of the upsert buffer or the insert/delete queries. I will dig into the code. |
So, the duplicate key error here concerns the journal, where, as a safeguard, serial and source are unique. NRTMv3 exports would otherwise break. The serial_nrtm has three possible sources:
The serial actually inserted by the DB is then added to the know new serials, for use for the next journal entry created and some metadata. This is not thread/multiprocess safe, but there should only be one process running to mirror one source, as ensured by the mirroring scheduler. Questions:
|
RADB:
authoritative: false
keep_journal: true
import_serial_source: "file:///<path>/radb/dbase/RADB.CURRENTSERIAL"
import_source: "file:///<path>/radb/dbase/radb.db.gz"
import_timer: 60
nrtm_host: "<auth_db_ip>"
nrtm_port: "4343"
nrtm_access_list: nrtm_public
rpki is disabled, no scope filters, and route object prefs disabled on this host.
The log level was reset to
Have only seen this error on our host that serves nrtm.radb.net. |
I'm on IRRd 4.4.3 and seeing something very similar - I hope this is the right issue to report this to. There are a few sources (not just RADB) that sometimes are out of sync and I need to kick a manual Status:
Samples: !JRADB
A520
{
"RADB": {
"source_type": "regular",
"authoritative": false,
"object_class_filter": null,
"rpki_rov_filter": false,
"scopefilter_enabled": false,
"route_preference": null,
"local_journal_kept": true,
"serial_oldest_journal": null,
"serial_newest_journal": null,
"serial_last_export": null,
"serial_newest_mirror": 2052890,
"last_update": "2024-08-19T16:52:36.911378+00:00",
"synchronised_serials": true
}
}
C
!JNTTCOM
A522
{
"NTTCOM": {
"source_type": "regular",
"authoritative": false,
"object_class_filter": null,
"rpki_rov_filter": false,
"scopefilter_enabled": false,
"route_preference": null,
"local_journal_kept": true,
"serial_oldest_journal": null,
"serial_newest_journal": null,
"serial_last_export": null,
"serial_newest_mirror": 1988580,
"last_update": "2024-08-19T16:26:12.047086+00:00",
"synchronised_serials": true
}
} The I'm not seeing any error, in fact I do see the NRTM updates: 2024-08-20 08:26:03,669 irrd[681]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for LACNIC from serial 428757 on irr.lacnic.net:43
2024-08-20 08:26:03,671 irrd[680]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for APNIC from serial 12493555 on whois.apnic.net:43
2024-08-20 08:26:03,672 irrd[682]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for NTTCOM from serial 1988577 on rr.ntt.net:43 But they just don't seem to trigger any update, I suspect it might be caused by the irrd=# select serial_oldest_journal,serial_newest_journal from database_status where source='APNIC';
serial_oldest_journal | serial_newest_journal
-----------------------+-----------------------
|
(1 row) |
As a client, your local serial_newest/oldest_journal are not too relevant, the important field is serial_newest_mirror which identifies the latest known serial from the mirror server, i.e. NRTM will next start from that serial + 1. You are the first person to report this with other mirror servers than RADB. Perhaps the issue is wider, perhaps there is a different issue involved. The first step is to set the log level to DEBUG, and to provide me all logs relating to the affected sources starting from the forced reload, when this reoccurs. |
We are still hitting up against |
Describe the bug
It looks like serial numbers can sometimes be corrupted. Once they are corrupted, NRTM processing of further updates fails. The LEVEL3 instance of IRRdv4 is mirroring RADB. Consider the following entries from irrd.log (grep RADB):
2023-12-04 13:39:13,137 irrd[210576]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB
2023-12-04 13:39:13,175 irrd[1188550]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 255031, force_reload: False, nrtm enabled: True
2023-12-04 13:39:13,176 irrd[1188550]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 255032 on nrtm.radb.net:43
2023-12-04 13:39:13,176 irrd[1188550]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:255032-LAST on nrtm.radb.net port 43
2023-12-04 13:39:13,300 irrd[1188550]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: % Warning: there are no newer updates available
Above is the last entries where I got expected behavior. The "import_timer" seconds (900) later the following occurred.
2023-12-04 13:54:14,468 irrd[210576]: [irrd.mirroring.scheduler#DEBUG] Started new process RPSLMirrorImportUpdateRunner-RADB for mirror import/export for RADB
2023-12-04 13:54:14,506 irrd[1189565]: [irrd.mirroring.mirror_runners_import#DEBUG] Most recent mirrored serial for RADB: 255031, force_reload: False, nrtm enabled: True
2023-12-04 13:54:14,507 irrd[1189565]: [irrd.mirroring.mirror_runners_import#INFO] Retrieving NRTM updates for RADB from serial 255032 on nrtm.radb.net:43
2023-12-04 13:54:14,508 irrd[1189565]: [irrd.utils.whois_client#DEBUG] Running whois query -g RADB:3:255032-LAST on nrtm.radb.net port 43
2023-12-04 13:54:14,622 irrd[1189565]: [irrd.mirroring.mirror_runners_import#DEBUG] Received NRTM response for RADB: %% ERROR: Serials 255032 - 268299 do not exist
Starting at 2023-12-04 13:54:14, "255032-LAST" seems to be interpreted as "255032 - 268299" and produces an error about non-existent serials. The error persisted (for over 24 hours) until I performed "irrd_mirror_force_reload RADB". Some RADB updates were missed during this period. The updates were recovered when I did the force_reload. I suspect (without great evidence) that the error occurred in RADB's instance of IRRd. I have seen this same situation more than a year ago (so long before IRRdv4 4.4.2). I don't know if I am running the same version of IRRd now as I was the last time I saw this (but maybe).
To Reproduce
nrtm_host is set to nrtm.radb.net
Unknown how to reproduce the problem
Expected behaviour
IRRd version you are running
I am running IRRdv4 4.2.6, I don't know what version RADB is running (suspect 4.4.2)
Additional context
None
The text was updated successfully, but these errors were encountered: