Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGP container crashes after BGP established with 500/1K IPv4 and IPv6 peers #14143

Closed
1 of 2 tasks
stepanblyschak opened this issue Aug 3, 2023 · 5 comments
Closed
1 of 2 tasks
Labels
bgp triage Needs further investigation

Comments

@stepanblyschak
Copy link

stepanblyschak commented Aug 3, 2023


Describe the bug

Establish 2k (1k IPv4, 1k IPv6) dynamic BGP sessions and wait for a minute. Observe bgpd crash. The crash is observed only when using SNMP module: -M snmp option in bgpd command line.

The log:

Feb 27 13:54:36.185873 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: Received signal 6 at 1677506076 (si_addr 0x12c00000038, PC 0x7fccf3f46ce1); aborting...
Feb 27 13:54:36.186112 qa-eth-vt03-1-4600ca1 INFO bgp#supervisord: bgpd *** buffer overflow detected ***: terminated
Feb 27 13:54:36.186314 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: zlog_signal+0xf5                   7fccf42fc215     7ffdfd035230 /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0 (mapped at 0x7fccf4260000)
Feb 27 13:54:36.186580 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: PBKDF2_SHA256+0x4e1                7fccf4328851     7ffdfd035370 /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0 (mapped at 0x7fccf4260000)
Feb 27 13:54:36.186852 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: funlockfile+0x50                   7fccf40f6140     7ffdfd0354c0 /lib/x86_64-linux-gnu/libpthread.so.0 (mapped at 0x7fccf40e3000)
Feb 27 13:54:36.187158 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]:     ---- signal ----
Feb 27 13:54:36.187185 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: gsignal+0x141                      7fccf3f46ce1     7ffdfd035a70 /lib/x86_64-linux-gnu/libc.so.6 (mapped at 0x7fccf3f0e000)
Feb 27 13:54:36.187432 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: abort+0x123                        7fccf3f30537     7ffdfd035b90 /lib/x86_64-linux-gnu/libc.so.6 (mapped at 0x7fccf3f0e000)
Feb 27 13:54:36.187724 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: __fsetlocking+0x288                7fccf3f89768     7ffdfd035cc0 /lib/x86_64-linux-gnu/libc.so.6 (mapped at 0x7fccf3f0e000)
Feb 27 13:54:36.187992 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: __fortify_fail+0x22                7fccf401a542     7ffdfd035e00 /lib/x86_64-linux-gnu/libc.so.6 (mapped at 0x7fccf3f0e000)
Feb 27 13:54:36.188351 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: __chk_fail+0x10                    7fccf4018f20     7ffdfd035e20 /lib/x86_64-linux-gnu/libc.so.6 (mapped at 0x7fccf3f0e000)
Feb 27 13:54:36.188567 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: __fdelt_warn+0x17                  7fccf401a497     7ffdfd035e30 /lib/x86_64-linux-gnu/libc.so.6 (mapped at 0x7fccf3f0e000)
Feb 27 13:54:36.188764 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: ?                                  7fccf3c9a475     7ffdfd035e40 /usr/lib/x86_64-linux-gnu/frr/libfrrsnmp.so.0 (mapped at 0x7fccf3c97000)
Feb 27 13:54:36.188919 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: ?                                  7fccf3c9a9e3     7ffdfd035f40 /usr/lib/x86_64-linux-gnu/frr/libfrrsnmp.so.0 (mapped at 0x7fccf3c97000)
Feb 27 13:54:36.189201 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: thread_call+0x7d                   7fccf433a48d     7ffdfd035f50 /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0 (mapped at 0x7fccf4260000)
Feb 27 13:54:36.189461 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: frr_run+0xe8                       7fccf42f44a8     7ffdfd035ff0 /usr/lib/x86_64-linux-gnu/frr/libfrr.so.0 (mapped at 0x7fccf4260000)
Feb 27 13:54:36.189605 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: main+0x356                         55d92d7c5286     7ffdfd036210 /usr/lib/frr/bgpd (mapped at 0x55d92d6e6000)
Feb 27 13:54:36.189928 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: __libc_start_main+0xea             7fccf3f31d0a     7ffdfd036270 /lib/x86_64-linux-gnu/libc.so.6 (mapped at 0x7fccf3f0e000)
Feb 27 13:54:36.190124 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: _start+0x2a                        55d92d7c6f5a     7ffdfd036340 /usr/lib/frr/bgpd (mapped at 0x55d92d6e6000)
Feb 27 13:54:36.190160 qa-eth-vt03-1-4600ca1 CRIT bgp#BGP[56]: in thread agentx_timeout scheduled from ../lib/agentx.c:124 agentx_events_update()
Feb 27 13:54:36.190396 qa-eth-vt03-1-4600ca1 INFO bgp#supervisord: bgpd core_handler: showing active allocations in memory group libfrr
Feb 27 13:54:36.190557 qa-eth-vt03-1-4600ca1 INFO bgp#supervisord: bgpd core_handler: memstats:  Buffer                        :      2 *         24
Feb 27 13:54:36.190557 qa-eth-vt03-1-4600ca1 INFO bgp#supervisord: bgpd core_handler: memstats:  Host config                   :      5 * 
  • Did you check if this is a duplicate issue?
  • Did you test it on the latest FRRouting/frr master branch?

To Reproduce

For example:

  1. Run bgpd with -M snmp option with SNMP agentx.
  2. Establish 1k IPv4, 1k IPv6 dynamic BGP neighbors

Expected behavior

Expect it to work, no crash.

Screenshots

Versions

  • OS Version: Debian GNU/Linux 11 (bullseye)
  • Kernel: 5.10.0-18-2-amd64

Additional context

The issue happens on SONiC OS.

@ton31337
Copy link
Member

ton31337 commented Aug 9, 2023

We need a full backtrace to see where it crashes, otherwise this information is useless. Can you test with a vanilla FRR (latest) and verify the crash?

@stepanblyschak
Copy link
Author

Tested with FRR 8.5.1 and it reproduces.

The more detailed look at the backtrace of bgpd:

(gdb) bt
#0  0x00007f95a1624fe1 in raise () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f95a185f23c in core_handler (signo=6, siginfo=0x7fff09b2dfb0, context=<optimized out>) at ../lib/sigevent.c:261
#2  <signal handler called>
#3  0x00007f95a1476ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f95a1460537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f95a14b83a8 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007f95a1549542 in __fortify_fail () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007f95a1547f20 in __chk_fail () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x00007f95a1549497 in __fdelt_warn () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x00007f95a11ad495 in agentx_events_update () at ../lib/agentx.c:146
#10 0x00007f95a18711dd in thread_call (thread=thread@entry=0x7fff09b2e990) at ../lib/thread.c:2006
#11 0x00007f95a1829db8 in frr_run (master=0x560d55bab440) at ../lib/libfrr.c:1198
#12 0x0000560d557413ab in main (argc=<optimized out>, argv=<optimized out>) at ../bgpd/bgp_main.c:520

Here, we can see that the origin is FRR SNMP agentx integration code.
Looking at the function __fdelt_warn from glibc in the stack trace I assumed FD > 1023 was passed to one of the FD_* macros.

Specifically we crash at line 146:

else if (FD_ISSET(fd, &fds)) {

 So, I it crashed because we pass fd > 1023 to FD_ISSET macro:

#9  0x00007f95a11ad495 in agentx_events_update () at ../lib/agentx.c:146
146     ../lib/agentx.c: No such file or directory.
(gdb) p fd
$11 = 1024
(gdb) p maxfd
$12 = 4027

 The maxfd is returned by snmp_select_info.

static void agentx_events_update(void)
{
	int maxfd = 0;
	int block = 1;
	struct timeval timeout = {.tv_sec = 0, .tv_usec = 0};
	fd_set fds;
	struct listnode *ln;
	struct event **thr;
	int fd, thr_fd;

    ...

	FD_ZERO(&fds);
	snmp_select_info(&maxfd, &fds, &timeout, &block);


    ...

	/* "two-pointer" / two-list simultaneous iteration
	 * ln/thr/thr_fd point to the next existing event listener to hit while
	 * fd counts to catch up */
	for (fd = 0; fd < maxfd; fd++) {
		/* caught up */
		if (thr_fd == fd) {
			...
		}
		/* need listener, but haven't hit one where it would be */
		else if (FD_ISSET(fd, &fds)) {
            ...
		}
	}
    ...
}

 I see snmp lib internally uses large FD set data structure and provides snmp_select_info2:

/*
     * snmp_select_info2() is similar to snmp_select_info(), but accepts a
     * pointer to a large file descriptor set instead of a pointer to a
     * regular file descriptor set.
     */
    NETSNMP_IMPORT
    int             snmp_select_info2(int *, netsnmp_large_fd_set *,
                                      struct timeval *, int *);

So, maybe the fix would be to use netsnmp_large_fd_set and corresponding NETSNMP_LARGE_FD_* macros?

@donaldsharp
Copy link
Member

in /etc/frr/daemons what do you have MAX_FDS set to?

@stepanblyschak
Copy link
Author

stepanblyschak commented Oct 19, 2023

@donaldsharp I think we done use /etc/frr/daemons. FRR daemons are started by another program.
Do you mean it is a matter of some configuration?

BTW on SONiC system:

UPD: inside BGP container:

admin@arc-switch1004:~$ docker exec -it bgp bash
root@arc-switch1004:/# ulimit -n
1048576

@donaldsharp
Copy link
Member

FRR reads the incoming MAX_FDS and uses that. Please ensure that a value can be set in Sonic and it can be respected. In any event I believe this is a side issue to the actual problem.

@ton31337 ton31337 closed this as completed Mar 7, 2024
StormLiangMS pushed a commit to sonic-net/sonic-buildimage that referenced this issue Apr 29, 2024
Why I did it
Upgrading FRR 8.5.4 to include latest fixes.

Work item tracking
Microsoft ADO (number only):
How I did it
New patches that were added:

Patch	FRR Pull request	Issue fixed
0024-lib-use-snmp-s-large-fd-sets-for-agentx.patch	FRRouting/frr#13396	FRRouting/frr#14143
0025-bgp-community-memory-leak-fix.patch	FRRouting/frr#15466	FRRouting/frr#15459
0026-bgp-fib-suppress-announce-fix.patch	FRRouting/frr#15634	FRRouting/frr#15626
0027-lib-Do-not-convert-EVPN-prefixes-into-IPv4-IPv6-if-n.patch	FRRouting/frr#15418	FRRouting/frr#14419
Removed patches:

Patch	Upstream FRR commit that is present in 8.5.4
0019-zebra-Abstract-dplane_ctx_route_init-to-init-route-w.patch	FRRouting/frr@3f01977
0020-zebra-Fix-crash-when-dplane_fpm_nl-fails-to-process-.patch	FRRouting/frr@fe5f624
0022-bgpd-Don-t-read-the-first-byte-of-ORF-header-if-we-a.patch	FRRouting/frr@3515178
0023-bgpd-Make-sure-we-have-enough-data-to-read-two-bytes.patch	FRRouting/frr@460ee93
0024-bgpd-Do-not-process-NLRIs-if-the-attribute-length-is.patch	FRRouting/frr@f291f1e
0025-bgpd-Use-treat-as-withdraw-for-tunnel-encapsulation-.patch	FRRouting/frr@8a4a88c
0026-zebra-Add-encap-type-when-building-packet-for-FPM.patch	FRRouting/frr@f0f7b28
0028-bgpd-Check-mandatory-attributes-more-carefully-for-U.patch	FRRouting/frr@21418d6
0029-bgpd-Handle-MP_REACH_NLRI-malformed-packets-with-ses.patch	FRRouting/frr@30b5c2a
0030-bgpd-Treat-EOR-as-withdrawn-to-avoid-unwanted-handli.patch	FRRouting/frr@01f232c
0031-bgpd-Ignore-handling-NLRIs-if-we-received-MP_UNREACH.patch	FRRouting/frr@a0c4ec2
0032-zebra-Fix-fpm-multipath-encap-addition.patch	FRRouting/frr@10a9a5f
Realigned patches:

Old Patch	New patch
0005-Add-support-of-bgp-l3vni-evpn.patch	0005-Add-support-of-bgp-l3vni-evpn.patch
0021-zebra-remove-duplicated-nexthops-when-sending-fpm-msg.patch	0019-zebra-remove-duplicated-nexthops-when-sending-fpm-msg.patch
0027-zebra-Fix-non-notification-of-better-admin-won.patch	0020-zebra-Fix-non-notification-of-better-admin-won.patch
Disable-ipv6-src-address-test-in-pceplib.patch	0021-Disable-ipv6-src-address-test-in-pceplib.patch
cross-compile-changes.patch	0022-cross-compile-changes.patch
0033-zebra-The-dplane_fpm_nl-return-path-leaks-memory.patch	0023-zebra-The-dplane_fpm_nl-return-path-leaks-memory.patch
How to verify it
Running sonic-mgmt test suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgp triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

4 participants