Asynchronous conference bridge operation #3928

nanangizz · 2024-04-16T07:29:28Z

In #3183, video conference bridge adopts asynchronous ports operations to handle non-uniform lock ordering issues. This PR integrates the same mechanism for audio conference bridge.

Beside avoiding deadlock, in general, this change may improve performance a little bit as audio data processing in the conference bridge will be performed without holding a mutex. It gives a serious performance improvement on a heavily loaded system where the high-priority conference bridge thread does not leave time for another thread to hold the conference mutex.

There are some pending tasks (e.g: integrate this into audio switchboard).

Potential backward compatibility issues (major):

Pool release immediately after port removal, consider this code:
```
pool = pj_pool_create(...);
pjmedia_some_port_create(pool, ..., &some_port);
pjmedia_conf_add_port(conf, ..., some_port, ..., &port_id);
...
pjmedia_conf_remove_port(conf, port_id);
pjmedia_port_destroy(some_port);
pj_pool_release(pool); /* <-- This is now a problem!
                              It was fine with a synchronous pjmedia_conf_remove_port() */
```
Note that this issue affects ports that do not use its own pool (i.e: it allocates its instance and all internal states using the supplied memory pool), so releasing the pool immediately after removing the port from conference bridge may cause crash as the port removal is now asynchronous and the port may be still being accessed by the conference bridge.

Now, all PJMEDIA ports (except Bidirectional & Stereo) have been updated to use its own pool, so they should not be affected. If your application uses any custom PJMEDIA ports, the alternatives are:
1. Update the custom PJMEDIA port to use its own pool, so it can control the pool lifetime using the port group lock (if the port does not have group lock, the conference will create one, so the group lock is optional).
2. In case updating the PJMEDIA port is not applicable:
  - release the pool (and any other resources) from port's group lock destroy handler, the handler can be added using pjmedia_port_add_destroy_handler(port, ...) or pj_grp_lock_add_handler(port->grp_lock, ...) after adding the port to conference.
  - workaround (not safe, e.g: for simple/test app): adding a sufficient delay for the pool release, e.g: using pj_thread_sleep(). Calculating the delay needed is quite tricky. For example, a 100 ms delay for most/normal cases should be sufficient, but for a heavily loaded system may not.
PJMEDIA Stream pattern.
PJMEDIA Stream exports a PJMEDIA port interface which can be queried using pjmedia_stream_get_port(). Unfortunately the port does not implements pjmedia_port_destroy(), it is destroyed using pjmedia_stream_destroy() instead, which will release the memory pool. Similar to above issue, if pjmedia_stream_destroy() is invoked immediately after pjmedia_conf_remove_port(), it may cause crash.

Now PJMEDIA Stream has a group lock, so pjmedia_stream_destroy() will just decrement the reference counter, the real destroy will be done when the reference counter reaches zero.
Premature caching pool destroy.
When all PJMEDIA ports have their own pools, there is still possibility that application uses its own pool factory for creating the ports and destroy the pool factory (e.g: invoking pj_caching_pool_destroy()) prematurely before the actual port removal is done.

For example, in a library shutdown scenario, the conference bridge clock has been stopped so actual port removal can only be done later by pjmedia_conf_destroy()/pjsua_destroy(), but in between, application may destroy its pool factory which may cause crash. A possible solution is to delay the pool factory destroy to PJLIB deinitialization via pj_atexit().

Due to the async nature of conf port removal, media port must be prepared to receive further get/put_frame() callbacks until the removal completes.

sauwming · 2024-04-16T08:59:06Z

Since issue 2 is already handled, I assume it's no longer a problem (i.e. no longer causes backward compatibility)?

Which only leaves issue 1. I understand it will be quite a lot to change all existing pjmedia_port(s) to have its own pool, but since it's necessary for this feature, I vote to continue.

And if 1&2 are already resolved, the change in pjsua_app.c is no longer necessary, correct? (i.e. app will not need to change anything).

nanangizz · 2024-04-16T09:39:27Z

Since issue 2 is already handled, I assume it's no longer a problem (i.e. no longer causes backward compatibility)?

Which only leaves issue 1. I understand it will be quite a lot to change all existing pjmedia_port(s) to have its own pool, but since it's necessary for this feature, I vote to continue.

The backward compatibility issues cannot be completely avoided, e.g:

if any custom/proprietary pjmedia ports are used, need to check/update them,
applications may need to be adjusted to avoid issue 3 (pool factory premature destroy).

And if 1&2 are already resolved, the change in pjsua_app.c is no longer necessary, correct? (i.e. app will not need to change anything).

Yes for pjsua_app.c, other apps may need adjustment (as mentioned above).

sauwming · 2024-04-17T02:20:23Z

Regarding issue 3, I don't think app is allowed to call pj_caching_pool_destroy() before pjmedia_conf_destroy() since the conference creation uses the pool, i.e. pjmedia_conf_create(pool, ...)?
Also, using pj_atexit() doesn't seem possible since pjsip can be restarted (instead of shutdown).

The important thing here seems to be that pjmedia_conf_destroy() must guarantee the completion of all pending async operations, if any.

nanangizz · 2024-04-17T03:06:47Z

Regarding issue 3, I don't think app is allowed to call pj_caching_pool_destroy() before pjmedia_conf_destroy() since the conference creation uses the pool, i.e. pjmedia_conf_create(pool, ...)? Also, using pj_atexit() doesn't seem possible since pjsip can be restarted (instead of shutdown).

Here is a sample scenario. App using PJSUA creates its own pool factory (instead of using PJSUA's) for some reason, and it uses the pool/factory to create a PJMEDIA port. It has to destroy the pool factory before shutting down PJSUA.

Re: lib (PJSUA?) restart, pj_atexit() will still be called right?

The important thing here seems to be that pjmedia_conf_destroy() must guarantee the completion of all pending async operations, if any.

Yes, pjmedia_conf_destroy() flushes all pending async ops.

sauwming · 2024-04-17T03:21:23Z

Here is a sample scenario. App using PJSUA creates its own pool factory (instead of using PJSUA's) for some reason, and it uses the pool/factory to create a PJMEDIA port. It has to destroy the pool factory before shutting down PJSUA.

I see.

Re: lib (PJSUA?) restart, pj_atexit() will still be called right?

Ah, right. It's called upon pj_shutdown().

…port check in video conference (to avoid vconf put/get frame from just-added-yet-unsynced ports).

…nto async-aud-conf

sauwming · 2024-10-25T01:58:29Z

pjmedia/src/pjmedia/port.c

                             port->info.name.ptr));
-        pj_assert(!"Port using group lock should implement on_destroy()!");


why are we removing the assertion and error?

The purpose of checking port.on_destroy is a kind of best effort to "detect" own pool availability to avoid premature destroy, I guess it should not cause a hard failure?
Some conditions may cause port.on_destroy not implemented while the port has own pool, e.g:

has its own destructor API, e.g: stream port has pjmedia_stream_destroy(), it has own pool but does not implement port.on_destroy.

old custom/app port implementation with group lock but do not publish it to port.grp_lock (the field is relatively newly added) may implement destroy via group lock instead of port.on_destroy.

sauwming · 2024-10-25T02:03:51Z

pjmedia/src/pjmedia/vid_conf.c

@@ -93,6 +93,7 @@ typedef struct vconf_port
    unsigned             idx;           /**< Port index.                    */
    pj_str_t             name;          /**< Port name.                     */
    pjmedia_port        *port;          /**< Video port.                    */
+    pj_bool_t            is_new;        /**< Port newly added?              */


I'm still not quite clear with the "new port" check. Since it also applies to video, is it considered a bug fix for the current code?

Yes, it is kind of bug fix (for some reason, for video it causes less/no harm).

Currently adding a port to conf needs to be done synchronously to avoid premature destroy (if async, the conf may end-up adding/accessing a destroyed port). However, we cannot increment/change conf->port_cnt there as it is accessed by conf get_frame() without mutex protection, e.g:

pjproject/pjmedia/src/pjmedia/conference.c

Line 2387 in ed32a91

for (i=0, ci=0; i<conf->max_ports && ci < conf->port_cnt; ++i) {

So, the "new port" check is introduced to be able to increment the conf->port_cnt safely.

trengginas · 2024-10-30T01:11:08Z

There's an issue when testing this.
Scenario using pjsua sample app:

set auto-answer=200
on incoming call, the ring port is connected to the sound device port. (the connect port is queued)
the call is answered, and the ring port is disconnected from the sound dev port (port connection does not exist)
the connect port action in the queue is processed

08:00:36.107    inv04AA1314  .......State changed from NULL to INCOMING, event=TSX_STATE
08:00:36.107    pjsua_aud.c  ..Conf connect: 4 --> 0
...
08:00:36.170   conference.c  ....Connect ports 4->0 requested
08:00:36.170   conference.c  ....Connect ports 4->0 queued
08:00:36.170   pjsua_call.c  ..Answering call 0: code=200
...
08:00:36.176    pjsua_app.c  .....Call 0 media 0 [type=audio], status is Active
08:00:36.177    pjsua_aud.c  .....Conf disconnect: 4 -x- 0
08:00:36.177   conference.c  .......Disconnect ports 4->0 requested
08:00:36.177   conference.c  .......Ports connection 4->0 does not exist
...
08:00:36.208   conference.c !.New port 3 (ringback) is added
08:00:36.209   conference.c  .New port 4 (ring) is added
08:00:36.209   conference.c  .New port 5 (sip:192.168.100.51) is added
08:00:36.210   conference.c  .Port 4 (ring) transmitting to port 0 (Wave mapper)
08:00:36.210   conference.c  .Port 5 (sip:192.168.100.51) transmitting to port 0 (Wave mapper)
08:00:36.210   conference.c  .Port 0 (Wave mapper) transmitting to port 5 (sip:192.168.100.51)

sauwming · 2024-11-19T06:45:12Z

I did the pre-release test and this seems to fail the double-hold scenario. Only video seems to be affected (video will not fully recover).

nanangizz · 2024-11-20T04:44:14Z

I did the pre-release test and this seems to fail the double-hold scenario. Only video seems to be affected (video will not fully recover).

See PR #4164

sauwming · 2024-11-20T05:24:00Z

Thanks for fixing it!

Also, I noticed these additional two lines in pjsua app during shutdown:

13:13:33.343         sip_endpoint.c  .Endpoint 0x10ae0a2a8 destroyed
13:13:33.343               ringback  .Pool is not released by application, releasing now
13:13:33.343                   ring  .Pool is not released by application, releasing now

I suppose this is unavoidable due to issue 1 in the PR description above, correct?

nanangizz · 2024-11-20T08:05:27Z

Also, I noticed these additional two lines in pjsua app during shutdown:
13:13:33.343         sip_endpoint.c  .Endpoint 0x10ae0a2a8 destroyed
13:13:33.343               ringback  .Pool is not released by application, releasing now
13:13:33.343                   ring  .Pool is not released by application, releasing now
I suppose this is unavoidable due to issue 1 in the PR description above, correct?

Unfortunately no. It is a new bug, see #4166.
Nice catch btw.

LeonidGoltsblat · 2024-12-31T21:31:25Z

Hi, everybody! I would like to leave feedback here.
Asynchronous commutation is a great idea!
And contrary to what the author wrote "this change may improve performance a little", it gives a serious performance improvement on a heavily loaded system where the high-priority conference bridge thread does not leave time for another thread to hold the conference mutex.

However, we are encountering several unexpected behavior changes and issues. Some of them are outlined below:

Memory Leak Issue: A new cause of memory leak has been identified due to implicit pool creation with each media and conference port creation. An example of a buggy application is pjmedia_test. Further details and fixes can be found in Memory leak caused by change in media port behavior due to implementation of asynchronous conference bridge #4235.
Deadlock Risk: The solution is still not fully free from deadlocks. The OP_REMOVE_PORT function calls the group lock handler while holding the conf->mutex. As with any callback invoked under a lock, there is a potential risk of deadlock. We cannot predict what actions the user’s application may take in the callback, for instance, it might acquire the SIP dialog mutex, which could lead to a lock ordering issue similar to the one described in bug report Lock ordering issue in video conference bridge #3183.
I don't think it's necessary to call handle_op_queue under the mutex lock. Mutex protection is only required to protect conf->op_queue, not the handle_op_queue() function itself. This way, we can invoke the callback outside the conference mutex lock.
Loss of control over data lifetime: The application loses control over the data lifetime. The recommendation to add a 100ms delay before destroying data works fine for applications handling a few simultaneous calls. However, this approach limits performance to only 10 operations per second, which is insufficient for high-load systems. I think we need an API to add/remove the grp_lock port conf handle - it's simple and gives the application the ability to safely close all the resources it needs.

nanangizz · 2025-01-06T06:28:22Z

Hi, everybody! I would like to leave feedback here. Asynchronous commutation is a great idea! And contrary to what the author wrote "this change may improve performance a little", it gives a serious performance improvement on a heavily loaded system where the high-priority conference bridge thread does not leave time for another thread to hold the conference mutex.

First, thanks for the feedbacks!

Re: peformance improvement note, I believe so especially for such condition. Will update the desc.

However, we are encountering several unexpected behavior changes and issues. Some of them are outlined below:

* Memory Leak Issue: A new cause of memory leak has been identified due to implicit pool creation with each media and conference port creation. An example of a buggy application is pjmedia_test. Further details and fixes can be found in [Memory leak caused by change in media port behavior due to implementation of asynchronous conference bridge #4235](https://github.com/pjsip/pjproject/pull/4235).

Thanks for the bug report & the fix.

* Deadlock Risk: The solution is still not fully free from deadlocks. The OP_REMOVE_PORT function calls the group lock handler while holding the conf->mutex. As with any callback invoked under a lock, there is a potential risk of deadlock. We cannot predict what actions the user’s application may take in the callback, for instance, it might acquire the SIP dialog mutex, which could lead to a lock ordering issue similar to the one described in bug report [Lock ordering issue in video conference bridge #3183](https://github.com/pjsip/pjproject/pull/3183).
  I don't think it's necessary to call handle_op_queue under the mutex lock. Mutex protection is only required to protect conf->op_queue, not the handle_op_queue() function itself. This way, we can invoke the callback outside the conference mutex lock.

You're right, will try to fix for both video & audio conf.

* Loss of control over data lifetime: The application loses control over the data lifetime. The recommendation to add a 100ms delay before destroying data works fine for applications handling a few simultaneous calls. However, this approach limits performance to only 10 operations per second, which is insufficient for high-load systems. I think we need an API to add/remove the grp_lock port conf handle - it's simple and gives the application the ability to safely close all the resources it needs.

Ideally app must check & update its ports to have its own pool (btw, the PJMEDIA tonegen and almost all PJMEDIA ports have been updated). It is not supposed to be a recommendation, just a sample of a possible workaround.

nanangizz · 2025-01-06T10:49:19Z

Re: deadlock risk, please check #4243.

LeonidGoltsblat · 2025-01-06T20:16:12Z

@nanangizz

Ideally app must check & update its ports to have its own pool (btw, the PJMEDIA tonegen and almost all PJMEDIA ports have been updated). It is not supposed to be a recommendation, just a sample of a possible workaround.

Real world applications may use pjsip as one of many other libraries and may not be written only in C/C++. The application manages resources outside the pjsip logic and may not be able to use the pool API to allocate memory, but in some cases the closing of application resources must be synchronized with the destruction of conference bridge ports.

For example, an application may implement a special media port proxy to play audio data stored in a BLOB field of a database. With an asynchronous switch, the application cannot close the database connection not only when performing a disconnect, but also when calling remove port, because the conference bridge may try to receive the next frame of data after that. The application must know the exact moment when to release resources, close the database connection, etc. This moment is the "grp_lock handle call".

nanangizz · 2025-01-07T03:04:02Z

@LeonidGoltsblat

You're right, so far I missed the idea and your suggestion that group lock destroy handler can be used as a proper solution to avoid premature pool destroy! Updated the PR desc.

Btw, app can access the port's group lock directly and use the group lock APIs to add/remove handler. Anyway, added helper APIs in #4244. I cannot add you as a reviewer somehow, but feel free to review. Thanks again!

Initial version

b46570b

nanangizz requested review from trengginas and sauwming April 16, 2024 07:44

nanangizz added component: pjmedia type: enhancement component: pjsua-lib component: applications labels Apr 16, 2024

nanangizz added this to the release-2.15 milestone Apr 16, 2024

nanangizz added 11 commits June 6, 2024 17:16

Merge branch 'master' into async-aud-conf

ae5f6ff

Updates PJMEDIA ports to have their own pool, update docs, apply new …

7b422ec

…port check in video conference (to avoid vconf put/get frame from just-added-yet-unsynced ports).

Initial version

d65b662

Updates PJMEDIA ports to have their own pool, update docs, apply new …

128638f

…port check in video conference (to avoid vconf put/get frame from just-added-yet-unsynced ports).

Merge branch 'async-aud-conf' of https://github.com/pjsip/pjproject i…

7072e3c

…nto async-aud-conf

Merge remote-tracking branch 'origin/master' into async-aud-conf

3d5dfad

Fix avi streams group lock handler.

cb52d09

Fix premature destroy of cstream objects (codec, jb, etc).

9f5a211

Add log for debugging

f11b296

Schedule pjsua app pool release using pjsip_endpt_atexit().

5558775

Merge remote-tracking branch 'origin/master' into async-aud-conf

3717170

nanangizz marked this pull request as ready for review September 13, 2024 05:25

Fix init_grp_lock() to add_ref() before add_handler().

ed32a91

sauwming reviewed Oct 25, 2024

View reviewed changes

sauwming approved these changes Oct 25, 2024

View reviewed changes

trengginas approved these changes Oct 29, 2024

View reviewed changes

nanangizz merged commit 4862352 into master Oct 29, 2024
36 checks passed

nanangizz deleted the async-aud-conf branch October 29, 2024 07:31

nanangizz mentioned this pull request Oct 30, 2024

Removed unnecessary prechecks in async conference bridge operations #4121

Merged

nanangizz mentioned this pull request Nov 12, 2024

Fix sound device auto close after conference bridge updated to asynchronous #4148

Merged

nanangizz mentioned this pull request Nov 20, 2024

Fix audio & video conference bridge: adding port may not update port counter #4164

Merged

nanangizz mentioned this pull request Nov 20, 2024

Tonegen releases pool in wrong function #4166

Merged

This was referenced Dec 2, 2024

Fixed video media transport leak #4189

Merged

Fixed bug in audio conference disconnection #4198

Merged

Fixed AudioMediaPort use after free #4200

Merged

nanangizz mentioned this pull request Dec 10, 2024

Crash in destroying extra audio device #4208

Merged

LeonidGoltsblat mentioned this pull request Dec 31, 2024

Memory leak caused by change in media port behavior due to implementation of asynchronous conference bridge #4235

Closed

LeonidGoltsblat mentioned this pull request Jan 5, 2025

Parallel conference bridge #4241

Open

nanangizz mentioned this pull request Jan 6, 2025

Fix possible deadlock in conference (audio & video) #4243

Merged

nanangizz mentioned this pull request Jan 7, 2025

Add helper APIs for adding & removing port destroy handler #4244

Merged

sauwming mentioned this pull request Jan 13, 2025

Fixed memory leak in mips test #4247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous conference bridge operation #3928

Asynchronous conference bridge operation #3928

nanangizz commented Apr 16, 2024 •

edited

Loading

sauwming commented Apr 16, 2024

nanangizz commented Apr 16, 2024

sauwming commented Apr 17, 2024

nanangizz commented Apr 17, 2024

sauwming commented Apr 17, 2024

sauwming Oct 25, 2024

nanangizz Oct 25, 2024

sauwming Oct 25, 2024

nanangizz Oct 25, 2024

trengginas commented Oct 30, 2024

sauwming commented Nov 19, 2024

nanangizz commented Nov 20, 2024

sauwming commented Nov 20, 2024

nanangizz commented Nov 20, 2024 •

edited

Loading

LeonidGoltsblat commented Dec 31, 2024 •

edited

Loading

nanangizz commented Jan 6, 2025

nanangizz commented Jan 6, 2025

LeonidGoltsblat commented Jan 6, 2025

nanangizz commented Jan 7, 2025

		port->info.name.ptr));
		pj_assert(!"Port using group lock should implement on_destroy()!");

Asynchronous conference bridge operation #3928

Asynchronous conference bridge operation #3928

Conversation

nanangizz commented Apr 16, 2024 • edited Loading

Potential backward compatibility issues (major):

sauwming commented Apr 16, 2024

nanangizz commented Apr 16, 2024

sauwming commented Apr 17, 2024

nanangizz commented Apr 17, 2024

sauwming commented Apr 17, 2024

sauwming Oct 25, 2024

Choose a reason for hiding this comment

nanangizz Oct 25, 2024

Choose a reason for hiding this comment

sauwming Oct 25, 2024

Choose a reason for hiding this comment

nanangizz Oct 25, 2024

Choose a reason for hiding this comment

trengginas commented Oct 30, 2024

sauwming commented Nov 19, 2024

nanangizz commented Nov 20, 2024

sauwming commented Nov 20, 2024

nanangizz commented Nov 20, 2024 • edited Loading

LeonidGoltsblat commented Dec 31, 2024 • edited Loading

nanangizz commented Jan 6, 2025

nanangizz commented Jan 6, 2025

LeonidGoltsblat commented Jan 6, 2025

nanangizz commented Jan 7, 2025

nanangizz commented Apr 16, 2024 •

edited

Loading

nanangizz commented Nov 20, 2024 •

edited

Loading

LeonidGoltsblat commented Dec 31, 2024 •

edited

Loading