Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous conference bridge operation #3928

Merged
merged 13 commits into from
Oct 29, 2024
Merged

Asynchronous conference bridge operation #3928

merged 13 commits into from
Oct 29, 2024

Conversation

nanangizz
Copy link
Member

@nanangizz nanangizz commented Apr 16, 2024

In #3183, video conference bridge adopts asynchronous ports operations to handle non-uniform lock ordering issues. This PR integrates the same mechanism for audio conference bridge.

Beside avoiding deadlock, in general, this change may improve performance a little bit as audio data processing in the conference bridge will be performed without holding a mutex. It gives a serious performance improvement on a heavily loaded system where the high-priority conference bridge thread does not leave time for another thread to hold the conference mutex.

There are some pending tasks (e.g: integrate this into audio switchboard).

Potential backward compatibility issues (major):

  1. Pool release immediately after port removal, consider this code:

    pool = pj_pool_create(...);
    pjmedia_some_port_create(pool, ..., &some_port);
    pjmedia_conf_add_port(conf, ..., some_port, ..., &port_id);
    ...
    pjmedia_conf_remove_port(conf, port_id);
    pjmedia_port_destroy(some_port);
    pj_pool_release(pool); /* <-- This is now a problem!
                                  It was fine with a synchronous pjmedia_conf_remove_port() */
    

    Note that this issue affects ports that do not use its own pool (i.e: it allocates its instance and all internal states using the supplied memory pool), so releasing the pool immediately after removing the port from conference bridge may cause crash as the port removal is now asynchronous and the port may be still being accessed by the conference bridge.

    Now, all PJMEDIA ports (except Bidirectional & Stereo) have been updated to use its own pool, so they should not be affected. If your application uses any custom PJMEDIA ports, the alternatives are:

    1. Update the custom PJMEDIA port to use its own pool, so it can control the pool lifetime using the port group lock (if the port does not have group lock, the conference will create one, so the group lock is optional).
    2. In case updating the PJMEDIA port is not applicable:
      • release the pool (and any other resources) from port's group lock destroy handler, the handler can be added using pjmedia_port_add_destroy_handler(port, ...) or pj_grp_lock_add_handler(port->grp_lock, ...) after adding the port to conference.
      • workaround (not safe, e.g: for simple/test app): adding a sufficient delay for the pool release, e.g: using pj_thread_sleep(). Calculating the delay needed is quite tricky. For example, a 100 ms delay for most/normal cases should be sufficient, but for a heavily loaded system may not.
  2. PJMEDIA Stream pattern.
    PJMEDIA Stream exports a PJMEDIA port interface which can be queried using pjmedia_stream_get_port(). Unfortunately the port does not implements pjmedia_port_destroy(), it is destroyed using pjmedia_stream_destroy() instead, which will release the memory pool. Similar to above issue, if pjmedia_stream_destroy() is invoked immediately after pjmedia_conf_remove_port(), it may cause crash.

    Now PJMEDIA Stream has a group lock, so pjmedia_stream_destroy() will just decrement the reference counter, the real destroy will be done when the reference counter reaches zero.

  3. Premature caching pool destroy.
    When all PJMEDIA ports have their own pools, there is still possibility that application uses its own pool factory for creating the ports and destroy the pool factory (e.g: invoking pj_caching_pool_destroy()) prematurely before the actual port removal is done.

    For example, in a library shutdown scenario, the conference bridge clock has been stopped so actual port removal can only be done later by pjmedia_conf_destroy()/pjsua_destroy(), but in between, application may destroy its pool factory which may cause crash. A possible solution is to delay the pool factory destroy to PJLIB deinitialization via pj_atexit().

  1. Due to the async nature of conf port removal, media port must be prepared to receive further get/put_frame() callbacks until the removal completes.

@sauwming
Copy link
Member

Since issue 2 is already handled, I assume it's no longer a problem (i.e. no longer causes backward compatibility)?

Which only leaves issue 1. I understand it will be quite a lot to change all existing pjmedia_port(s) to have its own pool, but since it's necessary for this feature, I vote to continue.

And if 1&2 are already resolved, the change in pjsua_app.c is no longer necessary, correct? (i.e. app will not need to change anything).

@nanangizz
Copy link
Member Author

Since issue 2 is already handled, I assume it's no longer a problem (i.e. no longer causes backward compatibility)?

Which only leaves issue 1. I understand it will be quite a lot to change all existing pjmedia_port(s) to have its own pool, but since it's necessary for this feature, I vote to continue.

The backward compatibility issues cannot be completely avoided, e.g:

  • if any custom/proprietary pjmedia ports are used, need to check/update them,
  • applications may need to be adjusted to avoid issue 3 (pool factory premature destroy).

And if 1&2 are already resolved, the change in pjsua_app.c is no longer necessary, correct? (i.e. app will not need to change anything).

Yes for pjsua_app.c, other apps may need adjustment (as mentioned above).

@sauwming
Copy link
Member

Regarding issue 3, I don't think app is allowed to call pj_caching_pool_destroy() before pjmedia_conf_destroy() since the conference creation uses the pool, i.e. pjmedia_conf_create(pool, ...)?
Also, using pj_atexit() doesn't seem possible since pjsip can be restarted (instead of shutdown).

The important thing here seems to be that pjmedia_conf_destroy() must guarantee the completion of all pending async operations, if any.

@nanangizz
Copy link
Member Author

Regarding issue 3, I don't think app is allowed to call pj_caching_pool_destroy() before pjmedia_conf_destroy() since the conference creation uses the pool, i.e. pjmedia_conf_create(pool, ...)? Also, using pj_atexit() doesn't seem possible since pjsip can be restarted (instead of shutdown).

Here is a sample scenario. App using PJSUA creates its own pool factory (instead of using PJSUA's) for some reason, and it uses the pool/factory to create a PJMEDIA port. It has to destroy the pool factory before shutting down PJSUA.

Re: lib (PJSUA?) restart, pj_atexit() will still be called right?

The important thing here seems to be that pjmedia_conf_destroy() must guarantee the completion of all pending async operations, if any.

Yes, pjmedia_conf_destroy() flushes all pending async ops.

@sauwming
Copy link
Member

Here is a sample scenario. App using PJSUA creates its own pool factory (instead of using PJSUA's) for some reason, and it uses the pool/factory to create a PJMEDIA port. It has to destroy the pool factory before shutting down PJSUA.

I see.

Re: lib (PJSUA?) restart, pj_atexit() will still be called right?

Ah, right. It's called upon pj_shutdown().

@nanangizz nanangizz marked this pull request as ready for review September 13, 2024 05:25
port->info.name.ptr));
pj_assert(!"Port using group lock should implement on_destroy()!");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing the assertion and error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of checking port.on_destroy is a kind of best effort to "detect" own pool availability to avoid premature destroy, I guess it should not cause a hard failure?
Some conditions may cause port.on_destroy not implemented while the port has own pool, e.g:

  • has its own destructor API, e.g: stream port has pjmedia_stream_destroy(), it has own pool but does not implement port.on_destroy.
  • old custom/app port implementation with group lock but do not publish it to port.grp_lock (the field is relatively newly added) may implement destroy via group lock instead of port.on_destroy.

@@ -93,6 +93,7 @@ typedef struct vconf_port
unsigned idx; /**< Port index. */
pj_str_t name; /**< Port name. */
pjmedia_port *port; /**< Video port. */
pj_bool_t is_new; /**< Port newly added? */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not quite clear with the "new port" check. Since it also applies to video, is it considered a bug fix for the current code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is kind of bug fix (for some reason, for video it causes less/no harm).

Currently adding a port to conf needs to be done synchronously to avoid premature destroy (if async, the conf may end-up adding/accessing a destroyed port). However, we cannot increment/change conf->port_cnt there as it is accessed by conf get_frame() without mutex protection, e.g:

for (i=0, ci=0; i<conf->max_ports && ci < conf->port_cnt; ++i) {

So, the "new port" check is introduced to be able to increment the conf->port_cnt safely.

@nanangizz nanangizz merged commit 4862352 into master Oct 29, 2024
36 checks passed
@nanangizz nanangizz deleted the async-aud-conf branch October 29, 2024 07:31
@trengginas
Copy link
Member

There's an issue when testing this.
Scenario using pjsua sample app:

  • set auto-answer=200
  • on incoming call, the ring port is connected to the sound device port. (the connect port is queued)
  • the call is answered, and the ring port is disconnected from the sound dev port (port connection does not exist)
  • the connect port action in the queue is processed
08:00:36.107    inv04AA1314  .......State changed from NULL to INCOMING, event=TSX_STATE
08:00:36.107    pjsua_aud.c  ..Conf connect: 4 --> 0
...
08:00:36.170   conference.c  ....Connect ports 4->0 requested
08:00:36.170   conference.c  ....Connect ports 4->0 queued
08:00:36.170   pjsua_call.c  ..Answering call 0: code=200
...
08:00:36.176    pjsua_app.c  .....Call 0 media 0 [type=audio], status is Active
08:00:36.177    pjsua_aud.c  .....Conf disconnect: 4 -x- 0
08:00:36.177   conference.c  .......Disconnect ports 4->0 requested
08:00:36.177   conference.c  .......Ports connection 4->0 does not exist
...
08:00:36.208   conference.c !.New port 3 (ringback) is added
08:00:36.209   conference.c  .New port 4 (ring) is added
08:00:36.209   conference.c  .New port 5 (sip:192.168.100.51) is added
08:00:36.210   conference.c  .Port 4 (ring) transmitting to port 0 (Wave mapper)
08:00:36.210   conference.c  .Port 5 (sip:192.168.100.51) transmitting to port 0 (Wave mapper)
08:00:36.210   conference.c  .Port 0 (Wave mapper) transmitting to port 5 (sip:192.168.100.51)

@sauwming
Copy link
Member

I did the pre-release test and this seems to fail the double-hold scenario. Only video seems to be affected (video will not fully recover).

@nanangizz
Copy link
Member Author

I did the pre-release test and this seems to fail the double-hold scenario. Only video seems to be affected (video will not fully recover).

See PR #4164

@sauwming
Copy link
Member

Thanks for fixing it!

Also, I noticed these additional two lines in pjsua app during shutdown:

13:13:33.343         sip_endpoint.c  .Endpoint 0x10ae0a2a8 destroyed
13:13:33.343               ringback  .Pool is not released by application, releasing now
13:13:33.343                   ring  .Pool is not released by application, releasing now

I suppose this is unavoidable due to issue 1 in the PR description above, correct?

@nanangizz
Copy link
Member Author

nanangizz commented Nov 20, 2024

Also, I noticed these additional two lines in pjsua app during shutdown:

13:13:33.343         sip_endpoint.c  .Endpoint 0x10ae0a2a8 destroyed
13:13:33.343               ringback  .Pool is not released by application, releasing now
13:13:33.343                   ring  .Pool is not released by application, releasing now

I suppose this is unavoidable due to issue 1 in the PR description above, correct?

Unfortunately no. It is a new bug, see #4166.
Nice catch btw.

@LeonidGoltsblat
Copy link
Contributor

LeonidGoltsblat commented Dec 31, 2024

Hi, everybody! I would like to leave feedback here.
Asynchronous commutation is a great idea!
And contrary to what the author wrote "this change may improve performance a little", it gives a serious performance improvement on a heavily loaded system where the high-priority conference bridge thread does not leave time for another thread to hold the conference mutex.

However, we are encountering several unexpected behavior changes and issues. Some of them are outlined below:

  • Memory Leak Issue: A new cause of memory leak has been identified due to implicit pool creation with each media and conference port creation. An example of a buggy application is pjmedia_test. Further details and fixes can be found in Memory leak caused by change in media port behavior due to implementation of asynchronous conference bridge #4235.

  • Deadlock Risk: The solution is still not fully free from deadlocks. The OP_REMOVE_PORT function calls the group lock handler while holding the conf->mutex. As with any callback invoked under a lock, there is a potential risk of deadlock. We cannot predict what actions the user’s application may take in the callback, for instance, it might acquire the SIP dialog mutex, which could lead to a lock ordering issue similar to the one described in bug report Lock ordering issue in video conference bridge #3183.
    I don't think it's necessary to call handle_op_queue under the mutex lock. Mutex protection is only required to protect conf->op_queue, not the handle_op_queue() function itself. This way, we can invoke the callback outside the conference mutex lock.

  • Loss of control over data lifetime: The application loses control over the data lifetime. The recommendation to add a 100ms delay before destroying data works fine for applications handling a few simultaneous calls. However, this approach limits performance to only 10 operations per second, which is insufficient for high-load systems. I think we need an API to add/remove the grp_lock port conf handle - it's simple and gives the application the ability to safely close all the resources it needs.

@nanangizz
Copy link
Member Author

Hi, everybody! I would like to leave feedback here. Asynchronous commutation is a great idea! And contrary to what the author wrote "this change may improve performance a little", it gives a serious performance improvement on a heavily loaded system where the high-priority conference bridge thread does not leave time for another thread to hold the conference mutex.

First, thanks for the feedbacks!

Re: peformance improvement note, I believe so especially for such condition. Will update the desc.

However, we are encountering several unexpected behavior changes and issues. Some of them are outlined below:

* Memory Leak Issue: A new cause of memory leak has been identified due to implicit pool creation with each media and conference port creation. An example of a buggy application is pjmedia_test. Further details and fixes can be found in [Memory leak caused by change in media port behavior due to implementation of asynchronous conference bridge #4235](https://github.com/pjsip/pjproject/pull/4235).

Thanks for the bug report & the fix.

* Deadlock Risk: The solution is still not fully free from deadlocks. The OP_REMOVE_PORT function calls the group lock handler while holding the conf->mutex. As with any callback invoked under a lock, there is a potential risk of deadlock. We cannot predict what actions the user’s application may take in the callback, for instance, it might acquire the SIP dialog mutex, which could lead to a lock ordering issue similar to the one described in bug report [Lock ordering issue in video conference bridge #3183](https://github.com/pjsip/pjproject/pull/3183).
  I don't think it's necessary to call handle_op_queue under the mutex lock. Mutex protection is only required to protect conf->op_queue, not the handle_op_queue() function itself. This way, we can invoke the callback outside the conference mutex lock.

You're right, will try to fix for both video & audio conf.

* Loss of control over data lifetime: The application loses control over the data lifetime. The recommendation to add a 100ms delay before destroying data works fine for applications handling a few simultaneous calls. However, this approach limits performance to only 10 operations per second, which is insufficient for high-load systems. I think we need an API to add/remove the grp_lock port conf handle - it's simple and gives the application the ability to safely close all the resources it needs.

Ideally app must check & update its ports to have its own pool (btw, the PJMEDIA tonegen and almost all PJMEDIA ports have been updated). It is not supposed to be a recommendation, just a sample of a possible workaround.

@nanangizz
Copy link
Member Author

Re: deadlock risk, please check #4243.

@LeonidGoltsblat
Copy link
Contributor

@nanangizz

Ideally app must check & update its ports to have its own pool (btw, the PJMEDIA tonegen and almost all PJMEDIA ports have been updated). It is not supposed to be a recommendation, just a sample of a possible workaround.

Real world applications may use pjsip as one of many other libraries and may not be written only in C/C++. The application manages resources outside the pjsip logic and may not be able to use the pool API to allocate memory, but in some cases the closing of application resources must be synchronized with the destruction of conference bridge ports.

For example, an application may implement a special media port proxy to play audio data stored in a BLOB field of a database. With an asynchronous switch, the application cannot close the database connection not only when performing a disconnect, but also when calling remove port, because the conference bridge may try to receive the next frame of data after that. The application must know the exact moment when to release resources, close the database connection, etc. This moment is the "grp_lock handle call".

@nanangizz
Copy link
Member Author

@LeonidGoltsblat

You're right, so far I missed the idea and your suggestion that group lock destroy handler can be used as a proper solution to avoid premature pool destroy! Updated the PR desc.

Btw, app can access the port's group lock directly and use the group lock APIs to add/remove handler. Anyway, added helper APIs in #4244. I cannot add you as a reviewer somehow, but feel free to review. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants