Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idc: zephyr: set of fixes to IDC in multicore scenario #8753

Closed
wants to merge 3 commits into from

Conversation

kv2019i
Copy link
Collaborator

@kv2019i kv2019i commented Jan 17, 2024

No description provided.

Replace infinite wait with a time-limited wait. In case a blocking
IDC message is not handled within IDC_TIMEOUT, return an error
instead of waiting. This allows to debug stuck core-to-core communication
easier.

Signed-off-by: Kai Vehmanen <[email protected]>
The IDC message objects are stored to static "idc_work", so
the cache operations are unnecessary when pointers to these
messages are passed around.

Signed-off-by: Kai Vehmanen <[email protected]>
@kv2019i kv2019i requested a review from abonislawski January 17, 2024 21:46
@@ -73,8 +73,7 @@ static void idc_handler(struct k_p4wq_work *work)
int payload = -1;
k_spinlock_key_t key;

/* A message is received from another core, invalidate local cache */
sys_cache_data_invd_range(msg, sizeof(*msg));
__ASSERT_NO_MSG(!is_cached(msg));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could just remove these asserts in the final patch. It's nice here in the PR review, but I wonder should this be just dropped.

@lyakh any comment, you added these cache ops in the original version?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i looks correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best to keep and modify any inline comments around cache handling code. i.e. why are we checking the ptr type.

src/idc/zephyr_idc.c Outdated Show resolved Hide resolved

if (zmsg->work_status == WORK_SUBMITTED) {
work->sync = 1;
ret = k_p4wq_wait(work, K_USEC(IDC_TIMEOUT));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't other work fields have to be set? E.g. .handler is called unconditionally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh This is just to influence p4wq_wait() behaviour. We want to sleep on the semaphore (if send_msg is called with BLOCKING flag), but the semaphore is private, so we need to set the work->sync flag to get the same effect.

In V2 version, I've modified this so to set the flag conditionally. If send_msg is called without BLOCKING flag, no need to wait here.

work->sync = 1;
ret = k_p4wq_wait(work, K_USEC(IDC_TIMEOUT));
if (ret < 0 && zmsg->work_status == WORK_SUBMITTED)
return ret;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we ever seen this race occurring? Between what and what? Non-blocking operations usually return an error when the action cannot be completed immediately. Wouldn't that be better?
In fact this seems racy too. What if the handler completes between checking for WORK_SUBMITTED and calling k_p4wq_wait()? Then we'll never be woken up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can hit this on a MTL with modified multicore tplg. Still trying to find more info on the scenario as I agree this should not happen normally (and it's rare!). I now reworked the wait logic to only rely o p4wq semaphore, so this should be safe against races (and dropped the work_status field).

A race exists between submitting a new IDC message and completion
of a previous message. The p4wq interface mandates k_p4wq_wait() must
be called to claim back ownership of the work object.  The SOF IDC
code does not do this, but relies on the caller not to send multiple
IDC messages to the same destination core.

Add appropriate calls to k_p4wq_wait() to handle this issue. If caller
wants to send an IDC in blocking mode, also call k_p4wq_wait() in
blocking mode. If non-blocking, return -EBUSY immediately.

The check for CPU status is moved earlier in the function. No pointing
in waiting for p4wq semaphore if the target core is already powered
down, so do the power state check as the first step.

Signed-off-by: Kai Vehmanen <[email protected]>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Jan 18, 2024

This is not necessarily fixing rootcauses, but I believe this is making the code more robust and can help in debug. Without this PR, if an ID is sent before previous one is compltely handled by the p4wq thread (of the destionation core), we enter unchartered territory. In debug builds, one will get following assert in these cases:

[    3.259940] <inf> ipc: ipc_cmd: rx   : 0x13010003|0x0
ASSERTION FAIL [item->thread == ((void *)0)] @ ZEPHYR_BASE/lib/os/p4wq.c:236

This PR ensure we will not reuse the work object until p4wq is finished with it.

@kv2019i kv2019i requested review from lyakh and RanderWang January 18, 2024 13:27
@kv2019i
Copy link
Collaborator Author

kv2019i commented Jan 18, 2024

@lyakh Here's one concrete case where this is hit. A set of set-pipeline-state IPCs, sent over to core1 with a non-blocking IDC. Processign one is delayed (I suspect same rootcause as #8756 ), and when the next set-pipeline-state is sent, the IDC work queue is busy:

The real rootcause is most probably IPC processing taking too much time, but without this PR, the work item is silently reused in the IDC thread and results can be unpredictable.

[   57.796150] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 1 timer avg 5635, max 23095, overruns 1
[   58.820255] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 1 timer avg 5533, max 8715, overruns 0
[   59.844440] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 1 timer avg 5528, max 8226, overruns 0
[   59.862658] <inf> ipc: ipc_cmd: rx   : 0x13000003|0x0
[   59.863898] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 2
[   59.864890] <inf> pipe: pipeline_trigger: Sum of KCPS consumption: 6388, core: 1
[   59.866810] <inf> ll_schedule: zephyr_ll_task_done: task complete 0xa0115680 0x20210U
[   59.866895] <inf> ll_schedule: zephyr_ll_task_done: num_tasks 2 total_num_tasks 2
[   59.867320] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 80 < period bytes 384, possible glitch
[   59.867366] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.868305] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.868350] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.869365] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.869410] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.870390] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.870435] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.871400] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.871446] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.872293] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.872338] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.873373] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.873420] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.874380] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.874425] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.875361] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.875406] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.876711] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.876758] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.877400] <inf> ipc: ipc_cmd: rx   : 0x13010003|0x0
[   59.877545] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.877583] <err> zephyr_idc: idc_send_msg: idc_send_msg error -16, target core 1
[   59.877596] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.877635] <err> ipc: ipc_cmd: ipc4: FW_GEN_MSG failed with err 4
[   59.878371] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.878416] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy

Copy link
Collaborator

@RanderWang RanderWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like defensive programming.

@lyakh
Copy link
Collaborator

lyakh commented Jan 19, 2024

A set of set-pipeline-state IPCs, sent over to core1 with a non-blocking IDC. Processign one is delayed (I suspect same rootcause as #8756 ), and when the next set-pipeline-state is sent, the IDC work queue is busy:

@kv2019i this seems to be a problem right there: the next IPC must not be sent by the host before the previous one has been processed and responded to! We've spent a lot of effort tracing all possible out-of-order IPC processing flows (LL-thread, other core, LL-thread on other core) and making sure we really complete processing one IPC before the next one can arrive!

Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me -1 this to make sure it isn't merged until fully understood

@kv2019i
Copy link
Collaborator Author

kv2019i commented Jan 19, 2024

@lyakh wrote.

A set of set-pipeline-state IPCs, sent over to core1 with a non-blocking IDC. Processign one is delayed (I suspect same
rootcause as #8756 ), and when the next set-pipeline-state is sent, the IDC work queue is busy:

@kv2019i this seems to be a problem right there: the next IPC must not be sent by the host before the previous one has been processed and responded to! We've spent a lot of effort tracing all possible out-of-order IPC processing flows (LL-thread, other core, LL-thread on other core) and making sure we really complete processing one IPC before the next one can arrive!

I'm not claiming the calls to idc_send_msg() are valid and the PR will not allow such cases to pass. But what this PR does is to explicitly catch misuse of the interface and raise an error instead of silently proceeding to use a worker object on core X while ownership is still on core Y (as per p4wq interface semantics).

In the specific case, these are no multiple IPCs concurrently. A set-pipeline IPC to core1 is handled, a reply is sent from core1, but core1 is busy with other work and the p4wq thread does not signal work to be ready before core0 sends the next one IPC to core1.
This is not a normal case, core1 is overloaded for some reason, but my point remains that we should not silently override the busy IDC work object in this case (especially as in assert-enabled build this will hit an assert but in release builds it will not).

Here's one log where this happens:

KV: first set-pipeline state
[   14.790320] <inf> ipc: ipc_cmd: rx   : 0x13000003|0x0
KV: passed to core 1
[   14.790400] <err> zephyr_idc: idc_send_msg: IDC@1: TX 4000000 (IPC 4010f380)
KV: at 14.791390 the core1 p4wq thread start running, and handles the IPC 
[   14.791390] <err> zephyr_idc: idc_handler: IDC@1: TX 4000000 worker (IPC 4010f380)
[   14.791471] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 2
[   14.791908] <inf> pipe: pipeline_trigger: Sum of KCPS consumption: 6388, core: 1
[   14.793371] <inf> ll_schedule: zephyr_ll_task_done: task complete 0xa0115680 0x20210U
[   14.793418] <inf> ll_schedule: zephyr_ll_task_done: num_tasks 2 total_num_tasks 2
[   14.794043] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 48 < period bytes 384, possible glitch
[   14.794148] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
KV: IPC reply is sent to host, host can send a new IPC (but p4wq worker has NOT signaled the semaphore yet)
[   14.794401] <inf> ipc: ipc_cmd: tx-reply     : 0xb3000000|0x0
[   14.795236] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   14.795281] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   14.796288] <err> zephyr_idc: idc_handler: IDC@1: TX 4000000 done (IPC 4010f380)
[   14.796066] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   14.796113] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
KV: next set-pipeline-state IPC comes from host
[   14.796140] <inf> ipc: ipc_cmd: rx   : 0x13010003|0x0
KV: passed to core1
[   14.796225] <err> zephyr_idc: idc_send_msg: IDC@1: TX 4000000 (IPC 4010f380)
KV: and bam, the error. obviously something is wrong on core1 as the semaphore has not been signaled, but without this PR we'd start using the worker object while it's still held by the p4wq worker thread 
[   14.796275] <err> zephyr_idc: idc_send_msg: idc_send_msg error -16, target core 1
[   14.796325] <err> ipc: ipc_cmd: ipc4: FW_GEN_MSG failed with err 4

* In all other cases, we must use k_p4wq_wait() before reuse
* of the object.
*/
if (work->handler) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing is that IIUC this check is unprotected, so 2 cores can simultaneously check this and proceed to use the work object simultaneously.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh This field is only set once object lifetime. It is NULL initially (as a static object in BSS). When the first message is sent to a core, handler is set to "handler = idc_handler".

I.e. this check is only to prevent p4wq_wait on an uninitialized work queue object.

I guess I could just skip this and and call p4wq_wait() always, but this is depending on p4wq internals a bit to work as expect (so that wait returns immediately if the object is fresh out from BSS).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i rrright, yes, sorry, forgot about this, but I still think there's a race possibility here: it's the "check-and-take" operation that should be atomic and safe, right? And here it isn't. You added a wait in line 141, but before we submit a new work in line 171, it might be unsafe again? Another core might send an IDC to the same target core and the target core might already be processing it again?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andyross is my understanding of this P4WQ use-case correct? What's the right way to handle this (see above comment)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nashif @ceolin @teburd can anyone help on P4WQ question above ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just seeing this question now, apologies: The API promises that after p4wq_wait() returns the item has been executed and the queue itself won't touch it anymore. It doesn't provide any synchronization otherwise, in particular if this code itself may be executed from more than one core then it can indeed race on the single work item.

It looks to me like there's one work item instantiated for each target CPU, so on 3+ core systems that seems like a real risk unless the is more synchronization at a higher level. You could fix that by implementing a synchronized pool of such work items (c.f. k_mem_slab), or by having one for each unique sender/receiver pair (i.e. 6 on MTL, 12 on TGL).

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I can take any new inline comments in a subsequent update. Would like to see this in daily testing.

@@ -73,8 +73,7 @@ static void idc_handler(struct k_p4wq_work *work)
int payload = -1;
k_spinlock_key_t key;

/* A message is received from another core, invalidate local cache */
sys_cache_data_invd_range(msg, sizeof(*msg));
__ASSERT_NO_MSG(!is_cached(msg));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best to keep and modify any inline comments around cache handling code. i.e. why are we checking the ptr type.

@@ -151,7 +151,7 @@ int idc_send_msg(struct idc_msg *msg, uint32_t mode)

switch (mode) {
case IDC_BLOCKING:
ret = k_p4wq_wait(work, K_FOREVER);
ret = k_p4wq_wait(work, K_USEC(IDC_TIMEOUT));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would add a specific message in log here if ret != 0

@lgirdwood
Copy link
Member

let me -1 this to make sure it isn't merged until fully understood

@lyakh are you aligned now ?

@lyakh
Copy link
Collaborator

lyakh commented Feb 20, 2024

let me -1 this to make sure it isn't merged until fully understood

@lyakh are you aligned now ?

@lgirdwood I still think there's a possibility for a race here, e.g. if two secondary cores try to send an IDC message to the primary core at the same time

@lgirdwood
Copy link
Member

let me -1 this to make sure it isn't merged until fully understood

@lyakh are you aligned now ?

@lgirdwood I still think there's a possibility for a race here, e.g. if two secondary cores try to send an IDC message to the primary core at the same time

ok, but is this worse

let me -1 this to make sure it isn't merged until fully understood

@lyakh are you aligned now ?

@lgirdwood I still think there's a possibility for a race here, e.g. if two secondary cores try to send an IDC message to the primary core at the same time

ok, but is this issue less likely than the issue we are seeing today. i.e. is this a stability improvement and yes we can circle back with an update once P4WQ opens are resolved ?

@lyakh
Copy link
Collaborator

lyakh commented Feb 21, 2024

but is this issue less likely than the issue we are seeing today. i.e. is this a stability improvement and yes we can circle back with an update once P4WQ opens are resolved ?

@lgirdwood if @kv2019i says that this improves the test results for him, then apparently yes, it makes the race window narrower. In general I don't find changes to address a problem that end up addressing it only partially particularly good, but if we add a "FIXME" comment here and open a bug for this, we could take this perhaps.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Feb 21, 2024

It seems this has multiple issues, so closing this. FYI @abonislawski

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants