idc: zephyr: set of fixes to IDC in multicore scenario #8753

kv2019i · 2024-01-17T21:46:04Z

No description provided.

Replace infinite wait with a time-limited wait. In case a blocking IDC message is not handled within IDC_TIMEOUT, return an error instead of waiting. This allows to debug stuck core-to-core communication easier. Signed-off-by: Kai Vehmanen <[email protected]>

The IDC message objects are stored to static "idc_work", so the cache operations are unnecessary when pointers to these messages are passed around. Signed-off-by: Kai Vehmanen <[email protected]>

src/idc/zephyr_idc.c

kv2019i · 2024-01-18T09:18:39Z

src/idc/zephyr_idc.c

@@ -73,8 +73,7 @@ static void idc_handler(struct k_p4wq_work *work)
 	int payload = -1;
 	k_spinlock_key_t key;

-	/* A message is received from another core, invalidate local cache */
-	sys_cache_data_invd_range(msg, sizeof(*msg));
+	__ASSERT_NO_MSG(!is_cached(msg));


I could just remove these asserts in the final patch. It's nice here in the PR review, but I wonder should this be just dropped.

@lyakh any comment, you added these cache ops in the original version?

@kv2019i looks correct

best to keep and modify any inline comments around cache handling code. i.e. why are we checking the ptr type.

src/idc/zephyr_idc.c

lyakh · 2024-01-18T09:40:45Z

src/idc/zephyr_idc.c

+
+	if (zmsg->work_status == WORK_SUBMITTED) {
+		work->sync = 1;
+		ret = k_p4wq_wait(work, K_USEC(IDC_TIMEOUT));


don't other work fields have to be set? E.g. .handler is called unconditionally.

@lyakh This is just to influence p4wq_wait() behaviour. We want to sleep on the semaphore (if send_msg is called with BLOCKING flag), but the semaphore is private, so we need to set the work->sync flag to get the same effect.

In V2 version, I've modified this so to set the flag conditionally. If send_msg is called without BLOCKING flag, no need to wait here.

lyakh · 2024-01-18T09:45:47Z

src/idc/zephyr_idc.c

+		work->sync = 1;
+		ret = k_p4wq_wait(work, K_USEC(IDC_TIMEOUT));
+		if (ret < 0 && zmsg->work_status == WORK_SUBMITTED)
+			return ret;


have we ever seen this race occurring? Between what and what? Non-blocking operations usually return an error when the action cannot be completed immediately. Wouldn't that be better?
In fact this seems racy too. What if the handler completes between checking for WORK_SUBMITTED and calling k_p4wq_wait()? Then we'll never be woken up?

I can hit this on a MTL with modified multicore tplg. Still trying to find more info on the scenario as I agree this should not happen normally (and it's rare!). I now reworked the wait logic to only rely o p4wq semaphore, so this should be safe against races (and dropped the work_status field).

A race exists between submitting a new IDC message and completion of a previous message. The p4wq interface mandates k_p4wq_wait() must be called to claim back ownership of the work object. The SOF IDC code does not do this, but relies on the caller not to send multiple IDC messages to the same destination core. Add appropriate calls to k_p4wq_wait() to handle this issue. If caller wants to send an IDC in blocking mode, also call k_p4wq_wait() in blocking mode. If non-blocking, return -EBUSY immediately. The check for CPU status is moved earlier in the function. No pointing in waiting for p4wq semaphore if the target core is already powered down, so do the power state check as the first step. Signed-off-by: Kai Vehmanen <[email protected]>

kv2019i · 2024-01-18T13:26:54Z

This is not necessarily fixing rootcauses, but I believe this is making the code more robust and can help in debug. Without this PR, if an ID is sent before previous one is compltely handled by the p4wq thread (of the destionation core), we enter unchartered territory. In debug builds, one will get following assert in these cases:

[    3.259940] <inf> ipc: ipc_cmd: rx   : 0x13010003|0x0
ASSERTION FAIL [item->thread == ((void *)0)] @ ZEPHYR_BASE/lib/os/p4wq.c:236

This PR ensure we will not reuse the work object until p4wq is finished with it.

kv2019i · 2024-01-18T15:57:15Z

@lyakh Here's one concrete case where this is hit. A set of set-pipeline-state IPCs, sent over to core1 with a non-blocking IDC. Processign one is delayed (I suspect same rootcause as #8756 ), and when the next set-pipeline-state is sent, the IDC work queue is busy:

The real rootcause is most probably IPC processing taking too much time, but without this PR, the work item is silently reused in the IDC thread and results can be unpredictable.

[   57.796150] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 1 timer avg 5635, max 23095, overruns 1
[   58.820255] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 1 timer avg 5533, max 8715, overruns 0
[   59.844440] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 1 timer avg 5528, max 8226, overruns 0
[   59.862658] <inf> ipc: ipc_cmd: rx   : 0x13000003|0x0
[   59.863898] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 2
[   59.864890] <inf> pipe: pipeline_trigger: Sum of KCPS consumption: 6388, core: 1
[   59.866810] <inf> ll_schedule: zephyr_ll_task_done: task complete 0xa0115680 0x20210U
[   59.866895] <inf> ll_schedule: zephyr_ll_task_done: num_tasks 2 total_num_tasks 2
[   59.867320] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 80 < period bytes 384, possible glitch
[   59.867366] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.868305] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.868350] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.869365] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.869410] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.870390] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.870435] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.871400] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.871446] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.872293] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.872338] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.873373] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.873420] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.874380] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.874425] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.875361] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.875406] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.876711] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.876758] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.877400] <inf> ipc: ipc_cmd: rx   : 0x13010003|0x0
[   59.877545] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.877583] <err> zephyr_idc: idc_send_msg: idc_send_msg error -16, target core 1
[   59.877596] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   59.877635] <err> ipc: ipc_cmd: ipc4: FW_GEN_MSG failed with err 4
[   59.878371] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   59.878416] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy

RanderWang

LGTM. I like defensive programming.

lyakh · 2024-01-19T08:15:57Z

A set of set-pipeline-state IPCs, sent over to core1 with a non-blocking IDC. Processign one is delayed (I suspect same rootcause as #8756 ), and when the next set-pipeline-state is sent, the IDC work queue is busy:

@kv2019i this seems to be a problem right there: the next IPC must not be sent by the host before the previous one has been processed and responded to! We've spent a lot of effort tracing all possible out-of-order IPC processing flows (LL-thread, other core, LL-thread on other core) and making sure we really complete processing one IPC before the next one can arrive!

lyakh

let me -1 this to make sure it isn't merged until fully understood

kv2019i · 2024-01-19T10:35:30Z

@lyakh wrote.

A set of set-pipeline-state IPCs, sent over to core1 with a non-blocking IDC. Processign one is delayed (I suspect same
rootcause as #8756 ), and when the next set-pipeline-state is sent, the IDC work queue is busy:

@kv2019i this seems to be a problem right there: the next IPC must not be sent by the host before the previous one has been processed and responded to! We've spent a lot of effort tracing all possible out-of-order IPC processing flows (LL-thread, other core, LL-thread on other core) and making sure we really complete processing one IPC before the next one can arrive!

I'm not claiming the calls to idc_send_msg() are valid and the PR will not allow such cases to pass. But what this PR does is to explicitly catch misuse of the interface and raise an error instead of silently proceeding to use a worker object on core X while ownership is still on core Y (as per p4wq interface semantics).

In the specific case, these are no multiple IPCs concurrently. A set-pipeline IPC to core1 is handled, a reply is sent from core1, but core1 is busy with other work and the p4wq thread does not signal work to be ready before core0 sends the next one IPC to core1.
This is not a normal case, core1 is overloaded for some reason, but my point remains that we should not silently override the busy IDC work object in this case (especially as in assert-enabled build this will hit an assert but in release builds it will not).

Here's one log where this happens:

KV: first set-pipeline state
[   14.790320] <inf> ipc: ipc_cmd: rx   : 0x13000003|0x0
KV: passed to core 1
[   14.790400] <err> zephyr_idc: idc_send_msg: IDC@1: TX 4000000 (IPC 4010f380)
KV: at 14.791390 the core1 p4wq thread start running, and handles the IPC 
[   14.791390] <err> zephyr_idc: idc_handler: IDC@1: TX 4000000 worker (IPC 4010f380)
[   14.791471] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 2
[   14.791908] <inf> pipe: pipeline_trigger: Sum of KCPS consumption: 6388, core: 1
[   14.793371] <inf> ll_schedule: zephyr_ll_task_done: task complete 0xa0115680 0x20210U
[   14.793418] <inf> ll_schedule: zephyr_ll_task_done: num_tasks 2 total_num_tasks 2
[   14.794043] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 48 < period bytes 384, possible glitch
[   14.794148] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
KV: IPC reply is sent to host, host can send a new IPC (but p4wq worker has NOT signaled the semaphore yet)
[   14.794401] <inf> ipc: ipc_cmd: tx-reply     : 0xb3000000|0x0
[   14.795236] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   14.795281] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
[   14.796288] <err> zephyr_idc: idc_handler: IDC@1: TX 4000000 done (IPC 4010f380)
[   14.796066] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_common_copy(): Copy_bytes 0 + avail bytes 0 < period bytes 384, possible glitch
[   14.796113] <wrn> dai_comp: dai_common_copy: comp:1 0x10004 dai_zephyr_copy(): nothing to copy
KV: next set-pipeline-state IPC comes from host
[   14.796140] <inf> ipc: ipc_cmd: rx   : 0x13010003|0x0
KV: passed to core1
[   14.796225] <err> zephyr_idc: idc_send_msg: IDC@1: TX 4000000 (IPC 4010f380)
KV: and bam, the error. obviously something is wrong on core1 as the semaphore has not been signaled, but without this PR we'd start using the worker object while it's still held by the p4wq worker thread 
[   14.796275] <err> zephyr_idc: idc_send_msg: idc_send_msg error -16, target core 1
[   14.796325] <err> ipc: ipc_cmd: ipc4: FW_GEN_MSG failed with err 4

lyakh · 2024-01-24T09:13:04Z

src/idc/zephyr_idc.c

+	 * In all other cases, we must use k_p4wq_wait() before reuse
+	 * of the object.
+	 */
+	if (work->handler) {


the thing is that IIUC this check is unprotected, so 2 cores can simultaneously check this and proceed to use the work object simultaneously.

@lyakh This field is only set once object lifetime. It is NULL initially (as a static object in BSS). When the first message is sent to a core, handler is set to "handler = idc_handler".

I.e. this check is only to prevent p4wq_wait on an uninitialized work queue object.

I guess I could just skip this and and call p4wq_wait() always, but this is depending on p4wq internals a bit to work as expect (so that wait returns immediately if the object is fresh out from BSS).

@kv2019i rrright, yes, sorry, forgot about this, but I still think there's a race possibility here: it's the "check-and-take" operation that should be atomic and safe, right? And here it isn't. You added a wait in line 141, but before we submit a new work in line 171, it might be unsafe again? Another core might send an IDC to the same target core and the target core might already be processing it again?

@andyross is my understanding of this P4WQ use-case correct? What's the right way to handle this (see above comment)?

@nashif @ceolin @teburd can anyone help on P4WQ question above ?

Just seeing this question now, apologies: The API promises that after p4wq_wait() returns the item has been executed and the queue itself won't touch it anymore. It doesn't provide any synchronization otherwise, in particular if this code itself may be executed from more than one core then it can indeed race on the single work item.

It looks to me like there's one work item instantiated for each target CPU, so on 3+ core systems that seems like a real risk unless the is more synchronization at a higher level. You could fix that by implementing a synchronized pool of such work items (c.f. k_mem_slab), or by having one for each unique sender/receiver pair (i.e. 6 on MTL, 12 on TGL).

lgirdwood

LGTM, I can take any new inline comments in a subsequent update. Would like to see this in daily testing.

lgirdwood · 2024-01-24T13:07:52Z

src/idc/zephyr_idc.c

@@ -73,8 +73,7 @@ static void idc_handler(struct k_p4wq_work *work)
 	int payload = -1;
 	k_spinlock_key_t key;

-	/* A message is received from another core, invalidate local cache */
-	sys_cache_data_invd_range(msg, sizeof(*msg));
+	__ASSERT_NO_MSG(!is_cached(msg));


best to keep and modify any inline comments around cache handling code. i.e. why are we checking the ptr type.

marcinszkudlinski · 2024-02-15T09:57:57Z

src/idc/zephyr_idc.c

@@ -151,7 +151,7 @@ int idc_send_msg(struct idc_msg *msg, uint32_t mode)

 	switch (mode) {
 	case IDC_BLOCKING:
-		ret = k_p4wq_wait(work, K_FOREVER);
+		ret = k_p4wq_wait(work, K_USEC(IDC_TIMEOUT));


i would add a specific message in log here if ret != 0

lgirdwood · 2024-02-19T13:20:38Z

let me -1 this to make sure it isn't merged until fully understood

@lyakh are you aligned now ?

lyakh · 2024-02-20T07:39:28Z

let me -1 this to make sure it isn't merged until fully understood

@lyakh are you aligned now ?

@lgirdwood I still think there's a possibility for a race here, e.g. if two secondary cores try to send an IDC message to the primary core at the same time

lgirdwood · 2024-02-20T10:48:20Z

let me -1 this to make sure it isn't merged until fully understood

@lyakh are you aligned now ?

@lgirdwood I still think there's a possibility for a race here, e.g. if two secondary cores try to send an IDC message to the primary core at the same time

ok, but is this worse

let me -1 this to make sure it isn't merged until fully understood

@lyakh are you aligned now ?

@lgirdwood I still think there's a possibility for a race here, e.g. if two secondary cores try to send an IDC message to the primary core at the same time

ok, but is this issue less likely than the issue we are seeing today. i.e. is this a stability improvement and yes we can circle back with an update once P4WQ opens are resolved ?

lyakh · 2024-02-21T07:16:22Z

but is this issue less likely than the issue we are seeing today. i.e. is this a stability improvement and yes we can circle back with an update once P4WQ opens are resolved ?

@lgirdwood if @kv2019i says that this improves the test results for him, then apparently yes, it makes the race window narrower. In general I don't find changes to address a problem that end up addressing it only partially particularly good, but if we add a "FIXME" comment here and open a bug for this, we could take this perhaps.

kv2019i · 2024-02-21T08:31:01Z

It seems this has multiple issues, so closing this. FYI @abonislawski

kv2019i added 2 commits January 17, 2024 23:31

idc: zephyr: remove unnecessary cache invd/flush calls

72e31f3

The IDC message objects are stored to static "idc_work", so the cache operations are unnecessary when pointers to these messages are passed around. Signed-off-by: Kai Vehmanen <[email protected]>

kv2019i requested a review from abonislawski January 17, 2024 21:46

kv2019i commented Jan 18, 2024

View reviewed changes

src/idc/zephyr_idc.c Show resolved Hide resolved

kv2019i commented Jan 18, 2024

View reviewed changes

lyakh reviewed Jan 18, 2024

View reviewed changes

kv2019i force-pushed the 202401-idc-fix branch from f7586a5 to 342553c Compare January 18, 2024 13:21

kv2019i marked this pull request as ready for review January 18, 2024 13:21

kv2019i requested review from lgirdwood, plbossart, mmaka1, lbetlej and dbaluta as code owners January 18, 2024 13:21

kv2019i requested review from marcinszkudlinski, jsarha and serhiy-katsyuba-intel January 18, 2024 13:24

kv2019i requested review from lyakh and RanderWang January 18, 2024 13:27

RanderWang approved these changes Jan 19, 2024

View reviewed changes

abonislawski approved these changes Jan 19, 2024

View reviewed changes

lyakh requested changes Jan 19, 2024

View reviewed changes

lyakh reviewed Jan 24, 2024

View reviewed changes

lgirdwood approved these changes Jan 24, 2024

View reviewed changes

lgirdwood mentioned this pull request Jan 24, 2024

[BUG] Various IPC timed out with multi-core run simultaneously #8642

Closed

lgirdwood added this to the v2.9 milestone Jan 29, 2024

marcinszkudlinski reviewed Feb 15, 2024

View reviewed changes

kv2019i closed this Feb 21, 2024

This was referenced Mar 6, 2024

idc: zephyr: remove unnecessary cache invd/flush calls #8910

Merged

idc: zephyr: add a timeout to blocking IDC send #8911

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idc: zephyr: set of fixes to IDC in multicore scenario #8753

idc: zephyr: set of fixes to IDC in multicore scenario #8753

kv2019i commented Jan 17, 2024

kv2019i Jan 18, 2024

lyakh Jan 22, 2024

lgirdwood Jan 24, 2024

lyakh Jan 18, 2024

kv2019i Jan 18, 2024

lyakh Jan 18, 2024

kv2019i Jan 18, 2024

kv2019i commented Jan 18, 2024

kv2019i commented Jan 18, 2024 •

edited

Loading

RanderWang left a comment

lyakh commented Jan 19, 2024

lyakh left a comment

kv2019i commented Jan 19, 2024

lyakh Jan 24, 2024

kv2019i Jan 24, 2024

lyakh Jan 25, 2024

lyakh Jan 25, 2024

lgirdwood Jan 26, 2024

andyross Feb 20, 2024

lgirdwood left a comment

lgirdwood Jan 24, 2024

marcinszkudlinski Feb 15, 2024

lgirdwood commented Feb 19, 2024

lyakh commented Feb 20, 2024

lgirdwood commented Feb 20, 2024

lyakh commented Feb 21, 2024

kv2019i commented Feb 21, 2024

idc: zephyr: set of fixes to IDC in multicore scenario #8753

idc: zephyr: set of fixes to IDC in multicore scenario #8753

Conversation

kv2019i commented Jan 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kv2019i commented Jan 18, 2024

kv2019i commented Jan 18, 2024 • edited Loading

RanderWang left a comment

Choose a reason for hiding this comment

lyakh commented Jan 19, 2024

lyakh left a comment

Choose a reason for hiding this comment

kv2019i commented Jan 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgirdwood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lgirdwood commented Feb 19, 2024

lyakh commented Feb 20, 2024

lgirdwood commented Feb 20, 2024

lyakh commented Feb 21, 2024

kv2019i commented Feb 21, 2024

kv2019i commented Jan 18, 2024 •

edited

Loading