add riscv-ipi

Signed-off-by: Zhangjin Wu <[email protected]>
tinyclub · Aug 27, 2024 · 9e246fe · 9e246fe
1 parent 56b35cd
commit 9e246fe
Showing 1 changed file with 316 additions and 0 deletions.
diff --git a/_posts/2024-08-27-11-58-54-riscv-ipi.md b/_posts/2024-08-27-11-58-54-riscv-ipi.md
@@ -0,0 +1,316 @@
+---
+layout: post
+author: 'sugarfillet'
+title: 'RISC-V IPI 实现'
+draft: false
+album: 'RISC-V Linux'
+license: 'cc-by-nc-nd-4.0'
+permalink: /riscv-ipi/
+description: 'RISC-V SMP IPI 实现分析'
+category:
+  - 开源项目
+  - Risc-V
+tags:
+  - Linux
+  - RISC-V
+  - IPI
+  - SMP
+  - 多核同步
+  - 负载均衡
+  - 核间中断
+---
+
+> Corrector: [TinyCorrect](https://gitee.com/tinylab/tinycorrect) v0.1 - [codeinline pangu epw]
+> Author:    sugarfillet <[email protected]>
+> Date:      2023/02/09
+> Revisor:   Falcon [email protected]
+> Project:   [RISC-V Linux 内核剖析](https://gitee.com/tinylab/riscv-linux)
+> Proposal:  [RISC-V Linux SMP 技术调研与分析](https://gitee.com/tinylab/riscv-linux/issues/I5MU96)
+> Sponsor:   PLCT Lab, ISCAS
+
+
+## 前言
+
+IPI 全称为 Inter-Processor Interrupt，即处理器之间的中断，在此基础上，可以在 SMP 系统中实现多核同步，多核负载均衡功能。本文对 RISC-V Linux 中的 IPI 实现进行分析。
+
+**说明**：
+
+* 本文的 Linux 版本采用 `Linux v6.2-rc5`
+
+## RISC-V 中断类型
+
+RISC-V 处理器将异常控制流称之为 trap，同时为可以支持 trap 处理的两种特权模式（M-mode、S-mode）分别定义一组寄存器用于执行 trap 处理。当 trap 发生时，Hart 自动执行如下流程，之后 Hart 会跳转到 `xtvec` 设置的异常处理函数。
+
+- 发生异常的指令的 PC 被存入 xepc，且 PC 被设置为 xtvec
+- xcause 根据 trap 类型设置，xtval 被设置成出错的地址或者其它特定异常的信息字
+- 把 xstatus CSR 中的 XIE 置零，屏蔽中断，且 XIE 之前的值被保存在 XPIE 中
+- 发生 trap 时的权限模式被保存在 xstatus 的 XPP 域，然后设置当前模式为 X 模式
+
+> 注：上文的 x 和 X 可以替换为对应模式首字符来理解：S-mode (s 和 S)、M-mode (m 和 M)
+
+RISC-V privileged 文档（文档版本为 20211203）的 Table 3.6 和 Table 4.2 分别对 mcause 和 scause 寄存器的值做了说明，从这两个表中，我们可以知道：trap 分为中断和异常两种，当产生中断时，`xcause` 的 Interupt 位置 1，异常时置 0。中断在两种特权模式下又分为软件中断、时钟中断、外部中断。mcause 寄存器值定义表如下：
+
+| Interrupt | Exception Code | Description                    |
+|-----------|----------------|--------------------------------|
+| 1         | 0              | Reserved                       |
+| 1         | 1              | Supervisor software interrupt  |
+| 1         | 2              | Reserved                       |
+| 1         | 3              | Machine software interrupt     |
+| 1         | 4              | Reserved                       |
+| 1         | 5              | Supervisor timer interrupt     |
+| 1         | 6              | Reserved                       |
+| 1         | 7              | Machine timer interrupt        |
+| 1         | 8              | Reserved                       |
+| 1         | 9              | Supervisor external interrupt  |
+| 1         | 10             | Reserved                       |
+| 1         | 11             | Machine external interrupt     |
+| 1         | 12–15          | Reserved                       |
+| 1         | ≥16            | Designated for platform use    |
+| 0         | 0              | Instruction address misaligned |
+| 0         | 1              | Instruction access fault       |
+| ..        | ..             | ..                             |
+
+M-mode 下，软件中断通过编程 CLINT.MSIP（Core Local Interruptor）寄存器来触发；时钟中断通过编程 CLINT.MTIMECMP 和 CLINT.MTIME 寄存器来触发；而外部中断是通过 PLIC（Platform-Level Interrupt Controller）控制器转发外设中断到 Hart。对 M-mode 的中断处理与应用感兴趣的同学可以参考这个 [课程][1]，课程里通过 CLINT 时钟中断实现抢占式调度，CLINT 软件中断实现兼容的协作式调度，PLIC 外部中断实现串口输入的功能。
+
+S-mode 下，外部中断与 M-mode 类似通过 PLIC 转发外部设备中断；时钟中断在 Linux 中通过 riscv-timer 时钟源来触发，此时钟源底层通过 SBI Timer 扩展或者编程 Sstc 相关寄存器来实现，这里不做展开，详细可参考（`drivers/clocksource/timer-riscv.c`)；而软件中断在 Linux 中主要用于核间的 IPI，底层通过调用 SBI IPI 扩展的 `sbi_send_ipi()` 接口触发。
+
+## HLIC PLIC CLINT
+
+如内核文档 `Documentation/devicetree/bindings/interrupt-controller/riscv,cpu-intc.txt` 中所述，RISC-V 每个 Hart 有关中断控制的本地寄存器由各自的 HLIC (Hart-Level Interrupt Controller) 进行管理。所有的中断类型最终都会路由到 HLIC 进行处理，包括 PLIC 转发的外部中断，CLINT/riscv-timer 产生的时钟中断，以及 CLINT 产生的或者 HLIC 自己管理的软件中断。我们以 `arch/riscv/boot/dts/starfive/jh7100.dtsi` 为例来说明互相的连接关系：
+
+- 在每个 CPU 节点中定义一个 HLIC 控制器，`compatible` 选项定义为 "riscv,cpu-intc"。
+- plic 节点定义一个 "sifive,plic-1.0.0" 中断控制器，通过 "interrupts-extended" 映射 11 号和 9 号中断到每个 HLIC 的外部中断（这里的 11 和 9 分别代表 M-mode 和 S-mode 外部中断在 xcause 对应的异常码）。
+- clint 节点定义一个 "sifive,clint0" 时钟设备，该节点通过 "interrupts-extended" 映射 3 号和 7 号中断到每个 HLIC 的软件中断和时钟中断（这里的 3 和 7 分别代表 M-mode 软件中断和时钟中断在 mcasue 对应的异常码）。
+- riscv-timer 设备在此设备的初始化流程中的 `irq_create_mapping(domain, RV_IRQ_TIMER);` 调用中映射时钟中断到 HLIC 的中断域（这里的 `RV_IRQ_TIMER` (5) 代表 S-mode 时钟中断在 xcause 对应的异常码），该设备不体现在 dts 中。
+- 软件中断则由 HLIC 自己来管理，此中断不体现在 dts 中。
+
+```c
+// arch/riscv/boot/dts/starfive/jh7100.dtsi
+
+              U74_0: cpu@0 {
+                        compatible = "sifive,u74-mc", "riscv";
+                        ...
+                        mmu-type = "riscv,sv39";
+                        riscv,isa = "rv64imafdc";
+
+                        cpu0_intc: interrupt-controller {
+                                compatible = "riscv,cpu-intc";
+                                interrupt-controller;
+                                #interrupt-cells = <1>;
+                        };
+                };
+
+               clint: clint@2000000 {
+                        compatible = "starfive,jh7100-clint", "sifive,clint0";
+                        reg = <0x0 0x2000000 0x0 0x10000>;
+                        interrupts-extended = <&cpu0_intc 3 &cpu0_intc 7
+                                               &cpu1_intc 3 &cpu1_intc 7>;
+                };
+
+                plic: interrupt-controller@c000000 {
+                        compatible = "starfive,jh7100-plic", "sifive,plic-1.0.0";
+                        reg = <0x0 0xc000000 0x0 0x4000000>;
+                        interrupts-extended = <&cpu0_intc 11 &cpu0_intc 9
+                                               &cpu1_intc 11 &cpu1_intc 9>;
+                        interrupt-controller;
+                        #address-cells = <0>;
+                        #interrupt-cells = <1>;
+                        riscv,ndev = <133>;
+                };
+```
+
+> CLINT 设备提供时钟中断和软件中断不会和 RISC-V-timer 设备的时钟中断以及 HLIC 管理的软件中断冲突么？
+
+如 RISC-V Kconfig 文件所示，CLINT 设备只在不支持 MMU 的 RISC-V 处理器上运行 M-mode Linux 的环境中使能，并提供 `clint_ipi_ops` 操作集来提供软件中断的支持，而 S-mode Linux 的默认的时钟源为 riscv-timer。所以上文的 dts 以及 QEMU 默认的 dts 中，"mmu-type" 选项和 "clint" 节点似乎不应该同时存在，同时存在的结果就是 clint 设备不工作。
+
+```c
+// arch/riscv/Kconfig.socs :33
+
+config SOC_VIRT
+        bool "QEMU Virt Machine"
+        select CLINT_TIMER if RISCV_M_MODE
+        ...
+        help
+          This enables support for QEMU Virt Machine.
+
+// arch/riscv/Kconfig : 14
+
+config RISCV
+       ...
+       select CLINT_TIMER if !MMU
+       ...
+```
+
+总之，RISC-V 系统的中断连接如下，其中 S-mode Linux 不访问 CLINT 设备：
+
+```
+ -------
+|       |<--> Soft  <----------------------v
+| HLIC  |<--- Timer ----- [riscv-timer] [CLINT]
+|       |<--- Exter ----- [PLIC] --|<--- Int1
+ -------                           |<--- Int2
+
+```
+
+## HLIC 初始化
+
+HLIC 在 RISC-V Linux 中通过 "riscv,cpu-intc" 中断控制器来实现，其初始化函数 `riscv_intc_init()` 在 `init_IRQ()` 阶段被调用。
+
+此函数调用 `irq_domain_add_linear()` 注册中断域，并设置根中断处理函数为 `riscv_intc_irq()`，之后通过设置热插拔状态 `CPUHP_AP_IRQ_RISCV_STARTING` 在热插拔线程的 `ONLINE` 阶段调用 `riscv_intc_cpu_starting()` 函数以开启当前 CPU 的 sie 寄存器的 SSIE 位，表示开启 S-mode 的软件中断。关键代码如下：
+
+```c
+// drivers/irqchip/irq-riscv-intc.c : 95
+
+IRQCHIP_DECLARE(riscv, "riscv,cpu-intc", riscv_intc_init);
+
+riscv_intc_init()
+   // only need to do INTC initialization for boot hart
+
+   irq_domain_add_linear(node, BITS_PER_LONG, &riscv_intc_domain_ops, NULL); // Allocate and register a linear revmap irq_domain.
+    __irq_domain_add
+        init domain->*  domain->ops = ops;
+        debugfs_add_domain_dir(domain);
+        list_add(&domain->link, &irq_domain_list);
+
+   set_handle_irq(&riscv_intc_irq); // set root irq handler
+
+   cpuhp_setup_state(CPUHP_AP_IRQ_RISCV_STARTING,
+      "irqchip/riscv/intc:starting",
+      riscv_intc_cpu_starting,   // csr_set(CSR_IE, BIT(RV_IRQ_SOFT));
+      riscv_intc_cpu_dying);  // csr_clear(CSR_IE, BIT(RV_IRQ_SOFT));
+```
+
+HIIC 中断处理函数 `riscv_intc_irq()`，获取 `pt_regs` 的 cause 成员（即 scause 寄存器），如果为软件中断则执行 `handle_IPI()` 进行 IPI 的处理，否则调用 `generic_handle_domain_irq()` 执行时钟中断和外部中断，此函数最终会调用到具体的中断处理函数，比如：RISC-V-timer 为 `riscv_timer_interrupt()`。
+
+```c
+// drivers/irqchip/irq-riscv-intc.c : 139
+
+riscv_intc_irq(struct pt_regs *regs)
+   cause = regs->cause & ~CAUSE_IRQ_FLAG;
+   case RV_IRQ_SOFT:
+      handle_IPI(regs);
+   default:
+      generic_handle_domain_irq(intc_domain, cause);
+        return handle_irq_desc(irq_resolve_mapping(domain, hwirq));
+          // find irq_desc in ` irq_domain_get_irq_data(domain, hwirq)` by cause
+          generic_handle_irq_desc(desc);
+            desc->handle_irq(desc); // handle_percpu_devid_irq
+                desc->action->handler()
+
+// drivers/clocksource/timer-riscv.c : 195
+
+riscv_timer_init_dt()
+  riscv_clock_event_irq = irq_create_mapping(domain, RV_IRQ_TIMER);
+  error = request_percpu_irq(riscv_clock_event_irq, riscv_timer_interrupt, riscv-timer, &riscv_clock_event);
+```
+
+## IPI 中断的触发与处理
+
+RISC-V Linux 中提供 `send_ipi_mask()`、`send_ipi_single()` 两个函数用于在指定 `cpu` 或者 `cpumask` 触发 `op` 参数指定的 IPI 中断事件类型。这两个函数首先在 percpu 变量 `ipi_data[cpu].bits` 中设置 IPI 事件类型表示触发此类事件，之后调用 SBI 的 IPI 扩展接口 `sbi_send_ipi()` 发送 IPI。关键代码如下：
+
+```c
+// arch/riscv/kernel/smp.c : 120
+
+send_ipi_single(int cpu, enum ipi_message_type op)
+send_ipi_mask(const struct cpumask *mask, enum ipi_message_type op)
+     set_bit(op, &ipi_data[cpu].bits);
+     ipi_ops->ipi_inject(mask); // sbi_ipi_ops.ipi_inject sbi_send_cpumask_ipi
+       if SBI_EXT_IPI __sbi_send_ipi_v02
+          hartid = cpuid_to_hartid_map(cpuid);
+          ret = sbi_ecall(SBI_EXT_IPI, SBI_EXT_IPI_SEND_IPI, hmask, hbase, 0, 0, 0, 0); //  sbi_send_ipi()
+```
+
+RISC-V SBI 规范（规范版本为 1.0-rc1）第六章的 "sPI: s-mode IPI" 扩展中描述了 `sbi_send_ipi()` 接口，当调用此接口时会向 `hart_mask` 中定义的所有 Hart 发送 IPI，而目标 Hart 在接收时以 S 模式的软件中断处理。规范原文引用如下：
+
+> Send an inter-processor interrupt to all the harts defined in hart_mask. Interprocessor interrupts
+> manifest at the receiving harts as the supervisor software interrupts.
+
+`handle_IPI()` 函数负责执行 IPI 中断的处理，从 percpu 变量 `ipi_data[cpu].bits` 中取出 IPI 事件类型，调用不同的处理函数进行事件处理，比如：`IPI_CALL_FUNC` 事件的处理函数为 `generic_smp_call_function_interrupt()` 函数。同时使用 `ipi_data[cpu].stats[]` 数组进行不同 IPI 事件的计数，从而可以在 `/proc/interrupts` 文件中查询。
+
+```c
+// arch/riscv/kernel/smp.c :154
+
+handle_IPI()
+
+   unsigned long *pending_ipis = &ipi_data[cpu].bits;
+   unsigned long *stats = ipi_data[cpu].stats; // show_ipi_stats /proc/interrupts
+   ops = xchg(pending_ipis, 0);
+   if (ops & (1 << IPI_RESCHEDULE)) {
+     stats[IPI_RESCHEDULE]++;
+     scheduler_ipi();
+   }
+   IPI_CALL_FUNC
+     generic_smp_call_function_interrupt();
+   IPI_CPU_STOP
+     ipi_stop();
+   IPI_CPU_CRASH_STOP
+     ipi_cpu_crash_stop(cpu, get_irq_regs());
+   IPI_IRQ_WORK
+     irq_work_run();
+   IPI_TIMER
+     tick_receive_broadcast();
+```
+
+## IPI 中断事件
+
+在上文的 IPI 中断触发与处理机制的基础上，RISC-V Linux 提供 6 种 IPI 中断事件的支持，以实现具体的跨 CPU 功能。这些事件在 `ipi_message_type` 枚举与 `ipi_names` 数组中定义如下：
+
+```
+static const char * const ipi_names[] = {
+        [IPI_RESCHEDULE]        = "Rescheduling interrupts",
+        [IPI_CALL_FUNC]         = "Function call interrupts",
+        [IPI_CPU_STOP]          = "CPU stop interrupts",
+        [IPI_CPU_CRASH_STOP]    = "CPU stop (for crash dump) interrupts",
+        [IPI_IRQ_WORK]          = "IRQ work interrupts",
+        [IPI_TIMER]             = "Timer broadcast interrupts",
+};
+```
+
+- IPI_RESCHEDULE
+
+  SMP 系统中，调度器更加倾向于把任务负载分摊到每个 CPU 上，不至于出现单核繁忙的情况，那么当调度器把任务负载从一个 CPU 卸载到其他的空闲 CPU 时，就会触发 `IPI_RESCHEDULE` 事件，而目标 CPU 中当前任务如果设置了重新调度位，则执行调度。`IPI_RESCHEDULE` 事件通过调用 `smp_send_reschedule()` 函数来触发，而在目标 CPU 上的 IPI 处理函数 `handle_IPI()` 中则调用 `scheduler_ipi()` 函数选择性地执行重新调度。
+
+- IPI_CALL_FUNC
+
+  Linux 提供 `smp_call_function()` 函数用来在多个 CPU 上执行函数，比如：在 ftrace 更新全局的跟踪函数后会调用此接口在其他 CPU 上执行 `smp_rmb()` 用于通知其他 CPU 此全局函数的更新，在 sysrq 中也会调用此接口打印每个 CPU 上的调用栈等相关信息。`smp_call_function()` 接口把要执行的函数及其参数存储在目标 CPU 的调用函数队列 `call_single_queue` 上，之后会调用 `arch_send_call_function_ipi_mask()` 触发 `IPI_CALL_FUNC` 事件。而在目标 CPU 上的 IPI 处理过程中则调用 `generic_smp_call_function_interrupt()` 函数从调用函数队列中取出**调用函数**去执行。
+
+- IPI_CPU_STOP
+
+  `IPI_CPU_STOP` 事件在 `panic()` 流程中调用 `smp_send_stop()` 触发，用于停止其他 CPU，而目标 CPU 处理此事件时，则通过 `ipi_stop()` 执行 wfi 循环。
+
+- IPI_CPU_CRASH_STOP
+
+  `IPI_CPU_CRASH_STOP` 事件在 crash 之后执行 kexec 之前时调用 `crash_smp_send_stop()` 触发，用来停止未 crash 的 CPU 并保存其寄存器，而目标 CPU 处理此事件时，调用 `ipi_cpu_crash_stop(cpu, get_irq_regs())` 保存进程信息和寄存器信息到 core 文件，之后调用 `cpu_ops[cpu]->cpu_stop()` 进入 SBI HSM 扩展的 STOP 状态。
+
+- IPI_IRQ_WORK
+
+  Irq Work 机制用于在中断上下文中执行回调函数，`IPI_IRQ_WORK` 事件在 irq_work 入队列时调用 `arch_irq_work_raise()` 触发，而目标 CPU 处理此事件时，调用 `irq_work_run()` 执行 irq_work 的回调函数。
+
+- IPI_TIMER
+
+  当 CPU 进入 idle 状态时可能会关闭本地时钟，系统时钟通过调用 `tick_broadcast()` 触发 `IPI_TIMER` 事件让 CPU 从 idle 状态退出，而目标 CPU 处理此事件时，调用 `tick_receive_broadcast()` 执行当前 CPU 上的时钟源（clock event）的时钟处理函数。
+
+以上 IPI 事件的触发和处理函数整理成下表，以便查询：
+
+| IPI type           | trigger func                     | deal func                           |
+|--------------------|----------------------------------|-------------------------------------|
+| IPI_RESCHEDULE     | smp_send_reschedule              | scheduler_ipi                       |
+| IPI_CALL_FUNC      | arch_send_call_function_ipi_mask | generic_smp_call_function_interrupt |
+| IPI_CPU_STOP       | smp_send_stop                    | ipi_stop                            |
+| IPI_CPU_CRASH_STOP | crash_smp_send_stop              | ipi_cpu_crash_stop                  |
+| IPI_IRQ_WORK       | arch_irq_work_raise              | irq_work_run                        |
+| IPI_TIMER          | tick_broadcast                   | tick_receive_broadcast              |
+
+## 小结
+
+RISC-V Linux 中通过 HLIC 来集中处理三种中断类型（软件中断、时钟中断、外部中断），其中 S-mode 的软件中断主要用于 IPI。通过调用 SBI IPI 扩展接口 `sbi_send_ipi()` 向目标 Hart 发送 IPI，在目标 Hart 在收到 IPI 中断时，HLIC 的中断处理函数 `riscv_intc_irq()` 以 S 模式的软件中断来处理。在此基础上又定义 6 种不同的 IPI 事件以实现各种具体的跨 CPU 功能。
+
+## 参考资料
+
+- [riscv-operating-system-mooc][1]
+- [riscv irq handle][2]
+- [time framework][3]
+
+[1]: https://gitee.com/unicornx/riscv-operating-system-mooc
+[2]: https://tinylab.org/riscv-irq-analysis-part3-interrupt-handling-cpu/
+[3]: http://www.wowotech.net/timer_subsystem/time-subsyste-architecture.html