Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

Rebase onto 6.8-rc1 #125

Closed
wants to merge 10,000 commits into from
Closed

Rebase onto 6.8-rc1 #125

wants to merge 10,000 commits into from

Conversation

Byte-Lab
Copy link
Collaborator

Mostly a clean merge. Please take a look specifically at the following to make sure I didn't botch anything:

  • cgroup_add_file() in kernel/cgroup/cgroup.c
  • All of the changes in ext.c

nathanchance and others added 30 commits January 17, 2024 18:08
Fangrui noted that the comment around CONFIG_AS_HAS_NON_CONST_LEB128
could be made more accurate because explicit .sleb128 directives are not
emitted, only .uleb128 directives are. Rename the symbol to
CONFIG_AS_HAS_NON_CONST_ULEB128 as a result.

Further clarifications include replacing "symbol deltas" with the more
accurate "label differences", noting that this issue has been resolved
in newer binutils (2.41+), and it only occurs when a port uses RISC-V
style linker relaxation.

Suggested-by: Fangrui Song <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Reviewed-by: Charlie Jenkins <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
…sions"

Nathan Chancellor <[email protected]> says:

This series disables DWARF5 for LLVM versions where it is known to be
broken due to linker relaxation.

* b4-shazam-merge:
  lib/Kconfig.debug: Update AS_HAS_NON_CONST_LEB128 comment and name
  riscv: Restrict DWARF5 when building with LLVM to known working versions
  riscv: Hoist linker relaxation disabling logic into Kconfig

Link: llvm/llvm-project@bbc0f99
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
In commit afc76b8 ("riscv: Using PATCHABLE_FUNCTION_ENTRY instead
of MCOUNT") RISC-V added support for -fpatchable-function-entry, which
removes the need for recordmcount.

Select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY to tell the build
system not to run recordmcount.

Link: https://lore.kernel.org/linux-riscv/CAAYs2=j3Eak9vU6xbAw0zPuoh00rh8v5C2U3fePkokZFibWs2g@mail.gmail.com/T/#t
Link: https://lore.kernel.org/linux-riscv/Y4jtfrJt+%2FQ5nMOz@spud/
Signed-off-by: Song Shuai <[email protected]>
Tested-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
Similar to commit 0c0593b ("x86/ftrace: Make function graph use
ftrace directly") and commit c4a0ebf ("arm64/ftrace: Make
function graph use ftrace directly"), RISC-V has no need for a special
graph tracer hook. The graph_ops::func function can be used to install
the return_hooker.

This cleanup only changes the FTRACE_WITH_REGS implementation, leaving
the mcount-based implementation is unaffected.

Perform the simplification, and also cleanup the register save/restore
macros.

Signed-off-by: Song Shuai <[email protected]>
Tested-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
Select the DYNAMIC_FTRACE_WITH_DIRECT_CALLS to provide the
register_ftrace_direct[_multi] interfaces allowing users to register
the customed trampoline (direct_caller) as the mcount for one or more
target functions. And modify_ftrace_direct[_multi] are also provided
for modifying direct_caller.

To make the direct_caller and the other ftrace hooks (e.g.
function/fgraph tracer, k[ret]probes) co-exist, a temporary register
is nominated to store the address of direct_caller in
ftrace_regs_caller. After the setting of the address direct_caller by
direct_ops->func and the RESTORE_REGS in ftrace_regs_caller,
direct_caller will be jumped to by the `jr` inst.

Add DYNAMIC_FTRACE_WITH_DIRECT_CALLS support for RISC-V.

Signed-off-by: Song Shuai <[email protected]>
Tested-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
Add RISC-V variants of the ftrace-direct* samples.

Tested-by: Evgenii Shatokhin <[email protected]>
Signed-off-by: Song Shuai <[email protected]>
Tested-by: Guo Ren <[email protected]>
Signed-off-by: Guo Ren <[email protected]>
Acked-by: Björn Töpel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
Björn Töpel <[email protected]> says:

This series includes a three ftrace improvements for RISC-V:

1. Do not require to run recordmcount at build time (patch 1)
2. Simplification of the function graph functionality (patch 2)
3. Enable DYNAMIC_FTRACE_WITH_DIRECT_CALLS (patch 3 and 4)

The series has been tested on Qemu/rv64 virt/Debian sid with the
following test configs:
  CONFIG_FTRACE_SELFTEST=y
  CONFIG_FTRACE_STARTUP_TEST=y
  CONFIG_SAMPLE_FTRACE_DIRECT=m
  CONFIG_SAMPLE_FTRACE_DIRECT_MULTI=m
  CONFIG_SAMPLE_FTRACE_OPS=m

All tests pass.

* b4-shazam-merge:
  samples: ftrace: Add RISC-V support for SAMPLE_FTRACE_DIRECT[_MULTI]
  riscv: ftrace: Add DYNAMIC_FTRACE_WITH_DIRECT_CALLS support
  riscv: ftrace: Make function graph use ftrace directly
  riscv: select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
A common issue in Makefile is a race in parallel building.

You need to be careful to prevent multiple threads from writing to the
same file simultaneously.

Commit 3939f33 ("ARM: 8418/1: add boot image dependencies to not
generate invalid images") addressed such a bad scenario.

A similar symptom occurs with the following command:

  $ make -j$(nproc) ARCH=riscv Image Image.gz loader loader.bin vmlinuz.efi
    [ snip ]
    SORTTAB vmlinux
    OBJCOPY arch/riscv/boot/Image
    OBJCOPY arch/riscv/boot/Image
    OBJCOPY arch/riscv/boot/Image
    OBJCOPY arch/riscv/boot/Image
    OBJCOPY arch/riscv/boot/Image
    GZIP    arch/riscv/boot/Image.gz
    AS      arch/riscv/boot/loader.o
    AS      arch/riscv/boot/loader.o
    Kernel: arch/riscv/boot/Image is ready
    PAD     arch/riscv/boot/vmlinux.bin
    GZIP    arch/riscv/boot/vmlinuz
    Kernel: arch/riscv/boot/loader is ready
    OBJCOPY arch/riscv/boot/loader.bin
    Kernel: arch/riscv/boot/loader.bin is ready
    Kernel: arch/riscv/boot/Image.gz is ready
    OBJCOPY arch/riscv/boot/vmlinuz.o
    LD      arch/riscv/boot/vmlinuz.efi.elf
    OBJCOPY arch/riscv/boot/vmlinuz.efi
    Kernel: arch/riscv/boot/vmlinuz.efi is ready

The log "OBJCOPY arch/riscv/boot/Image" is displayed 5 times.
(also "AS      arch/riscv/boot/loader.o" twice.)

It indicates that 5 threads simultaneously enter arch/riscv/boot/
and write to arch/riscv/boot/Image.

It occasionally leads to a build failure:

  $ make -j$(nproc) ARCH=riscv Image Image.gz loader loader.bin vmlinuz.efi
    [ snip ]
    SORTTAB vmlinux
    OBJCOPY arch/riscv/boot/Image
    OBJCOPY arch/riscv/boot/Image
    OBJCOPY arch/riscv/boot/Image
    OBJCOPY arch/riscv/boot/Image
    PAD     arch/riscv/boot/vmlinux.bin
  truncate: Invalid number: 'arch/riscv/boot/vmlinux.bin'
  make[2]: *** [drivers/firmware/efi/libstub/Makefile.zboot:13: arch/riscv/boot/vmlinux.bin] Error 1
  make[2]: *** Deleting file 'arch/riscv/boot/vmlinux.bin'
  make[1]: *** [arch/riscv/Makefile:167: vmlinuz.efi] Error 2
  make[1]: *** Waiting for unfinished jobs....
    Kernel: arch/riscv/boot/Image is ready
    GZIP    arch/riscv/boot/Image.gz
    AS      arch/riscv/boot/loader.o
    AS      arch/riscv/boot/loader.o
    Kernel: arch/riscv/boot/loader is ready
    OBJCOPY arch/riscv/boot/loader.bin
    Kernel: arch/riscv/boot/loader.bin is ready
    Kernel: arch/riscv/boot/Image.gz is ready
  make: *** [Makefile:234: __sub-make] Error 2

Image.gz, loader, vmlinuz.efi depend on Image. loader.bin depends
on loader. Such dependencies are not specified in arch/riscv/Makefile.

Signed-off-by: Masahiro Yamada <[email protected]>
Acked-by: Ard Biesheuvel <[email protected]>
Reviewed-by: Samuel Holland <[email protected]>
Tested-by: Samuel Holland <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
The Hamming Weight of a number is the total number of bits set in it, so
the cpop/cpopw instruction from Zbb extension can be used to accelerate
hweight() API.

Signed-off-by: Xiao Wang <[email protected]>
Reviewed-by: Charlie Jenkins <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
The Zkr extension is ratified and provides 16 bits of entropy seed when
reading the SEED CSR.

We can implement arch_get_random_seed_longs() by doing multiple csrrw to
that CSR and filling an unsigned long with valid entropy bits.

Acked-by: Conor Dooley <[email protected]>
Signed-off-by: Samuel Ortiz <[email protected]>
Signed-off-by: Clément Léger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
The patch can optimize the running times of insmod command by modify ELF
relocation function.
In the 5.10 and latest kernel, when install the riscv ELF drivers which
contains multiple symbol table items to be relocated, kernel takes a lot
of time to execute the relocation. For example, we install a 3+MB driver
need 180+s.
We focus on the riscv architecture handle R_RISCV_HI20 and R_RISCV_LO20
type items relocation function in the arch\riscv\kernel\module.c and
find that there are two-loops in the function. If we modify the begin
number in the second for-loops iteration, we could save significant time
for installation. We install the same 3+MB driver could just need 2s.

Signed-off-by: Amma Lee <[email protected]>
Signed-off-by: Maxim Kochetkov <[email protected]>
Reviewed-by: Charlie Jenkins <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
Add feature detector of kernel-side arg:ctx (__arg_ctx) tag support. If
this is detected, libbpf will avoid doing any __arg_ctx-related BTF
rewriting and checks in favor of letting kernel handle this completely.

test_global_funcs/ctx_arg_rewrite subtest is adjusted to do the same
feature detection (albeit in much simpler, though round-about and
inefficient, way), and skip the tests. This is done to still be able to
execute this test on older kernels (like in libbpf CI).

Note, BPF token series ([0]) does a major refactor and code moving of
libbpf-internal feature detection "framework", so to avoid unnecessary
conflicts we keep newly added feature detection stand-alone with ad-hoc
result caching. Once things settle, there will be a small follow up to
re-integrate everything back and move code into its final place in
newly-added (by BPF token series) features.c file.

  [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=814209&state=*

Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Refactor btf_get_prog_ctx_type() a bit to allow reuse of
bpf_ctx_convert_map logic in more than one places. Simplify interface by
returning btf_type instead of btf_member (field reference in BTF).

To do the above we need to touch and start untangling
btf_translate_to_vmlinux() implementation. We do the bare minimum to
not regress anything for btf_translate_to_vmlinux(), but its
implementation is very questionable for what it claims to be doing.
Mapping kfunc argument types to kernel corresponding types conceptually
is quite different from recognizing program context types. Fixing this
is out of scope for this change though.

Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Add enforcement of expected types for context arguments tagged with
arg:ctx (__arg_ctx) tag.

First, any program type will accept generic `void *` context type when
combined with __arg_ctx tag.

Besides accepting "canonical" struct names and `void *`, for a bunch of
program types for which program context is actually a named struct, we
allows a bunch of pragmatic exceptions to match real-world and expected
usage:

  - for both kprobes and perf_event we allow `bpf_user_pt_regs_t *` as
    canonical context argument type, where `bpf_user_pt_regs_t` is a
    *typedef*, not a struct;
  - for kprobes, we also always accept `struct pt_regs *`, as that's what
    actually is passed as a context to any kprobe program;
  - for perf_event, we resolve typedefs (unless it's `bpf_user_pt_regs_t`)
    down to actual struct type and accept `struct pt_regs *`, or
    `struct user_pt_regs *`, or `struct user_regs_struct *`, depending
    on the actual struct type kernel architecture points `bpf_user_pt_regs_t`
    typedef to; otherwise, canonical `struct bpf_perf_event_data *` is
    expected;
  - for raw_tp/raw_tp.w programs, `u64/long *` are accepted, as that's
    what's expected with BPF_PROG() usage; otherwise, canonical
    `struct bpf_raw_tracepoint_args *` is expected;
  - tp_btf supports both `struct bpf_raw_tracepoint_args *` and `u64 *`
    formats, both are coded as expections as tp_btf is actually a TRACING
    program type, which has no canonical context type;
  - iterator programs accept `struct bpf_iter__xxx *` structs, currently
    with no further iterator-type specific enforcement;
  - fentry/fexit/fmod_ret/lsm/struct_ops all accept `u64 *`;
  - classic tracepoint programs, as well as syscall and freplace
    programs allow any user-provided type.

In all other cases kernel will enforce exact match of struct name to
expected canonical type. And if user-provided type doesn't match that
expectation, verifier will emit helpful message with expected type name.

Note a bit unnatural way the check is done after processing all the
arguments. This is done to avoid conflict between bpf and bpf-next
trees. Once trees converge, a small follow up patch will place a simple
btf_validate_prog_ctx_type() check into a proper ARG_PTR_TO_CTX branch
(which bpf-next tree patch refactored already), removing duplicated
arg:ctx detection logic.

Suggested-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Add a bunch of global subprogs across variety of program types to
validate expected kernel type enforcement logic for __arg_ctx arguments.

Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
On kernel that don't support arg:ctx tag, before adjusting global
subprog BTF information to match kernel's expected canonical type names,
make sure that types used by user are meaningful, and if not, warn and
don't do BTF adjustments.

This is similar to checks that kernel performs, but narrower in scope,
as only a small subset of BPF program types can be accommodated by
libbpf using canonical type names.

Libbpf unconditionally allows `struct pt_regs *` for perf_event program
types, unlike kernel, which supports that conditionally on architecture.
This is done to keep things simple and not cause unnecessary false
positives. This seems like a minor and harmless deviation, which in
real-world programs will be caught by kernels with arg:ctx tag support
anyways. So KISS principle.

This logic is hard to test (especially on latest kernels), so manual
testing was performed instead. Libbpf emitted the following warning for
perf_event program with wrong context argument type:

  libbpf: prog 'arg_tag_ctx_perf': subprog 'subprog_ctx_tag' arg#0 is expected to be of `struct bpf_perf_event_data *` type

Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Andrii Nakryiko says:

====================
Tighten up arg:ctx type enforcement

Follow up fixes for kernel-side and libbpf-side logic around handling arg:ctx
(__arg_ctx) tagged arguments of BPF global subprogs.

Patch #1 adds libbpf feature detection of kernel-side __arg_ctx support to
avoid unnecessary rewriting BTF types. With stricter kernel-side type
enforcement this is now mandatory to avoid problems with using `struct
bpf_user_pt_regs_t` instead of actual typedef. For __arg_ctx tagged arguments
verifier is now supporting either `bpf_user_pt_regs_t` typedef or resolves it
down to the actual struct (pt_regs/user_pt_regs/user_regs_struct), depending
on architecture), but for old kernels without __arg_ctx support it's more
backwards compatible for libbpf to use `struct bpf_user_pt_regs_t` rewrite
which will work on wider range of kernels. So feature detection prevent libbpf
accidentally breaking global subprogs on new kernels.

We also adjust selftests to do similar feature detection (much simpler, but
potentially breaking due to kernel source code refactoring, which is fine for
selftests), and skip tests expecting libbpf's BTF type rewrites.

Patch #2 is preparatory refactoring for patch #3 which adds type enforcement
for arg:ctx tagged global subprog args. See the patch for specifics.

Patch #4 adds many new cases to ensure type logic works as expected.

Finally, patch #5 adds a relevant subset of kernel-side type checks to
__arg_ctx cases that libbpf supports rewrite of. In libbpf's case, type
violations are reported as warnings and BTF rewrite is not performed, which
will eventually lead to BPF verifier complaining at program verification time.

Good care was taken to avoid conflicts between bpf and bpf-next tree (which
has few follow up refactorings in the same code area). Once trees converge
some of the code will be moved around a bit (and some will be deleted), but
with no change to functionality or general shape of the code.

v2->v3:
  - support `bpf_user_pt_regs_t` typedef for KPROBE and PERF_EVENT (CI);
v1->v2:
  - add user_pt_regs and user_regs_struct support for PERF_EVENT (CI);
  - drop FEAT_ARG_CTX_TAG enum leftover from patch #1;
  - fix warning about default: without break in the switch (CI).
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
netdevsim tests aren't very well integrated with kselftest,
which has its advantages and disadvantages. But regardless
of the intended integration - a config file to know what kernel
to build is very useful, add one.

Fixes: fc4c93f ("selftests: add basic netdevsim devlink flash testing")
Signed-off-by: Jakub Kicinski <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
As a followup to commit 03fb856 ("selftests: bonding: add missing
build configs"), add more networking-specific config options which are
needed for bonding tests.

For testing, I used the minimal config generated by virtme-ng and I added
the options in the config file. All bonding tests passed.

Fixes: bbb774d ("net: Add tests for bonding and team address list management") # for ipv6
Fixes: 6cbe791 ("kselftest: bonding: add num_grat_arp test") # for tc options
Fixes: 222c94e ("selftests: bonding: add tests for ether type changes") # for nlmon
Suggested-by: Jakub Kicinski <[email protected]>
Signed-off-by: Benjamin Poirier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
Currently the ARM64_WORKAROUND_SPECULATIVE_UNPRIV_LOAD workaround isn't
quite right, as it is supposed to be applied after the last explicit
memory access, but is immediately followed by an LDR.

The ARM64_WORKAROUND_SPECULATIVE_UNPRIV_LOAD workaround is used to
handle Cortex-A520 erratum 2966298 and Cortex-A510 erratum 3117295,
which are described in:

* https://developer.arm.com/documentation/SDEN2444153/0600/?lang=en
* https://developer.arm.com/documentation/SDEN1873361/1600/?lang=en

In both cases the workaround is described as:

| If pagetable isolation is disabled, the context switch logic in the
| kernel can be updated to execute the following sequence on affected
| cores before exiting to EL0, and after all explicit memory accesses:
|
| 1. A non-shareable TLBI to any context and/or address, including
|    unused contexts or addresses, such as a `TLBI VALE1 Xzr`.
|
| 2. A DSB NSH to guarantee completion of the TLBI.

The important part being that the TLBI+DSB must be placed "after all
explicit memory accesses".

Unfortunately, as-implemented, the TLBI+DSB is immediately followed by
an LDR, as we have:

| alternative_if ARM64_WORKAROUND_SPECULATIVE_UNPRIV_LOAD
| 	tlbi	vale1, xzr
| 	dsb	nsh
| alternative_else_nop_endif
| alternative_if_not ARM64_UNMAP_KERNEL_AT_EL0
| 	ldr	lr, [sp, #S_LR]
| 	add	sp, sp, #PT_REGS_SIZE		// restore sp
| 	eret
| alternative_else_nop_endif
|
| [ ... KPTI exception return path ... ]

This patch fixes this by reworking the logic to place the TLBI+DSB
immediately before the ERET, after all explicit memory accesses.

The ERET is currently in a separate alternative block, and alternatives
cannot be nested. To account for this, the alternative block for
ARM64_UNMAP_KERNEL_AT_EL0 is replaced with a single alternative branch
to skip the KPTI logic, with the new shape of the logic being:

| alternative_insn "b .L_skip_tramp_exit_\@", nop, ARM64_UNMAP_KERNEL_AT_EL0
| 	[ ... KPTI exception return path ... ]
| .L_skip_tramp_exit_\@:
|
| 	ldr	lr, [sp, #S_LR]
| 	add	sp, sp, #PT_REGS_SIZE		// restore sp
|
| alternative_if ARM64_WORKAROUND_SPECULATIVE_UNPRIV_LOAD
| 	tlbi	vale1, xzr
| 	dsb	nsh
| alternative_else_nop_endif
| 	eret

The new structure means that the workaround is only applied when KPTI is
not in use; this is fine as noted in the documented implications of the
erratum:

| Pagetable isolation between EL0 and higher level ELs prevents the
| issue from occurring.

... and as per the workaround description quoted above, the workaround
is only necessary "If pagetable isolation is disabled".

Fixes: 471470b ("arm64: errata: Add Cortex-A520 speculative unprivileged load workaround")
Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: James Morse <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
For historical reasons, the non-KPTI exception return path is duplicated for
EL1 and EL0, with the structure:

	.if \el == 0
	[ KPTI handling ]
	ldr     lr, [sp, #S_LR]
 	add	sp, sp, #PT_REGS_SIZE		// restore sp
	[ EL0 exception return workaround ]
	eret
	.else
	ldr     lr, [sp, #S_LR]
 	add	sp, sp, #PT_REGS_SIZE		// restore sp
	[ EL1 exception return workaround ]
	eret
	.endif
	sb

This would be simpler and clearer with the common portions factored out,
e.g.

	.if \el == 0
	[ KPTI handling ]
	.endif

	ldr     lr, [sp, #S_LR]
 	add	sp, sp, #PT_REGS_SIZE		// restore sp

	.if \el == 0
	[ EL0 exception return workaround ]
	.else
	[ EL1 exception return workaround ]
	.endif

	eret
	sb

This expands to the same code, but is simpler for a human to follow as
it avoids duplicates the restore of LR+SP, and makes it clear that the
ERET is associated with the SB.

There should be no functional change as a result of this patch.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: James Morse <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
When writing ZA we currently unconditionally flush the buffer used to store
it as part of ensuring that it is allocated. Since this buffer is shared
with ZT0 this means that a write to ZA when PSTATE.ZA is already set will
corrupt the value of ZT0 on a SME2 system. Fix this by only flushing the
backing storage if PSTATE.ZA was not previously set.

This will mean that short or failed writes may leave stale data in the
buffer, this seems as correct as our current behaviour and unlikely to be
something that userspace will rely on.

Fixes: f90b529 ("arm64/sme: Implement ZT0 ptrace support")
Signed-off-by: Mark Brown <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
There is no need to check for SVE support when changing vector lengths,
even if the system is SME only we still need SVE storage for the streaming
SVE state.

Fixes: d4d5be9 ("arm64/fpsimd: Ensure SME storage is allocated after SVE VL changes")
Signed-off-by: Mark Brown <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
When sme_alloc() is called with existing storage and we are not flushing we
will always allocate new storage, both leaking the existing storage and
corrupting the state. Fix this by separating the checks for flushing and
for existing storage as we do for SVE.

Callers that reallocate (eg, due to changing the vector length) should
call sme_free() themselves.

Fixes: 5d0a8d2 ("arm64/ptrace: Ensure that SME is set up for target when writing SSVE state")
Signed-off-by: Mark Brown <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
Remove the errant blank lines to make the desired empty row separators
around the Fujitsu and ASR entries in the main table, rather than them
being their own separate tables which then look odd in the HTML view.

Signed-off-by: Robin Murphy <[email protected]>
Link: https://lore.kernel.org/r/b6637654eda761e224f828a44a7bbc1eadf2ef88.1705511145.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <[email protected]>
Accessing an ethernet device that is powered off or clock gated might
cause the CPU to hang. Add ethnl_ops_begin/complete in
ethnl_set_features() to protect against this.

Fixes: 0980bfc ("ethtool: set netdev features with FEATURES_SET request")
Signed-off-by: Ludvig Pärsson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
Using the address operator on the array doesn't work:

./include/linux/seq_buf.h:27:27: error: initialization of ‘char *’
  from incompatible pointer type ‘char (*)[128]’
  [-Werror=incompatible-pointer-types]
   27 |                 .buffer = &__ ## NAME ## _buffer,       \
      |                           ^

Apart from fixing that, we can improve DECLARE_SEQ_BUF() by using a
compound literal to define the buffer array without attaching a name
to it. This makes the macro a single statement, allowing constructs
such as:

  static DECLARE_SEQ_BUF(my_seq_buf, MYSB_SIZE);

to work as intended.

Link: https://lkml.kernel.org/r/[email protected]

Cc: [email protected]
Acked-by: Kees Cook <[email protected]>
Fixes: dcc4e57 ("seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str()")
Signed-off-by: Nathan Lynch <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
…devices

__loop_update_dio only checks the alignment requirement for block backed
file systems, but misses them for the case where the loop device is
created directly on top of another block device.  Due to this creating
a loop device with default option plus the direct I/O flag on a > 512 byte
sector size file system will lead to incorrect I/O being submitted to the
lower block device and a lot of error from the lock layer.  This can
be seen with xfstests generic/563.

Fix the code in __loop_update_dio by factoring the alignment check into
a helper, and calling that also for the struct block_device of a block
device inode.

Also remove the TODO comment talking about dynamically switching between
buffered and direct I/O, which is a would be a recipe for horrible
performance and occasional data loss.

Fixes: 2e5ab5f ("block: loop: prepare for supporing direct IO")
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
This doc hasn't been touched in a while, in the meantime some
new io schedulers were added (e.g. all of mq), some with ioprio
support.

Also reword the introduction to remove reference to CFQ and the
limitation that io priorities only work on reads, which is no longer
true.

Signed-off-by: Christian Loehle <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
Lately, a bug was found when many TC filters are added - at some point,
several bugs are printed to dmesg [1] and the switch is crashed with
segmentation fault.

The issue starts when gen_pool_free() fails because of unexpected
behavior - a try to free memory which is already freed, this leads to BUG()
call which crashes the switch and makes many other bugs.

Trying to track down the unexpected behavior led to a bug in eRP code. The
function mlxsw_sp_acl_erp_table_alloc() gets a pointer to the allocated
index, sets the value and returns an error code. When gen_pool_alloc()
fails it returns address 0, we track it and return -ENOBUFS outside, BUT
the call for gen_pool_alloc() already override the index in erp_table
structure. This is a problem when such allocation is done as part of
table expansion. This is not a new table, which will not be used in case
of allocation failure. We try to expand eRP table and override the
current index (non-zero) with zero. Then, it leads to an unexpected
behavior when address 0 is freed twice. Note that address 0 is valid in
erp_table->base_index and indeed other tables use it.

gen_pool_alloc() fails in case that there is no space left in the
pre-allocated pool, in our case, the pool is limited to
ACL_MAX_ERPT_BANK_SIZE, which is read from hardware. When more than max
erp entries are required, we exceed the limit and return an error, this
error leads to "Failed to migrate vregion" print.

Fix this by changing erp_table->base_index only in case of a successful
allocation.

Add a test case for such a scenario. Without this fix it causes
segmentation fault:

$ TESTS="max_erp_entries_test" ./tc_flower.sh
./tc_flower.sh: line 988:  1560 Segmentation fault      tc filter del dev $h2 ingress chain $i protocol ip pref $i handle $j flower &>/dev/null

[1]:
kernel BUG at lib/genalloc.c:508!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 6 PID: 3531 Comm: tc Not tainted 6.7.0-rc5-custom-ga6893f479f5e #1
Hardware name: Mellanox Technologies Ltd. MSN4700/VMOD0010, BIOS 5.11 07/12/2021
RIP: 0010:gen_pool_free_owner+0xc9/0xe0
...
Call Trace:
 <TASK>
 __mlxsw_sp_acl_erp_table_other_dec+0x70/0xa0 [mlxsw_spectrum]
 mlxsw_sp_acl_erp_mask_destroy+0xf5/0x110 [mlxsw_spectrum]
 objagg_obj_root_destroy+0x18/0x80 [objagg]
 objagg_obj_destroy+0x12c/0x130 [objagg]
 mlxsw_sp_acl_erp_mask_put+0x37/0x50 [mlxsw_spectrum]
 mlxsw_sp_acl_ctcam_region_entry_remove+0x74/0xa0 [mlxsw_spectrum]
 mlxsw_sp_acl_ctcam_entry_del+0x1e/0x40 [mlxsw_spectrum]
 mlxsw_sp_acl_tcam_ventry_del+0x78/0xd0 [mlxsw_spectrum]
 mlxsw_sp_flower_destroy+0x4d/0x70 [mlxsw_spectrum]
 mlxsw_sp_flow_block_cb+0x73/0xb0 [mlxsw_spectrum]
 tc_setup_cb_destroy+0xc1/0x180
 fl_hw_destroy_filter+0x94/0xc0 [cls_flower]
 __fl_delete+0x1ac/0x1c0 [cls_flower]
 fl_destroy+0xc2/0x150 [cls_flower]
 tcf_proto_destroy+0x1a/0xa0
...
mlxsw_spectrum3 0000:07:00.0: Failed to migrate vregion
mlxsw_spectrum3 0000:07:00.0: Failed to migrate vregion

Fixes: f465261 ("mlxsw: spectrum_acl: Implement common eRP core")
Signed-off-by: Amit Cohen <[email protected]>
Signed-off-by: Ido Schimmel <[email protected]>
Signed-off-by: Petr Machata <[email protected]>
Acked-by: Paolo Abeni <[email protected]>
Link: https://lore.kernel.org/r/4cfca254dfc0e5d283974801a24371c7b6db5989.1705502064.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <[email protected]>
Kent Overstreet and others added 22 commits January 21, 2024 13:27
Signed-off-by: Kent Overstreet <[email protected]>
bcachefs_format.h has gotten too big; let's do some organizing.

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Add line breaks - inode_to_text() is now much easier to read.

Signed-off-by: Kent Overstreet <[email protected]>
…l/git/powerpc/linux

Pull powerpc fixes from Aneesh Kumar:

 - Increase default stack size to 32KB for Book3S

Thanks to Michael Ellerman.

* tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: Increase default stack size to 32KB
…nux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "Updates for time and clocksources:

   - A fix for the idle and iowait time accounting vs CPU hotplug.

     The time is reset on CPU hotplug which makes the accumulated
     systemwide time jump backwards.

   - Assorted fixes and improvements for clocksource/event drivers"

* tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
  clocksource/drivers/ep93xx: Fix error handling during probe
  clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
  clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
  clocksource/timer-riscv: Add riscv_clock_shutdown callback
  dt-bindings: timer: Add StarFive JH8100 clint
  dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
…hefs

Pull more bcachefs updates from Kent Overstreet:
 "Some fixes, Some refactoring, some minor features:

   - Assorted prep work for disk space accounting rewrite

   - BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
     makes our trigger context more explicit

   - A few fixes to avoid excessive transaction restarts on
     multithreaded workloads: fstests (in addition to ktest tests) are
     now checking slowpath counters, and that's shaking out a few bugs

   - Assorted tracepoint improvements

   - Starting to break up bcachefs_format.h and move on disk types so
     they're with the code they belong to; this will make room to start
     documenting the on disk format better.

   - A few minor fixes"

* tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
  bcachefs: Improve inode_to_text()
  bcachefs: logged_ops_format.h
  bcachefs: reflink_format.h
  bcachefs; extents_format.h
  bcachefs: ec_format.h
  bcachefs: subvolume_format.h
  bcachefs: snapshot_format.h
  bcachefs: alloc_background_format.h
  bcachefs: xattr_format.h
  bcachefs: dirent_format.h
  bcachefs: inode_format.h
  bcachefs; quota_format.h
  bcachefs: sb-counters_format.h
  bcachefs: counters.c -> sb-counters.c
  bcachefs: comment bch_subvolume
  bcachefs: bch_snapshot::btime
  bcachefs: add missing __GFP_NOWARN
  bcachefs: opts->compression can now also be applied in the background
  bcachefs: Prep work for variable size btree node buffers
  bcachefs: grab s_umount only if snapshotting
  ...
Still need to fix a couple issues in ext.c

Linux 6.8-rc1

Signed-off-by: David Vernet <[email protected]>
strlcpy is removed upstream. Let's use the replacement strscpy_pad()
instead.

Signed-off-by: David Vernet <[email protected]>
The most recent struct bpf_struct_ops implementation has a cfi_stubs
field so as to enable CFI calls for indirect struct_ops callbacks. We
have to add a bunch of blank callbacks so we generate the correct CFI
hash and let the compiler generate the correct CFI for the actual
trampolines.

In addition, use the __bpf_kfunc annotation on scx kfuncs.

Signed-off-by: David Vernet <[email protected]>
@Byte-Lab Byte-Lab requested a review from htejun January 22, 2024 21:14
@htejun
Copy link
Collaborator

htejun commented Jan 23, 2024

I'm unsure whether we want to be pulling in rc releases into the main branch for a couple reasons:

  • bpf/for-next has always been our base when we needed to receive new bpf features. Note that bpf/for-next doesn't stay synced to linus#master and conflicts are possible when we pull in both bpf/for-next and linus#master.
  • Early RCs tend to be pretty rough and pulling them in doesn't really give us any practical benefits.
  • Tracking the latest stable release is likely more useful for most users including distros, which doesn't mean the devel branch is going to track stable kernels and no matter what we do there's going to be some manual work to cut and maintain kernel releases. The overall amount of work may be lower if we maintain -rcX kernel as separate releases rather than always pulling the into the main dev branch.

@Byte-Lab
Copy link
Collaborator Author

Byte-Lab commented Jan 23, 2024

bpf/for-next has always been our base when we needed to receive new bpf features. Note that bpf/for-next doesn't stay synced to linus#master and conflicts are possible when we pull in both bpf/for-next and linus#master.

I was under the impression that we were going to be moving away from pulling bpf/for-next for the reason you mentioned? I assumed we'd just be tracking Linus' branch and doing merges as needed. Is the plan to keep merging bpf-next?

Early RCs tend to be pretty rough and pulling them in doesn't really give us any practical benefits.

The practical benefit is that we're able to use and test our code against the latest "release" to uncover any issues; especially for bpf changes. I agree though that it's more convenient if we can be testing new features on top of a kernel we have confidence in, and that seems like the more reasonable thing to optimize for.

Tracking the latest stable release is likely more useful for most users including distros, which doesn't mean the devel branch is going to track stable kernels and no matter what we do there's going to be some manual work to cut and maintain kernel releases. The overall amount of work may be lower if we maintain -rcX kernel as separate releases rather than always pulling the into the main dev branch.

Hmm, I think I'm a bit confused. If distros are tracking the latest stable release, then this should have no impact on them, correct? That said, I'm fine with not rebasing against rc kernels if that's your preference. I was originally going to cut a 6.8-rc1 release in https://github.com/sched-ext/scx-kernel-releases, but decided to do this first to save whomever cuts the next release the trouble of dealing with the merge conflicts.

@htejun
Copy link
Collaborator

htejun commented Jan 23, 2024

I was under the impression that we were going to be moving away from pulling bpf/for-next for the reason you mentioned? I assumed we'd just be tracking Linus' branch and doing merges as needed. Is the plan to keep merging bpf-next?

I don't think we're moving way from basing on bpf/for-next necessarily. It just hasn't been necessary to pull in bpf/for-next for a while. The main reason we've been pulling bpf/for-next is because there often were something that we needed in bpf/for-next. Other than that, for the devel branch, we don't want to straggle too far behind linus#master while staying more or less stable.

There isn't a black-or-white answer to how close we want to track linus#master but there are costs associated with both tracking too close and falling behind too far.

Too close, we introduce instability to devel tree without actually getting much practical value out of it. If there are issues in linus#master, our dev tree experiencing the same issues is unlikely to be meaningful benefit either us or upstream. Another cost associated with following too close is higher cost of maintaining stable release kernel. As long as we're based on the latest release kernel, we can just keep pulling in changes. Once we move off to an rc, it isn't as simple anymore.

Lagging too far behind can bring surprises too late and if fall behind the latest release kernel, prepping kernels for distros becomes cumbersome too.

While not black and white, we can come up with rules of thumb. We don't want to fall behind fully released kernels and we don't want to pull in early rc's either. It likely makes the most sense to pull in later rc's which are close to release - say, rc5+ or so. At that point, both linus#master and -stable for the released kernel are a lot more stable, so the costs of both pulling in the new kernel and maintaining stable releases are a lot lower than earlier during the cycle.

Of course, if bpf/for-next has a feature that we want or some -rc release contains a necessary bug fix, we pull that in and eat the cost.

Tracking the latest stable release is likely more useful for most users including distros, which doesn't mean the devel branch is going to track stable kernels and no matter what we do there's going to be some manual work to cut and maintain kernel releases. The overall amount of work may be lower if we maintain -rcX kernel as separate releases rather than always pulling the into the main dev branch.

Hmm, I think I'm a bit confused. If distros are tracking the latest stable release, then this should have no impact on them, correct? That said, I'm fine with not rebasing against rc kernels if that's your preference. I was originally going to cut a 6.8-rc1 release in https://github.com/sched-ext/scx-kernel-releases, but decided to do this first to save whomever cuts the next release the trouble of dealing with the merge conflicts.

The cost for us is maintaining the scx-kernel-releases repo. If we want to maintain -rc releases too, I think it'd be better to keep them in scx-kernel-releases. As long as the fixes are in a few commits, forward carrying them with cherry-picks shouldn't be too bad and is easier to figure out than the other way around.

@arighi
Copy link
Collaborator

arighi commented Jan 23, 2024

On a similar topic, can we create a sched_ext-ci branch (or sched_ext-stable, or similar), so that the scx CI can use this branch, to make sure a regression in the kernel doesn't trigger failures in the schedulers?

(And obviously for the kernel CI, instead, we will still use the kernel from the latest push/pull request)

@Byte-Lab
Copy link
Collaborator Author

Ack, will close this for now. We can rebase onto linus#master when 6.8 is released. For now I'll just do the rc releases on scx-kernel-releases as Peter Jung (@ptr1337) requested for CachyOS.

@Byte-Lab Byte-Lab closed this Jan 23, 2024
@Byte-Lab Byte-Lab deleted the rebase_68 branch March 14, 2024 18:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.