Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add large code model information. #388

Merged
merged 5 commits into from
Aug 9, 2024

Conversation

kuanlinchentw
Copy link

Hi,

This PR add description about large code model.
I was wondering if we need large+fpic model.
In general, position independant code model puts external symbol addresses into the GOT table.
Is there any case that we have to layout GOT table far away from code over +-2GB?

riscv-elf.adoc Outdated Show resolved Hide resolved
@kito-cheng kito-cheng requested review from jrtc27 and kito-cheng June 27, 2023 02:39
@kuanlinchentw kuanlinchentw force-pushed the master branch 2 times, most recently from 4c454df to a902324 Compare September 27, 2023 03:10
@rui314
Copy link
Collaborator

rui314 commented Sep 27, 2023

I think I'd prefer to define a set of relocations to materialize a 64-bit address with four instructions and let the linker to relax it to 1 to 3 instruction depending on the offset to the materialized address. That approach is easier to implement than the address pool and doesn't need a writable text segment.

I'd also think it could be faster than reading addresses from the address pool because 1) the processor could fuse 3 or 4 instructions into a single macro-op, and 2) loading an address from the address pool is just a waste of resources if the materialized address happens to be not too far from PC.

@kito-cheng
Copy link
Collaborator

@rui314 I am not sure if we can generate any arbitrary 64 bit address within 4 instruction? did you mind share the instruction sequence?

@rui314
Copy link
Collaborator

rui314 commented Sep 27, 2023

@kito-cheng Apologies, we can't materialize a 64-bit value with four instructions in RISC-V. We actually need six instructions to, for example, load a value from an arbitrary 64-bit address as follows:

lui   t0, <highest20>
addi  t0, t0, <higher12>
slli  t0, 32
auipc t1, <hi20>
addi  t1, t1, t0
ld    t1, <lo12>(t1)

which can be relaxed to the following 5 instructions if the symbol is within ±2^44 bytes

addi    t0, zero, <higher12>
c.slli  t0, 32
auipc   t1, <hi20>
addi    t1, t1, t0
ld      t1, <lo12>(t1)

and of course to the following two instructions if it's within ±2GiB.

auipc   t1, <hi20>
ld      t1, <lo12>(t1)

It looks to me that the RISC-V psABI's design choice to allow the linker to shrink the section really shines for this use case.

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 27, 2023

Creating new ABIs that only support position-dependent code seems like a bit of a questionable thing to be doing in this day and age

@kuanlinchentw
Copy link
Author

kuanlinchentw commented Sep 28, 2023

I think using constant pool for large model doesn't cost so much. Because compiler can use anchors to tag variables, and load each variable just by its offset from the anchor.
Ex:
If we want to get values of global variables A and B. We don't have to load constanct pool entries twice for A and B.

auipc t0, hi20(.LC0)
ld       t1, t0, lo12(.LC0)  
lw      a4,0(t1) 
lw      a0,4(t1)  
.LC0:
       .dword  .LANCHOR0
       .bss
       .set    .LANCHOR0,. + 0
a:
       .zero   4
b:
       .zero   4

@rui314
Copy link
Collaborator

rui314 commented Sep 28, 2023

I know there are many extremely large programs out there that might already need the large code model, but to my knowledge, most of these programs are server-side and run in datacenters. They naturally need to be built as position-independent executables, and their text segments need to be read-only (or execute-only if possible). This made me wonder about your motivation to define a position-dependent-only ABI in the first place.

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

@MaskRay
Copy link
Collaborator

MaskRay commented Sep 28, 2023

I have some notes about large code models in aarch64/powerpc64/x86-64: https://maskray.me/blog/2023-05-14-relocation-overflow-and-code-models#aarch64-code-models

I know that certain JIT programs may use large code models, possibly just the position-dependent form.

I think using constant pool for large model doesn't cost so much. Because compiler can use anchors to tag variables, and load each variable just by its offset from the anchor.

Agree.

For server side large x86-64 applications, they can use the medium code model. This larger range makes it unlikely for AArch64 to encounter relocation overflow issues before the binary becomes excessively oversized for x86-64.

@aswaterman
Copy link
Contributor

to my knowledge, most of these programs are server-side and run in datacenters. They naturally need to be built as position-independent executables, and their text segments need to be read-only

Without commenting on the merits of this particular code model, I'll remark that there is a distinct and very real use case: RV64 embedded systems, which might not consume that much memory in total but need to cope with a sparse address space. The text/rodata might be separated by gigabytes from the absolute-addressed I/O, and there might be multiple regions of each. There's no virtual memory, so it isn't possible to remap the relevant regions to improve virtual spatial locality.

@kuanlinchentw
Copy link
Author

kuanlinchentw commented Oct 2, 2023

Actually, using constant pools as the large code model can generate position-independent executables. It only needs the static linker to leave dynamic relocations for the loader or the memery manager to add the offset when executables are remapped.
In my first comment, I was just wondering if there is the real case that we need large+fpic.

@jrtc27
Copy link
Collaborator

jrtc27 commented Oct 2, 2023

Yes, constant pools are equivalent to a hand-rolled GOT.

@kuanlinchentw
Copy link
Author

Yes, constant pools are equivalent to a hand-rolled GOT.

Yes. It's a nice description. Thanks.

@rui314
Copy link
Collaborator

rui314 commented Oct 2, 2023

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

I think I'm still waiting for a response to this comment...

@kito-cheng
Copy link
Collaborator

kito-cheng commented Oct 2, 2023

lui t0,
addi t0, t0,
slli t0, 32
auipc t1,
addi t1, t1, t0
ld t1, (t1)

Can use lui rather than auipc? I think all using lui would be easier to shared the high-part (first 5 instruction)? that should be able let compiler share the high-part between different low-part?

Use auipc we may either enforce whole instruction sequence must together or has a relocation let last instruction point to the auipc instruction like PCREL_LO12_*.

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

I think I'm still waiting for a response to this comment...

I involved the design and implementation of this code model when I still collage with @kuanlinchentw, so I guess I can give few detail from my brain dump: that design come with several advantages: 1) simple to implement, because it can be borrow the implementation from AArch64 :P, 2) NO new relocation required.

However the disadvantage is obviously: 1) every address need load from constant pool, 2) the pool has duplicated entries.
But we think the disadvantage can be ignore in most use case of large code model, since it mostly used when MMU-less situation, and also we have ePIC proposal, that could address some special use case in embedded world.

IIRC, long instruction sequence scheme also has discussed before in somewhere (publicly?), but it just come with more overhead to implement: new relocation and new linker relaxation, also psABI TG isn't exist in that moment, so we are trying to prevent touch psABI as possible at that moment.

@kuanlinchentw
Copy link
Author

So, before diving into the details, I think we need to take a step back and start by understanding the context of this change. I'd like to understand your motivation, explore potential alternative specifications, and learn why you believe this is the best way to achieve the goal.

I think I'm still waiting for a response to this comment...

As @kito-cheng mentioned, It's easy to implement at the compiler veiw, and it doesn't need to modify binutils.
For compiler, each variable access can be a dependent load intruction after setting the anchor value.
This can avoid using lots of pseudo intructions that may not scheduled apart.
We might consider the way that using a set of relocations to materialize a 64-bit address before.
But there is a trade-off between the compiler scheduler and the linker relaxation.
If the compiler expands the instruction sequence to schedule, it's hard for the linker to relax.
Even if linker can recognize the sequence and relax, the delete instructions may affect the schedule result.
And the disadvantages as @kito-cheng mentioned, I think it's still an issue.
Obviously, it waste the space to save redundant entries. Maybe the compiler can generate the mergable constant sections to reduce the harm.

@rui314
Copy link
Collaborator

rui314 commented Oct 3, 2023

If no new feature is required for it, what's the point of adding a new section to the psABI document for it? Does AArch64 psABI has a section for their counterpart?

@kuanlinchentw
Copy link
Author

If no new feature is required for it, what's the point of adding a new section to the psABI document for it? Does AArch64 psABI has a section for their counterpart?

It need to add a new option for code model just like medany and differenct code generations.
Yes. AArch64 defines small, kernel, medium and large model, and there is a section about code model.

@rui314
Copy link
Collaborator

rui314 commented Oct 3, 2023

I couldn't find a section in https://github.com/ARM-software/abi-aa/blob/844a79fd4c77252a11342709e3b27b2c9f590cf1/aaelf64/aaelf64.rst about how to use a constant pool to load an object's address from memory. Could you share the URL?

@kuanlinchentw
Copy link
Author

@rui314
Copy link
Collaborator

rui314 commented Oct 3, 2023

And which code model? It looks like the "large" code model in the AArch64 psABI is different from this proposal because the AArch64's large code model requires that GOT is within 2 GiB from the text segment and seems like addresses are read from GOT.

@kuanlinchentw
Copy link
Author

And which code model? It looks like the "large" code model in the AArch64 psABI is different from this proposal because the AArch64's large code model requires that GOT is within 2 GiB from the text segment and seems like addresses are read from GOT.

I think you can find example at https://github.com/ARM-software/abi-aa/blob/2982a9f3b512a5bfdc9e3fea5d3b298f9165c36b/sysvabi64/sysvabi64.rst#get-the-address-of-a-symbol-defined-in-the-same-elf-file

I think the distance of GOT means the literal pool not normal GOT. Because it doesn't support PIC.
image
image
image

@rui314
Copy link
Collaborator

rui314 commented Oct 3, 2023

If "GOT" in the documentation doesn't mean the .got section, that's super confusing, but if that's the case, that's their problem and not ours. Thank you for pointing that out.

@kito-cheng
Copy link
Collaborator

Some comment from the last LLVM sync meeting:

Constant pool and long instruction sequence are both has it own use case, so we may allow both scheme and let user to choose which scheme should be used by some option, also same for function call.

Also some other comment from the last psABI call:

We didn't (officially) reserve intra-procedure-call scratch register, AArch64 has listed r16 and r17 ad IP0 and IP1, and explicitly say they may clobber during procedure call, that might be an issue when we implement range extension thunks .

However we actually already use t0, t1, t2 and t3 at PLT stuffs, so we could use same set of register to implement that, then we should specify that explicitly in the psABI, the only concern is it will seem like an incompatible ABI change, but this is less risky since it's kind of de facto behavior.

@jrtc27
Copy link
Collaborator

jrtc27 commented Dec 21, 2023

Some comment from the last LLVM sync meeting:

Constant pool and long instruction sequence are both has it own use case, so we may allow both scheme and let user to choose which scheme should be used by some option, also same for function call.

Also some other comment from the last psABI call:

We didn't (officially) reserve intra-procedure-call scratch register, AArch64 has listed r16 and r17 ad IP0 and IP1, and explicitly say they may clobber during procedure call, that might be an issue when we implement range extension thunks .

However we actually already use t0, t1, t2 and t3 at PLT stuffs, so we could use same set of register to implement that, then we should specify that explicitly in the psABI, the only concern is it will seem like an incompatible ABI change, but this is less risky since it's kind of de facto behavior.

No; using custom calling conventions within an object has always been allowed (and that’s a thing that’s done across architectures), but range extension thunks clobbering registers that weren’t previously reserved for it would break that. It’s only safe to do in the PLT case because people know PLTs exist and they need to be careful.

@kito-cheng
Copy link
Collaborator

No; using custom calling conventions within an object has always been allowed (and that’s a thing that’s done across architectures), but range extension thunks clobbering registers that weren’t previously reserved for it would break that. It’s only safe to do in the PLT case because people know PLTs exist and they need to be careful.

Yeah, fair enough, so I think let moving forward without range extension thunks, then extend that later with necessary changes (e.g. adding new tag) if needed

Copy link
Collaborator

@kito-cheng kito-cheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am intend to moving this forward and then extend this further later, e.g. add long instruction sequence scheme, one concern is that will require adding new relocation and extra implementation work, so it should split into another step to do to prevent this stuck here too long.

For now, I think it would be great to add few note like: "NOTE: We intend extend the large code model with different code generation strategy in future." to mention we will add long instruction scheme in future, also range extension thunk may included in future.

@kito-cheng
Copy link
Collaborator

ping @MaskRay @rui314 , would you like to give some blessing to moving this forward?

riscv-elf.adoc Outdated Show resolved Hide resolved
riscv-elf.adoc Outdated Show resolved Hide resolved
riscv-elf.adoc Outdated Show resolved Hide resolved
@sorear
Copy link
Collaborator

sorear commented Feb 18, 2024

My biggest concern here is that we're allocating the name "large" and creating a compatibility promise for a short-term code model. If in the future we have a fully designed large model, gcc won't be able to switch to it for -mcmodel=large because that will regress functionality for anyone with an old binutils, so the new, better code model will be stuck with a worse name.

@kito-cheng There is a fourth option - use a real GOT. RISC-V does not have a meaningful concept of a GOT base, so there's nothing forcing the GOT to be contiguous; interleave text and GOT in 4 GiB chunks to support GOTPCREL_HI20 relocations in the large model. Obviously this won't work if you're generating a.out and need RX and RW memory to be a single contiguous range each, but it should work for ELF.

I'm a strong supporter of range extension thunks and implemented them for the riscv Go linker a while ago. Ideally we would support them with both 4-byte and 8-byte call sites, which means we need a new relocation type JAL_THUNK anyway, so adding CALL_THUNK might not be so bad.

@sorear sorear mentioned this pull request Feb 20, 2024
@kivoimusa
Copy link

kivoimusa commented Feb 21, 2024

I think I can post this here for some brief:
Am running Ubuntu 20.04 L.T.S on AMD 64-bit processor and I got a compiler error when executing my linked embedded python into caffe framework. The compiler tells me to recompile with -fPIC. This causes memory relocation and I don't know why the linker and the compiler are failing to use a linked static library.
My caffe build is a large code-base of more than 32GB. The program is built from source as per the manual. I have tried to look for solutions on Stack-overflow with no success.
I would really appreciate for your assistance.
Kivoi Musa

@jrtc27
Copy link
Collaborator

jrtc27 commented Feb 21, 2024

I think I can post this here for some brief: Am running Ubuntu 20.04 L.T.S on AMD 64-bit processor and I got a compiler error when executing my linked embedded python into caffe framework. The compiler tells me to recompile with -fPIC. This causes memory relocation and I don't know why the linker and the compiler are failing to use a linked static library. My caffe build is a large code-base of more than 35GB. The program is built from source as per the manual. I have tried to look for solutions on Stack-overflow with no success. I would really appreciate for your assistance. Kivoi Musa

This is the specification for the RISC-V instruction set's ABI, and your 64-bit AMD processor is not a RISC-V processor; unless you're cross-compiling for RISC-V (doubtful?) you seem quite lost and this is not the place for this kind of question since it's for a completely different processor instruction set.

qihangkong pushed a commit to rvgpu/llvm that referenced this pull request Apr 18, 2024
Implement large code model for GlobalAddressSDNode, BlockAddressSDNode
and ExternalSymbolSDNode.

See discussion on
riscv-non-isa/riscv-elf-psabi-doc#388.

co-authored by: Kuan-Lin Chen <[email protected]>
@kito-cheng
Copy link
Collaborator

@sorear

I incline to accept current proposal with optional range extension thunk*1 support, we already have note say we may have other code generation strategies, so it let us have room to add more large code model variant in future, I am not really comfortable with the multiple GOT design, that's complicate and it would be challenge on the customized linker script to specify that.

*1 Add note to mention function call may use auipc+jalr sequence if linker support range extension thunk.

@kito-cheng
Copy link
Collaborator

Will moving forward/merge this PR after next psABI meeting, GCC already merged for a while, and LLVM also provided PoC.

Copy link
Collaborator

@kito-cheng kito-cheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kito-cheng kito-cheng merged commit 79fbbc8 into riscv-non-isa:master Aug 9, 2024
4 checks passed
tclin914 added a commit to llvm/llvm-project that referenced this pull request Sep 9, 2024
Implement large code model for GlobalAddressSDNode and ExternalSymbolSDNode.

See discussion on
riscv-non-isa/riscv-elf-psabi-doc#388.

---------

Co-authored-by: Kuan-Lin Chen <[email protected]>
asb added a commit to asb/riscv-c-api-doc that referenced this pull request Sep 10, 2024
With riscv-non-isa/riscv-elf-psabi-doc#388
landed it makes sense to have a define for the large code model for
consistency with medany and medlow.
dlav-sc pushed a commit to dlav-sc/llvm-project that referenced this pull request Sep 10, 2024
Implement large code model for GlobalAddressSDNode and ExternalSymbolSDNode.

See discussion on
riscv-non-isa/riscv-elf-psabi-doc#388.

---------

Co-authored-by: Kuan-Lin Chen <[email protected]>
VitaNuo pushed a commit to VitaNuo/llvm-project that referenced this pull request Sep 12, 2024
Implement large code model for GlobalAddressSDNode and ExternalSymbolSDNode.

See discussion on
riscv-non-isa/riscv-elf-psabi-doc#388.

---------

Co-authored-by: Kuan-Lin Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants