Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

riscv-elf.md: add new definitions for the compact code model #154

Closed
wants to merge 20 commits into from

Conversation

ebahapo
Copy link
Contributor

@ebahapo ebahapo commented Jun 19, 2020

Add the basic structures to support the compact code model.

@bluewww
Copy link

bluewww commented Jun 19, 2020

Typo there
### Meidum

riscv-elf.md Outdated Show resolved Hide resolved
riscv-elf.md Outdated
@@ -441,6 +477,8 @@ rules about 2✕XLEN aligned arguments being passed in "aligned" register pairs.
* EF_RISCV_RVE (0x0008): This bit is set when the binary targets the E ABI.
* EF_RISCV_TSO (0x0010): This bit is set when the binary requires the RVTSO
memory consistency model.
* EF_RISCV_COMPACT (0x0020): This bit is set when the binary targets the
compact code model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an elf header flag for this? There are a limited number of flag bits, and we will need to be careful not to use them unnecessarily. If we do need a flag, maybe we should set aside a group of bits for a code model field instead of one bit per code model? If we consder Maciej's FDPIC to be a code model, then we have 2 supported and 2 proposed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m likely to propose something around ROPI/RWPI soon, which will need a bit too (It’s subtly different to both Compact and FDPIC.

On further thought, maybe there should be a Bit dedicated to whether GP has to be preserved or not (Something Compact and FDPIC both require, and I don’t expect ROPI/RWPI to need), and maybe the rest of compatibility is down to the relocations present in the ELF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using LTO, the compiler will have to know what is the desired code model. A case could be made about medlow and medany as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's fair (though the LLVM LTO implementation is going to use module metadata when I get around to it).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI medlow and medany can be linked together. The result will be medlow. But we don't track that anywhere. You just get a link error if you try to put medlow code above 0x80000000 in a 64-bit address space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the small and medium code models should save, change and restore the gp in a DSO as well in order to access globals of local scope with relaxed code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything which temporarily modifies gp is incompatible because signal handlers can contain gp-relative references, and neither Linux, glibc, nor musl will restore the main-program gp before invoking a signal handler. You cannot link anything which writes to gp with medany/medlow code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. If the signal handlers are built using the compact code model, there should be no problem. But, if built using the small or medium code models, then the only way to avoid any issue would be to disable relaxation for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebahapo In medlow/medany, GP is not currently saved around calls, even for cross-DSO calls. Is this merely because __global_pointer$ is only defined in the main executable (and therefore the gp-relative relaxations won't happen for shared objects), or am I missing something?

My understanding from when we discussed this before, is that in the compact code model, gp has to be saved around calls to extern functions (I think it might also have to be saved around local calls, because you don't know if a local call tail-calls into an external function). I could be confusing this with FDPIC though, so am I missing a detail as to how the compact model works?

Copy link
Contributor Author

@ebahapo ebahapo Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lenary, that's right, no relaxation for small and medium DSOs because, if they defined their own __global_pointer$, the gp would have to be preserved and set by all dynamic functions.

Actually, only compact DSOs would have to preserve and set the gp. Executables have just one __global_pointer$ and the gp is set at startup as for small and medium executables.

riscv-elf.md Outdated
57 | R_RISCV_32_PCREL | PC-relative reference | _word32_ | S + A - P
58 | R_RISCV_IRELATIVE | Runtime relocation | _wordclass_ | `ifunc_resolver(B + A)`
59 | R_RISCV_64_PCREL | PC-relative reference | _word64_ | S + A - P
60 | R_RISCV_GPREL_HI20 | GP-relative reference | _U-type_ | S + A - GP | `%gprel_hi(symbol)`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an implementation of the compact model yet? I'm not aware of one. If not, then I think it is premature to allocate relocations for it when we don't even know if it is workable yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is not working prototype yet. Should the numbers be omitted?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't maintained an ABI before, so I don't know if there is existing practice. I would like to avoid holes in the numbering scheme though, and if we don't have a working accepted implementation, then we can't be sure that this list of relocs is sufficient and necessary. Though that suggests that maybe we shouldn't add the reloc list yet, or maybe mark them as proposed and put them in a separate list.

@sorear
Copy link
Collaborator

sorear commented Aug 14, 2020

Having now read the whole proposal, I rather dislike it. I don't think this is the appropriate solution for its stated objectives. I apologize for the belatedness of this reply.


For its minimal stated objectives, the proposal is unnecessarily complicated and unnecessarily incompatible with existing code. We can define a viable -mcmodel=large that requires zero new relocations, changes to existing relocations, or ELF flags, and is completely linking-compatible with the default code model. (I would argue that this is the essential difference between a "code model" and an "ABI": linking code from two code models succeeds, but produces the more restrictive model.)

You can access data anywhere in the address space using a pc-relative GOT entry. The auipc + load + load sequence requires, when compressed, 10 bytes, the same as the post-relaxation version of this proposal. A -mcmodel=large would allow data to be accessed anywhere by forcing all accesses to programmer-defined data objects to go through GOT references. The performance impact for most programs will be small because most programs access static data very infrequently. Even though the large model is oriented toward access to large amounts of static data, such a large amount of data is likely to be in arrays, and arrays imply loops where the address setup can be hoisted.

Distributions compile most programs with -fpie for security reasons, and the current gcc and clang compilers always generate GOT references for data if -fpie is passed. My proposal is no slower than -fpie code generation currently is. -fpie could be improved (x86_64 generates non-GOT pcrel references), but the fact that it has not been is further evidence that the performance impact is not compelling for most users.

I would rather not spend any time on a relaxation scheme specific to the large model until there is clear and compelling evidence that one is needed.

"large" also implies support for code larger than 2^31 bytes. This can be done with linker changes only and does not require psABI or compiler adjustments. The linker merely needs to be taught to generate trampolines for R_RISCV_JAL and R_RISCV_CALL relocations which are out of range, and to generate multiple GOTs if GOTPCREL relocations occur over a sufficiently large range. Trampolines are implemented as auipc; l[wd]; jr — the same as a PLT entry. ld.bfd has support for trampoline generation on several architectures, including arm, and I believe that it supports multiple GOTs on some architectures (possibly powerpc?) but I have not confirmed this. Trampoline support has already been added to the (non-ELF) Go linker.

Binaries with more than 2GB of text are even more of a niche case than binaries with more than 2GB of data, so we do not need to implement this in the linkers immediately. However, this needs to be regarded as a linker limitation and not a code model limitation; you are using the correct large code model, you just need to add missing code to the linker. As such, the code model should be called large.


The proposal as written appears to mix up three things:

  • A true large code model, albeit misnamed

  • Accessing all writable data in the main executable, including the GOT, via gp and __global_pointer$ — in other words, RWPI

  • Accessing data in shared libraries by first using a PC-relative large model access to materialize __global_pointer$ in a register, and then using RWPI-like accesses relative to the materialized value

I believe that both a large code model and RWPI should be implemented, but as they are orthogonal they should be proposed, reviewed, implemented, and enabled separately, not combined in a single "compact model" PR.

I am not convinced at this time that the third part of your proposal makes sense as a thing to do. There is no savings of bytes, instructions, or loads in shared-library functions which access any number of interposable globals or one non-interposable global. Saving loads requires a function to access two non-interposable globals, saving instructions would require a function to access two non-interposable globals which successfully reach the small data region. Both cases are much less likely than a function accessing one global, which is already an uncommon case.

The third piece as proposed also has a major ABI compatibility issue since it uses gp without restoring the main-program value prior to invoking signal handlers or callback functions. However this is easily fixed by using any call-clobbered register instead of gp. The psABI text implies that GPREL_I and GPREL_S relocations can be used with any base register as long as that base register contains __global_pointer$, so it doesn't need to be an ABI definition, it can be allocated by the compiler. This should be made explicit in the psABI.

(I checked what ld.bfd does in this case. ld.bfd has an undocumented feature where GPREL relocations can be used to access addresses near zero, and overwrites the rs1 field with either gp or zero. Since the ld.bfd behavior can be argued to violate the psABI, is undocumented, and is not used by compilers, I would argue that it is a bug in ld.bfd and should be changed. It would of course be possible to define new versions of the GPREL relocations with unambiguous semantics, but the relocation type field is only 8 bits on ELFCLASS32 and they should not be used frivolously.)

@ebahapo
Copy link
Contributor Author

ebahapo commented Aug 14, 2020

For its minimal stated objectives, the proposal is unnecessarily complicated and unnecessarily incompatible with existing code. We can define a viable -mcmodel=large that requires zero new relocations, changes to existing relocations, or ELF flags, and is completely linking-compatible with the default code model. (I would argue that this is the essential difference between a "code model" and an "ABI": linking code from two code models succeeds, but produces the more restrictive model.)

This proposal does not attempt to address the situation when both text and data are more than 2GiB in size. Rather, when text and data are more than 2GiB apart. It just so happens that this compact code mode also supports data more than 2GiB in size.

You can access data anywhere in the address space using a pc-relative GOT entry. The auipc + load + load sequence requires, when compressed, 10 bytes, the same as the post-relaxation version of this proposal. A -mcmodel=large would allow data to be accessed anywhere by forcing all accesses to programmer-defined data objects to go through GOT references. The performance impact for most programs will be small because most programs access static data very infrequently. Even though the large model is oriented toward access to large amounts of static data, such a large amount of data is likely to be in arrays, and arrays imply loops where the address setup can be hoisted.

If the GOT is more than 2GIB away from the PC, then the existing relocations overflow and cannot be used to reach it.

The proposal as written appears to mix up three things:

  • A true large code model, albeit misnamed

Truly, a compact code model.

  • Accessing all writable data in the main executable, including the GOT, via gp and __global_pointer$ — in other words, RWPI

This is a side effect from leveraging the existing data structures meant for PIC. A welcome one, methinks.

  • Accessing data in shared libraries by first using a PC-relative large model access to materialize __global_pointer$ in a register, and then using RWPI-like accesses relative to the materialized value
    I am not convinced at this time that the third part of your proposal makes sense as a thing to do. There is no savings of bytes, instructions, or loads in shared-library functions which access any number of interposable globals or one non-interposable global. Saving loads requires a function to access two non-interposable globals, saving instructions would require a function to access two non-interposable globals which successfully reach the small data region. Both cases are much less likely than a function accessing one global, which is already an uncommon case.

Not sure what your point is. DSOs can use global data with non public scope (non interposable). However, most systems that support DSOs have VM, so this compact code model is not necessary then. Those systems without VM that still support DSOs in unconventional ways would still benefit from this code model.

The third piece as proposed also has a major ABI compatibility issue since it uses gp without restoring the main-program value prior to invoking signal handlers or callback functions. However this is easily fixed by using any call-clobbered register instead of gp. The psABI text implies that GPREL_I and GPREL_S relocations can be used with any base register as long as that base register contains __global_pointer$, so it doesn't need to be an ABI definition, it can be allocated by the compiler. This should be made explicit in the psABI.

(I checked what ld.bfd does in this case. ld.bfd has an undocumented feature where GPREL relocations can be used to access addresses near zero, and overwrites the rs1 field with either gp or zero. Since the ld.bfd behavior can be argued to violate the psABI, is undocumented, and is not used by compilers, I would argue that it is a bug in ld.bfd and should be changed. It would of course be possible to define new versions of the GPREL relocations with unambiguous semantics, but the relocation type field is only 8 bits on ELFCLASS32 and they should not be used frivolously.)

This is true. For instance, DSOs could use any register to play the role of the "GP". That would address the case of signal handlers.

@sorear
Copy link
Collaborator

sorear commented Aug 14, 2020

This proposal does not attempt to address the situation when both text and data are more than 2GiB in size. Rather, when text and data are more than 2GiB apart.

However, most systems that support DSOs have VM, so this compact code model is not necessary then.

If your goal is to support statically linked programs with code and all writable data more than 2GiB apart, what you want is RWPI, which is a standard concept with well-understood implications and interactions with other standard concepts.

Those systems without VM that still support DSOs in unconventional ways would still benefit from this code model.

Supporting shared objects in a system without VM means that a single copy of the text can access writable data at several different addresses, which means that the offset from text to writable data cannot be a compile time constant and cannot be loaded from a rodata address. As such your proposal is not useful in such systems. What is useful in such systems is a FDPIC ABI, another standard concept.

@jrtc27
Copy link
Collaborator

jrtc27 commented Aug 14, 2020

Re multiple GOTs, this is supported for at least MIPS in both BFD and LLD, since MIPS originally used 16-bit offsets in its GOT which quickly overflow, so now each .o gets its own GOT and the linker then optimistically merges them as much as it can.

(There is the optional -mxgot flag to use 32-bit offsets, but due to the TLS entries being required to be at the end of the GOT (MIPS is weird, seriously, its GOT is terribly designed, there are strict ordering requirements) and not having corresponding relocations for getting 32-bit offsets, -mxgot is a bit of a waste of time that solves the problem for code that doesn't use certain types of TLS but does not fully address the problem. Moreover it requires an additional compiler flag, whereas multi-GOT support is an always-on feature of the linker that "just works".)

@ebahapo
Copy link
Contributor Author

ebahapo commented Aug 17, 2020

Those systems without VM that still support DSOs in unconventional ways would still benefit from this code model.

Supporting shared objects in a system without VM means that a single copy of the text can access writable data at several different addresses, which means that the offset from text to writable data cannot be a compile time constant and cannot be loaded from a rodata address. As such your proposal is not useful in such systems. What is useful in such systems is a FDPIC ABI, another standard concept.

As far as I understand, this proposal shares structures used by the FDPIC proposal and does not interfere with it.

@sorear
Copy link
Collaborator

sorear commented Aug 17, 2020

As far as I understand, this proposal shares structures used by the FDPIC proposal and does not interfere with it.

Yes, the RWPI part shares a lot of relocations with FDPIC.

However, you're claiming that this proposal is useful for nommu shared libraries and I am having trouble understanding that. If __global_pointer__ is located at a fixed location relative to pc, and pc is in ROM or a shared text area, then __global_pointer__ must also be in a shared area, which prevents a single library loaded into several processes from having different data in each process.

Conversely, if you're considering systems that do not have multiple processes and text sharing, this is effectively the loadable kernel module situation and it can be handled using plain RWPI and a loader that works at the .o level.

What am I missing?

@ebahapo
Copy link
Contributor Author

ebahapo commented Aug 17, 2020

However, you're claiming that this proposal is useful for nommu shared libraries and I am having trouble understanding that. If __global_pointer__ is located at a fixed location relative to pc, and pc is in ROM or a shared text area, then __global_pointer__ must also be in a shared area, which prevents a single library loaded into several processes from having different data in each process.

On the contrary, without MMU, this proposal is not enough for DSOs. I did add the case in an example of how they could be supported, by preserving and setting the gp, until it was pointed out the issue with signal handlers, but it is not its goal. I do believe that if signal handlers are forbidden from using the gp, it could work, but the goal of this proposal is to allow code to reside far away from RW data and it can live without supporting DSOs or another register is used instead of the gp register.

ebahapo and others added 16 commits January 8, 2021 10:56
Change the relocations table to include the respective calculations and organize the information in specific columns.
Use the instruction types per the current v2.2 ISA spec.
Add the missing information for the relocations intended primarily for DWARF records.
Fix the calculation of `R_RISCV_RVC_LUI`.
Despite its name it's PC-relative.
Add a brief descritption of the existing code models.
R_RISCV_RELAX applies to previous reloc not to a pair of instructions.
R_RISCV_CALL and R_RISCV_CALL_PLT are now interchangeable.
Add the basic structures to support the compact code model.
Add the relocation type R_RISCV_GPREL_STORE.
The compact code model can be linked with other code models, so the ELF
header is probably not the best place for this information to reside.  For
LTO purposes, this information can be preserved in metadata.

Additionally, remove some typos.
Add the TLS relocations for the compact code model.
Add the PLT entries for the compact code model.
@ebahapo
Copy link
Contributor Author

ebahapo commented Jan 23, 2021

Ping, please.

riscv-elf.md Outdated Show resolved Hide resolved
56 | R_RISCV_SET32 | Local label assignment | _word32_ | S + A
57 | R_RISCV_32_PCREL | PC-relative reference | _word32_ | S + A - P
58 | R_RISCV_IRELATIVE | Runtime relocation | _wordclass_ | `ifunc_resolver(B + A)`
59 | R_RISCV_64_PCREL | PC-relative reference | _word64_ | S + A - P
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in the first PLT entry for the compact code model.

riscv-elf.md Outdated Show resolved Hide resolved
```
1: auipc t0, %hi_pcrel(2f) # address of 2f
addi t0, %lo_pcrel(1b)
ld t2, (t0) # difference between .got.plt - 2f
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this inlined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, elaborate.

riscv-elf.md Outdated Show resolved Hide resolved
```
1: lui t3, %hi([email protected] - .got.plt) # offset to the function pointer
addi t3, %lo([email protected] - .got.plt)
jal t1, [email protected]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this all be a lot simpler if you just required that gp be valid on entry? Then you can just do a GP-relative load and look much more like the non-compact models. What is the reasoning for doing it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared objects do not have a valid gp set.

riscv-elf.md Outdated Show resolved Hide resolved
jr t3
nop
nop
nop
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of nops changes based on the pointer size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compact code model does not apply to RV32, unless you mean RV128.

and fills in the GOT entry for subsequent calls to the function:
For the compact code model, the third entry in the PLT has a stub that
calculates the absolute address of a function pointer in the GOT.
It occupies three 16 byte entries:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not enough space for RV128 with the current scheme, though arguable whether it's a meaningful combination.

Evandro Menezes added 2 commits January 29, 2021 18:27
Add examples for the TLS pseudo instructions.
@ebahapo
Copy link
Contributor Author

ebahapo commented Feb 4, 2021

Ping, pretty please.

@jrtc27
Copy link
Collaborator

jrtc27 commented Feb 4, 2021

Pinging isn't going to help the fact that this is adding a whole new ABI that needs to go through thorough analysis before being declared official, especially when there are outstanding concerns described many months ago.

@ebahapo
Copy link
Contributor Author

ebahapo commented Feb 4, 2021

What concerns do you believe are outstanding and how would you suggest this to be analyzed?

We have a prototype downstream and I'd be glad to inquire about sharing our results.

@kito-cheng
Copy link
Collaborator

kito-cheng commented Apr 14, 2021

Our colleges has report their memory layout is:

  • rodata and data is put close
  • bss put very far from rodata and data.

And seems like compact code model can't resolve such issue, maybe we need a real large code model...

@ebahapo
Copy link
Contributor Author

ebahapo commented Apr 14, 2021

The cost of a large code model is... large. When all data is up to 2GiB and share the same 2GiB range, the cost can be smaller, as in this compact code model.

@jrtc27
Copy link
Collaborator

jrtc27 commented Apr 14, 2021

However, such large applications are exceedingly rare and a compact code "model" is a whole new ABI not a code model due to changing how PLTs and GP work, and thus needs much more of a reason to exist than a large code model. Distributions aren't going to be shipping two sets of libraries, for example.

@kito-cheng
Copy link
Collaborator

I know the cost of large code model is large, the most demand on the large code model is come from the bare-metal without MMU environment, in such situation, they don't have too much choice, of cause the best solution is changing the memory layout of hardware platform, but it's hard to ask hardware guy to change things in generally...:P

@kito-cheng
Copy link
Collaborator

Few more word about large code model, in my experience, most user who large code model is not because the program is too large, but the platform/hardware memory layout.

But that's might be my bias since most of my working experience are in the embedded world.

@ebahapo
Copy link
Contributor Author

ebahapo commented Apr 15, 2021

However, such large applications are exceedingly rare and a compact code "model" is a whole new ABI not a code model due to changing how PLTs and GP work, and thus needs much more of a reason to exist than a large code model. Distributions aren't going to be shipping two sets of libraries, for example.

I think that it this proposal does not change how either the PLT or the GP work. It definitely expands how they work, while supporting the current way in which they work. In other targets, code models have been supported very much in the same way as this.

As @kito-cheng pointed out, the current code models do not support bare metal embedded applications without MMU, such as code ROM far away from data RAM, for whatever hardware reason. Distributions usually don't address embedded systems and embedded systems do rebuild the libraries to fit their needs, so what distributions do is tangential to the needs of embedded applications.

@jrtc27
Copy link
Collaborator

jrtc27 commented Apr 15, 2021

However, such large applications are exceedingly rare and a compact code "model" is a whole new ABI not a code model due to changing how PLTs and GP work, and thus needs much more of a reason to exist than a large code model. Distributions aren't going to be shipping two sets of libraries, for example.

I think that it this proposal does not change how either the PLT or the GP work. It definitely expands how they work, while supporting the current way in which they work. In other targets, code models have been supported very much in the same way as this.

As @kito-cheng pointed out, the current code models do not support bare metal embedded applications without MMU, such as code ROM far away from data RAM, for whatever hardware reason. Distributions usually don't address embedded systems and embedded systems do rebuild the libraries to fit their needs, so what distributions do is tangential to the needs of embedded applications.

If the desire is to support embedded systems then that would be better off as part of the EABI, not the Unix psABI.

@kito-cheng
Copy link
Collaborator

Close due to lack of update, and we have large code model(#388) and ePIC (#343) are on-going now.

@kito-cheng kito-cheng closed this Feb 15, 2024
@ebahapo ebahapo deleted the patch-compact branch February 24, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants