-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
riscv-elf.md: add new definitions for the compact code model #154
Conversation
Typo there |
riscv-elf.md
Outdated
@@ -441,6 +477,8 @@ rules about 2✕XLEN aligned arguments being passed in "aligned" register pairs. | |||
* EF_RISCV_RVE (0x0008): This bit is set when the binary targets the E ABI. | |||
* EF_RISCV_TSO (0x0010): This bit is set when the binary requires the RVTSO | |||
memory consistency model. | |||
* EF_RISCV_COMPACT (0x0020): This bit is set when the binary targets the | |||
compact code model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need an elf header flag for this? There are a limited number of flag bits, and we will need to be careful not to use them unnecessarily. If we do need a flag, maybe we should set aside a group of bits for a code model field instead of one bit per code model? If we consder Maciej's FDPIC to be a code model, then we have 2 supported and 2 proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m likely to propose something around ROPI/RWPI soon, which will need a bit too (It’s subtly different to both Compact and FDPIC.
On further thought, maybe there should be a Bit dedicated to whether GP has to be preserved or not (Something Compact and FDPIC both require, and I don’t expect ROPI/RWPI to need), and maybe the rest of compatibility is down to the relocations present in the ELF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using LTO, the compiler will have to know what is the desired code model. A case could be made about medlow
and medany
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's fair (though the LLVM LTO implementation is going to use module metadata when I get around to it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI medlow and medany can be linked together. The result will be medlow. But we don't track that anywhere. You just get a link error if you try to put medlow code above 0x80000000 in a 64-bit address space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the small and medium code models should save, change and restore the gp
in a DSO as well in order to access globals of local scope with relaxed code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything which temporarily modifies gp is incompatible because signal handlers can contain gp-relative references, and neither Linux, glibc, nor musl will restore the main-program gp before invoking a signal handler. You cannot link anything which writes to gp with medany/medlow code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. If the signal handlers are built using the compact code model, there should be no problem. But, if built using the small or medium code models, then the only way to avoid any issue would be to disable relaxation for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebahapo In medlow/medany, GP is not currently saved around calls, even for cross-DSO calls. Is this merely because __global_pointer$
is only defined in the main executable (and therefore the gp-relative relaxations won't happen for shared objects), or am I missing something?
My understanding from when we discussed this before, is that in the compact code model, gp
has to be saved around calls to extern functions (I think it might also have to be saved around local calls, because you don't know if a local call tail-calls into an external function). I could be confusing this with FDPIC though, so am I missing a detail as to how the compact model works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lenary, that's right, no relaxation for small and medium DSOs because, if they defined their own __global_pointer$
, the gp
would have to be preserved and set by all dynamic functions.
Actually, only compact DSOs would have to preserve and set the gp
. Executables have just one __global_pointer$
and the gp
is set at startup as for small and medium executables.
riscv-elf.md
Outdated
57 | R_RISCV_32_PCREL | PC-relative reference | _word32_ | S + A - P | ||
58 | R_RISCV_IRELATIVE | Runtime relocation | _wordclass_ | `ifunc_resolver(B + A)` | ||
59 | R_RISCV_64_PCREL | PC-relative reference | _word64_ | S + A - P | ||
60 | R_RISCV_GPREL_HI20 | GP-relative reference | _U-type_ | S + A - GP | `%gprel_hi(symbol)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an implementation of the compact model yet? I'm not aware of one. If not, then I think it is premature to allocate relocations for it when we don't even know if it is workable yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there is not working prototype yet. Should the numbers be omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't maintained an ABI before, so I don't know if there is existing practice. I would like to avoid holes in the numbering scheme though, and if we don't have a working accepted implementation, then we can't be sure that this list of relocs is sufficient and necessary. Though that suggests that maybe we shouldn't add the reloc list yet, or maybe mark them as proposed and put them in a separate list.
Having now read the whole proposal, I rather dislike it. I don't think this is the appropriate solution for its stated objectives. I apologize for the belatedness of this reply. For its minimal stated objectives, the proposal is unnecessarily complicated and unnecessarily incompatible with existing code. We can define a viable You can access data anywhere in the address space using a pc-relative GOT entry. The auipc + load + load sequence requires, when compressed, 10 bytes, the same as the post-relaxation version of this proposal. A Distributions compile most programs with I would rather not spend any time on a relaxation scheme specific to the large model until there is clear and compelling evidence that one is needed. "large" also implies support for code larger than 2^31 bytes. This can be done with linker changes only and does not require psABI or compiler adjustments. The linker merely needs to be taught to generate trampolines for R_RISCV_JAL and R_RISCV_CALL relocations which are out of range, and to generate multiple GOTs if GOTPCREL relocations occur over a sufficiently large range. Trampolines are implemented as Binaries with more than 2GB of text are even more of a niche case than binaries with more than 2GB of data, so we do not need to implement this in the linkers immediately. However, this needs to be regarded as a linker limitation and not a code model limitation; you are using the correct large code model, you just need to add missing code to the linker. As such, the code model should be called The proposal as written appears to mix up three things:
I believe that both a large code model and RWPI should be implemented, but as they are orthogonal they should be proposed, reviewed, implemented, and enabled separately, not combined in a single "compact model" PR. I am not convinced at this time that the third part of your proposal makes sense as a thing to do. There is no savings of bytes, instructions, or loads in shared-library functions which access any number of interposable globals or one non-interposable global. Saving loads requires a function to access two non-interposable globals, saving instructions would require a function to access two non-interposable globals which successfully reach the small data region. Both cases are much less likely than a function accessing one global, which is already an uncommon case. The third piece as proposed also has a major ABI compatibility issue since it uses (I checked what ld.bfd does in this case. ld.bfd has an undocumented feature where GPREL relocations can be used to access addresses near zero, and overwrites the rs1 field with either gp or zero. Since the ld.bfd behavior can be argued to violate the psABI, is undocumented, and is not used by compilers, I would argue that it is a bug in ld.bfd and should be changed. It would of course be possible to define new versions of the GPREL relocations with unambiguous semantics, but the relocation type field is only 8 bits on ELFCLASS32 and they should not be used frivolously.) |
This proposal does not attempt to address the situation when both text and data are more than 2GiB in size. Rather, when text and data are more than 2GiB apart. It just so happens that this compact code mode also supports data more than 2GiB in size.
If the GOT is more than 2GIB away from the PC, then the existing relocations overflow and cannot be used to reach it.
Truly, a compact code model.
This is a side effect from leveraging the existing data structures meant for PIC. A welcome one, methinks.
Not sure what your point is. DSOs can use global data with non public scope (non interposable). However, most systems that support DSOs have VM, so this compact code model is not necessary then. Those systems without VM that still support DSOs in unconventional ways would still benefit from this code model.
This is true. For instance, DSOs could use any register to play the role of the "GP". That would address the case of signal handlers. |
If your goal is to support statically linked programs with code and all writable data more than 2GiB apart, what you want is RWPI, which is a standard concept with well-understood implications and interactions with other standard concepts.
Supporting shared objects in a system without VM means that a single copy of the text can access writable data at several different addresses, which means that the offset from text to writable data cannot be a compile time constant and cannot be loaded from a rodata address. As such your proposal is not useful in such systems. What is useful in such systems is a FDPIC ABI, another standard concept. |
Re multiple GOTs, this is supported for at least MIPS in both BFD and LLD, since MIPS originally used 16-bit offsets in its GOT which quickly overflow, so now each .o gets its own GOT and the linker then optimistically merges them as much as it can. (There is the optional |
As far as I understand, this proposal shares structures used by the FDPIC proposal and does not interfere with it. |
Yes, the RWPI part shares a lot of relocations with FDPIC. However, you're claiming that this proposal is useful for nommu shared libraries and I am having trouble understanding that. If Conversely, if you're considering systems that do not have multiple processes and text sharing, this is effectively the loadable kernel module situation and it can be handled using plain RWPI and a loader that works at the What am I missing? |
On the contrary, without MMU, this proposal is not enough for DSOs. I did add the case in an example of how they could be supported, by preserving and setting the |
Change the relocations table to include the respective calculations and organize the information in specific columns.
Use the instruction types per the current v2.2 ISA spec.
Add the missing information for the relocations intended primarily for DWARF records.
Despite its name it's PC-relative.
Add a brief descritption of the existing code models.
Additionally, address some comments by @jim-wilson and @Lenari.
R_RISCV_RELAX applies to previous reloc not to a pair of instructions. R_RISCV_CALL and R_RISCV_CALL_PLT are now interchangeable.
Add the basic structures to support the compact code model.
Add the relocation type R_RISCV_GPREL_STORE.
The compact code model can be linked with other code models, so the ELF header is probably not the best place for this information to reside. For LTO purposes, this information can be preserved in metadata. Additionally, remove some typos.
Add the TLS relocations for the compact code model.
Add the PLT entries for the compact code model.
Ping, please. |
56 | R_RISCV_SET32 | Local label assignment | _word32_ | S + A | ||
57 | R_RISCV_32_PCREL | PC-relative reference | _word32_ | S + A - P | ||
58 | R_RISCV_IRELATIVE | Runtime relocation | _wordclass_ | `ifunc_resolver(B + A)` | ||
59 | R_RISCV_64_PCREL | PC-relative reference | _word64_ | S + A - P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in the first PLT entry for the compact code model.
``` | ||
1: auipc t0, %hi_pcrel(2f) # address of 2f | ||
addi t0, %lo_pcrel(1b) | ||
ld t2, (t0) # difference between .got.plt - 2f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this inlined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, elaborate.
``` | ||
1: lui t3, %hi([email protected] - .got.plt) # offset to the function pointer | ||
addi t3, %lo([email protected] - .got.plt) | ||
jal t1, [email protected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this all be a lot simpler if you just required that gp
be valid on entry? Then you can just do a GP-relative load and look much more like the non-compact models. What is the reasoning for doing it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shared objects do not have a valid gp
set.
jr t3 | ||
nop | ||
nop | ||
nop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of nop
s changes based on the pointer size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compact code model does not apply to RV32, unless you mean RV128.
and fills in the GOT entry for subsequent calls to the function: | ||
For the compact code model, the third entry in the PLT has a stub that | ||
calculates the absolute address of a function pointer in the GOT. | ||
It occupies three 16 byte entries: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not enough space for RV128 with the current scheme, though arguable whether it's a meaningful combination.
Add examples for the TLS pseudo instructions.
Ping, pretty please. |
Pinging isn't going to help the fact that this is adding a whole new ABI that needs to go through thorough analysis before being declared official, especially when there are outstanding concerns described many months ago. |
What concerns do you believe are outstanding and how would you suggest this to be analyzed? We have a prototype downstream and I'd be glad to inquire about sharing our results. |
Our colleges has report their memory layout is:
And seems like compact code model can't resolve such issue, maybe we need a real large code model... |
The cost of a large code model is... large. When all data is up to 2GiB and share the same 2GiB range, the cost can be smaller, as in this compact code model. |
However, such large applications are exceedingly rare and a compact code "model" is a whole new ABI not a code model due to changing how PLTs and GP work, and thus needs much more of a reason to exist than a large code model. Distributions aren't going to be shipping two sets of libraries, for example. |
I know the cost of large code model is large, the most demand on the large code model is come from the bare-metal without MMU environment, in such situation, they don't have too much choice, of cause the best solution is changing the memory layout of hardware platform, but it's hard to ask hardware guy to change things in generally...:P |
Few more word about large code model, in my experience, most user who large code model is not because the program is too large, but the platform/hardware memory layout. But that's might be my bias since most of my working experience are in the embedded world. |
I think that it this proposal does not change how either the PLT or the GP work. It definitely expands how they work, while supporting the current way in which they work. In other targets, code models have been supported very much in the same way as this. As @kito-cheng pointed out, the current code models do not support bare metal embedded applications without MMU, such as code ROM far away from data RAM, for whatever hardware reason. Distributions usually don't address embedded systems and embedded systems do rebuild the libraries to fit their needs, so what distributions do is tangential to the needs of embedded applications. |
If the desire is to support embedded systems then that would be better off as part of the EABI, not the Unix psABI. |
Add the basic structures to support the compact code model.