-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llext: fix issues identified in #74509 #74568
Conversation
The ELF files should be aligned to at least sizeof(elf_word) to avoid issues. Use a larger value to ease debugging, since it reduces the differences in addresses between similar runs. Signed-off-by: Luca Burelli <[email protected]>
This patch sets the default value for LLEXT_STORAGE_WRITABLE to 'y' on the Xtensa architecture. This is necessary because it does not currently support the read-only mode for the LLEXT storage. Make sure the default reflects this instead of asking the user to manually set it. Signed-off-by: Luca Burelli <[email protected]>
This patch reworks the testcase.yaml and sample.yaml files for the llext subsystem to further reduce the number of tests performed by CI while improving overall coverage. The following changes are introduced by this commit: - Remove the arch_allow field from the common section to allow any arch to be tested in the build_only test. All other tests explicitly narrow down the arches they are applicable to. - In addition to platforms with active issues, also exclude a number of platforms that are always skipped by the runtime filter due to RAM/Flash limitations. - Add integration_platforms to limit the test count to a few selected platforms which are representative of the different arches. - Remove a number of duplicate SLID tests and group them into a single test that covers both ARM and Xtensa architectures. - Test the relocatable case on ARM as well. Signed-off-by: Luca Burelli <[email protected]>
8b99d2d
to
fed8e3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The section merge commit did not consider the fact that relocations and symbols are relative to the set of original sections, not the merged ones. Offsets must be calculated and used properly in symbols and linking logic.
This is one thing that bugged me the moment I reviewed that PR, but the code convinced me it was handled... 😅
Off-topic note: IMO, the naming convention could use a little overhaul; I find myself very easily confused between sect
and mem
stuff.
In-file ELF section stuff being named sect{tion}
is fine, but everything related to in-memory representation should be renamed.
Something like segment
would fit nicely and aligns with the ELF format - despite not being exactly ELF segments, what we're doing is pretty close.
Other than this, looks good to me, besides a few comments:
- arch:arm:CONFIG_ARM_MPU=n | ||
- arch:xtensa:CONFIG_LLEXT_STORAGE_WRITABLE=y | ||
- arch:arm:CONFIG_ARM_AARCH32_MMU=n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this cause Kconfig failures if e.g. building for AArch64 target?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, AFAIK this kind of setting a Kconfig to false has no effect if the Kconfig itself is not defined.
The line ARM_MPU=n
has always been ignored on qemu_cortex_a9
, and resulted in a runtime filter (since that CPU has an MMU and it always ended up being active).
harness: keyboard | ||
extra_configs: | ||
- arch:arm:CONFIG_ARM_MPU=n | ||
- arch:arm:CONFIG_ARM_AARCH32_MMU=n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
"number (%p) is not at .data section start (%p)", | ||
symbol_ptr, section_ptr); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test relying on the exact .data
layout sounds a bit fragile to me.
Could this be relaxed to merely checking if symbol is in .data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, since the API does not provide the size of the section. 😕 🤦
I agree this can break easily. However, I have added comments both there and in the code to make that clear, and it's being tested by CI now, so any commit breaking this would be flagged early.
I think in the next revision I will add a separate one-liner extension for this test - instead of reusing hello_world - so that will make this extra clear.
subsys/llext/llext_priv.h
Outdated
static inline const void *llext_elf_sect_addr(struct llext_loader *ldr, struct llext *ext, | ||
unsigned int sect_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Not fond of the function name.
Argument name is also confusing.
Proposal:
static inline const void *llext_elf_sect_addr(struct llext_loader *ldr, struct llext *ext, | |
unsigned int sect_idx) | |
static inline const void *llext_get_loaded_section_addr(struct llext_loader *ldr, struct llext *ext, | |
unsigned int sh_ndx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the sh_ndx
, makes it clear it is an ELF section index, we should use that more often.
The function itself has a quirk used by the Xtensa code path: it does try to map directly on the ELF via peek
if the section is not loaded. That's why I first found your proposed name a little confusing, but that made me realize I will leave this visible where it's needed, and be unambiguous here (either a valid ext->mem
ptr or NULL).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mathieuchopstm for the insightful comments, will apply to the next round.
Re/ OT naming: I was also thinking about a PR changing the names of "merged sections" to something else (I had "regions" in mind, but segments is perfect!).
Most importantly ldr->sects
needs to become ldr->segs
; also LLEXT_MEM
could better be LLEXT_SEG
(or at least make it evident in comments MEM it's a "memory segment"). Not sure this change would be accepted during LTS freeze though, WDYT?
- arch:arm:CONFIG_ARM_MPU=n | ||
- arch:xtensa:CONFIG_LLEXT_STORAGE_WRITABLE=y | ||
- arch:arm:CONFIG_ARM_AARCH32_MMU=n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, AFAIK this kind of setting a Kconfig to false has no effect if the Kconfig itself is not defined.
The line ARM_MPU=n
has always been ignored on qemu_cortex_a9
, and resulted in a runtime filter (since that CPU has an MMU and it always ended up being active).
"number (%p) is not at .data section start (%p)", | ||
symbol_ptr, section_ptr); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, since the API does not provide the size of the section. 😕 🤦
I agree this can break easily. However, I have added comments both there and in the code to make that clear, and it's being tested by CI now, so any commit breaking this would be flagged early.
I think in the next revision I will add a separate one-liner extension for this test - instead of reusing hello_world - so that will make this extra clear.
subsys/llext/llext_priv.h
Outdated
static inline const void *llext_elf_sect_addr(struct llext_loader *ldr, struct llext *ext, | ||
unsigned int sect_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the sh_ndx
, makes it clear it is an ELF section index, we should use that more often.
The function itself has a quirk used by the Xtensa code path: it does try to map directly on the ELF via peek
if the section is not loaded. That's why I first found your proposed name a little confusing, but that made me realize I will leave this visible where it's needed, and be unambiguous here (either a valid ext->mem
ptr or NULL).
Not exactly familiar with all rules related to code freeze, but I think it might be OK as long as the change is purely mechanical? (i.e., only symbol renaming). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes I don't believe aren't quite right here. @lyakh should absolutely test this patch to ensure issues are in fact fixed in SoF for 3.7.
- qemu_arc/qemu_arc_hs5x | ||
- qemu_arc/qemu_arc_hs6x | ||
- qemu_cortex_m0 | ||
- qemu_xtensa/dc233c/mmu | ||
integration_platforms: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove qemu_xtensa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MMU-enabled Xtensa is not compatible with llext at the moment: there is no writable_mpu
test, as it would require to split up the ELF file in mem_domains (with all the relevant alignment issues...).
I could leave the platform in, but it would re-do the same tests of the non-MMU qemu_xtensa
variant.
@@ -62,6 +62,7 @@ config LLEXT_SHELL_MAX_SIZE | |||
|
|||
config LLEXT_STORAGE_WRITABLE | |||
bool "llext storage is writable" | |||
default y if XTENSA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look quite right, xtensa could easily be loading from non-writable storage as well. The architecture has nothing to do with the Kconfig imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, however as I made explicit here, the Xtensa port currently needs this to ensure the relocation logic in llext_link_plt
works.
There has never been a working Xtensa llext.simple.readonly
test; they have always been arch-masked. While reviewing all YAML files I found out the shell sample did not even set STORAGE_WRITABLE=y
for Xtensa; instead of adding this, I chose to set the appropriate default so it automatically gets set correctly everywhere, even in users' experimental projects.
Should I revert to explicitly adding that in all Xtensa test instances?
Add a new test case that verifies that llext_find_section() returns the correct offset for a symbol in a loadable extension. This exploits the fact that in the STORAGE_WRITABLE cases, the symbol addresses will be inside the ELF file buffer, so they can be easily compared. Signed-off-by: Luca Burelli <[email protected]>
fed8e3e
to
f8f4431
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do I understand it correctly, that these 9 patches are needed to fix errors, introduced by the last 2 commits in #73949? If so, I'd rather revert those 2 commits and re-submitted their fixed versions. Would that work?
subsys/llext/llext_link.c
Outdated
got_offset = rela.r_offset + tgt->sh_offset - | ||
size_t offset; | ||
|
||
if (ldr->hdr.e_type == ET_REL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it this exact location that we discussed before and that cannot work for SOF because there we perform symbol linking on an image in a temporary storage, i.e. addresses where it's stored aren't the same as where it will be executed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, maybe you are right, I totally forgot (... as there is no comment and no test whatsoever for that scenario). I just took the opportunity to do away with the "looking up sections by name" stuff, but I will leave the Xtensa paths exactly as before.
4 of the 9 patches, as described in the PR text, actually relax the test scenarios to enable testing ARM relocatable, build on all arches, and |
* larger value eases debugging, since it reduces the differences in addresses | ||
* between similar runs. | ||
*/ | ||
#define ELF_ALIGN __aligned(4096) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when testing with QEMU I seem to have come across out-of-memory issues a few times, i.e. some configurations couldn't be tested because QEMU wasn't allocating enough memory for them (as configured for those tests). Wouldn't this change make those cases more likely? Not sure if we can increase memory size in those configurations or it's explicitly fixed to match some real hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not observed any regression on the integration platforms with this. Do you have an example?
* arch-specific code paths. This code should be merged with | ||
* the logic below once the differences are resolved. | ||
*/ | ||
if (IS_ENABLED(CONFIG_XTENSA)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be about xtensa reallym but I think this is actually guaranteed by the ELF standard - the names of .rel*
and .rela*
sections, e.g. it says that a .rela
section for a section, called "XXX" should be called .relaXXX
, I actually think that the above .rela.data
, .rela.rodata
and .rela.bss
should be handled together with other .rela*
sections, but I had no test-case to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the associated commit description. That is not Xtensa specific because of the naming, it's because if you enter link_plt
you need to ultimately implement a different architecture-specific relocation function (arch_elf_relocate_local
), and that is defined only on Xtensa.
Furthermore, as the previous comments re/SOF imply, there are subtle undocumented issues in that code path that make it brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to unify the relocate arch api imo, plt or pie shouldn’t require a different entry
f8f4431
to
eeaaee0
Compare
this latest version did pass my local test, let me push an SOF PR referencing it for a more complete test |
Yes, latest push should have addressed all comments by @mathieuchopstm and @lyakh (including the revert of the changes in the SOF path). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appears OK to me besides this:
subsys/llext/llext_load.c
Outdated
@@ -114,14 +114,17 @@ static int llext_load_elf_data(struct llext_loader *ldr, struct llext *ext) | |||
|
|||
ldr->sect_cnt = ldr->hdr.e_shnum; | |||
|
|||
size_t sect_map_sz = ldr->sect_cnt * sizeof(ldr->sect_map[0]); | |||
size_t sect_map_sz = ldr->sect_cnt * sizeof(struct llext_elf_sect_map); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is an example of why this shouldn't be done. Firstly this change isn't needed, right? The original code here would still work with this your change. Secondly - that's exactly the advantage of using sizeof(object)
instead of sizeof(type)
wherever possible. Let's keep this line as is
@@ -73,7 +73,7 @@ struct llext_loader { | |||
elf_shdr_t sects[LLEXT_MEM_COUNT]; | |||
elf_shdr_t *sect_hdrs; | |||
bool sect_hdrs_on_heap; | |||
enum llext_mem *sect_map; | |||
struct llext_elf_sect_map *sect_map; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need a forward declaration for this structure - this header isn't including llext_priv.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was one in the first draft, I removed it on purpose - C standard explicitly allows you to declare opaque pointers to structures this way. But it makes sense to leave a comment referring to the file, thanks for spotting this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pillo79 wow, which C version explicitly allows that? I only know about compilers warning about such missing declarations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is apparently in the language since the first standard (C89/C90) if cppreference is to be trusted 🤯
I was completely unaware of this as well, but it appears to indeed be supported fine; GCC 3.4.6 from March 2006 accepts it without problem, even in -std=c89
mode, according to Compiler Explorer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's plain C - it's the reason why you may get a weird "this type is only usable inside the function" message from your compiler when you make a typo in a struct
argument during a function call definition.
That type is really an implementation detail, and I wanted to hide it as much as possible - since it's declared in a @cond
block, it will not even be picked up by documentation. I will add the forward declaration and explicitly hide it, but it looked ugly and redundant to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's plain C - it's the reason why you may get a weird "this type is only usable inside the function" message from your compiler when you make a typo in a
struct
argument during a function call definition.
hm, interesting. So you can use a pointer to an incomplete type in a structure, but not a function parameter. And forward declarations are only needed for the latter? Even though I haven't found a definitive answer apart from
Note that a new struct name may also be introduced just by using a struct tag within another declaration
in https://en.cppreference.com/w/c/language/struct but yes, looks like that's the case. I stand corrected, learned something new, thanks!
eeaaee0
to
a869972
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
A few details that can be modified if other modifications must be made:
subsys/llext/llext_link.c
Outdated
|
||
if (mem_idx == LLEXT_MEM_COUNT) { | ||
LOG_ERR("Section %d has no corresponding memory region", shdr->sh_info); | ||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: -ENOEXEC.
subsys/llext/llext_link.c
Outdated
if (shdr->sh_entsize != sizeof(elf_rel_t)) { | ||
LOG_ERR("Invalid relocation entry size %zd for section %d", | ||
(size_t)shdr->sh_entsize, i); | ||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: -ENOEXEC is probably a better error code.
subsys/llext/llext_link.c
Outdated
if (shdr->sh_entsize != sizeof(elf_rela_t)) { | ||
LOG_ERR("Invalid relocation entry size %zd for section %d", | ||
(size_t)shdr->sh_entsize, i); | ||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: -ENOEXEC
subsys/llext/llext_link.c
Outdated
(size_t)shdr->sh_info, | ||
(size_t)shdr->sh_size, | ||
(size_t)shdr->sh_entsize); | ||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: -ENOEXEC.
return -EINVAL; | ||
} | ||
break; | ||
case SHT_RELA: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Considering we don't actually seem to handle .rela
sections properly (elf_rela_t.r_addend
appears unused in the whole codebase?), maybe this whole branch should return -ENOTSUP
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to, but I would surely break something in the Xtensa port - it actually has code dealing with them (and ignoring r_addend
) just below:
zephyr/subsys/llext/llext_link.c
Lines 226 to 227 in a869972
if (strcmp(name, ".rela.plt") == 0 || | |
strcmp(name, ".rela.dyn") == 0) { |
The recent changes to the loader code merged related sections together, making sure the merged sections are self-coherent. This, however, did not take into account that the original ELF sections are now a _subset_ of the merged section - and might not start from the beginning of the merged section. This patch converts the `sect_map` member of `struct llext_loader` to a structure with two fields: - mem_idx: the memory area index where the ELF section is mapped - offset: the offset of the ELF section inside the memory area The offset is calculated after all sections are merged and the final groups are defined. This will allow the loader to correctly calculate the address of symbols and relocations in the merged sections. Signed-off-by: Luca Burelli <[email protected]>
The optimization in llext_load() to avoid using the generic path for sections that are cached in memory was broken for two reasons: - it was comparing an ELF section index to LLEXT_MEM_BSS, which is a llext_mem enum, and - it was using the wrong section address for the cached sections since the "merged sections" feature was introduced in 709b2e4. This patch fixes both issues using the new llext_loaded_sect_ptr() helper function. Signed-off-by: Luca Burelli <[email protected]>
Currently, the code uses the section name to identify the target section of a relocation. This is not reliable, as the section name is not guaranteed to be in a specific format. Instead, use the sh_info field of the relocation section header to identify the target section. This is a tricky change, as it requires a workaround for the Xtensa port, whose code path diverges here into the `link_plt` function and ultimately different arch-specific code. Avoiding this divergence will require additional refactorings. Signed-off-by: Luca Burelli <[email protected]>
The function llext_section_by_name() is used only in one place, and it expects the caller to have the section headers cache available. This cache is freed after the ELF file is loaded, so the function is not usable in the context where it is called. Remove the function and replace the call with a direct search in the ELF file section headers array, as was done before 08eb314. Signed-off-by: Luca Burelli <[email protected]>
a869972
to
db8e9ad
Compare
PR #74509 expanded the llext test suite and identified a number of shortcomings related to the latest changes. In particular:
llext_find_section
uses a section header cache that Is basically not available when the API can be used. Partially revert the changes and reduce the number of sections identified by name in the code.This PR fixes the above issues, with a green CI run on all the new test cases.
The first 4 commits are taken directly from the PR above (and can be viewed separately there).