Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llext: fix issues identified in #74509 #74568

Merged
merged 8 commits into from
Jun 26, 2024

Conversation

pillo79
Copy link
Collaborator

@pillo79 pillo79 commented Jun 19, 2024

PR #74509 expanded the llext test suite and identified a number of shortcomings related to the latest changes. In particular:

  • The section merge commit did not consider the fact that relocations and symbols are relative to the set of original sections, not the merged ones. Offsets must be calculated and used properly in symbols and linking logic.
  • llext_find_section uses a section header cache that Is basically not available when the API can be used. Partially revert the changes and reduce the number of sections identified by name in the code.

This PR fixes the above issues, with a green CI run on all the new test cases.

The first 4 commits are taken directly from the PR above (and can be viewed separately there).

The ELF files should be aligned to at least sizeof(elf_word) to avoid
issues. Use a larger value to ease debugging, since it reduces the
differences in addresses between similar runs.

Signed-off-by: Luca Burelli <[email protected]>
This patch sets the default value for LLEXT_STORAGE_WRITABLE to 'y' on
the Xtensa architecture. This is necessary because it does not currently
support the read-only mode for the LLEXT storage.

Make sure the default reflects this instead of asking the user to
manually set it.

Signed-off-by: Luca Burelli <[email protected]>
This patch reworks the testcase.yaml and sample.yaml files for the llext
subsystem to further reduce the number of tests performed by CI while
improving overall coverage.

The following changes are introduced by this commit:

- Remove the arch_allow field from the common section to allow any arch
  to be tested in the build_only test. All other tests explicitly narrow
  down the arches they are applicable to.
- In addition to platforms with active issues, also exclude a number of
  platforms that are always skipped by the runtime filter due to
  RAM/Flash limitations.
- Add integration_platforms to limit the test count to a few selected
  platforms which are representative of the different arches.
- Remove a number of duplicate SLID tests and group them into a single
  test that covers both ARM and Xtensa architectures.
- Test the relocatable case on ARM as well.

Signed-off-by: Luca Burelli <[email protected]>
@pillo79 pillo79 added the DNM This PR should not be merged (Do Not Merge) label Jun 19, 2024
@pillo79 pillo79 removed the DNM This PR should not be merged (Do Not Merge) label Jun 20, 2024
@pillo79 pillo79 marked this pull request as ready for review June 20, 2024 08:25
Copy link
Collaborator

@mathieuchopstm mathieuchopstm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section merge commit did not consider the fact that relocations and symbols are relative to the set of original sections, not the merged ones. Offsets must be calculated and used properly in symbols and linking logic.

This is one thing that bugged me the moment I reviewed that PR, but the code convinced me it was handled... 😅

Off-topic note: IMO, the naming convention could use a little overhaul; I find myself very easily confused between sect and mem stuff.
In-file ELF section stuff being named sect{tion} is fine, but everything related to in-memory representation should be renamed.
Something like segment would fit nicely and aligns with the ELF format - despite not being exactly ELF segments, what we're doing is pretty close.

Other than this, looks good to me, besides a few comments:

- arch:arm:CONFIG_ARM_MPU=n
- arch:xtensa:CONFIG_LLEXT_STORAGE_WRITABLE=y
- arch:arm:CONFIG_ARM_AARCH32_MMU=n
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this cause Kconfig failures if e.g. building for AArch64 target?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, AFAIK this kind of setting a Kconfig to false has no effect if the Kconfig itself is not defined.
The line ARM_MPU=n has always been ignored on qemu_cortex_a9, and resulted in a runtime filter (since that CPU has an MMU and it always ended up being active).

harness: keyboard
extra_configs:
- arch:arm:CONFIG_ARM_MPU=n
- arch:arm:CONFIG_ARM_AARCH32_MMU=n
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

subsys/llext/llext_link.c Show resolved Hide resolved
"number (%p) is not at .data section start (%p)",
symbol_ptr, section_ptr);
}
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test relying on the exact .data layout sounds a bit fragile to me.
Could this be relaxed to merely checking if symbol is in .data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, since the API does not provide the size of the section. 😕 🤦

I agree this can break easily. However, I have added comments both there and in the code to make that clear, and it's being tested by CI now, so any commit breaking this would be flagged early.
I think in the next revision I will add a separate one-liner extension for this test - instead of reusing hello_world - so that will make this extra clear.

Comment on lines 60 to 61
static inline const void *llext_elf_sect_addr(struct llext_loader *ldr, struct llext *ext,
unsigned int sect_idx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Not fond of the function name.
Argument name is also confusing.
Proposal:

Suggested change
static inline const void *llext_elf_sect_addr(struct llext_loader *ldr, struct llext *ext,
unsigned int sect_idx)
static inline const void *llext_get_loaded_section_addr(struct llext_loader *ldr, struct llext *ext,
unsigned int sh_ndx)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the sh_ndx, makes it clear it is an ELF section index, we should use that more often.

The function itself has a quirk used by the Xtensa code path: it does try to map directly on the ELF via peek if the section is not loaded. That's why I first found your proposed name a little confusing, but that made me realize I will leave this visible where it's needed, and be unambiguous here (either a valid ext->mem ptr or NULL).

Copy link
Collaborator Author

@pillo79 pillo79 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mathieuchopstm for the insightful comments, will apply to the next round.

Re/ OT naming: I was also thinking about a PR changing the names of "merged sections" to something else (I had "regions" in mind, but segments is perfect!).
Most importantly ldr->sects needs to become ldr->segs; also LLEXT_MEM could better be LLEXT_SEG (or at least make it evident in comments MEM it's a "memory segment"). Not sure this change would be accepted during LTS freeze though, WDYT?

- arch:arm:CONFIG_ARM_MPU=n
- arch:xtensa:CONFIG_LLEXT_STORAGE_WRITABLE=y
- arch:arm:CONFIG_ARM_AARCH32_MMU=n
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, AFAIK this kind of setting a Kconfig to false has no effect if the Kconfig itself is not defined.
The line ARM_MPU=n has always been ignored on qemu_cortex_a9, and resulted in a runtime filter (since that CPU has an MMU and it always ended up being active).

"number (%p) is not at .data section start (%p)",
symbol_ptr, section_ptr);
}
#endif
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, since the API does not provide the size of the section. 😕 🤦

I agree this can break easily. However, I have added comments both there and in the code to make that clear, and it's being tested by CI now, so any commit breaking this would be flagged early.
I think in the next revision I will add a separate one-liner extension for this test - instead of reusing hello_world - so that will make this extra clear.

Comment on lines 60 to 61
static inline const void *llext_elf_sect_addr(struct llext_loader *ldr, struct llext *ext,
unsigned int sect_idx)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the sh_ndx, makes it clear it is an ELF section index, we should use that more often.

The function itself has a quirk used by the Xtensa code path: it does try to map directly on the ELF via peek if the section is not loaded. That's why I first found your proposed name a little confusing, but that made me realize I will leave this visible where it's needed, and be unambiguous here (either a valid ext->mem ptr or NULL).

@mathieuchopstm
Copy link
Collaborator

Not sure this change would be accepted during LTS freeze though, WDYT?

Not exactly familiar with all rules related to code freeze, but I think it might be OK as long as the change is purely mechanical? (i.e., only symbol renaming).

Copy link
Collaborator

@teburd teburd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes I don't believe aren't quite right here. @lyakh should absolutely test this patch to ensure issues are in fact fixed in SoF for 3.7.

- qemu_arc/qemu_arc_hs5x
- qemu_arc/qemu_arc_hs6x
- qemu_cortex_m0
- qemu_xtensa/dc233c/mmu
integration_platforms:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove qemu_xtensa

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MMU-enabled Xtensa is not compatible with llext at the moment: there is no writable_mpu test, as it would require to split up the ELF file in mem_domains (with all the relevant alignment issues...).
I could leave the platform in, but it would re-do the same tests of the non-MMU qemu_xtensa variant.

@@ -62,6 +62,7 @@ config LLEXT_SHELL_MAX_SIZE

config LLEXT_STORAGE_WRITABLE
bool "llext storage is writable"
default y if XTENSA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look quite right, xtensa could easily be loading from non-writable storage as well. The architecture has nothing to do with the Kconfig imo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, however as I made explicit here, the Xtensa port currently needs this to ensure the relocation logic in llext_link_plt works.

There has never been a working Xtensa llext.simple.readonly test; they have always been arch-masked. While reviewing all YAML files I found out the shell sample did not even set STORAGE_WRITABLE=y for Xtensa; instead of adding this, I chose to set the appropriate default so it automatically gets set correctly everywhere, even in users' experimental projects.

Should I revert to explicitly adding that in all Xtensa test instances?

@teburd teburd added the DNM This PR should not be merged (Do Not Merge) label Jun 20, 2024
@teburd
Copy link
Collaborator

teburd commented Jun 20, 2024

Marking DNM until @lyakh has a chance to test as the regression affected SoF. Once @lyakh reviews and either -/+ 1's DNM should be removed by myself or others.

Add a new test case that verifies that llext_find_section() returns the
correct offset for a symbol in a loadable extension. This exploits the
fact that in the STORAGE_WRITABLE cases, the symbol addresses will be
inside the ELF file buffer, so they can be easily compared.

Signed-off-by: Luca Burelli <[email protected]>
Copy link
Collaborator

@lyakh lyakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I understand it correctly, that these 9 patches are needed to fix errors, introduced by the last 2 commits in #73949? If so, I'd rather revert those 2 commits and re-submitted their fixed versions. Would that work?

got_offset = rela.r_offset + tgt->sh_offset -
size_t offset;

if (ldr->hdr.e_type == ET_REL) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it this exact location that we discussed before and that cannot work for SOF because there we perform symbol linking on an image in a temporary storage, i.e. addresses where it's stored aren't the same as where it will be executed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe you are right, I totally forgot (... as there is no comment and no test whatsoever for that scenario). I just took the opportunity to do away with the "looking up sections by name" stuff, but I will leave the Xtensa paths exactly as before.

@pillo79
Copy link
Collaborator Author

pillo79 commented Jun 20, 2024

do I understand it correctly, that these 9 patches are needed to fix errors, introduced by the last 2 commits in #73949? If so, I'd rather revert those 2 commits and re-submitted their fixed versions. Would that work?

4 of the 9 patches, as described in the PR text, actually relax the test scenarios to enable testing ARM relocatable, build on all arches, and find_section. The other patches fix the bugs that have been identified.
I don't see why would we want to revert and merge a big blob of code instead of separate, reviewable fixes.

* larger value eases debugging, since it reduces the differences in addresses
* between similar runs.
*/
#define ELF_ALIGN __aligned(4096)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when testing with QEMU I seem to have come across out-of-memory issues a few times, i.e. some configurations couldn't be tested because QEMU wasn't allocating enough memory for them (as configured for those tests). Wouldn't this change make those cases more likely? Not sure if we can increase memory size in those configurations or it's explicitly fixed to match some real hardware.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not observed any regression on the integration platforms with this. Do you have an example?

* arch-specific code paths. This code should be merged with
* the logic below once the differences are resolved.
*/
if (IS_ENABLED(CONFIG_XTENSA)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be about xtensa reallym but I think this is actually guaranteed by the ELF standard - the names of .rel* and .rela* sections, e.g. it says that a .rela section for a section, called "XXX" should be called .relaXXX, I actually think that the above .rela.data, .rela.rodata and .rela.bss should be handled together with other .rela* sections, but I had no test-case to check

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the associated commit description. That is not Xtensa specific because of the naming, it's because if you enter link_plt you need to ultimately implement a different architecture-specific relocation function (arch_elf_relocate_local), and that is defined only on Xtensa.
Furthermore, as the previous comments re/SOF imply, there are subtle undocumented issues in that code path that make it brittle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to unify the relocate arch api imo, plt or pie shouldn’t require a different entry

@lyakh
Copy link
Collaborator

lyakh commented Jun 20, 2024

this latest version did pass my local test, let me push an SOF PR referencing it for a more complete test

@pillo79
Copy link
Collaborator Author

pillo79 commented Jun 20, 2024

Yes, latest push should have addressed all comments by @mathieuchopstm and @lyakh (including the revert of the changes in the SOF path).
Thanks for the help in testing!

Copy link
Collaborator

@mathieuchopstm mathieuchopstm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appears OK to me besides this:

subsys/llext/llext_link.c Show resolved Hide resolved
@@ -114,14 +114,17 @@ static int llext_load_elf_data(struct llext_loader *ldr, struct llext *ext)

ldr->sect_cnt = ldr->hdr.e_shnum;

size_t sect_map_sz = ldr->sect_cnt * sizeof(ldr->sect_map[0]);
size_t sect_map_sz = ldr->sect_cnt * sizeof(struct llext_elf_sect_map);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is an example of why this shouldn't be done. Firstly this change isn't needed, right? The original code here would still work with this your change. Secondly - that's exactly the advantage of using sizeof(object) instead of sizeof(type) wherever possible. Let's keep this line as is

subsys/llext/llext_link.c Show resolved Hide resolved
@@ -73,7 +73,7 @@ struct llext_loader {
elf_shdr_t sects[LLEXT_MEM_COUNT];
elf_shdr_t *sect_hdrs;
bool sect_hdrs_on_heap;
enum llext_mem *sect_map;
struct llext_elf_sect_map *sect_map;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need a forward declaration for this structure - this header isn't including llext_priv.h

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was one in the first draft, I removed it on purpose - C standard explicitly allows you to declare opaque pointers to structures this way. But it makes sense to leave a comment referring to the file, thanks for spotting this!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pillo79 wow, which C version explicitly allows that? I only know about compilers warning about such missing declarations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is apparently in the language since the first standard (C89/C90) if cppreference is to be trusted 🤯

I was completely unaware of this as well, but it appears to indeed be supported fine; GCC 3.4.6 from March 2006 accepts it without problem, even in -std=c89 mode, according to Compiler Explorer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's plain C - it's the reason why you may get a weird "this type is only usable inside the function" message from your compiler when you make a typo in a struct argument during a function call definition.

That type is really an implementation detail, and I wanted to hide it as much as possible - since it's declared in a @cond block, it will not even be picked up by documentation. I will add the forward declaration and explicitly hide it, but it looked ugly and redundant to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's plain C - it's the reason why you may get a weird "this type is only usable inside the function" message from your compiler when you make a typo in a struct argument during a function call definition.

hm, interesting. So you can use a pointer to an incomplete type in a structure, but not a function parameter. And forward declarations are only needed for the latter? Even though I haven't found a definitive answer apart from

Note that a new struct name may also be introduced just by using a struct tag within another declaration

in https://en.cppreference.com/w/c/language/struct but yes, looks like that's the case. I stand corrected, learned something new, thanks!

subsys/llext/llext_link.c Show resolved Hide resolved
@pillo79
Copy link
Collaborator Author

pillo79 commented Jun 21, 2024

All comments but this one by @teburd are addressed.
@lyakh I added only one extra check on sizeof(elf_rel/rela_t) from the one you tested. Do you wish to do another SOF CI run?

Copy link
Collaborator

@mathieuchopstm mathieuchopstm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
A few details that can be modified if other modifications must be made:


if (mem_idx == LLEXT_MEM_COUNT) {
LOG_ERR("Section %d has no corresponding memory region", shdr->sh_info);
return -EINVAL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: -ENOEXEC.

if (shdr->sh_entsize != sizeof(elf_rel_t)) {
LOG_ERR("Invalid relocation entry size %zd for section %d",
(size_t)shdr->sh_entsize, i);
return -EINVAL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: -ENOEXEC is probably a better error code.

if (shdr->sh_entsize != sizeof(elf_rela_t)) {
LOG_ERR("Invalid relocation entry size %zd for section %d",
(size_t)shdr->sh_entsize, i);
return -EINVAL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: -ENOEXEC

(size_t)shdr->sh_info,
(size_t)shdr->sh_size,
(size_t)shdr->sh_entsize);
return -EINVAL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: -ENOEXEC.

return -EINVAL;
}
break;
case SHT_RELA:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Considering we don't actually seem to handle .rela sections properly (elf_rela_t.r_addend appears unused in the whole codebase?), maybe this whole branch should return -ENOTSUP for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to, but I would surely break something in the Xtensa port - it actually has code dealing with them (and ignoring r_addend) just below:

if (strcmp(name, ".rela.plt") == 0 ||
strcmp(name, ".rela.dyn") == 0) {

The recent changes to the loader code merged related sections together,
making sure the merged sections are self-coherent. This, however, did
not take into account that the original ELF sections are now a _subset_
of the merged section - and might not start from the beginning of the
merged section.

This patch converts the `sect_map` member of `struct llext_loader` to a
structure with two fields:
- mem_idx: the memory area index where the ELF section is mapped
- offset: the offset of the ELF section inside the memory area

The offset is calculated after all sections are merged and the final
groups are defined. This will allow the loader to correctly calculate
the address of symbols and relocations in the merged sections.

Signed-off-by: Luca Burelli <[email protected]>
The optimization in llext_load() to avoid using the generic path for
sections that are cached in memory was broken for two reasons:
- it was comparing an ELF section index to LLEXT_MEM_BSS, which is a
  llext_mem enum, and
- it was using the wrong section address for the cached sections since
  the "merged sections" feature was introduced in 709b2e4.

This patch fixes both issues using the new llext_loaded_sect_ptr()
helper function.

Signed-off-by: Luca Burelli <[email protected]>
Currently, the code uses the section name to identify the target section
of a relocation. This is not reliable, as the section name is not
guaranteed to be in a specific format. Instead, use the sh_info field of
the relocation section header to identify the target section.

This is a tricky change, as it requires a workaround for the Xtensa
port, whose code path diverges here into the `link_plt` function and
ultimately different arch-specific code. Avoiding this divergence
will require additional refactorings.

Signed-off-by: Luca Burelli <[email protected]>
The function llext_section_by_name() is used only in one place, and it
expects the caller to have the section headers cache available. This
cache is freed after the ELF file is loaded, so the function is not
usable in the context where it is called.

Remove the function and replace the call with a direct search in the
ELF file section headers array, as was done before 08eb314.

Signed-off-by: Luca Burelli <[email protected]>
@teburd teburd removed the DNM This PR should not be merged (Do Not Merge) label Jun 25, 2024
@nashif nashif merged commit d4ea1da into zephyrproject-rtos:main Jun 26, 2024
22 of 23 checks passed
@henrikbrixandersen henrikbrixandersen added the area: llext Linkable Loadable Extensions label Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: llext Linkable Loadable Extensions area: Samples Samples
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants