You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I was notified Friday morning that my demand paging PR had been reverted as it had been identified as the culprit in a failure in tests/kernel/fatal/exception. The crash occurred at very early boot, right after installing page tables.
However, digging deeper this is not the case and my PR just exposed a different issue with a PR that went in about the same time.
When I reproduced locally, I found that the memory addresses of page tables were shifting in between builds:
$ nm -n zephyr/zephyr.elf | grep ptables
0000000000120000 R z_x86_kernel_ptables
$ nm -n zephyr/zephyr_prebuilt.elf | grep ptables
000000000011f000 D z_x86_kernel_ptables
This will totally break the system, we absolutely rely on memory addresses of symbols not shifting in between builds as certain auto-generated CPU structures (such as page tables) will not function correctly if the information used to build them in zephyr_prebuilt.elf becomes stale.
In this particular case, this resulted in the memory mapping for the kernel image to not be properly sized. One of the changes in the demand paging PR is that only the kernel image is now memory-mapped with other page frames being left un-mapped and available for anonymous memory mappings.
In addition to this problem, this can also break userspace as the kernel object tables are created from symbol addresses in zephyr_prebuilt.elf. If the addresses of kernel object symbol addresses changes between zephyr.elf and zephyr_prebuilt.elf no syscalls will work. This is why the kernel object tables themselves are located at the very end of RAM so that other addresses do not shift.
I then looked at the image to see where the addresses started shifting. The culprit is some symbols all prefixed with __devicehdl:
0000000000108070 R __device_handles_start
-0000000000108070 V __devicehdl_DT_N_S_soc_S_uart_2f8
+0000000000108070 R __devicehdl_sys_init_z_clock_driver_init0
0000000000108070 R __init_APPLICATION_start
0000000000108070 R __init_end
0000000000108070 R __init_POST_KERNEL_start
0000000000108070 R __init_SMP_start
-0000000000108080 V __devicehdl_DT_N_S_soc_S_uart_3f8
-000000000010808a V __devicehdl_sys_init_z_clock_driver_init0
-0000000000108090 R __app_shmem_regions_end
-0000000000108090 R __app_shmem_regions_start
...snip...
+0000000000108078 R __devicehdl_DT_N_S_soc_S_uart_3f8
+0000000000108088 R __devicehdl_DT_N_S_soc_S_uart_2f8
+0000000000108092 R __app_shmem_regions_end
+0000000000108092 R __app_shmem_regions_start
The overall size of these handles is increasing:
0000000000108070 R __device_handles_start
-0000000000108090 R __device_handles_end
+0000000000108092 R __device_handles_end
We are off by 2 bytes. As bad luck would have it, this particular test case resulted in the size of the kernel image being pushed out by an additional page due to alignment requirements.
To Reproduce
Back up in the tree to right before where my demand paging PR was reverted and run tests/kernel/fatal/exception using sentinel.conf.
Not sure why other tests for userspace aren't also failing, we might be getting lucky with alignement directives restoring the synchronization between symbol addresses after they get messed up.
Expected behavior
This new DTS infrastruture can't change in size between builds.
Impact
Showstopper.
The text was updated successfully, but these errors were encountered:
Describe the bug
I was notified Friday morning that my demand paging PR had been reverted as it had been identified as the culprit in a failure in tests/kernel/fatal/exception. The crash occurred at very early boot, right after installing page tables.
However, digging deeper this is not the case and my PR just exposed a different issue with a PR that went in about the same time.
When I reproduced locally, I found that the memory addresses of page tables were shifting in between builds:
This will totally break the system, we absolutely rely on memory addresses of symbols not shifting in between builds as certain auto-generated CPU structures (such as page tables) will not function correctly if the information used to build them in
zephyr_prebuilt.elf
becomes stale.In this particular case, this resulted in the memory mapping for the kernel image to not be properly sized. One of the changes in the demand paging PR is that only the kernel image is now memory-mapped with other page frames being left un-mapped and available for anonymous memory mappings.
In addition to this problem, this can also break userspace as the kernel object tables are created from symbol addresses in zephyr_prebuilt.elf. If the addresses of kernel object symbol addresses changes between zephyr.elf and zephyr_prebuilt.elf no syscalls will work. This is why the kernel object tables themselves are located at the very end of RAM so that other addresses do not shift.
I then looked at the image to see where the addresses started shifting. The culprit is some symbols all prefixed with
__devicehdl
:The overall size of these handles is increasing:
We are off by 2 bytes. As bad luck would have it, this particular test case resulted in the size of the kernel image being pushed out by an additional page due to alignment requirements.
To Reproduce
Back up in the tree to right before where my demand paging PR was reverted and run tests/kernel/fatal/exception using sentinel.conf.
Not sure why other tests for userspace aren't also failing, we might be getting lucky with alignement directives restoring the synchronization between symbol addresses after they get messed up.
Expected behavior
This new DTS infrastruture can't change in size between builds.
Impact
Showstopper.
The text was updated successfully, but these errors were encountered: