Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zephyr] kernel/timer/timer_monotonic fails on 6.2-based QEMU #66

Closed
abrodkin opened this issue Mar 11, 2022 · 11 comments
Closed

[Zephyr] kernel/timer/timer_monotonic fails on 6.2-based QEMU #66

abrodkin opened this issue Mar 11, 2022 · 11 comments
Assignees

Comments

@abrodkin
Copy link
Member

west -v build -b qemu_arc_em tests/kernel/timer/timer_monotonic -t run

...

*** Booting Zephyr OS build zephyr-v3.0.0-1009-g3a4c3702324f  ***
Running test suite timer_fn
===================================================================
START - test_timer
k_ticks_to_cyc_floor32(1) = 100000
sys_clock_hw_cycles_per_sec() = 10000000
START - test monotonic timer
diff = -899967 (t_last = 999972 : t_now = 100005);i = 29972

    Assertion failed at WEST_TOPDIR/zephyr/tests/kernel/timer/timer_monotonic/src/main.c:78: test_timer: (errors is true)
errors = 1

 FAIL - test_timer in 3.178 seconds
===================================================================
Test suite timer_fn failed.
===================================================================
PROJECT EXECUTION FAILED

Here's the test source: https://github.com/zephyrproject-rtos/zephyr/tree/main/tests/kernel/timer/timer_monotonic

@cupertinomiranda
Copy link

I cannot reproduce or debug this issue without a qemu command that executes this and a binary.
In this terms the issue should be in zephyr repo and not here.

@abrodkin
Copy link
Member Author

qemu-system-arc -cpu arcem -m 8M -nographic -no-reboot -monitor none -global cpu.firq=false -global cpu.num-irqlevels=15 -global cpu.num-irq=25 -global cpu.ext-irq=20 -global cpu.freq_hz=10000000 -global cpu.timer0=true -global cpu.timer1=true -global cpu.has-mpu=true -global cpu.mpu-numreg=16 -net none -pidfile qemu.pid -chardev stdio,id=con,mux=on -serial chardev:con -mon chardev=con,mode=readline -icount shift=6,align=off,sleep=off -rtc clock=vm -kernel timer_monotonic
timer_monotonic.zip

cupertinomiranda pushed a commit that referenced this issue Mar 14, 2022
This patches makes counter only appear to reset after the timer
interrupt gets to be served.
Also removed the condition that would make timers less reactive.
@cupertinomiranda
Copy link

Test passes with new fix in commit 2625a6f.

@abrodkin
Copy link
Member Author

Indeed, results now are all clear:

./scripts/twister --platform qemu_arc_em --platform qemu_arc_hs --platform qemu_arc_hs6x
ZEPHYR_BASE unset, using "..."
INFO    - Zephyr version: zephyr-v3.0.0-1286-g38938b4e594a
INFO    - JOBS: 12
INFO    - Using 'zephyr' toolchain.
INFO    - Building initial testcase list...
INFO    - 1651 test scenarios (4686 configurations) selected, 3483 configurations discarded due to filters.
INFO    - Adding tasks to the queue...
INFO    - Added initial list of jobs to queue
INFO    - Total complete:  552/1203  45%  skipped: 3601, failed:    0
ERROR   - qemu_arc_em               tests/kernel/mem_protect/syscalls/kernel.memory_protection.syscalls FAILED: Exited with 2
ERROR   - see: .../twister-out/qemu_arc_em/tests/kernel/mem_protect/syscalls/kernel.memory_protection.syscalls/handler.log
INFO    - Total complete: 1203/1203  100%  skipped: 3851, failed:    1
INFO    - 834 of 835 test configurations passed (99.88%), 1 failed, 3851 skipped with 0 warnings in 4056.32 seconds
INFO    - In total 5425 test cases were executed, 55082 skipped on 3 out of total 442 platforms (0.68%)
INFO    - 706 test configurations executed on platforms, 129 test configurations were only built.

Failed on qemu_arc_em test shows:

START - test_syscall_torture
Running syscall torture test with 4 threads on 1 cpu(s)
E: ***** Exception vector: 0x2, cause code: 0x1, parameter 0x0
E: Address 0x8000137c
E: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
E: Current thread: 0x80400928 (unknown)
E: Halting system

But that needs to be looked at separately.

@shahab-vahedi
Copy link
Member

shahab-vahedi commented Apr 4, 2022

It seems this change is inflicting a problem on my linux images. it might be actually that it is revealing an issue in our Linux Kernel. That's why I filed one there.

@stephanosio
Copy link

stephanosio commented Apr 16, 2022

For some reason, this exact issue is still seen with the Zephyr SDK 0.14.1, which includes 2625a6f, although it is seen only in the AWS CI machine and not locally reproducible.

https://github.com/zephyrproject-rtos/zephyr-testing/runs/6046357806?check_suite_focus=true#step:11:2236

stephanosio added a commit to stephanosio/zephyr that referenced this issue Apr 16, 2022
This commit excludes the kernel syscall test for the `qemu_arc_em`
platform because this test may fail with the ARC QEMU 6.2 on certain
host systems.

Refer to the following issues for more details:

* foss-for-synopsys-dwc-arc-processors/qemu#66
* zephyrproject-rtos#44862

Signed-off-by: Stephanos Ioannidis <[email protected]>
stephanosio added a commit to zephyrproject-rtos/zephyr-testing that referenced this issue Apr 16, 2022
This commit excludes the kernel syscall test for the `qemu_arc_em`
platform because this test may fail with the ARC QEMU 6.2 on certain
host systems.

Refer to the following issues for more details:

* foss-for-synopsys-dwc-arc-processors/qemu#66
* zephyrproject-rtos/zephyr#44862

Signed-off-by: Stephanos Ioannidis <[email protected]>
nashif pushed a commit to zephyrproject-rtos/zephyr that referenced this issue Apr 19, 2022
This commit excludes the kernel syscall test for the `qemu_arc_em`
platform because this test may fail with the ARC QEMU 6.2 on certain
host systems.

Refer to the following issues for more details:

* foss-for-synopsys-dwc-arc-processors/qemu#66
* #44862

Signed-off-by: Stephanos Ioannidis <[email protected]>
shahab-vahedi pushed a commit that referenced this issue Apr 20, 2022
coreboot-org-bot pushed a commit to coreboot/zephyr-cros that referenced this issue Apr 20, 2022
This commit excludes the kernel syscall test for the `qemu_arc_em`
platform because this test may fail with the ARC QEMU 6.2 on certain
host systems.

Refer to the following issues for more details:

* foss-for-synopsys-dwc-arc-processors/qemu#66
* zephyrproject-rtos/zephyr#44862

Signed-off-by: Stephanos Ioannidis <[email protected]>
GitOrigin-RevId: 9706404
Change-Id: I08b455a5f8474d14c3915a2ee58e6792d819d292
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/3596486
Tested-by: CopyBot Service Account <[email protected]>
Reviewed-by: Al Semjonovs <[email protected]>
Commit-Queue: Al Semjonovs <[email protected]>
Tested-by: Al Semjonovs <[email protected]>
@cupertinomiranda
Copy link

After further analysis of this issue with Claudiu, we got to the conclusion that my patch on QEMU is not Ok.
According through PRM there it is not guaranteed that an interrupt is immediately served after the timer interrupt is raised.
On the other hand this time interval is flagged by IP in CONTROL0, being that the reasoning behind IP flag.

Zephyr itself when reading the COUNT0 should be verifying if IP flag from CONTROL0 is not set by QEMU.
If it was then at this particular case Zephyr knows that monotonicity is not guaranteed, since the timer as already been reset and the interrupt is not served yet.

Be aware that although less likely, mostly because we do not not have external sources of interrupts in our testing, that the same can happen in hardware, and for that reason the software is incorrect.

@ruuddw
Copy link
Member

ruuddw commented Apr 21, 2022

Zephyr itself when reading the COUNT0 should be verifying if IP flag from CONTROL0 is not set by QEMU ...
... and for that reason the software is incorrect

The ARC timer driver checks and handles the IP flag, in the 'elapsed()' function. Can you elaborate what's not correct in the Zephyr driver?

@cupertinomiranda
Copy link

The refered test fails in the following:

	for (i = 0U; i < 1000000; i++) {
		t_now = k_cycle_get_32();

		if (t_now < t_last) {
			diff = t_now - t_last;
			TC_PRINT("diff = %d (t_last = %u : t_now = %u);"
				"i = %u\n", diff, t_last, t_now, i);
			errors++;
		}
		t_last = t_now;
	}

Using QEMU, this test fails since the interrupt for the timer is not immediately handled, but rather only executed in the end of the TLB (i.e. a control-flow block).
In my initial assessment I though that would be a QEMU problem since the interrupt would not be immediately served, but thinking it through no architecture can guarantee that behavior.
Looking it once more, I can see that QEMU implementation is accurate with what is described in the PRM, which is to reset the counter when reaching limit and queue the interrupt. All this happens atomically with setting the IP flag.

In the example above, the monotonicity of k_cycle_get_32() is verified. The Zephyr problem to my eyes is that the k_cycle_get_32 function is directly tied to what COUNT0 returns, which will be not monotonic for the Zephyr infrastructure, when IP flag is set.
Maybe the logic around elapsed is not correct. Or IP should be verified somewhere else as well.

Looking at Zephyr code, without really trying it, I would assume that having timer0_count_register_get defined like this would fix the monotonicity problem, although might not be the most optimal implementation:

  static ALWAYS_INLINE uint32_t timer0_count_register_get(void)
  {
    static uint32_t last = 0;
    uint32_t ctrl = timer0_control_register_get();
    if (ctrl & _ARC_V2_TMR_CTRL_IP == 0) {
      last = z_arc_v2_aux_reg_read(_ARC_V2_TMR0_COUNT);
    }
    return last
  }

@ruuddw
Copy link
Member

ruuddw commented Apr 25, 2022

Zephyr 'elapsed' handles the wrapping/pending IP already, no need to add that to the count_register_get.
Are you sure the IP and count register updates are atomic in QEMU?
Also, the 'fix' in 2625a6f modifies the count value when IP is pending, that doesn't seem right to me. Certainly not in combination with software handlers that also handle the IP overflow in software: it now gets adjusted two times.
Maybe the only issue was the 'delta' that got adjusted before 2625a6f?

cupertinomiranda pushed a commit that referenced this issue Apr 26, 2022
Something that was actually correct in commit:
  2625a6f
@cupertinomiranda
Copy link

With commit 7772126, which is a subset of the previous reverted 2625a6f, the test now passes. Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants