Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stack_sentinel: rare ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429 Threads may not be created in ISRs #16915

Closed
marc-hb opened this issue Jun 19, 2019 · 6 comments · Fixed by #18241
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@marc-hb
Copy link
Collaborator

marc-hb commented Jun 19, 2019

Describe the bug
Intermittent failure of sanitycheck test:

qemu_xtensa               tests/kernel/fatal/kernel.common.stack_sentinel

Spotted only once. Generally doesn't fail. The previous run was in the exact same conditions and didn't fail.

To Reproduce
Steps to reproduce the behavior:

  1. Run this test in sanitycheck many times. On a loaded machine maybe?
  2. Pray

Expected behavior
PASS

Impact
Low.

Environment (please complete the following information):

  • OS: Ubuntu 18.0.4.2
  • Toolchain Zephyr SDK 0.10.0
  • Commit SHA or Version used: b58aa20

Additional context
JOBS=50

Screenshots or console output

qemu_xtensa               tests/kernel/fatal/kernel.common.stack_sentinel    FAILED: unexpected byte (qemu)`
***** Booting Zephyr build zephyr-v1.14.0-1527-g565bf3263d3b *****
Running test suite fatal
===================================================================
starting test - test_fatal
test alt thread 1: generic CPU exception
 ** FATAL EXCEPTION
 ** CPU 0 EXCCAUSE 0 PS 0x00060020 PC 0x600044c0 VADDR 0x00000000
 **  A0 0xa0001a72  SP 0x600044c0  A2 0x00000000  A3 0x00000000
 **  A4 0x00000000  A5 0x60004b10  A6 0x00000000  A7 0x00060520
 **  A8 0xa0000f68  A9 0x00000040 A10 0x00000000 A11 0x60004500
 ** A12 0x60004b10 A13 0x60004af0 A14 0xfffffff5 A15 0x00060520
 ** SAR 0x00000000
Current thread ID = 0x60006088
Faulting instruction address = 0xdeaddead
Caught system error -- reason 0
test alt thread 2: initiate kernel oops
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x60006104
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6

    Assertion failed at ZEPHYR_BASE/tests/kernel/fatal/src/main.c:247: test_fatal: (crash_reason not equal to _NANO_ERR_KERNEL_OOPS)
bad reason code got 6 expected 5

test alt thread 3: initiate kernel panic
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x60006104
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6
test stack sentinel overflow - timer irq
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x60006104
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6

    Assertion failed at ZEPHYR_BASE/tests/kernel/fatal/src/main.c:197: check_stack_overflow: (crash_reason not equal to _NANO_ERR_STACK_CHK_FAIL)
bad reason code got 6 expected 2

FAIL - test_fatal
===================================================================
Test suite fatal failed.
===================================================================
PROJECT EXECUTION FAILED
@marc-hb marc-hb added the bug The issue is a bug, or the PR is fixing a bug label Jun 19, 2019
@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 19, 2019

Another strange and also rare "Fatal fault in ISR!", does it seem related?

qemu_xtensa/tests/kernel/stack/stack_api/kernel.stack/handler.log

***** Booting Zephyr build zephyr-v1.14.0-1631-g15e8179a3c10 *****
Running test suite stack_api
===================================================================
starting test - test_stack_thread2thread
PASS - test_stack_thread2thread
===================================================================
starting test - test_stack_user_thread2thread
SKIP - test_stack_user_thread2thread
===================================================================
starting test - test_stack_thread2isr
 ** FATAL EXCEPTION
 ** CPU 0 EXCCAUSE 28 PS 0x00060120 PC 0x6000360b VADDR 0x00000008
 **  A0 0xa0003485  SP 0x60004d20  A2 0x00000008  A3 0x6000665c
 **  A4 0x00060723  A5 0x600058c0  A6 0xfffffff5  A7 0x00060620
 **  A8 0x60004dc0  A9 0x60004da0 A10 0x00000000 A11 0x60004df4
 ** A12 0x00000010 A13 0x60004e20 A14 0x00000000 A15 0x00000000
 ** SAR 0x00000000
Current thread ID = 0x6000653c
Faulting instruction address = 0xdeaddead
Fatal fault in ISR! Spinning...

@andrewboie
Copy link
Contributor

please file a different GH issue for the other crash, the test cases and the errors produced are unrelated

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jun 20, 2019

Another nearly identical instance:

qemu_xtensa/tests/kernel/fatal/kernel.common.stack_sentinel/handler.log

***** Booting Zephyr OS build zephyr-v1.14.0-1662-ga75771488cf2 *****
Running test suite fatal
===================================================================
starting test - test_fatal
test alt thread 1: generic CPU exception
 ** FATAL EXCEPTION
 ** CPU 0 EXCCAUSE 0 PS 0x00060020 PC 0x60004580 VADDR 0x00000000
 **  A0 0xa0001a7a  SP 0x60004580  A2 0x00000000  A3 0x00000000
 **  A4 0x00000000  A5 0x60004bd0  A6 0x00000000  A7 0x00060520
 **  A8 0xa0000f6c  A9 0x00000040 A10 0x00000000 A11 0x600045c0
 ** A12 0x60004bd0 A13 0x60004bb0 A14 0xfffffff5 A15 0x00060520
 ** SAR 0x00000000
Current thread ID = 0x60006148
Faulting instruction address = 0xdeaddead
Caught system error -- reason 0
test alt thread 2: initiate kernel oops
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x600061c4
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6

    Assertion failed at ZEPHYR_BASE/tests/kernel/fatal/src/main.c:275: test_fatal: (crash_reason not equal to _NANO_ERR_KERNEL_OOPS)
bad reason code got 6 expected 5

test alt thread 3: initiate kernel panic
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x600061c4
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6
test stack sentinel overflow - timer irq
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x600061c4
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6

    Assertion failed at ZEPHYR_BASE/tests/kernel/fatal/src/main.c:225: check_stack_overflow: (crash_reason not equal to _NANO_ERR_STACK_CHK_FAIL)
bad reason code got 6 expected 2

test stack sentinel overflow - swap
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x600061c4
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6
FAIL - test_fatal
===================================================================
Test suite fatal failed.
===================================================================
PROJECT EXECUTION FAILED

@ioannisg ioannisg added the priority: medium Medium impact/importance bug label Jun 25, 2019
marc-hb added a commit to marc-hb/zephyr that referenced this issue Jun 27, 2019
Quoting bug "Configure QEMU to run independent of the host clock zephyrproject-rtos#14173"

  We have been struggling for years with issues related to how QEMU
  attempts to synchronize guest timer interrupts with the host clock,
  for example zephyrproject-rtos#12553. The symptom is that heavily loaded sanitycheck
  runs have tests spuriously failing due to timing related issues.

  This creates noise in our CI runs which masks true bugs in our system
  which manifest only intermittently, causing real issues that will
  happen all the time at scale to be 'swept under the rug'; right now
  any time a test fails sanitycheck retries it a few times and only
  consecutive failures produce an error.

There's also a lot of relevant information and more links in:
"List of tests that keep failing sporadically" zephyrproject-rtos#12553

This new "emu_time" tag helps by letting users either select or exclude
the tests that really need accurate time to pass and have a high chance
to actually be impacted by this emulation issue. As an example, it's
only with 'sanitycheck --exclude emu_time' that I could spot and file
intermittent but non-emu_time issue zephyrproject-rtos#16915. As Andrew predicted above,
it was drown in emu_time noise before that.

Conversely, "--tag emu_time" can be used by developers focusing on
fixing qemu's -icount feature, for instance zephyrproject-rtos#14173 or others.

Even before qemu's -icount is fixed, Continuous Integration could be
split in two separate runs: A. --tag emu_time and B. --exclude
emu_time. Only A tests would be allowed retries which would stop hiding
other, unrelated intermittent issues affecting B tests.

This initial commit does not pretend to exhaustively tag all affected
tests. However it's an already functional and useful start of 14 tests
collected from and tested over weeks of local sanitycheck runs and
_partially_ reviewed by qemu clock expert Andy Ross.

This commit also increases the individual timeout of 7 tests that have
been observed to consistently take much longer than the
median (2-3s). This flags how unusually long these are, lets users
temporarily reduce the very long 60s default timeout in their local
workspace and finally should reduce the chance of them timing out on a
heavily loaded system. Set their timeout to 3-4 times the duration
observed in CI and locally.

Signed-off-by: Marc Herbert <[email protected]>
@andrewboie andrewboie added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Aug 13, 2019
@andrewboie
Copy link
Contributor

ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs

This seems to be the culprit. The check is here:

k_tid_t z_impl_k_thread_create(struct k_thread *new_thread,
			      k_thread_stack_t *stack,
			      size_t stack_size, k_thread_entry_t entry,
			      void *p1, void *p2, void *p3,
			      int prio, u32_t options, s32_t delay)
{
	__ASSERT(!z_is_in_isr(), "Threads may not be created in ISRs");

I think there is some issue on how z_is_in_isr() is implemented in xtensa.

@andrewboie andrewboie added priority: medium Medium impact/importance bug and removed priority: low Low impact/importance bug labels Aug 13, 2019
@andrewboie
Copy link
Contributor

I'm currently unable to reproduce this issue with the current master branch, even on a heavily loaded system.

@andrewboie
Copy link
Contributor

/* Entry setup for all exceptions and interrupts.  Arrive here with
 * the stack pointer decremented across a base save area, A0-A3 and
 * PS/PC already spilled to the stack in the BSA, and A2 containing a
 * level-specific C handler function.
 *
 * This is a macro (to allow for unit testing) that expands to a
 * handler body to which the vectors can jump.  It takes two static
 * (!) arguments: a special register name (which should be set up to
 * point to some kind of per-CPU record struct) and offsets within
 * that struct which contains an interrupt stack top and a "nest
 * count" word.
 */
.macro EXCINT_HANDLER SR, NEST_OFF, INTSTACK_OFF
	/* A2 contains our handler function which will get clobbered
	 * by the save.  Stash it into the unused "a1" slot in the
	 * BSA and recover it immediately after.  Kind of a hack.
	 */
	s32i a2, a1, BSA_SCRATCH_OFF

	ODD_REG_SAVE
	call0 xtensa_save_high_regs

	l32i a2, a1, 0
	l32i a2, a2, BSA_SCRATCH_OFF

	/* There's a gotcha with level 1 handlers: the INTLEVEL field
	 * gets left at zero and not set like high priority interrupts
	 * do.  That works fine for exceptions, but for L1 interrupts,
	 * when we unmask EXCM below, the CPU will just fire the
	 * interrupt again and get stuck in a loop blasting save
	 * frames down the stack to the bottom of memory.  It would be
	 * good to put this code into the L1 handler only, but there's
	 * not enough room in the vector without some work there to
	 * squash it some.  Next choice would be to make this a macro
	 * argument and expand two versions of this handler.  An
	 * optimization FIXME, I guess.
	 */
	rsr.PS a0
	movi a3, PS_INTLEVEL_MASK
	and a0, a0, a3
	bnez a0, _not_l1
	rsr.PS a0
	movi a3, PS_INTLEVEL(1)
	or a0, a0, a3
	wsr.PS a0
_not_l1:

	/* Unmask EXCM bit so C code can spill/fill in window
	 * exceptions.  Note interrupts are already fully masked by
	 * INTLEVEL, so this is safe.
	 */
	rsr.PS a0
	movi a3, ~(PS_EXCM_MASK)
	and a0, a0, a3
	wsr.PS a0
	rsync

	/* A1 already contains our saved stack, and A2 our handler.
	 * So all that's needed for CROSS_STACK_CALL is to put the
	 * "new" stack into A3.  This can be either a copy of A1 or an
	 * entirely new area depending on whether we find a 1 in our
	 * SR[off] macro argument.
	 */
	rsr.\SR a3
	l32i a0, a3, \NEST_OFF
	beqz a0, _switch_stacks_\@

	/* Use the same stack, just copy A1 to A3 after incrementing NEST */
	addi a0, a0, 1
	s32i a0, a3, \NEST_OFF
	mov a3, a1
	j _do_call_\@

_switch_stacks_\@:
	addi a0, a0, 1
	s32i a0, a3, \NEST_OFF
	l32i a3, a3, \INTSTACK_OFF

_do_call_\@:
	CROSS_STACK_CALL

	/* Decrement nest count */
	rsr.\SR a3
	l32i a0, a3, \NEST_OFF
	addi a0, a0, -1
	s32i a0, a3, \NEST_OFF

	/* Last trick: the called function returned the "next" handle
	 * to restore to in A6 (the call4'd function's A2).  If this
	 * is not the same handle as we started with, we need to do a
	 * register spill before restoring, for obvious reasons.
	 * Remember to restore the A1 stack pointer as it existed at
	 * interrupt time so the caller of the interrupted function
	 * spills to the right place.  Also mask interrupts (which
	 * have been unmasked during the handler execution) while we
	 * muck with the windows.  The restore will unmask them
	 * correctly.
	 */
	beq a6, a1, _restore_\@
	rsil a0, XCHAL_NMILEVEL
	l32i a1, a1, 0
	addi a1, a1, BASE_SAVE_AREA_SIZE
	SPILL_ALL_WINDOWS
	mov a1, a6

_restore_\@:
	j _restore_context
.endm

This is the code that increments/decrements nested.
AFAICT, interrupts are NOT masked on the way out, after CROSS_STACK_CALL, when the nest count is decremented, despite the fact that this decrement operation is not atomic, which could plausibly explain what's happening if we get a nested IRQ at the wrong time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants