stack_sentinel: rare ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429 Threads may not be created in ISRs #16915

marc-hb · 2019-06-19T00:29:36Z

Describe the bug
Intermittent failure of sanitycheck test:

qemu_xtensa               tests/kernel/fatal/kernel.common.stack_sentinel

Spotted only once. Generally doesn't fail. The previous run was in the exact same conditions and didn't fail.

To Reproduce
Steps to reproduce the behavior:

Run this test in sanitycheck many times. On a loaded machine maybe?
Pray

Expected behavior
PASS

Impact
Low.

Environment (please complete the following information):

OS: Ubuntu 18.0.4.2
Toolchain Zephyr SDK 0.10.0
Commit SHA or Version used: b58aa20

Additional context
JOBS=50

Screenshots or console output

qemu_xtensa               tests/kernel/fatal/kernel.common.stack_sentinel    FAILED: unexpected byte (qemu)`
***** Booting Zephyr build zephyr-v1.14.0-1527-g565bf3263d3b *****
Running test suite fatal
===================================================================
starting test - test_fatal
test alt thread 1: generic CPU exception
 ** FATAL EXCEPTION
 ** CPU 0 EXCCAUSE 0 PS 0x00060020 PC 0x600044c0 VADDR 0x00000000
 **  A0 0xa0001a72  SP 0x600044c0  A2 0x00000000  A3 0x00000000
 **  A4 0x00000000  A5 0x60004b10  A6 0x00000000  A7 0x00060520
 **  A8 0xa0000f68  A9 0x00000040 A10 0x00000000 A11 0x60004500
 ** A12 0x60004b10 A13 0x60004af0 A14 0xfffffff5 A15 0x00060520
 ** SAR 0x00000000
Current thread ID = 0x60006088
Faulting instruction address = 0xdeaddead
Caught system error -- reason 0
test alt thread 2: initiate kernel oops
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x60006104
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6

    Assertion failed at ZEPHYR_BASE/tests/kernel/fatal/src/main.c:247: test_fatal: (crash_reason not equal to _NANO_ERR_KERNEL_OOPS)
bad reason code got 6 expected 5

test alt thread 3: initiate kernel panic
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x60006104
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6
test stack sentinel overflow - timer irq
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x60006104
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6

    Assertion failed at ZEPHYR_BASE/tests/kernel/fatal/src/main.c:197: check_stack_overflow: (crash_reason not equal to _NANO_ERR_STACK_CHK_FAIL)
bad reason code got 6 expected 2

FAIL - test_fatal
===================================================================
Test suite fatal failed.
===================================================================
PROJECT EXECUTION FAILED

The text was updated successfully, but these errors were encountered:

marc-hb · 2019-06-19T23:29:37Z

Another strange and also rare "Fatal fault in ISR!", does it seem related?

qemu_xtensa/tests/kernel/stack/stack_api/kernel.stack/handler.log

***** Booting Zephyr build zephyr-v1.14.0-1631-g15e8179a3c10 *****
Running test suite stack_api
===================================================================
starting test - test_stack_thread2thread
PASS - test_stack_thread2thread
===================================================================
starting test - test_stack_user_thread2thread
SKIP - test_stack_user_thread2thread
===================================================================
starting test - test_stack_thread2isr
 ** FATAL EXCEPTION
 ** CPU 0 EXCCAUSE 28 PS 0x00060120 PC 0x6000360b VADDR 0x00000008
 **  A0 0xa0003485  SP 0x60004d20  A2 0x00000008  A3 0x6000665c
 **  A4 0x00060723  A5 0x600058c0  A6 0xfffffff5  A7 0x00060620
 **  A8 0x60004dc0  A9 0x60004da0 A10 0x00000000 A11 0x60004df4
 ** A12 0x00000010 A13 0x60004e20 A14 0x00000000 A15 0x00000000
 ** SAR 0x00000000
Current thread ID = 0x6000653c
Faulting instruction address = 0xdeaddead
Fatal fault in ISR! Spinning...

andrewboie · 2019-06-19T23:33:07Z

please file a different GH issue for the other crash, the test cases and the errors produced are unrelated

marc-hb · 2019-06-20T18:26:03Z

Another nearly identical instance:

qemu_xtensa/tests/kernel/fatal/kernel.common.stack_sentinel/handler.log

***** Booting Zephyr OS build zephyr-v1.14.0-1662-ga75771488cf2 *****
Running test suite fatal
===================================================================
starting test - test_fatal
test alt thread 1: generic CPU exception
 ** FATAL EXCEPTION
 ** CPU 0 EXCCAUSE 0 PS 0x00060020 PC 0x60004580 VADDR 0x00000000
 **  A0 0xa0001a7a  SP 0x60004580  A2 0x00000000  A3 0x00000000
 **  A4 0x00000000  A5 0x60004bd0  A6 0x00000000  A7 0x00060520
 **  A8 0xa0000f6c  A9 0x00000040 A10 0x00000000 A11 0x600045c0
 ** A12 0x60004bd0 A13 0x60004bb0 A14 0xfffffff5 A15 0x00060520
 ** SAR 0x00000000
Current thread ID = 0x60006148
Faulting instruction address = 0xdeaddead
Caught system error -- reason 0
test alt thread 2: initiate kernel oops
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x600061c4
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6

    Assertion failed at ZEPHYR_BASE/tests/kernel/fatal/src/main.c:275: test_fatal: (crash_reason not equal to _NANO_ERR_KERNEL_OOPS)
bad reason code got 6 expected 5

test alt thread 3: initiate kernel panic
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x600061c4
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6
test stack sentinel overflow - timer irq
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x600061c4
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6

    Assertion failed at ZEPHYR_BASE/tests/kernel/fatal/src/main.c:225: check_stack_overflow: (crash_reason not equal to _NANO_ERR_STACK_CHK_FAIL)
bad reason code got 6 expected 2

test stack sentinel overflow - swap
ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs
@ ZEPHYR_BASE/lib/os/assert.c:30:
***** Kernel Panic! *****
Current thread ID = 0x600061c4
Faulting instruction address = 0xdeaddead
Caught system error -- reason 6
FAIL - test_fatal
===================================================================
Test suite fatal failed.
===================================================================
PROJECT EXECUTION FAILED

Quoting bug "Configure QEMU to run independent of the host clock zephyrproject-rtos#14173" We have been struggling for years with issues related to how QEMU attempts to synchronize guest timer interrupts with the host clock, for example zephyrproject-rtos#12553. The symptom is that heavily loaded sanitycheck runs have tests spuriously failing due to timing related issues. This creates noise in our CI runs which masks true bugs in our system which manifest only intermittently, causing real issues that will happen all the time at scale to be 'swept under the rug'; right now any time a test fails sanitycheck retries it a few times and only consecutive failures produce an error. There's also a lot of relevant information and more links in: "List of tests that keep failing sporadically" zephyrproject-rtos#12553 This new "emu_time" tag helps by letting users either select or exclude the tests that really need accurate time to pass and have a high chance to actually be impacted by this emulation issue. As an example, it's only with 'sanitycheck --exclude emu_time' that I could spot and file intermittent but non-emu_time issue zephyrproject-rtos#16915. As Andrew predicted above, it was drown in emu_time noise before that. Conversely, "--tag emu_time" can be used by developers focusing on fixing qemu's -icount feature, for instance zephyrproject-rtos#14173 or others. Even before qemu's -icount is fixed, Continuous Integration could be split in two separate runs: A. --tag emu_time and B. --exclude emu_time. Only A tests would be allowed retries which would stop hiding other, unrelated intermittent issues affecting B tests. This initial commit does not pretend to exhaustively tag all affected tests. However it's an already functional and useful start of 14 tests collected from and tested over weeks of local sanitycheck runs and _partially_ reviewed by qemu clock expert Andy Ross. This commit also increases the individual timeout of 7 tests that have been observed to consistently take much longer than the median (2-3s). This flags how unusually long these are, lets users temporarily reduce the very long 60s default timeout in their local workspace and finally should reduce the chance of them timing out on a heavily loaded system. Set their timeout to 3-4 times the duration observed in CI and locally. Signed-off-by: Marc Herbert <[email protected]>

andrewboie · 2019-08-13T20:39:18Z

ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429
        Threads may not be created in ISRs

This seems to be the culprit. The check is here:

k_tid_t z_impl_k_thread_create(struct k_thread *new_thread,
			      k_thread_stack_t *stack,
			      size_t stack_size, k_thread_entry_t entry,
			      void *p1, void *p2, void *p3,
			      int prio, u32_t options, s32_t delay)
{
	__ASSERT(!z_is_in_isr(), "Threads may not be created in ISRs");

I think there is some issue on how z_is_in_isr() is implemented in xtensa.

andrewboie · 2019-08-13T22:06:34Z

I'm currently unable to reproduce this issue with the current master branch, even on a heavily loaded system.

andrewboie · 2019-08-13T22:18:36Z

/* Entry setup for all exceptions and interrupts.  Arrive here with
 * the stack pointer decremented across a base save area, A0-A3 and
 * PS/PC already spilled to the stack in the BSA, and A2 containing a
 * level-specific C handler function.
 *
 * This is a macro (to allow for unit testing) that expands to a
 * handler body to which the vectors can jump.  It takes two static
 * (!) arguments: a special register name (which should be set up to
 * point to some kind of per-CPU record struct) and offsets within
 * that struct which contains an interrupt stack top and a "nest
 * count" word.
 */
.macro EXCINT_HANDLER SR, NEST_OFF, INTSTACK_OFF
	/* A2 contains our handler function which will get clobbered
	 * by the save.  Stash it into the unused "a1" slot in the
	 * BSA and recover it immediately after.  Kind of a hack.
	 */
	s32i a2, a1, BSA_SCRATCH_OFF

	ODD_REG_SAVE
	call0 xtensa_save_high_regs

	l32i a2, a1, 0
	l32i a2, a2, BSA_SCRATCH_OFF

	/* There's a gotcha with level 1 handlers: the INTLEVEL field
	 * gets left at zero and not set like high priority interrupts
	 * do.  That works fine for exceptions, but for L1 interrupts,
	 * when we unmask EXCM below, the CPU will just fire the
	 * interrupt again and get stuck in a loop blasting save
	 * frames down the stack to the bottom of memory.  It would be
	 * good to put this code into the L1 handler only, but there's
	 * not enough room in the vector without some work there to
	 * squash it some.  Next choice would be to make this a macro
	 * argument and expand two versions of this handler.  An
	 * optimization FIXME, I guess.
	 */
	rsr.PS a0
	movi a3, PS_INTLEVEL_MASK
	and a0, a0, a3
	bnez a0, _not_l1
	rsr.PS a0
	movi a3, PS_INTLEVEL(1)
	or a0, a0, a3
	wsr.PS a0
_not_l1:

	/* Unmask EXCM bit so C code can spill/fill in window
	 * exceptions.  Note interrupts are already fully masked by
	 * INTLEVEL, so this is safe.
	 */
	rsr.PS a0
	movi a3, ~(PS_EXCM_MASK)
	and a0, a0, a3
	wsr.PS a0
	rsync

	/* A1 already contains our saved stack, and A2 our handler.
	 * So all that's needed for CROSS_STACK_CALL is to put the
	 * "new" stack into A3.  This can be either a copy of A1 or an
	 * entirely new area depending on whether we find a 1 in our
	 * SR[off] macro argument.
	 */
	rsr.\SR a3
	l32i a0, a3, \NEST_OFF
	beqz a0, _switch_stacks_\@

	/* Use the same stack, just copy A1 to A3 after incrementing NEST */
	addi a0, a0, 1
	s32i a0, a3, \NEST_OFF
	mov a3, a1
	j _do_call_\@

_switch_stacks_\@:
	addi a0, a0, 1
	s32i a0, a3, \NEST_OFF
	l32i a3, a3, \INTSTACK_OFF

_do_call_\@:
	CROSS_STACK_CALL

	/* Decrement nest count */
	rsr.\SR a3
	l32i a0, a3, \NEST_OFF
	addi a0, a0, -1
	s32i a0, a3, \NEST_OFF

	/* Last trick: the called function returned the "next" handle
	 * to restore to in A6 (the call4'd function's A2).  If this
	 * is not the same handle as we started with, we need to do a
	 * register spill before restoring, for obvious reasons.
	 * Remember to restore the A1 stack pointer as it existed at
	 * interrupt time so the caller of the interrupted function
	 * spills to the right place.  Also mask interrupts (which
	 * have been unmasked during the handler execution) while we
	 * muck with the windows.  The restore will unmask them
	 * correctly.
	 */
	beq a6, a1, _restore_\@
	rsil a0, XCHAL_NMILEVEL
	l32i a1, a1, 0
	addi a1, a1, BASE_SAVE_AREA_SIZE
	SPILL_ALL_WINDOWS
	mov a1, a6

_restore_\@:
	j _restore_context
.endm

This is the code that increments/decrements nested.
AFAICT, interrupts are NOT masked on the way out, after CROSS_STACK_CALL, when the nest count is decremented, despite the fact that this decrement operation is not atomic, which could plausibly explain what's happening if we get a nested IRQ at the wrong time.

marc-hb added the bug The issue is a bug, or the PR is fixing a bug label Jun 19, 2019

marc-hb assigned andyross and andrewboie Jun 19, 2019

marc-hb added the area: Kernel label Jun 19, 2019

ioannisg added the priority: medium Medium impact/importance bug label Jun 25, 2019

marc-hb mentioned this issue Jun 27, 2019

CI testing must be retry-free #14649

Closed

marc-hb mentioned this issue Jun 27, 2019

tests: new sanitycheck tag "emu_time"; increase some timeouts #17107

Closed

andrewboie added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Aug 13, 2019

andrewboie added priority: medium Medium impact/importance bug and removed priority: low Low impact/importance bug labels Aug 13, 2019

andrewboie mentioned this issue Aug 13, 2019

xtensa: mask interrupts earlier #18241

Merged

andrewboie closed this as completed in #18241 Aug 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stack_sentinel: rare ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429 Threads may not be created in ISRs #16915

stack_sentinel: rare ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429 Threads may not be created in ISRs #16915

marc-hb commented Jun 19, 2019

marc-hb commented Jun 19, 2019

andrewboie commented Jun 19, 2019

marc-hb commented Jun 20, 2019 •

edited

Loading

andrewboie commented Aug 13, 2019

andrewboie commented Aug 13, 2019

andrewboie commented Aug 13, 2019

stack_sentinel: rare ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429 Threads may not be created in ISRs #16915

stack_sentinel: rare ASSERTION FAIL [!(z_arch_curr_cpu()->nested != 0U)] @ ZEPHYR_BASE/kernel/thread.c:429 Threads may not be created in ISRs #16915

Comments

marc-hb commented Jun 19, 2019

marc-hb commented Jun 19, 2019

andrewboie commented Jun 19, 2019

marc-hb commented Jun 20, 2019 • edited Loading

andrewboie commented Aug 13, 2019

andrewboie commented Aug 13, 2019

andrewboie commented Aug 13, 2019

marc-hb commented Jun 20, 2019 •

edited

Loading