Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

murdock: enable esp32-wroom-32 for CI testing #11449

Merged
merged 2 commits into from
Dec 6, 2019

Conversation

kaspar030
Copy link
Contributor

Contribution description

I've added three esp32-wroom-32 boards to the CI. This PR enables them for testing.

Testing procedure

Let CI run with "run tests" set.

Issues/PRs references

none

@kaspar030 kaspar030 added CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Area: CI Area: Continuous Integration of RIOT components CI: run tests If set, CI server will run tests on hardware for the labeled PR Platform: ESP Platform: This PR/issue effects ESP-based platforms labels Apr 25, 2019
@kaspar030
Copy link
Contributor Author

hrmpf, the reset circuitry isn't working properly. the boards needs capacitors between EN and GND. :(

@kaspar030 kaspar030 added State: WIP State: The PR is still work-in-progress and its code is not in its final presentable form yet and removed CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Apr 25, 2019
'\t 1 | pending Q | 15',
'\t 2 | running Q | 7'
r'\s+pid | state Q | pri',
r'\s+1 | pending Q |\s+\d+',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to ESP32? SCHED_PRIO_LEVELS seems to be not architecture specific but a common variable that might change from application to application as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not ESP32 related. I'll cut out the commit into another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the esp Makefile states # if any WiFi interface is used, the number of priority levels has to be 32.

Do you remember the reason why?

A couple of tests fail because the priorities change with SCHED_PRIO_LEVELS=32.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, number of priority levels is simply to small if esp_wifi is used. If esp_wifi is used there are 3 additional high-priority threads:

  • WiFi hardware handler wifi with priority 1
  • WiFi event loop wifi-event-loop with priority 4 which is defined by the ESP32 SDK.
  • WiFi netdev netdev-esp-wifi with priority 10 which is simply the priority defined by GNRC_NETIF_PRIO = (THREAD_PRIORITY_MAIN - 5).
> ps
	pid | name                 | state    Q | pri | stack  ( used) | base addr  | current     
	  - | isr_stack            | -        - |   - |   2048 (  400) | 0x3ffb0220 | 0x3ffb0a20
	  1 | wifi-event-loop      | bl rx    _ |   4 |   2100 ( 1020) | 0x3ffaee18 | 0x3ffaf430 
	  2 | idle                 | pending  Q |  31 |   2048 (  448) | 0x3ffb4b80 | 0x3ffb51e0 
	  3 | main                 | running  Q |  15 |   3072 ( 1616) | 0x3ffb5380 | 0x3ffb5d00 
	  4 | pktdump              | bl rx    _ |  14 |   3072 (  560) | 0x3ffb9da0 | 0x3ffba770 
	  5 | ipv6                 | bl rx    _ |  12 |   2048 (  752) | 0x3ffb7890 | 0x3ffb7e40 
	  6 | udp                  | bl rx    _ |  13 |   2048 (  576) | 0x3ffbb3f0 | 0x3ffbb9b0 
	  7 | wifi                 | bl rx    _ |   1 |   3636 ( 1848) | 0x3ffbe004 | 0x3ffbec10 
	  8 | netdev-esp-wifi      | bl rx    _ |  10 |   2048 (  848) | 0x3ffb6490 | 0x3ffb6a10 
	  9 | RPL                  | bl rx    _ |  13 |   2048 (  492) | 0x3ffbabec | 0x3ffbb200 
	    | SUM                  |            |     |  24168 ( 8560)

If you would use the default number of priority levels, THREAD_PRIORITY_MAIN would be 7 and GNRC_NETIF_PRIO would be 2. This could lead to a deadlock situations because upper layer netdev-esp-wifi has higher priority than lower layer thread wifi-event-loop which is dealing with events from hardware driver thread wifi. At least, there will be instabilities if thread wifi-event-loop can't handle events from WiFi hardware just in time.

Following log shows the priorities with default number of priority levels.

> ps
	pid | name                 | state    Q | pri | stack  ( used) | base addr  | current     
	  - | isr_stack            | -        - |   - |   2048 (  400) | 0x3ffb0220 | 0x3ffb0a20
	  1 | wifi-event-loop      | bl rx    _ |   4 |   2100 ( 1020) | 0x3ffaee18 | 0x3ffaf430 
	  2 | idle                 | pending  Q |  15 |   2048 (  416) | 0x3ffb4b40 | 0x3ffb51a0 
	  3 | main                 | running  Q |   7 |   3072 ( 1616) | 0x3ffb5340 | 0x3ffb5cc0 
	  4 | pktdump              | bl rx    _ |   6 |   3072 (  560) | 0x3ffb9d60 | 0x3ffba730 
	  5 | ipv6                 | bl rx    _ |   4 |   2048 (  752) | 0x3ffb7850 | 0x3ffb7e00 
	  6 | udp                  | bl rx    _ |   5 |   2048 (  576) | 0x3ffbb3b0 | 0x3ffbb970 
	  7 | wifi                 | bl rx    _ |   1 |   3636 ( 1864) | 0x3ffbdfc4 | 0x3ffbebd0 
	  8 | netdev-esp-wifi      | bl rx    _ |   2 |   2048 (  848) | 0x3ffb6450 | 0x3ffb69d0 
	  9 | RPL                  | bl rx    _ |   5 |   2048 (  492) | 0x3ffbabac | 0x3ffbb1c0 
	    | SUM                  |            |     |  24168 ( 8544)

It seems that there are too many threads with same priority. I don't know remember where, but I have read anywhere in Wiki or the online doc that each thread should have it's own priority since threads with same priority don't lead to a context switch as long as both are in state pending.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I have seen that there is still an inconsistency in cpu/esp32/Makefile.include. SCHED_PRIO_LEVELS is always set to 32.

CFLAGS  += -DSCHED_PRIO_LEVELS=32
...
# if any WiFi interface is used,  the number of priority levels has to be 32
ifneq (,$(filter esp_wifi_any,$(USEMODULE)))
    CFLAGS += -DSCHED_PRIO_LEVELS=32
endif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look to the tests that failed. All of them are tests that don't require module esp_wifi. So, I would prefer to reactivate the conditional setting of SCHED_PRIO_LEVELS.

 CFLAGS  += -DSCHED_PRIO_LEVELS=32
-CFLAGS  += -DSCHED_PRIO_LEVELS=32
...
# if any WiFi interface is used,  the number of priority levels has to be 32
ifneq (,$(filter esp_wifi_any,$(USEMODULE)))
    CFLAGS += -DSCHED_PRIO_LEVELS=32
endif

This should help the tests which insist on certain PIDs and will work for networking applications as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I would prefer to reactivate the conditional setting of SCHED_PRIO_LEVELS.

ok!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This did the trick for most tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ps regex is currently wrong, as | is a regex character. It should be escaped to have a proper test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are not necessary any longer. After changing the thread priority levels in PR #12753 everything is working without changes.

@@ -140,6 +140,8 @@ LINKFLAGS += -Wl,--warn-unresolved-symbols
FLASH_MODE ?= dout # FIX configuration, DO NOT CHANGE
FLASH_FREQ = 40m # FIX configuration, DO NOT CHANGE
FLASH_SIZE ?= 2MB
FLASHFILE ?= $(ELFFILE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For which purpose is this define required? Is it something used by murdock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the "test-murdock" target needs to know which file to send to the RasPi worker nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When setting FLASHFILE, the flash/preflash targets should actually be using FLASHFILE instead of ELFFILE. I have two commits in #8838 that I can PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now has its PR #11648

# used by Murdock
export RESET = dtr $(PORT) 0 && dtr $(PORT) 1
else
export RESET ?= esptool.py --before default_reset run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it do the job you need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, it doesn't work with the "make test" target. In that use-case, the test terminal program has the tty open, then "make reset" is called. esptool tries to connect to the bootloader instead of just resetting via RTS/DTR, and fails because the test terminal is hijacking the communication.

Also, on the RPi workers, spawning python takes seconds. ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there not a python alternative for RIOT that could be included in the repo ?

I mean, flashing is done with python, the tests are run with python, so using python for this would be easier than having to install external programs.

And the RPI could use its own C implementation anyway by overwriting RESET and RESET_FLAGS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally local testing is not much different from what murdock does. And for Murdock, spawning python is slow. We should maybe just integrate the (5 line) c tool into the RIOT repo.

@gschorcht
Copy link
Contributor

hrmpf, the reset circuitry isn't working properly. the boards needs capacitors between EN and GND. :(

I guess it depends on the board you really have. The original ESP32 DevKitC V4 which is the master for all clones with ESP32-WROOM-32 modules has a capacitor between EN and GND.

@kaspar030
Copy link
Contributor Author

I guess it depends on the board you really have.

They look like this, but I got them cheaply from China, so they might be clones. The printing on the backside is a little off.

The original ESP32 DevKitC V4 which is the master for all clones with ESP32-WROOM-32 modules has a capacitor between EN and GND.

I tried adding one, but only had 100uF caps lying around. Weirdly, esptool.py can now properly change to boot mode, reset or flash the device and get it's chip id etc. but the user code doesn't run anymore (board outputs only the garbage as seen in this PR's failed tests).

@gschorcht
Copy link
Contributor

but the user code doesn't run anymore (board outputs only the garbage as seen in this PR's failed tests).

How exactly looks the garbage? Maybe, there is a timing problem when changing the system clock and the UART hasn't the right baudrate anymore.

@kaspar030 kaspar030 added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Apr 29, 2019
@kaspar030
Copy link
Contributor Author

How exactly looks the garbage? Maybe, there is a timing problem when changing the system clock and the UART hasn't the right baudrate anymore.

That's how it looked (wrong baudrate). Anyhow, I changed the caps to 2.2uF (from 100uF) and now it seems to work fine! Let's see how many tests run through...

@gschorcht
Copy link
Contributor

That's how it looked (wrong baudrate). Anyhow, I changed the caps to 2.2uF (from 100uF) and now it seems to work fine! Let's see how many tests run through...

So was just the timing.

@kaspar030
Copy link
Contributor Author

I've pulled out a general fix for tests/posix_semaphore in #11467.

@kaspar030
Copy link
Contributor Author

@gschorcht could you maybe take over this PR? I think all the CI specific parts should be working. Some tests are still failing for reasons you might be better suited to fix...

@gschorcht gschorcht self-assigned this May 6, 2019
@gschorcht
Copy link
Contributor

gschorcht commented May 24, 2019

@kaspar030 I took a look to the remaining tests that fail. For most of them it is not really clear to me why they fail. Even though ESP32 produce the right output, the python script runs into timeout. For example, tests/thread_msg_block_w_queue gives the following results:

Starting RIOT kernel on PRO cpu
I (296) [main_trampoline]: main(): This is RIOT! (Version: buildtest)
sender_thread start
main thread alive

Timeout in expect script at "child.expect('sender_thread start\r\n')" (tests/thread_msg_block_w_queue/tests/01-run.py:14)

Even though the test application prints out sender_thread start, the python script gets a timeout while waiting for it. This is the same for tests/thread_msg_block_wo_queue, tests/thread_basic and even tests/xtimer_usleep_short:

Starting RIOT kernel on PRO cpu
I (295) [main_trampoline]: main(): This is RIOT! (Version: buildtest)
This test will call xtimer_usleep for values from 500 down to 0
going to sleep 500 usecs...
going to sleep 499 usecs...
...
going to sleep 0 usecs...
[SUCCESS]

Timeout in expect script at "child.expect(u"This test will call xtimer_usleep for values from \\d+ down to \\d+\r\n")" (tests/xtimer_usleep_short/tests/01-run.py:15)

tests/posix_semaphore failed since the timer takes 1000102 us instead of 1000000 us due the 100 us margin. This problem is solved by your PR #11467.

tests/ssp fails because the compiler optimizes out the whole test_funcfunction 😟 Maybe the best way is to blacklist ESP32 boards for this test.

@kaspar030
Copy link
Contributor Author

Even though the test application prints out sender_thread start, the python script gets a timeout while waiting for it. This is the same for tests/thread_msg_block_wo_queue, tests/thread_basic and even tests/xtimer_usleep_short:

might be the line endings (\r\n). I'll try using $ instead.

@kaspar030
Copy link
Contributor Author

might be the line endings (\r\n).

This really does seem line ending related. Somehow sometimes there's a "0x0a" missing.

@gschorcht
Copy link
Contributor

gschorcht commented Nov 28, 2019

@kaspar030 CI tests succeed. The PR still needs squashing.

@kaspar030
Copy link
Contributor Author

The PR still needs squashing.

#12816 is just temporarily merged here, but not merged yet. need to wait...

@gschorcht
Copy link
Contributor

The PR still needs squashing.

#12816 is just temporarily merged here, but not merged yet. need to wait...

Yes, of course. For any reason, I thought PR #12816 has been merged already, my fault.

@kaspar030 kaspar030 removed the State: WIP State: The PR is still work-in-progress and its code is not in its final presentable form yet label Dec 5, 2019
@kaspar030
Copy link
Contributor Author

#12816 is just temporarily merged here, but not merged yet. need to wait...

#12816 is merged, I rebased this one.

@kaspar030
Copy link
Contributor Author

tests/pkg_spiffs failed with timeout...

@gschorcht
Copy link
Contributor

tests/pkg_spiffs failed with timeout...

It doesn't seem to be related to the test. The log only shows

dwqw: command timed out

whatever it means.

@gschorcht
Copy link
Contributor

It failed again on pi-6356c652.

@kaspar030
Copy link
Contributor Author

dwqw: command timed out

whatever it means.

That means the test took longer than 5 minutes to complete and has been cancelled. Currently Murdock cannot capture the test output in that case.

@gschorcht
Copy link
Contributor

gschorcht commented Dec 5, 2019

dwqw: command timed out

whatever it means.

That means the test took longer than 5 minutes to complete and has been cancelled. Currently Murdock cannot capture the test output in that case.

I see. Might it be a problem of the ESP32 node on pi-6356c652. This was the node which also had the problems the last time. The test itself should work after we increased the timeout with commit 47a4ff0.

@kaspar030
Copy link
Contributor Author

I see. Might it be a problem of the ESP32 node on pi-6356c652. This was the node which also had the problem the last time.

Well, the ESP is a new one. Lets hope that Pi isn't eating esp boards...

Here's the local uart output of the test application:

Help: Press s to start test, r to print it is ready
START
main(): This is RIOT! (Version: 2020.01-devel-1274-g2b934d)
EXCEPTION!! exccause=0 (IllegalInstructionCause) @800df20d excvaddr=00000000


heap: 184120 (free 183292) byte

register set
pc      : 401238e8      ps      : 00060130      exccause: 00000000      excvaddr: 00000000
epc1    : 401238e8      epc2    : 00000000      epc3    : 00000000      epc4    : 00000000
a0      : 800df20d      a1      : 3ffb2800      a2      : 3ffb2894      a3      : 3ffb2870
a4      : 00000001      a5      : 3ffb28a8      a6      : 00000000      a7      : 00000001
a8      : 800da710      a9      : 3ffb2800      a10     : 3ffb289c      a11     : 3ffb2800
a12     : 3ffb289c      a13     : 3ffb107c      a14     : 00000000      a15     : 00000001
lbeg    : 40084bc8      lend    : 40084bd3      lcount  : 00000000
ets Jun  8 2016 00:22:57

rst:0x7 (TG0WDT_SYS_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DOUT, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:4004
load:0x40078000,len:7304
load:0x40080000,len:7016
entry 0x40080344
W (29) boot: PRO CPU has been reset by WDT.
W (29) boot: WDT reset info: PRO CPU PC=0x40082e04
W (29) boot: WDT reset info: APP CPU PC=0x3ccc4d07

Help: Press s to start test, r to print it is ready

@gschorcht
Copy link
Contributor

Ups, that's a crash. Let me check.

@gschorcht
Copy link
Contributor

I can reproduce the crash which is new. SPIFFS was definitely working with the last tests.

@gschorcht
Copy link
Contributor

gschorcht commented Dec 5, 2019

It seems that the problem came in with the merge of PR #12810. Even though ESP32 flash driver was touched by this PR, the changes are only changes in log output during intialization. The problem doesn't occur when compiled with debug option. I have to investigate it further.

@gschorcht
Copy link
Contributor

It seems that the problem came in with the merge of PR #12810. Even though ESP32 flash driver was touched by this PR, the changes are only changes in log output during intialization. The problem doesn't occur when compiled with debug option. I have to investigate it further.

This is one of the problems I really hate. The test works with debug compile option, it works in QEMU, it works when board configuration is printed at startup and it works if any printf is added to the test or the SPI flash driver. But it can't be used with OCD. How to figure out the problem under these conditions? Right now, I'm completely at a loss.

Sure, we could enable the output of the startup information as it was before as a workaround for this test. However, this would probably not solve the cause of the problem.

My guess is that the problem is caused by an asynchronous UART interrupt while the cache for the program code in SPI flash is disabled due to a write operation to the SPI flash drive.

@gschorcht
Copy link
Contributor

gschorcht commented Dec 6, 2019

@kaspar030 Problem solved. I had to place the code of VFS module in IRAM because the cache for program code in SPI flash memory is disabled during SPI flash write operations. PR #12890 fixes the problem.

I have also reduced the test timeouts for tests/pkg_spiffs and pkg_littlefs.

@kaspar030
Copy link
Contributor Author

@kaspar030 Problem solved.

Nice catch! I've rebased with #12890 merged.

@gschorcht
Copy link
Contributor

Wow, all light are green. Do we wanna to merge it now?

@kaspar030
Copy link
Contributor Author

Wow, all light are green. Do we wanna to merge it now?

Yeah! I'll let maintainers@ know to watch out for esp32 CI results for a while.

@kaspar030 kaspar030 merged commit 220d642 into RIOT-OS:master Dec 6, 2019
@kaspar030 kaspar030 deleted the murdock_add_esp32-wroom-32 branch December 6, 2019 12:29
@kaspar030
Copy link
Contributor Author

Good job @gschorcht!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: CI Area: Continuous Integration of RIOT components CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR CI: run tests If set, CI server will run tests on hardware for the labeled PR Platform: ESP Platform: This PR/issue effects ESP-based platforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants