Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgm_read_byte() does not work well without delay() #3140

Closed
milkpirate opened this issue Apr 15, 2017 · 44 comments
Closed

pgm_read_byte() does not work well without delay() #3140

milkpirate opened this issue Apr 15, 2017 · 44 comments

Comments

@milkpirate
Copy link

milkpirate commented Apr 15, 2017

I have the following code:

PROGMEM const uint8_t table[2][3] = {
  {0x7B,0x60,0x5D},
  {0x7D,0x48,0x3E}
};

void setup() {
  Serial.begin(115200);
}

void loop() {
  uint8_t data = pgm_read_byte(&table[1][1]);
  Serial.println(data, HEX);
  delay(250);
}

which results in:

Exception (3):
epc1=0x40201c0e epc2=0x00000000 epc3=0x00000000 excvaddr=0x4022e71c depc=0x00000000

ctx: cont
sp: 3ffef200 end: 3ffef3c0 offset: 01a0

>>>stack>>>
3ffef3a0:  feefeffe feefeffe feefeffe 40202098
3ffef3b0:  feefeffe feefeffe 3ffee3a0 40100114
<<<stack<<<

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)

load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v09f0c112

But swapping Serial.println(data, HEX); and delay(250); then works. Why?

EDIT:
Shorter delays make it work as well even a "not existing" one:

PROGMEM const uint8_t table[2][3] = {
  {0x7B,0x60,0x5D},
  {0x7D,0x48,0x3E}
};

void setup() {
  Serial.begin(115200);
}

void loop() {
  uint8_t data = pgm_read_byte(&table[1][1]);
  delayMicroseconds(0);
  Serial.println(data, HEX);
  delay(250);
}
@milkpirate milkpirate changed the title pgm_read_byte() does not work well with delay() pgm_read_byte() does not work well without delay() Apr 20, 2017
@igrr
Copy link
Member

igrr commented Apr 20, 2017

Could you please upload the elf file for the code which demonstrates the crash? That would really help.

@milkpirate
Copy link
Author

Elf file (1.8.0): http://www21.zippyshare.com/v/Vhm4iyHc/file.html (sorry could not find the option to attach files).

@igrr
Copy link
Member

igrr commented Apr 21, 2017

OK, thanks. The compiler sees that offset in the array is constant; so it can calculate everything that happens in the pgm_read_byte macro at compile time. It sees that in this particular case, shifting and masking is not needed (because the byte you are loading happens to be 4-byte aligned) and it optimizes away all the code inside pgm_read_byte, replacing it with a single l8ui instruction. Which obviously doesn't work because the cache port supports only 32-bit access.

I guess we can hard-code the load/shift/mask part inside an __volatile__ assembly block so that this optimization doesn't happen.

/cc @Makuna

@milkpirate
Copy link
Author

Ahhh, OK perfect! Could you also explain why the delayMicroseconds(0); helps? Would attributing the pgm_read_byte function itself as __volatile__ suffice? As far as I know, writing correct inline assembly is pretty hard (NOT meaning you're not able to).

@Makuna
Copy link
Collaborator

Makuna commented Apr 21, 2017

Would also decorating the code with a #pragma that stops optimizations work?
Could you modify this line in your sketch and try it?

// uint8_t data = pgm_read_byte(&table[1][1]); // old 
uint8_t data = pgm_read_byte(&(table[1][1])); // new, take the address of the specific table entry 

@milkpirate
Copy link
Author

Ok also tried pgm_read_byte(&(table[1][1])); (is C++ that picky?) no change. Even tried pgm_read_byte(&(1[table][1])); still no luck :P

@Makuna
Copy link
Collaborator

Makuna commented Apr 22, 2017

replace pgm_read_byte with...

#define pgm_read_byte(addr) 		                                           \
(__extension__({                                                               \
    PGM_P __local = (PGM_P)(addr);  /* isolate varible for macro expansion */         \
    ptrdiff_t __offset = ((uint32_t)__local & 0x00000003); /* byte aligned mask */            \
    const uint32_t* __addr32 = (const uint32_t*)((const uint8_t*)(__local)-__offset); \
    uint32_t __data32 = *__addr32; /* cache data locally as 32bits */ \
    uint8_t __result = (__data32 >> (__offset * 8));                        \
    __result;                                                                  \
}))

I will do some testing tomorrow

@igrr
Copy link
Member

igrr commented Apr 22, 2017

@Makuna, I think what you are trying to do is to trick the compiler while speaking C. I think this won't work for the following reason. What the compiler is doing is fully valid given the architectural model it has. The backend is not aware that there are memory regions from which byte loads are not allowed. The only way to express this (i think) is to write this part in assembly. Every C expression you write which can be evaluated at compile time can be simplified by the compiler to a byte load.
The only other way is to not let the compiler do optimization based on known arguments, that is, turn pgm_read_byte into a (non-inline) function.

@Makuna
Copy link
Collaborator

Makuna commented Apr 22, 2017

@igrr Actually it would be a bug if the compiler disregards sizes if defined; the original was written that the size is ambiguous giving the compiler the option to do the wrong thing. I want to try one other option, which is to just insert a cast. These aren't tricks, they are considered hints. Most developers I know would consider reverting to assembler to get the compiler to do the right thing as the last option to try.

@Makuna
Copy link
Collaborator

Makuna commented Apr 22, 2017

It seems that GCC requires the volatile keyword to force the hints and data size is not enough. There are many references to this when you search online about GCC over optimizing in similar cases across other domains.

@devyte
Copy link
Collaborator

devyte commented Apr 22, 2017 via email

@Makuna
Copy link
Collaborator

Makuna commented Apr 22, 2017

@devyte I don't understand the second question, as the change is to define one as volatile. could you be more specific?

I don't remember the reason other than following the Arduino originals as close as possible (they were also macros). It was almost two years ago when I wrote those ;-)

Do note, the inline prefix does not guarantee to make it inline, its just a hint to the compiler that they can be.

@devyte
Copy link
Collaborator

devyte commented Apr 23, 2017 via email

@igrr
Copy link
Member

igrr commented Apr 23, 2017

Using an intermediate volatile variable would indeed work; performance hit would probably be comparable to the use of an inline assembly block. I don't mind either solution.

Regarding "should be portable across architectures"... we are dealing with a very architecture specific thing (reading from flash as it was RAM). Even for an xtensa-based ESP32 we do a different thing. So I would place portability as the last item on the list of possible requirements.

@Makuna
Copy link
Collaborator

Makuna commented Apr 23, 2017

Did either of you note the reference to the pull request?
I did not implement it as assembly, I just use the volatile variable; my confusion to your question was due to assuming you looked at the changes and were referencing them.

@igrr
Copy link
Member

igrr commented Apr 23, 2017

I did :)

Just for comparison, can you check what code is generated for two cases:

  • load from a location known at compile time
  • load from a location determined at run time

@milkpirate
Copy link
Author

milkpirate commented Apr 23, 2017

I guess using volatile would be best. That tells the compiler what to consider and still leave all the optimization up to it. Inline assembly would force you to to what the compiler might do better anyway and as i said writing good online assembly is hard.

EDIT: I tried the following:

 PROGMEM const uint8_t table[2][3] = {
  {0x7B,0x60,0x5D},
  {0x7D,0x48,0x3E}
};

#undef pgm_read_byte                       // not sure how to correctly undefine this macro
#undef pgm_read_byte(addr)
#define pgm_read_byte(addr)                                                \
(__extension__({                                                               \
    PGM_P __local = (PGM_P)(addr);  /* isolate varible for macro expansion */         \
    ptrdiff_t __offset = ((uint32_t)__local & 0x00000003); /* byte aligned mask */            \
    const uint32_t* __addr32 = (const uint32_t*)((const uint8_t*)(__local)-__offset); \
    uint32_t __data32 = *__addr32; /* cache data locally as 32bits */ \
    uint8_t __result = (__data32 >> (__offset * 8));                        \
    __result;                                                                  \
}))

void setup() {
  Serial.begin(115200);
}

void loop() {
  uint8_t data = pgm_read_byte(&table[1][1]);
  Serial.println(data, HEX);
  delay(250);
}

Same behavior as described above.

@milkpirate
Copy link
Author

Ok

PROGMEM const uint8_t table[] = {
  0x7B, 0x60, 0x5D, 0x7D, 0x48, 0x3E, 0x66, 0xC8, 0x3F
};

#undef pgm_read_byte(addr)
#define pgm_read_byte(addr)                                                \
(__extension__({                                                               \
    PGM_P __local = (PGM_P)(addr);  /* isolate varible for macro expansion */         \
    ptrdiff_t __offset = ((uint32_t)__local & 0x00000003); /* byte aligned mask */            \
    const uint32_t* __addr32 = (const uint32_t*)((const uint8_t*)(__local)-__offset); \
    volatile uint8_t __result = ((*__addr32) >> (__offset * 8));                        \
    __result;                                                                  \
}))

void setup() {
  Serial.begin(115200);
}

void loop() {
  for(uint8_t i=0; i<sizeof(table); i++)
  {
    uint8_t data = pgm_read_byte(&table[i]);
    Serial.println(data, HEX);
  }
  delay(250);
}

Works as desired!

@Makuna
Copy link
Collaborator

Makuna commented Apr 25, 2017

Glad to see you confirmed it. I ran it through a series of sketch's that test it.
I had some sketches that I tested the initial work with (they passed back then) that I ran with the current release and they broke in the same way your sketch did. I guess the compiler or an option got changed that made this optimization happen sometime after the initial work. They passed with the pull I put out.

I suspect we should leave this open until the pull request gets merged in.

@milkpirate milkpirate reopened this Apr 25, 2017
@igrr
Copy link
Member

igrr commented Apr 25, 2017

@Makuna poke regarding #3140 (comment) (i.e. it would be great if you attach the generated assembly code to the PR; so that we can compare with the hand-coded assembly approach).

@milkpirate
Copy link
Author

So the latest code I posted works but I guess it is not because of the volatile at that place. Since the following does not:

PROGMEM const uint8_t table[2][3] = {
  {0x7B,0x60,0x5D},
  {0x7D,0x48,0x3E}
};

#undef pgm_read_byte(addr)
#define pgm_read_byte(addr)                                                \
(__extension__({                                                               \
    PGM_P __local = (PGM_P)(addr);  /* isolate varible for macro expansion */         \
    ptrdiff_t __offset = ((uint32_t)__local & 0x00000003); /* byte aligned mask */            \
    const uint32_t* __addr32 = (const uint32_t*)((const uint8_t*)(__local)-__offset); \
    volatile uint8_t __result = ((*__addr32) >> (__offset * 8));                        \
    __result;                                                                  \
}))

void setup() {
  Serial.begin(115200);
}

void loop() {
  uint8_t data = pgm_read_byte(&table[1][1]);
  Serial.println(data, HEX);
  delay(250);
}

But turning

    ptrdiff_t __offset = ((uint32_t)__local & 0x00000003); /* byte aligned mask */            \
    const uint32_t* __addr32 = (const uint32_t*)((const uint8_t*)(__local)-__offset); \
    volatile uint8_t __result = ((*__addr32) >> (__offset * 8));                        \

into

    volatile ptrdiff_t __offset = ((uint32_t)__local & 0x00000003); /* byte aligned mask */            \
    const uint32_t* __addr32 = (const uint32_t*)((const uint8_t*)(__local)-__offset); \
    uint8_t __result = ((*__addr32) >> (__offset * 8));                        \

makes it work. Puhh I hope I am not annoying someone with my noobish posts...

@Makuna
Copy link
Collaborator

Makuna commented Apr 25, 2017

@igrr If you wish to give me instructions on how to get that, I can run it and provide it. I always forget the details and have only used it once almost two years ago.

@Makuna
Copy link
Collaborator

Makuna commented Apr 25, 2017

@milkpirate You are placing the volatile in the wrong location. The issue is that the compiler is optimization the statement ((*__addr32) >> (__offset * 8)), so the *_addr32 must be pulled out and made volatile local variable to force it to not optimize the fetch to just an 8 bit read as flash will only allow a 32bit read.

@milkpirate
Copy link
Author

milkpirate commented Apr 25, 2017

@Makuna yeah I know. It dosent make sense to me either but thats how it works...
@igrr Here are some elfs for you:

PROGMEM const uint8_t table[2][3] = {
  {0x7B,0x60,0x5D},
  {0x7D,0x48,0x3E}
};

X

void setup() {
  Serial.begin(115200);
}

void loop() {
  uint8_t data = pgm_read_byte(&table[1][1]);
  Serial.println(data, HEX);
  delay(250);
}

X = nothing: http://www109.zippyshare.com/v/RE2Wp929/file.html (not working)
X = http://www109.zippyshare.com/v/MaNDLU5E/file.html (not working)

#undef pgm_read_byte(addr)
#define pgm_read_byte(addr)                                                \
(__extension__({                                                               \
    PGM_P __local = (PGM_P)(addr);  /* isolate varible for macro expansion */         \
    ptrdiff_t __offset = ((uint32_t)__local & 0x00000003); /* byte aligned mask */            \
    const uint32_t* __addr32 = (const uint32_t*)((const uint8_t*)(__local)-__offset); \
    volatile uint8_t __result = ((*__addr32) >> (__offset * 8));                        \
    __result;                                                                  \
}))

X = http://www109.zippyshare.com/v/IvfOo3cB/file.html (working)

#undef pgm_read_byte(addr)
#define pgm_read_byte(addr)                                                \
(__extension__({                                                               \
    PGM_P __local = (PGM_P)(addr);  /* isolate varible for macro expansion */         \
    volatile ptrdiff_t __offset = ((uint32_t)__local & 0x00000003); /* byte aligned mask */            \
    const uint32_t* __addr32 = (const uint32_t*)((const uint8_t*)(__local)-__offset); \
    uint8_t __result = ((*__addr32) >> (__offset * 8));                        \
    __result;                                                                  \
}))

@milkpirate
Copy link
Author

milkpirate commented Apr 26, 2017

Similar behavior with pgm_read_word(). Following crashs as well. Adding volatile in front of ptrdiff_t ... seems to fix this as well. Still dont know why.

 PROGMEM const uint16_t table[2][3] = {
  {0x7B,0x60,0x5D},
  {0x7D,0xF548,0x3E}
};

void setup() {
  Serial.begin(115200);
}

void loop() {
  uint16_t data = pgm_read_word(&table[1][1]);
  Serial.println(data, HEX);
  delay(250);
}

@milkpirate
Copy link
Author

Hey. Just had a look at the release note of the espressif ESP8266_NONOS_SDK v2.1.0. It states:

  1. Fix NMI handle crash on unaligned memory access, BBP309;

Could that relate to the problems above?

@igrr
Copy link
Member

igrr commented May 10, 2017

No, the problems above were fully addressed in this issue. They stem from the fact that our previous code for reading PROGMEM would in some cases compile to an 8- or 16-bit load instruction, which caused bus fault when reading from cache (which only supports 32-bit loads). Note that this is not an alignment issue — reads were aligned, just the read size was wrong.

The issue in SDK release notes is about a bug in the NMI interrupt vector, which was related to unaligned memory access.

@igrr
Copy link
Member

igrr commented May 10, 2017

Will keep this open for reference, until the release.

@igrr igrr reopened this May 10, 2017
@igrr igrr added this to the 2.4.0 milestone May 10, 2017
@lucysrausch lucysrausch mentioned this issue May 23, 2017
11 tasks
@wkraft-fablabka
Copy link
Contributor

It looks like, that with the current GITHUB-Version of ESP8266 core for Arduino
pgm_read_byte
ist now broken.

I have an application based on https://github.com/markruys/arduino-Max72xxPanel, which itself makes use of Adafruits GFX-library (https://github.com/adafruit/Adafruit-GFX-Library).

You can see the problems by compiling the Ticker-Example from the max72xx-library, after changing pinCS to 12.

The sketch crashes inside the GFX-library, where letters from a font are send to the panel.

The font itself is defined:

static const unsigned char font[] PROGMEM = {
        0x00, 0x00, 0x00, 0x00, 0x00,
	0x3E, 0x5B, 0x4F, 0x5B, 0x3E,
        //... 

The crash occurs in Adafruit_GFX::drawChar:

uint8_t line = pgm_read_byte(&font[c * 5 + i]);
// c is unsigned char, i is int8_t

I tried to cast the result of c*5+i to uint16_t or uint32_t, without success.

The library worked with the previous implementation of pgm_read_byte() like a charm.

@wkraft-fablabka
Copy link
Contributor

This is a reduced example, which shows the crash:

#include <SPI.h>
#include <Adafruit_GFX.h>
#include <Max72xxPanel.h>  // https://github.com/markruys/arduino-Max72xxPanel.git

Max72xxPanel matrix = Max72xxPanel(12, 1, 1);

void setup() {
  matrix.setIntensity(7); // Use a value between 0 and 15 for brightness
  Serial.begin(115200);
}

void loop() {
    Serial.println("Before drawChar CRASH!");
    matrix.drawChar(0, 1, 65, HIGH, LOW, 1); // <<< === Crash
    Serial.println("After drawChar, Hooray!");
    delay(10000);  // for test
 
}


/*  Those are the parts of the library, where pgm_read_byte is crashing!
 *   The old variant without assembly works fine!
  void Adafruit_GFX::drawChar(int16_t x, int16_t y, unsigned char c,  uint16_t color, uint16_t bg, uint8_t size) {
        ...
        for(int8_t i=0; i<5; i++ ) { // Char bitmap = 5 columns
            uint8_t line = pgm_read_byte(&font[c * 5 + i]);
        }
        ...        
  }

  static const unsigned char font[] PROGMEM = {
    0x00, 0x00, 0x00, 0x00, 0x00,
    0x3E, 0x5B, 0x4F, 0x5B, 0x3E,
    0x3E, 0x6B, 0x4F, 0x6B, 0x3E,
    ...
    0x00, 0x00, 0x00, 0x00, 0x00  // #255 NBSP
  };
 */

@wkraft-fablabka
Copy link
Contributor

Identified the problem:
After that change:
0b67266
pgm_read_byte and pgm_read_word are no longer a #define but a function!

That kills different libraries, which are assuming, that those two items are #defines, like in the original AVR-Arduino world. Here those libraries make shure, that after including a pgmspace.h-file, those items are really defined. If not, they redefine it.

// Many (but maybe not all) non-AVR board installs define macros
// for compatibility with existing PROGMEM-reading AVR code.
// Do our own checks and defines here for good measure...

#ifndef pgm_read_byte
 #define pgm_read_byte(addr) (*(const unsigned char *)(addr))
#endif
#ifndef pgm_read_word
 #define pgm_read_word(addr) (*(const unsigned short *)(addr))
#endif

Its now possible, to change those third party libraries with some elif statements, but as those are more and several people will struggle after updating the ESP8266-runtime, it should be changed here to be compatible to the previous world.

@wkraft-fablabka
Copy link
Contributor

Created a pull request, fixing this issue

@devyte
Copy link
Collaborator

devyte commented Jun 1, 2017 via email

@d-a-v
Copy link
Collaborator

d-a-v commented Jun 1, 2017

what about a

#define pgm_read_byte(addr) pgm_read_byte_inlined(addr)

with

static inline __uint8_t pgm_read_byte_inlined(const void* addr)  __attribute__((always_inline));

That way the define would exist and would end up in an inlined function (if the inline function is considered better).

@igrr
Copy link
Member

igrr commented Jun 2, 2017

Agree with @d-a-v suggestion except the 'always inline' part. That's a considerable chunk of code, I think we should let the compiler decide whether to inline it or not in each particular case.

The other thing is that the variant proposed in the PR would fail to do truncation to target type if one writes int x = pgm_read_byte(address).

@devyte
Copy link
Collaborator

devyte commented Jun 2, 2017

I agree with @igrr.

@wkraft-fablabka
Copy link
Contributor

agree with those arguments!
No problem at all with inlined functions, as long, as they are "hidden" behind a #define.

One more question to raise:
Comparing my code to the original, I have a "leftover" in the assembler code from my searches for the problem.
My first idea was, that a addressing related issue occurs in assembler code, as in my situation the variables leading to the crash were distributed over different libraries. So I digged into the Xtensa Instruction Set and changed L32I.n to _L32I.
This because I found for L32I.n:

L32I.N is similar to L32I, but has a 16-bit encoding and supports a smaller range of
offset values encoded in the instruction word.

and for L32I I found:

The assembler may convert L32I instructions to L32I.N when the Code Density
Option is enabled and the immediate operand falls within the available range. Prefixing
the L32I instruction with an underscore (_L32I) disables this optimization and forces
the assembler to generate the wide form of the instruction.

I don't know, if that is important.

@igrr
Copy link
Member

igrr commented Jun 2, 2017

Since the immediate is equal to zero, l32i.n can be used. It could have been written as l32i as well. Not much difference here.

@wkraft-fablabka
Copy link
Contributor

Thanks, good to know. That was just one idea, when I searched for the reasons, why my application is now suddenly crashing.

wkraft-fablabka added a commit to wkraft-fablabka/Arduino that referenced this issue Jun 2, 2017
pgm_read_byte() and pgm_read_word() need to be macros, as some third party libraries and sketches testing macro existence.
igrr pushed a commit that referenced this issue Jun 5, 2017
pgm_read_byte() and pgm_read_word() need to be macros, as some third party libraries and sketches testing macro existence.
@igrr igrr closed this as completed Jan 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants