Fix random crashing of ClientContext::write(Stream) and write_P(PGM_P buf, size_t size) (#2504) #4530

mongozmaki · 2018-03-17T10:19:01Z

I experienced random crashes when sending large progmem files with ESP8266Webserver.send_P over the ESP8266 WiFi Access Point.

Affects any call of WiFiClient::write(Stream) and WiFiClient::write_P(...) if underlying tcp_write returns error (in my case ERR_MEM ("come back later")).

Might fix issue #2504 and maybe others.

The bug originates in BufferedStreamDataSource (DataSource.h) (used in ClientContext::_write_some()).

In ClientContext::_write_some:
If tcp_write(...) returns error (e.g. ERR_MEM) (File: ClientContext.h:446), than DataSource::release_buffer is not called (which is fine).
However, if a BufferedStreamDataSource is used, the stream data was already read from the stream by DataSource::get_buffer(next_chunk) (File: ClientContext.h:441). Next time DataSource::get_buffer is called, data is read from the stream again (wrong data, because stream advanced already).
So from this point on, the stream reports less bytes left than the DataSource.
That leads to the assertion assert(cb == size) in BufferedStreamDataSource::get_buffer to fail eventually.

The solution is to remember the stream position and recognise if DataSource::get_buffer gets called multiple times without the corresponding release.
If the stream data was already read earlier, data isn’t read from the stream again.

To reproduce, edit ClientContext::_write_some

…
bool need_output = false;
int rand_err = 0; // <-- Artificial "random" error count
while( will_send && _datasource) {
    size_t next_chunk =
        will_send > _write_chunk_size ? _write_chunk_size : will_send;
    const uint8_t* buf = _datasource->get_buffer(next_chunk);
    if (state() == CLOSED) {
        need_output = false;
        break;
    }
    //err_t err = tcp_write(_pcb, buf, next_chunk, TCP_WRITE_FLAG_COPY); // <-- Original
    //Simulate err == -1 every 4th time
    err_t err;
    ++rand_err;
    if (rand_err % 4 == 0) {
        err = -1;
    }
    else {
        err = tcp_write(_pcb, buf, next_chunk, TCP_WRITE_FLAG_COPY);
    }
    DEBUGV(":wrc %d %d %d\r\n", next_chunk, will_send, (int) err);
    ...

… buf, size_t size) (esp8266#2504)

d-a-v

Looks good !

devyte · 2018-03-18T03:55:05Z

I understand the explanation and the code changes, but I haven't looked at the callng code, so I'm missing a bit of context. In any case, this looks sane to me. My one question: when they get out of sync, and before calling get_buffer(), which gets them in sync again, what happens with available()? is the current code for that correct?

mongozmaki · 2018-03-18T11:56:05Z

You mean DataSource::available()? The behaviour of that didn't change. get_buffer does not alter the available value (which is _size - _pos) because the _pos is only incremented in DataSource::release_buffer().
The current code is correct in my opinion.

release_buffer basically put them in sync again.

Lets assume get_buffer is called multiple times without calling release_buffer (can happend in ClientContext::_write_some if tcp_write returns error):

Note: chunk_size is min(max chunk size,free tcp buffer, data_source available). data_source available does not change if release_buffer is not called. max chunk size is fixed. Only free tcp buffer changes.

_datasource->get_buffer(next_chunk);
Reads the data from the stream to own buffer and returns pointer. Internal streamPos is advanced.
tcp_write returns error for some reason (e.g. network busy)
release_buffer is skipped (leaving data source position untouched)
Next time _datasource->get_buffer(next_chunk) is called:
4.a. If chunk_size is the same as before, no stream data is read and same buffer is returned. streamPos does not change.
4.b. If chunk_size is smaller than last time, same as 4.a.
4.c. If chunk_size is greater, only remaining data is read from the stream and appended to the buffer. streampos is updated.
If tcp_write is finally successful, release_buffer is called and data source position is advanced (affects DataSource::available)

However, I just realized, that the assert(_pos == _streamPos) in release_buffer is not optimal.
In reality this should not cause a problem because release_buffer is called with the same size than get_buffer. But to make it more robust, I will add some code so that release_buffer can actually be called with a size less or equal to the get_buffer size.

…ix/clientcontext_write

…zmaki/Arduino into bugfix/clientcontext_write

mongozmaki · 2018-03-18T13:32:31Z

I added code to allow partial release of the buffer.
Is this change in release_buffer visible in the merge request?

Although this is not used right now, it makes the code more future proof IMHO. This can be useful if a buffer of certain size X is requested but for some reason, less bytes are used (e.g. by tcp_write). Than release buffer can be called with less than X. Unused data already read from the stream is saved and returned by the next get_buffer call.

…ix/clientcontext_write

mongozmaki · 2018-03-21T08:11:56Z

I dug further into the issue of tcp_write(...) returning ERR_MEM. It seems that I'm getting this issue if I'm running out of heap memory.
Freeing up heap helps preventing this bug to strike. However my proposed fix is still valid.

d-a-v · 2018-03-21T10:18:40Z

@mongozmaki I once again reviewed your PR and it seems fine to me.
However It'd be nice if @igrr reviewed it too.

Freeing up heap helps preventing this bug to strike. However my proposed fix is still valid

Which bug is it, the same one that made you write this fixing PR or another one ?
Was it the OOM debug option that showed it to you ?

mongozmaki · 2018-03-21T14:21:56Z

I try to explain it a bit better.

This PR is a fix to the problem if tcp_write(...) in ClientContext::_write_some() returns an error (like ERR_MEM). Than the DateSource and the underlying Stream are not at the same position anymore after such an error (stream to far ahead). My code fixes that out-of-sync problem.

Now I come to the issue why I found this bug in the first place (which might or might not be another bug):
My program crashed irregularly while sending large progmem data via webserver. I found that when tcp_write returns ERR_MEM, the transmission crashes eventually (when stream reached its end).

However, the question still remains, why tcp_write returns ERR_MEM sometimes even if tcp_sndbuf (in ClientContext::_write_some) reports more free memory than what is finally written by tcp_write.

This happend especially, when compiling with "lwip2-Higher Bandwidth" option.
With my fix it didn't crash anymore but the transmission was much slower (than with "lwip2-Lower Memory" option) because I got much more ERR_MEM retured by tcp_write.

I cannot reproduce this other issue relably for now. All I can guess is, that tcp_write returns ERR_MEM more often, if free heap is low (which is amplified by "lwip2-Higher Bandwidth" option because of larger buffers in mem).

I will further investigate and hopefully reproduce this second issue in a sample code.

The debug message to check out is the line DEBUGV(":wrc %d %d %d\r\n", next_chunk, will_send, (int) err); in ClientContext::_write_some, especially if err == ERR_MEM. WiFi debug needs to be enabled.

mongozmaki · 2018-03-21T14:29:04Z

At the end this second issue could also just be an out-of-memory problem.

earlephilhower · 2018-03-21T14:45:52Z

I've been peripherally following this and it looks like a good fix, thanks @mongozmaki .

If you're getting ERR_MEM it could be that, even though you have enough total free space in the heap the UMM allocator can't find a large enough contiguous block to satisfy the request. Fragmentation, which can hurt on a system with no MMU and a very small shared heap used by everything.

So if ERR_MEM is returned only when TCP is doing a malloc()/realloc()/calloc() that returns NULL, I'm not sure it's a bug so much as a "feature." The upper wrapper and app layers would need to handle it appropriately.

mongozmaki · 2018-03-21T15:29:05Z

Ok great!
Yes, I guess memory is the problem here.
If I find some other reason, I'll let you know.
Thanks!

d-a-v · 2018-03-21T15:56:42Z

The explanation for lwIP's ERR_MEM: link.

As this name ERR_MEM is misleading (totally unrelated with low HEAP), maybe you can add the explanation in your PR for further reference ?

ERR_MEM is to lwIP what is libc's errno==EAGAIN in O_NONBLOCK mode.

Whatever error lwIP's tcp_write() is returning (fatal or ERR_MEM/EAGAIN), _write_some returns and its caller can check the session(pcb)'s state, ie. will close if fatal, or try again until ~~space is released by PHY effectively sending packets~~ already tcp_written data are sent, received by peer and acked back.

d-a-v · 2018-03-21T16:04:47Z

@mongozmaki
While dealing with this issue, you should enable the OOM debug option in tools menu to be sure you are not running out of heap.

…zmaki/Arduino into bugfix/clientcontext_write

…ix/clientcontext_write

mongozmaki · 2018-03-21T23:55:44Z

@d-a-v
I'm not sure what you mean by adding an explanation. I refined the comments as bit.

mongozmaki · 2018-03-22T00:26:11Z

@d-a-v
The OOM Debug option did not really help in this case. Getting :oom(1568)@? outputs.

I created a test code for the PR bugfix. It provokes ERR_MEM errors and causes the ESP to crash or omit some data.

It basically streams a large HTML file which - after loading - checks itself for completeness.
Each time before the file is requested, the heap is cut in half.

Start the code and open the ESP IP address in a browser. After some loading (512kB) it should display Payload complete!.
Refresh the page 2-4 times and the ESP should crash or the webpage will display something like Payload INCOMPLETE!!! Only 524032 of 524288 chars!!!
Watch the Serial output also.

For better debuging of tcp_write errors, you can add a debug message in ClientContext::_write_some if an error occurs:

...
if (err == ERR_OK) {
   ...
} else {
   //HERE FOR EXAMPLE
   Serial.printf("err: %d, left: %d, can_send: %d, will_send: %d, next_chunk %d, written: %d\n", (int)err, left, can_send, will_send, next_chunk, _written);
   break;
}
 ...

After applying my PR, the HTML page should always report complete. The ESP might crash eventually because it is running out of memory.

mongozmaki · 2018-03-22T00:27:03Z

#include <ESP8266WiFi.h>
#include <ESP8266WebServer.h>
#include <Stream.h>

#include <list>

const char* ssid = "..............";
const char* password = "...............";

ESP8266WebServer server(80);
std::list < uint8_t *> mem_eaten;


class LargeFakeStream : public Stream {

public:
    LargeFakeStream(size_t size) : pos_(0), fill_size_(size) {

        html_prologue_ = R"foo(
<html>
<head>
<script>
function check() {
let s = document.getElementById('payload').innerHTML;
let len_exp = )foo" + String(fill_size_) + R"foo(;
document.getElementById('info').innerHTML = (s.length==len_exp)?'Payload complete!':('Payload INCOMPLETE!!! Only '+s.length+' of '+len_exp+' chars!!!');
}
</script>
</head>
<body onload='check()'>
<p id='info'>Loading...</p>
<textarea rows='25' cols='256' id='payload'>)foo";

        html_epilogue_ = R"foo(</textarea>
</body>
</html>)foo";


        epi_start_ = html_prologue_.length() + fill_size_;
    }

    size_t size() const {
        return html_prologue_.length() + fill_size_ + html_epilogue_.length();
    }

    size_t write(uint8_t s) override {
        //NOP
        return s;
    }

    int available() override {
        return size() - pos_;
    }

    int peek() override {
        if (pos_ < html_prologue_.length()) {
            return html_prologue_.charAt(pos_);
        }
        if (pos_ >= epi_start_) {
            return html_epilogue_.charAt(pos_ - epi_start_);
        }
        return 'a';
    }

    int read() override {
        int c = peek();
        ++pos_;
        return c;
    }

private:
    size_t pos_;
    size_t fill_size_;

    String html_prologue_;
    String html_epilogue_;
    size_t epi_start_;
};


void handleLargeStream() {
    Serial.println("\nHandle large stream...");
    const size_t free_heap = ESP.getFreeHeap();
    Serial.printf("  Free heap before: %d bytes\n", free_heap);
    
    const  size_t eat_portion = free_heap / 2;
    Serial.printf("  Eating %d bytes...", eat_portion);
    mem_eaten.push_back(new uint8_t[eat_portion]);
    Serial.println("DONE");

    Serial.printf("  Free heap now: %d bytes\n", ESP.getFreeHeap());

    Serial.print("  Transfer large file...");
    LargeFakeStream ts(512 * 1024);
   

    ulong t1 = millis();
    server.setContentLength(ts.size());
    server.send(200, "text/html", "");
    server.client().write(ts);

    ulong dt = millis() - t1;
    Serial.printf("took %d ms at %.2f kb/sec.\n", dt, (ts.size() / 1024.0f) / (0.001f*dt));
}


void setup(void) {
    Serial.begin(115200);

    WiFi.mode(WIFI_STA);
    WiFi.begin(ssid, password);
    Serial.println("");

    // Wait for connection
    while (WiFi.status() != WL_CONNECTED) {
        delay(500);
        Serial.print(".");
    }
    Serial.println("");
    Serial.print("Connected to ");
    Serial.println(ssid);
    Serial.print("IP address: ");
    Serial.println(WiFi.localIP());

 
    server.onNotFound(handleLargeStream);

    server.begin();
    Serial.println("HTTP server started");
}


void loop(void) {
    server.handleClient();
    delay(0);
}

d-a-v · 2018-03-22T01:03:20Z

@mongozmaki Sorry for the misunderstanding. I was not meaning you add more comments to your PR, which already looks fine.

About OOM, the message you get :oom(nnn)@? means that a m/re/calloc(nnn) returned NULL at some point ('?' indicates an unknown caller, libc or binary firmware). So you are indeed running out of heap.

Let us some more time to check again as it is an update to an important part of the WiFi layer.

earlephilhower

Went through the code and it looks good. Only potential issue might be checking for new[] error when making a larger buffer (since no exceptions on ESP8266), but honestly having it crash sooner, here, may be easier to debug than passing up a NULL and waiting for the main app to use it w/o checking.

… buf, size_t size) (esp8266#2504) (esp8266#4530) * Fix random crashing of ClientContext::write(Stream) and write_P(PGM_P buf, size_t size) (esp8266#2504) * - Allow partial buffer release * - Refined comments (cherry picked from commit 3267443)

Fix random crashing of ClientContext::write(Stream) and write_P(PGM_P…

befa1bb

… buf, size_t size) (esp8266#2504)

d-a-v approved these changes Mar 17, 2018

View reviewed changes

d-a-v requested review from earlephilhower, igrr and devyte March 17, 2018 17:55

Merge branch 'master' into bugfix/clientcontext_write

cba8edf

devyte approved these changes Mar 18, 2018

View reviewed changes

Harald Frostel added 3 commits March 18, 2018 13:39

Merge branch 'master' of https://github.com/esp8266/Arduino into bugf…

061ca2d

…ix/clientcontext_write

- Allow partial buffer release

d67641b

Merge branch 'bugfix/clientcontext_write' of https://github.com/mongo…

c768893

…zmaki/Arduino into bugfix/clientcontext_write

Harald Frostel and others added 3 commits March 18, 2018 23:54

Merge branch 'master' of https://github.com/esp8266/Arduino into bugf…

f0e0843

…ix/clientcontext_write

Merge branch 'master' of https://github.com/esp8266/Arduino into bugf…

b55f2f8

…ix/clientcontext_write

Merge branch 'master' into bugfix/clientcontext_write

22e77cf

Merge branch 'master' into bugfix/clientcontext_write

926fafc

Harald Frostel added 3 commits March 22, 2018 00:41

Merge branch 'bugfix/clientcontext_write' of https://github.com/mongo…

a064260

…zmaki/Arduino into bugfix/clientcontext_write

- Refined comments

64fae1c

Merge branch 'master' of https://github.com/esp8266/Arduino into bugf…

677d847

…ix/clientcontext_write

Merge branch 'master' into bugfix/clientcontext_write

2ec0023

earlephilhower approved these changes Mar 22, 2018

View reviewed changes

Merge branch 'master' into bugfix/clientcontext_write

a9a9bfa

devyte merged commit 3267443 into esp8266:master Mar 22, 2018

mongozmaki deleted the bugfix/clientcontext_write branch March 22, 2018 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix random crashing of ClientContext::write(Stream) and write_P(PGM_P buf, size_t size) (#2504) #4530

Fix random crashing of ClientContext::write(Stream) and write_P(PGM_P buf, size_t size) (#2504) #4530

mongozmaki commented Mar 17, 2018

d-a-v left a comment

devyte commented Mar 18, 2018

mongozmaki commented Mar 18, 2018

mongozmaki commented Mar 18, 2018

mongozmaki commented Mar 21, 2018

d-a-v commented Mar 21, 2018

mongozmaki commented Mar 21, 2018

mongozmaki commented Mar 21, 2018

earlephilhower commented Mar 21, 2018

mongozmaki commented Mar 21, 2018

d-a-v commented Mar 21, 2018 •

edited

Loading

d-a-v commented Mar 21, 2018

mongozmaki commented Mar 21, 2018

mongozmaki commented Mar 22, 2018

mongozmaki commented Mar 22, 2018

d-a-v commented Mar 22, 2018

earlephilhower left a comment

Fix random crashing of ClientContext::write(Stream) and write_P(PGM_P buf, size_t size) (#2504) #4530

Fix random crashing of ClientContext::write(Stream) and write_P(PGM_P buf, size_t size) (#2504) #4530

Conversation

mongozmaki commented Mar 17, 2018

d-a-v left a comment

Choose a reason for hiding this comment

devyte commented Mar 18, 2018

mongozmaki commented Mar 18, 2018

mongozmaki commented Mar 18, 2018

mongozmaki commented Mar 21, 2018

d-a-v commented Mar 21, 2018

mongozmaki commented Mar 21, 2018

mongozmaki commented Mar 21, 2018

earlephilhower commented Mar 21, 2018

mongozmaki commented Mar 21, 2018

d-a-v commented Mar 21, 2018 • edited Loading

d-a-v commented Mar 21, 2018

mongozmaki commented Mar 21, 2018

mongozmaki commented Mar 22, 2018

mongozmaki commented Mar 22, 2018

d-a-v commented Mar 22, 2018

earlephilhower left a comment

Choose a reason for hiding this comment

d-a-v commented Mar 21, 2018 •

edited

Loading