Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix random crashing of ClientContext::write(Stream) and write_P(PGM_P buf, size_t size) (#2504) #4530

Merged
merged 14 commits into from
Mar 22, 2018

Conversation

mongozmaki
Copy link
Contributor

I experienced random crashes when sending large progmem files with ESP8266Webserver.send_P over the ESP8266 WiFi Access Point.

Affects any call of WiFiClient::write(Stream) and WiFiClient::write_P(...) if underlying tcp_write returns error (in my case ERR_MEM ("come back later")).

Might fix issue #2504 and maybe others.

The bug originates in BufferedStreamDataSource (DataSource.h) (used in ClientContext::_write_some()).

In ClientContext::_write_some:
If tcp_write(...) returns error (e.g. ERR_MEM) (File: ClientContext.h:446), than DataSource::release_buffer is not called (which is fine).
However, if a BufferedStreamDataSource is used, the stream data was already read from the stream by DataSource::get_buffer(next_chunk) (File: ClientContext.h:441). Next time DataSource::get_buffer is called, data is read from the stream again (wrong data, because stream advanced already).
So from this point on, the stream reports less bytes left than the DataSource.
That leads to the assertion assert(cb == size) in BufferedStreamDataSource::get_buffer to fail eventually.

The solution is to remember the stream position and recognise if DataSource::get_buffer gets called multiple times without the corresponding release.
If the stream data was already read earlier, data isn’t read from the stream again.

To reproduce, edit ClientContext::_write_some

…
bool need_output = false;
int rand_err = 0; // <-- Artificial "random" error count
while( will_send && _datasource) {
    size_t next_chunk =
        will_send > _write_chunk_size ? _write_chunk_size : will_send;
    const uint8_t* buf = _datasource->get_buffer(next_chunk);
    if (state() == CLOSED) {
        need_output = false;
        break;
    }
    //err_t err = tcp_write(_pcb, buf, next_chunk, TCP_WRITE_FLAG_COPY); // <-- Original
    //Simulate err == -1 every 4th time
    err_t err;
    ++rand_err;
    if (rand_err % 4 == 0) {
        err = -1;
    }
    else {
        err = tcp_write(_pcb, buf, next_chunk, TCP_WRITE_FLAG_COPY);
    }
    DEBUGV(":wrc %d %d %d\r\n", next_chunk, will_send, (int) err);
    ...

Copy link
Collaborator

@d-a-v d-a-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good !

@devyte
Copy link
Collaborator

devyte commented Mar 18, 2018

I understand the explanation and the code changes, but I haven't looked at the callng code, so I'm missing a bit of context. In any case, this looks sane to me. My one question: when they get out of sync, and before calling get_buffer(), which gets them in sync again, what happens with available()? is the current code for that correct?

@mongozmaki
Copy link
Contributor Author

You mean DataSource::available()? The behaviour of that didn't change. get_buffer does not alter the available value (which is _size - _pos) because the _pos is only incremented in DataSource::release_buffer().
The current code is correct in my opinion.

release_buffer basically put them in sync again.

Lets assume get_buffer is called multiple times without calling release_buffer (can happend in ClientContext::_write_some if tcp_write returns error):

Note: chunk_size is min(max chunk size,free tcp buffer, data_source available). data_source available does not change if release_buffer is not called. max chunk size is fixed. Only free tcp buffer changes.

  1. _datasource->get_buffer(next_chunk);
    Reads the data from the stream to own buffer and returns pointer. Internal streamPos is advanced.
  2. tcp_write returns error for some reason (e.g. network busy)
  3. release_buffer is skipped (leaving data source position untouched)
  4. Next time _datasource->get_buffer(next_chunk) is called:
    4.a. If chunk_size is the same as before, no stream data is read and same buffer is returned. streamPos does not change.
    4.b. If chunk_size is smaller than last time, same as 4.a.
    4.c. If chunk_size is greater, only remaining data is read from the stream and appended to the buffer. streampos is updated.
  5. If tcp_write is finally successful, release_buffer is called and data source position is advanced (affects DataSource::available)

However, I just realized, that the assert(_pos == _streamPos) in release_buffer is not optimal.
In reality this should not cause a problem because release_buffer is called with the same size than get_buffer. But to make it more robust, I will add some code so that release_buffer can actually be called with a size less or equal to the get_buffer size.

@mongozmaki
Copy link
Contributor Author

I added code to allow partial release of the buffer.
Is this change in release_buffer visible in the merge request?

Although this is not used right now, it makes the code more future proof IMHO. This can be useful if a buffer of certain size X is requested but for some reason, less bytes are used (e.g. by tcp_write). Than release buffer can be called with less than X. Unused data already read from the stream is saved and returned by the next get_buffer call.

@mongozmaki
Copy link
Contributor Author

I dug further into the issue of tcp_write(...) returning ERR_MEM. It seems that I'm getting this issue if I'm running out of heap memory.
Freeing up heap helps preventing this bug to strike. However my proposed fix is still valid.

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 21, 2018

@mongozmaki I once again reviewed your PR and it seems fine to me.
However It'd be nice if @igrr reviewed it too.

Freeing up heap helps preventing this bug to strike. However my proposed fix is still valid

Which bug is it, the same one that made you write this fixing PR or another one ?
Was it the OOM debug option that showed it to you ?

@mongozmaki
Copy link
Contributor Author

I try to explain it a bit better.

This PR is a fix to the problem if tcp_write(...) in ClientContext::_write_some() returns an error (like ERR_MEM). Than the DateSource and the underlying Stream are not at the same position anymore after such an error (stream to far ahead). My code fixes that out-of-sync problem.

Now I come to the issue why I found this bug in the first place (which might or might not be another bug):
My program crashed irregularly while sending large progmem data via webserver. I found that when tcp_write returns ERR_MEM, the transmission crashes eventually (when stream reached its end).

However, the question still remains, why tcp_write returns ERR_MEM sometimes even if tcp_sndbuf (in ClientContext::_write_some) reports more free memory than what is finally written by tcp_write.

This happend especially, when compiling with "lwip2-Higher Bandwidth" option.
With my fix it didn't crash anymore but the transmission was much slower (than with "lwip2-Lower Memory" option) because I got much more ERR_MEM retured by tcp_write.

I cannot reproduce this other issue relably for now. All I can guess is, that tcp_write returns ERR_MEM more often, if free heap is low (which is amplified by "lwip2-Higher Bandwidth" option because of larger buffers in mem).

I will further investigate and hopefully reproduce this second issue in a sample code.

The debug message to check out is the line DEBUGV(":wrc %d %d %d\r\n", next_chunk, will_send, (int) err); in ClientContext::_write_some, especially if err == ERR_MEM. WiFi debug needs to be enabled.

@mongozmaki
Copy link
Contributor Author

At the end this second issue could also just be an out-of-memory problem.

@earlephilhower
Copy link
Collaborator

I've been peripherally following this and it looks like a good fix, thanks @mongozmaki .

If you're getting ERR_MEM it could be that, even though you have enough total free space in the heap the UMM allocator can't find a large enough contiguous block to satisfy the request. Fragmentation, which can hurt on a system with no MMU and a very small shared heap used by everything.

So if ERR_MEM is returned only when TCP is doing a malloc()/realloc()/calloc() that returns NULL, I'm not sure it's a bug so much as a "feature." The upper wrapper and app layers would need to handle it appropriately.

@mongozmaki
Copy link
Contributor Author

Ok great!
Yes, I guess memory is the problem here.
If I find some other reason, I'll let you know.
Thanks!

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 21, 2018

The explanation for lwIP's ERR_MEM: link.

As this name ERR_MEM is misleading (totally unrelated with low HEAP), maybe you can add the explanation in your PR for further reference ?

ERR_MEM is to lwIP what is libc's errno==EAGAIN in O_NONBLOCK mode.

Whatever error lwIP's tcp_write() is returning (fatal or ERR_MEM/EAGAIN), _write_some returns and its caller can check the session(pcb)'s state, ie. will close if fatal, or try again until space is released by PHY effectively sending packets already tcp_written data are sent, received by peer and acked back.

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 21, 2018

@mongozmaki
While dealing with this issue, you should enable the OOM debug option in tools menu to be sure you are not running out of heap.

@mongozmaki
Copy link
Contributor Author

@d-a-v
I'm not sure what you mean by adding an explanation. I refined the comments as bit.

@mongozmaki
Copy link
Contributor Author

@d-a-v
The OOM Debug option did not really help in this case. Getting :oom(1568)@? outputs.

I created a test code for the PR bugfix. It provokes ERR_MEM errors and causes the ESP to crash or omit some data.

It basically streams a large HTML file which - after loading - checks itself for completeness.
Each time before the file is requested, the heap is cut in half.

Start the code and open the ESP IP address in a browser. After some loading (512kB) it should display Payload complete!.
Refresh the page 2-4 times and the ESP should crash or the webpage will display something like Payload INCOMPLETE!!! Only 524032 of 524288 chars!!!
Watch the Serial output also.

For better debuging of tcp_write errors, you can add a debug message in ClientContext::_write_some if an error occurs:

...
if (err == ERR_OK) {
   ...
} else {
   //HERE FOR EXAMPLE
   Serial.printf("err: %d, left: %d, can_send: %d, will_send: %d, next_chunk %d, written: %d\n", (int)err, left, can_send, will_send, next_chunk, _written);
   break;
}
 ...

After applying my PR, the HTML page should always report complete. The ESP might crash eventually because it is running out of memory.

@mongozmaki
Copy link
Contributor Author

#include <ESP8266WiFi.h>
#include <ESP8266WebServer.h>
#include <Stream.h>

#include <list>

const char* ssid = "..............";
const char* password = "...............";

ESP8266WebServer server(80);
std::list < uint8_t *> mem_eaten;


class LargeFakeStream : public Stream {

public:
    LargeFakeStream(size_t size) : pos_(0), fill_size_(size) {

        html_prologue_ = R"foo(
<html>
<head>
<script>
function check() {
let s = document.getElementById('payload').innerHTML;
let len_exp = )foo" + String(fill_size_) + R"foo(;
document.getElementById('info').innerHTML = (s.length==len_exp)?'Payload complete!':('Payload INCOMPLETE!!! Only '+s.length+' of '+len_exp+' chars!!!');
}
</script>
</head>
<body onload='check()'>
<p id='info'>Loading...</p>
<textarea rows='25' cols='256' id='payload'>)foo";

        html_epilogue_ = R"foo(</textarea>
</body>
</html>)foo";


        epi_start_ = html_prologue_.length() + fill_size_;
    }

    size_t size() const {
        return html_prologue_.length() + fill_size_ + html_epilogue_.length();
    }

    size_t write(uint8_t s) override {
        //NOP
        return s;
    }

    int available() override {
        return size() - pos_;
    }

    int peek() override {
        if (pos_ < html_prologue_.length()) {
            return html_prologue_.charAt(pos_);
        }
        if (pos_ >= epi_start_) {
            return html_epilogue_.charAt(pos_ - epi_start_);
        }
        return 'a';
    }

    int read() override {
        int c = peek();
        ++pos_;
        return c;
    }

private:
    size_t pos_;
    size_t fill_size_;

    String html_prologue_;
    String html_epilogue_;
    size_t epi_start_;
};


void handleLargeStream() {
    Serial.println("\nHandle large stream...");
    const size_t free_heap = ESP.getFreeHeap();
    Serial.printf("  Free heap before: %d bytes\n", free_heap);
    
    const  size_t eat_portion = free_heap / 2;
    Serial.printf("  Eating %d bytes...", eat_portion);
    mem_eaten.push_back(new uint8_t[eat_portion]);
    Serial.println("DONE");

    Serial.printf("  Free heap now: %d bytes\n", ESP.getFreeHeap());

    Serial.print("  Transfer large file...");
    LargeFakeStream ts(512 * 1024);
   

    ulong t1 = millis();
    server.setContentLength(ts.size());
    server.send(200, "text/html", "");
    server.client().write(ts);

    ulong dt = millis() - t1;
    Serial.printf("took %d ms at %.2f kb/sec.\n", dt, (ts.size() / 1024.0f) / (0.001f*dt));
}


void setup(void) {
    Serial.begin(115200);

    WiFi.mode(WIFI_STA);
    WiFi.begin(ssid, password);
    Serial.println("");

    // Wait for connection
    while (WiFi.status() != WL_CONNECTED) {
        delay(500);
        Serial.print(".");
    }
    Serial.println("");
    Serial.print("Connected to ");
    Serial.println(ssid);
    Serial.print("IP address: ");
    Serial.println(WiFi.localIP());

 
    server.onNotFound(handleLargeStream);

    server.begin();
    Serial.println("HTTP server started");
}


void loop(void) {
    server.handleClient();
    delay(0);
}

@d-a-v
Copy link
Collaborator

d-a-v commented Mar 22, 2018

@mongozmaki Sorry for the misunderstanding. I was not meaning you add more comments to your PR, which already looks fine.

About OOM, the message you get :oom(nnn)@? means that a m/re/calloc(nnn) returned NULL at some point ('?' indicates an unknown caller, libc or binary firmware). So you are indeed running out of heap.

Let us some more time to check again as it is an update to an important part of the WiFi layer.

Copy link
Collaborator

@earlephilhower earlephilhower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through the code and it looks good. Only potential issue might be checking for new[] error when making a larger buffer (since no exceptions on ESP8266), but honestly having it crash sooner, here, may be easier to debug than passing up a NULL and waiting for the main app to use it w/o checking.

@devyte devyte merged commit 3267443 into esp8266:master Mar 22, 2018
@mongozmaki mongozmaki deleted the bugfix/clientcontext_write branch March 22, 2018 07:34
bryceschober pushed a commit to bryceschober/Arduino that referenced this pull request Apr 5, 2018
… buf, size_t size) (esp8266#2504) (esp8266#4530)

* Fix random crashing of ClientContext::write(Stream) and write_P(PGM_P buf, size_t size) (esp8266#2504)

* - Allow partial buffer release

* - Refined comments

(cherry picked from commit 3267443)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants