-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic on WPA2 Enterprise Connection #8082
Comments
Can confirm, I also have this issue |
@wujiangang Can you help with this issue? There is a bug introduced somewhere (seems like it's in wpabuf_free) in the nonos-sdk that frees memory that was never allocated (or to be more precise the pointer is pointing to a memory region that's invalid), would be great if that could get fixed. |
@d-a-v Unfortunately it looks like this SDK issue won't get fixed. What would you think about relaxing the free() check a little to work around this issue? Maybe just silently drop if free() is called on invalid memory (and optionally check if it's being called by wpabuf_free())? I could also patch out that call from the library, but I'm not sure if you want to have a modified binary in this repository? Or maybe you know someone from espressif who you can contact to get this fixed in the SDK? |
I have no more power than you do :] About relaxing the call to |
Yes, I have tried it (and others have aswell, see https://bbs.espressif.com/viewtopic.php?t=5962&start=20#p71905), it's really just this free()-call that's causing issues. |
Thanks for pointing us ! Unconditionally patching the I think something like this would be acceptable: umm_antidote(true); // disable umm_check_poison()
wifi_station_set_wpa2_enterprise_auth(1);
wifi_station_set_enterprise_identity((uint8*)username, strlen(username));
wifi_station_set_enterprise_username((uint8*)username, strlen(username));
wifi_station_set_enterprise_password((uint8*)password, strlen(password));
wifi_station_set_enterprise_ca_cert((byte*)ca_cert, strlen(ca_cert));
wifi_station_connect();
umm_antidote(false); // re/enable umm_check_poison() (s/antidote/whatever/g) |
I thought about something like ignoring |
Making |
|
Never mind it is not that simple. The poison check fails after the block. The block starts in DRAM and extends far beyond DRAM address space. Most likely umm_block data is invalid resulting in an oversized allocation calculation. |
A double-free could probably easily be detected by adding a |
Did you try to add a |
I don't have the latest version installed at the moment, I guess I need to setup the latest version on a test system (I don't want to overwrite the old working version). The issue only comes up when a connection is established, so there's no point in adding additional debug output between each line of the MCVE code as it will only show that the issue is happening somewhere during the connection process (I think the stacktrace also shows that). |
Using some experimental
ref: SDK version: 2.2.2-dev(38a443e) |
Are there debug symbols in some file? I'm a little surprised the exact line is shown, that's all closed source isn't it?.... Just so I understand it correctly: free() is called 2 times by the exact same line of the exact same function? And that's causing all the issues? Those register dumps look absolutely identical to me. That'd make it a lot harder to patch it though.... I am wondering if it even ends at a double-free or if they do a triple-free if you would let them.... The only hackaround would be to ignore the first free and only execute the second one (since the first one shouldn't be there apparently), but things are getting really dirty then.... |
No. Yes. However, the malloc API gives us hints. The SDK uses the portable heap library or something like that. The functions have fields that include the module/file name and line number that is calling. The
I think that is a valid concern. I am not sure what I remember, I saw too many unresolved things. Unfortunately, I have run out of time. I have some work that needs to be finished this week. So I have to resist poking at this any further. void IRAM_ATTR vPortFree(void *ptr, const char* file, int line) |
@mhightower83 Which SDK and core did you use for your tests? I setup a test environment with the latest git-master and just went through the pre-3 and the latest 2.2.1+119 and I am always getting this exception, which is kind of funny as I am only using a single heap and those functions should be esssentially empty.....
The sketch has been slightly adapted:
|
From: Serial.printf("SDK version: %s\n", system_get_sdk_version()); SDK version: 2.2.2-dev(38a443e) |
I have figured out what caused the crashing: I need to define a debug serial port, otherwise it will just crash with a stacktrace like the one from above. If I don't define a debug serial port it, even with that poison check removed there are crashes. There seems to be a massive corruption somewhere, I am trying to get additional output for debugging in umm_local.c and modified the lines after 114 like this:
so I should get a nice error message, right? Wrong... All I get is
When I comment those debug lines out again I can actually see a little more:
I went a little further and simply disallowed
That gives me this:
It got allocated properly, so I reduced the "trace range" a little and added:
So now I will see if it got
First line allocs it, first free frees it again and then it end's with another free.... So the last one is most likely the one we actually want. So now I want the line numbers aswell:
So we are interested in line 213. Unfortunately there are 2 cases where a line 213 is calling and accessing file only shows the LoadStoreError from above. Anyways, now I am interested in that line 213 only, so I only log those:
So just 2 cases where a line 213 frees something (it could still be in different files), unfortunately I still get that LoadStoreError if I try to access the file pointer. Anyways, the __panic_func can do it, so I simply call it to get a full stacktrace:
sweet, first call is in wpabuf.c:213. I kinda get the feeling that's wpabuf_free() there (just guessing, but it doesn't really matter). Anyways, on to the next one:
So same caller as the "bad" call from above. That's not good, so we can't patch these, but we might not even have to since there is still that second call in line 935, so let's see where that is being called from:
So eap.c:935 is calling free() on something that has been free'd already. This would be the right call to suppress. So let's look at it in a disassembler:
I added 2 comments for better readability for non-assembler-people reading this, a1 is the stack pointer. This looks pretty much like they call The last question that I was wondering about was if my fix is now leaking memory (I mean I kicked out a call to free(), so the possibility is definitely there), and the answer is unfortunately yes. My first test shows 8 bytes less free heap when I reconnect the first time, 40 bytes less free heap for the following 2 (re-)connections, then 8 again and so on. I guess |
Nice hacking 🚀
Ultimately yes. If you can do that for the rest of the firmware this would help alot 😝
Logging again malloc/free calls may allow you to locate what is leaked. |
Well unless I'm mistaken the correct implementation is really just a I think the implementation in https://github.com/espressif/esp-idf/blob/2467aa7f6c177ed040e385657304b0ea413e8133/components/wpa_supplicant/src/eap_peer/eap.c#L815 is how it should look like. I wonder if we could replace the entire closed source libwpa, libwpa2 and libwps stuff with this open source implementation or if there are too many changes necessary then to make it compatible with the other libraries. I believe for ESP32 it can be self-built from the source code in the esp-idf? Or are there parts missing so they just built the closed source libraries for ESP32 based on the repo and if you want changes you need to get them in that repo somehow? I've seen the fix shell script(and I believe the PATH-export in it is missing a "../", at least if its supposed to be called from one of the NONOS* directories ;) ), I've looked at it again and the symbol redefinition sounds suitable for this case. The heap stays steady until I cause a reconnection by kicking it from the AP. I don't know if that's "normal" for this SDK (I wouldn't be surprised at all) or if this is (another) WPA Enterprise issue. The memory tracking isn't very nice, I had to suppress some messages (for 44 bytes allocations for example) as otherwise I'd get crashes, and you get a whole load of messages so it's super hard to follow them through (I guess I'd have to write a parser that parses my log and keeps track of the leak then). If there is a 8 byte leak and a 40 byte leak somewhere I would have to look for 8 and 40 byte allocations? Or is there an overhead involved aswell so that the actual requested memory is shorter? 8 bytes is really short, if there's overhead involved aswell that'd probably be a 4 or 6 byte allocation.... |
The input to the function is a structure pointer a2.
It should not be a problem unless a2[0xC0] and a2[0xB8] hold the same value. Unless I am tired and confused |
I think so, unless |
@mhightower83 You are totally right, I got confused with the (kinda out of order) instructions right before the vPortFree stuff is loaded, but that's the zeroing and not related to the function call... I think we are super close though, wpabuf_free() is doing this:
So they are essentially freeing 2 pointers in there. 0xC0 - 8 would be 0xB8, but that's an addition and not a subtraction there? |
Ah sorry my analysis was partially wrong: 0xB8 is eapKeyData, it's not the wpabuf that is being freed twice here apparently, if I look again at one of my debug outputs from above:
And the first free (for our actual memory, ignore the wpabuf that momentarily existed there) is happening right after
Looks like a free() call which leaves the pointer untouched afterwards to me. If it would be set to null here properly then everything would be alright..... Fortunately there is some-ish space if we don't care about the line number and file name, 2 instructions which would be enough to set a register to 0 and write it to the offset 0xB8. It would completely mess up the file and line stuff though, there would be an invalid pointer passed as file, so if something ever relies on that being valid it would break completely. To make things worse, the pointer is also accessed in eap_sm_process_request, when a new key is received the old memory is freed. So if we would not touch that assembly above and if there is a way that eap_sm_process_request get's called before the eap_sm_abort where our pointer is nulled we would have the same double-free again, just at another location. Long story short: The pointer must be nulled there (or the free must be removed there, which needs to happen manually as it's not like we can replace an entire function here), doing it where I first did it is not a good idea. Maybe I could have it fixed by espressif when I point to the exact line number and state the exact code that needs to be added, but even then this fix will only go into master, I don't think there is any chance they would backport it to 2.2.x, and getting master into Arduino is quite tricky (I tried earlier today and failed with the partition stuff).... |
If you try to free a NULL pointer, it is ignored. |
Yeah I know, the problem is though that the SDK frees it in the assembly I posted above but does not update the pointer to null. If they would update it properly it would not be freed a second time as it's null. I corrected my previous comment at some places, I was referring to the pointer as "memory", like this it should be more clear what I actually mean. In fact I had quite a few mistakes/unclear wordings in that comment, I've corrected those now. Anyways, it's wayy to late so I'll pick this up tomorrow, I'll probably write a simple patch function that checks if at the desired offset the expected data is located and then updates it accordingly, so this could be integrated in fix_sdk_libs.sh. |
Sounds like you found a working solution. Update: |
I had a look at the latest libwpa2.a from NONOS-SDK and it looks like the issue is fixed there. So if we pull that in some day the issue will get fixed eventually. For all the other SDKs I decided the best fix would be to remove that vPortFree() call entirely, I have already prepared the modifications for the fix_sdk_libs.sh, just need to checkout a clean master, apply those and then prepare a PR. Have you also observed that 8-40-40-8... memory leak when you kick the ESP and it reconnects? I think you need code that specifically instructs the ESP to reconnect (so basically put all the stuff in a connect() function and call that whenever there's no connection), otherwise it will not reconnect automatically. |
Using this modification to heap.cpp #define DEBUG_PRINTF ets_uart_printf
void IRAM_ATTR print_free_loc(const char str[], void *ptr, const char* file, int line)
{
{
DEBUG_PRINTF("\n%s: %p, ", str, ptr);
bool inISR = ETS_INTR_WITHINISR();
if (NULL == file || (inISR && (uint32_t)file >= 0x40200000)) {
DEBUG_PRINTF("File: %p", file);
} else if (!inISR && (uint32_t)file >= 0x40200000) {
char buf[strlen_P(file) + 1];
strcpy_P(buf, file);
DEBUG_PRINTF(buf);
} else {
DEBUG_PRINTF(file);
}
DEBUG_PRINTF(":%d\n", line);
}
}
void IRAM_ATTR vPortFree(void *ptr, const char* file, int line)
{
static void *delayed_free = NULL;
#if defined(UMM_POISON_CHECK) || defined(UMM_INTEGRITY_CHECK)
// While umm_free internally determines the correct heap, UMM_POISON_CHECK
// and UMM_INTEGRITY_CHECK do not have arguments. They have to rely on the
// current heap to identify which one to analyze.
//
// Should not need this for UMM_POISON_CHECK_LITE, it directly handles
// multiple heaps. DEBUG_ESP_OOM not tied to any one heap.
HeapSelectDram ephemeral;
#endif
if (NULL == ptr) {
} else
if (line == 935) {
if (delayed_free != ptr) {
print_free_loc("free", delayed_free, file, line);
heap_vPortFree(delayed_free, file, line);
}
delayed_free = NULL;
print_free_loc("free", ptr, file, line);
} else
if (line == 672) {
if (delayed_free) {
print_free_loc("leak", delayed_free, file, line);
}
print_free_loc("stow", ptr, file, line);
delayed_free = ptr;
return;
}
return heap_vPortFree(ptr, file, line);
} I did not see any leaks get reported from those two locations. Side notes on Heap |
Are these line numbers (935, 672) constant accross the different versions of FW which we can enable in menus / config, |
My patch is unique for each libwpa2.a version. There's currently 3 versions of that file in this repo. However, since the latest NONOS-SDK fixes it we should never need to patch again for future versions. I only modify the libwpa2.a by changing the free-call into a nop. |
So we have two ways for fixing this bug:
Are you referring to nonos-sdk-v3 ? |
Yes, those are the 2 ways. I prefer the first one as for the second one there could be the possibility of a "collision" of line numbers and as those are unnecessary instructions that get executed each time. With "latest" I mean "latest master" of the SDK on GitHub, the one that's not (yet) part of this repo. I have looked at it in a disassembler just to see if they fixed it. So if someone does add the new SDK 3.0.4 (I think that's the current version) that would probably fix this specific problem aswell. |
My intent with this approach was just to explore if we might be leaking memory by not freeing at 672. |
Fixes: esp8266#8082 This patches the callx0 instruction to a nop in eap.o which is part of libwpa2.a. It looks like espressif fixed the Bug in newer SDK versions, so if we update to the latest NONOS-SDK it is most likely not necessary to add/adapt this patch. Also modifies the fix_sdk_libs.sh script as it even changed files if no changes were necessary, for example adding multiple system_func1 exports.
I've opened a PR for my changes now as it seems to be a good fix and doesn't seem to cause any regressions. |
Fixes: esp8266#8082 This patches the callx0 instruction to a nop in eap.o which is part of libwpa2.a. It looks like espressif fixed the Bug in newer SDK versions, so if we update to the latest NONOS-SDK it is most likely not necessary to add/adapt this patch. Also modifies the fix_sdk_libs.sh script as it even changed files if no changes were necessary, for example adding multiple system_func1 exports.
Two more observations for the WPA2-Enterprise option:
|
For 2.: Did you correctly call the matching clear-methods for all the set-methods? Those are necessary to free the memory according to the SDK documentation. |
Yes this set: wifi_station_clear_enterprise_identity();
wifi_station_clear_enterprise_password();
wifi_station_clear_enterprise_cert_key();
wifi_station_clear_enterprise_ca_cert();
wifi_station_clear_enterprise_username();
wifi_station_clear_enterprise_new_password(); It looks like those only free the memory allocated when calling |
You are using username/password, right? I was using certificate based authentication when testing. Line 757 is for the identity, I assume you're using the anonymous identity? Not that it would make things better, if you supply an identity it will just change to line 750, but with the same issue as far as I can see... Looks like eap_peer_config_deinit is incomplete... Line 775 seems to be for the password. What we can do to fix this is probably renaming that eap_peer_config_deinit, providing our own version that clears the identity and password (it currently only clears the username if I see that correctly) and new_password, so we are no longer leaking password, identity and new_password (allocated in line 766 if you supply it). It would explain my 40 Byte leak: 23 for the identity string, 12 for poison, 4 for header, that's 39, must be multiple of 8, so 40. It won't explain why you only saw a 32 byte leak though (you mentioned that earlier, not sure if that's still accurate), especially if you also supplied a password. It also doesn't really explain my 8-40-40...-Pattern. |
I have been trying TLS, TTLS, and PEAP. They all seem to share some form of leak. PEAP was the one I was last looking at. The following is for TLS. This is a snippet from the allocation dump I am working on. The [nnnnn] value is the real size specified for the allocation. I think this dump is correct, I have fixed a few issues already to get to this. Take it with a grain of salt.
For TLS, as you go from initial start, Enterprise set, connect, disconnect, and Enterprise clear; total Allocations grew from 19, 24, 32, 28, 28. As you cycle through connect disconnect the number of Identity allocations increases by one. |
Well I know what needs to be free'd but I don't know how to do it yet: eap_peer_config_deinit is not exported, it's only a section in the eap.o object, so weak function overrides and that kind of stuff won't work. I tried to compile my own .o (using the xtensa-gcc with the -c option) and then replacing the section using update-section of objcopy, but there's relocation information attached to it, and even if I replace that aswell it won't work and will crash with an illegal instruction. That leak also still seems to be there in latest master of the SDK, unlike the double-free which was fixed in the meantime. |
* Fix double-free when connecting to WPA2-Enterprise networks Fixes: #8082 This patches the callx0 instruction to a nop in eap.o which is part of libwpa2.a. It looks like espressif fixed the Bug in newer SDK versions, so if we update to the latest NONOS-SDK it is most likely not necessary to add/adapt this patch. Also modifies the fix_sdk_libs.sh script as it even changed files if no changes were necessary, for example adding multiple system_func1 exports. * Apply suggestions from code review
…8529) * Fix double-free when connecting to WPA2-Enterprise networks Fixes: esp8266#8082 This patches the callx0 instruction to a nop in eap.o which is part of libwpa2.a. It looks like espressif fixed the Bug in newer SDK versions, so if we update to the latest NONOS-SDK it is most likely not necessary to add/adapt this patch. Also modifies the fix_sdk_libs.sh script as it even changed files if no changes were necessary, for example adding multiple system_func1 exports. * Apply suggestions from code review
Basic Infos
Platform
Settings in IDE
Problem Description
Immediately after connecting to a WPA2 Enterprise encrypted network I receive an exception. This seems to be related to the way free is called from a function in eap.c (which seems to be part of the SDK). I am not sure if this is a bug here or in the SDK, but apparently an address that is unmapped was passed to free. I am not sure if maybe the stack got corrupted somewhere as that function hierarchy doesn't really make sense to me.....
I have everything working on the super old version 2.3.0 (which uses completely different functions for setting WPA Enterprise up) but I want to update.
I have reproduced this on all Espressif Firmware Versions that are available.
MCVE Sketch
Debug Messages
The text was updated successfully, but these errors were encountered: