Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Website breaks while ahoy keeps running #660

Closed
Argafal opened this issue Feb 7, 2023 · 29 comments
Closed

Website breaks while ahoy keeps running #660

Argafal opened this issue Feb 7, 2023 · 29 comments
Assignees
Labels

Comments

@Argafal
Copy link
Contributor

Argafal commented Feb 7, 2023

Running 0.5.80 on an ESP8266. This problem was also present in 0.5.78 (and maybe before).

Error description: the website stops working. There is no obvious user interaction or inverter condition that I can identify as a possible cause of the problem. The problem shows like this:

Screenshot 2023-02-07 at 12-24-05 Index

MQTT data keeps coming, so AHOY seems to be running just fine. I do not get any error messages related to the webserver on the serial console. This might be related to #645, however in #645 it is reported that the website is not reachable at all, whereas I still see parts of it.

@Argafal
Copy link
Contributor Author

Argafal commented Feb 7, 2023

MQTT has stopped reported production values now and only reports uptime. Here are two pages of recent serial output. Right at half way through the log the pattern of messages changes, related? I don't know the ahoy code well enough to understand what causes the change in output.

I have now rebooted ahoy.

I: Payload (62): 00 01 01 63 01 43 01 3B 04 7B 04 5E 00 04 65 72 00 03 F8 15 01 1E 01 19 01 70 01 BB 01 F2 06 5B 07 26 00 04 75 6A 00 04 7C 83 01 64 01 8F 09 4D 13 8A 15 3B 00 DD 00 E4 03 E7 00 F7 00 02
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch61 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 43 D7 00 00 00 01 00 00 00 00 DB 0D 45
W: (#2) nothing received
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch75 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 43 F5 00 00 00 01 00 00 00 00 7A 8D 5A
W: (#0) nothing received
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch3 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 44 13 00 00 00 02 00 00 00 00 C4 9A 1E
RX whatHappened Ch61 001
RX whatHappened Ch61 001
RX whatHappened Ch40 001
RX whatHappened Ch40 001
I: RX 27B Ch61 | 95 74 40 35 30 74 40 35 30 01 00 01 01 64 01 3F 01 37 04 6D 04 51 00 04 65 75 D0
I: RX 27B Ch61 | 95 74 40 35 30 74 40 35 30 02 00 03 F8 18 01 21 01 1C 01 70 01 B5 01 EC 06 47 20
I: RX 27B Ch40 | 95 74 40 35 30 74 40 35 30 03 07 12 00 04 75 6E 00 04 7C 87 01 68 01 93 09 4B DA
I: RX 27B Ch40 | 95 74 40 35 30 74 40 35 30 84 13 8A 14 FD 00 DC 00 E2 03 E7 00 F6 00 02 AC 21 C2
I: procPyld: cmd:  0xb
I: procPyld: txid: 0x95
I: Payload (62): 00 01 01 64 01 3F 01 37 04 6D 04 51 00 04 65 75 00 03 F8 18 01 21 01 1C 01 70 01 B5 01 EC 06 47 07 12 00 04 75 6E 00 04 7C 87 01 68 01 93 09 4B 13 8A 14 FD 00 DC 00 E2 03 E7 00 F6 00 02
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch23 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 44 31 00 00 00 01 00 00 00 00 65 5E 49
W: (#2) nothing received
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch40 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 44 4F 00 00 00 01 00 00 00 00 C7 77 A0
W: (#0) nothing received
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch61 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 44 6D 00 00 00 02 00 00 00 00 66 B3 EB
RX whatHappened Ch23 001
RX whatHappened Ch3 001
RX whatHappened Ch3 001
RX whatHappened Ch3 001
I: RX 27B Ch23 | 95 74 40 35 30 74 40 35 30 01 00 01 01 63 01 40 01 38 04 71 04 54 00 04 65 78 B3
I: RX 27B Ch3 | 95 74 40 35 30 74 40 35 30 02 00 03 F8 1B 01 24 01 1F 01 70 01 B6 01 ED 06 4B 2B
I: RX 27B Ch3 | 95 74 40 35 30 74 40 35 30 03 07 16 00 04 75 72 00 04 7C 8B 01 6C 01 97 09 4B CE
I: RX 27B Ch3 | 95 74 40 35 30 74 40 35 30 84 13 87 15 0B 00 DC 00 E2 03 E7 00 F7 00 02 45 E0 11
I: procPyld: cmd:  0xb
I: procPyld: txid: 0x95
I: Payload (62): 00 01 01 63 01 40 01 38 04 71 04 54 00 04 65 78 00 03 F8 1B 01 24 01 1F 01 70 01 B6 01 ED 06 4B 07 16 00 04 75 72 00 04 7C 8B 01 6C 01 97 09 4B 13 87 15 0B 00 DC 00 E2 03 E7 00 F7 00 02
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch75 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 44 8B 00 00 00 01 00 00 00 00 02 15 DF
W: (#2) nothing received
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch3 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 44 A9 00 00 00 01 00 00 00 00 A3 95 C0
W: (#0) nothing received
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch23 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 44 C7 00 00 00 02 00 00 00 00 C1 35 60
RX whatHappened Ch61 001
RX whatHappened Ch61 001
RX whatHappened Ch40 001
I: RX 27B Ch61 | 95 74 40 35 30 74 40 35 30 02 00 03 F8 1D 01 26 01 21 01 70 01 B6 01 ED 06 48 12
I: RX 27B Ch61 | 95 74 40 35 30 74 40 35 30 03 07 14 00 04 75 76 00 04 7C 90 01 70 01 9C 09 4E C1
I: RX 27B Ch40 | 95 74 40 35 30 74 40 35 30 84 13 89 14 F4 00 E2 00 E1 03 E7 00 F7 00 02 60 50 49
W: Frame 1 missing: Request Retransmit
I: TX 11B Ch40 | 15 74 40 35 30 86 99 51 75 81 9E
RX whatHappened Ch75 001
I: RX 27B Ch75 | 95 74 40 35 30 74 40 35 30 01 00 01 01 62 01 3C 01 38 04 60 04 52 00 04 65 7A DB
I: procPyld: cmd:  0xb
I: procPyld: txid: 0x95
I: Payload (62): 00 01 01 62 01 3C 01 38 04 60 04 52 00 04 65 7A 00 03 F8 1D 01 26 01 21 01 70 01 B6 01 ED 06 48 07 14 00 04 75 76 00 04 7C 90 01 70 01 9C 09 4E 13 89 14 F4 00 E2 00 E1 03 E7 00 F7 00 02
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch61 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 44 E5 00 00 00 01 00 00 00 00 60 F1 37
W: (#2) nothing received
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch75 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 45 03 00 00 00 01 00 00 00 00 C1 42 DE
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch3 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 45 21 00 00 00 02 00 00 00 00 60 86 95
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 50)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch23 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 45 3F 00 00 00 01 00 00 00 00 C0 43 FE
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch40 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 45 5D 00 00 00 01 00 00 00 00 A2 F2 53
I: enqueued cmd failed/timeout
I: (#1) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch61 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 45 7B 00 00 00 02 00 00 00 00 C3 04 EE
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch75 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 45 99 00 00 00 01 00 00 00 00 67 90 2C
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch3 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 45 B7 00 00 00 01 00 00 00 00 C6 45 6A
I: enqueued cmd failed/timeout
I: (#1) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch23 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 45 D5 00 00 00 02 00 00 00 00 A4 B0 93
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch40 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 45 F3 00 00 00 01 00 00 00 00 C5 46 32
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch61 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 46 11 00 00 00 01 00 00 00 00 6E 66 44
I: enqueued cmd failed/timeout
I: (#1) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket
I: TX 27B Ch75 | 15 74 40 35 30 86 99 51 75 80 0B 00 63 E2 46 2F 00 00 00 02 00 00 00 00 0F 3A 4B
I: enqueued cmd failed/timeout
I: (#2) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 2
I: (#2) Requesting Inv SN 1161XXX4
I: (#2) enqueuedCmd: 0xb
I: (#2) sendTimePacket
I: TX 27B Ch3 | 15 74 40 42 54 86 99 51 75 80 0B 00 63 E2 46 4D 00 00 00 01 00 00 00 00 6D CF AE
I: enqueued cmd failed/timeout
I: (#0) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 0
I: (#0) Requesting Inv SN 1141XXX9
I: (#0) enqueuedCmd: 0xb
I: (#0) sendTimePacket
I: TX 27B Ch23 | 15 72 22 17 79 86 99 51 75 80 0B 00 63 E2 46 6B 00 00 00 01 00 00 00 00 0C 7D 47
I: enqueued cmd failed/timeout
I: (#1) I: no Payload received! (retransmits: 0)
I: resetPayload: id: 1
I: (#1) Requesting Inv SN 1161XXX0
I: (#1) enqueuedCmd: 0xb
I: (#1) sendTimePacket

@lumapu
Copy link
Owner

lumapu commented Feb 7, 2023

I think we have some kind of memory problem. The website issue comes up because the API only returns null. There is no recover from this state and happens to me from time to time.
The log shows that there is data received RX but suddenly there will be no new data.

@Argafal
Copy link
Contributor Author

Argafal commented Feb 7, 2023

Yes, I agree. I documented an OOM in issue #644, may or may not be related.

How can I help to narrow this down further? Is building the esp8266-debug target the right way of getting more debug output? (Building this target currently fails for me, so I can't try it yet, see issue #653)

@lumapu
Copy link
Owner

lumapu commented Feb 7, 2023

Saw both of the related issues. Don't know about that, maybe digging deeper the next days.
I try to reduce code which dynamically allocates memory, so yesterday I removed another new. I think it cannot be solved with one commit, it's more like a process over several of commits.
You can try to review the code and check for memory leaks. Maybe you find some section where the code can be optimized to reduce memory.
We use several subroutines which enlarges the stack once they're called. Maybe we have to think about a desing which is more flat without a deep hierachy.

This was referenced Feb 7, 2023
@beegee3
Copy link
Contributor

beegee3 commented Feb 8, 2023

@Argafal in short:
dont't know if it is related to the problem, but I saw two strange things in your log file:

  1. in the first half of the log inverter no. 1 sends a complete payload (no frame missing), what should lead to an output on MQTT or the web page. But next line is "enqueued cmd failed/timeout". I'm wondering, how this can happen? I don't see this possibility in the program code
  2. you set the number of retransmits to 50. This is too much. In particular combined with short intervals. Retransmits are for the case, that frames are missing and are send immediately to the inverter. With a number of 50 you may flood the inverter and it will stop answering for some time. Hence a retransmit value of 5 should be sufficient and perhaps an interval value not lower than 10 s. The latter means that you should get updates from all of your 3 inverters within 30 seconds. If you need a lower interval, please decrease it starting with 10 and take a look into the serial log to see if it's going worse.

@Argafal
Copy link
Contributor Author

Argafal commented Feb 8, 2023

@beegee3 Much appreciate that you commented on the log file. The first point you make is most curious. Maybe @lumapu understands it better?

About the second point, I recently changed retransmits from 5 to 50. The reason was: I noticed that if any inverter didn't come online within the first 10-15 minutes after boot it never seems to come online. The only thing that helps is a reboot. I assumed that this means 5 connection attempts failed and ahoy has permanently given up on that inverter. Increasing the number of attempts to infinite (or in my case 50) seemed the smart thing to do: I know the inverter is there and producing, thus: "Ahoy, please keep on trying." Do I misunderstand what retransmits actually does/means?

@stefan123t
Copy link
Collaborator

@beegee3 your point 1. is the same I was about to make after looking thoroughly at the log file.
The inverter sent a complete payload and Ahoy should process that

iv->addValue(i, payload, rec); 
...
iv->doCalculations();
notify(mPayload[iv->id].txCmd);
...
iv->setQueuedCmdFinished();

decode and add the values to the internal inverter structure using memcpy
do some calculations on various composite metrics
and notify other parts of the code about the changes it received and processed.

But then instead of finishing the enqueuedCmd with success,
it complains about the command being failed/timeout ?

See here:
https://github.com/lumapu/ahoy/blob/development03/src/hm/hmPayload.h#L261-L322

@Argafal regarding Retransmits & Retries, it has been shown in the earliest HackRF traces between Hoymiles DTU WLite and HM-600 inverters that the NRF24L01+ in the inverter is only sending 15 Retries of the same packet in sequence.
But I guess you have modified the Retransmits, which is the number of times the DTU will send another 0x8N packet to retrieve the Nth packet of the complete message. So here you will remain in the same Send loop for 50 iterations and try on receiving any of the up to 4 packets each complete message / payload can be made up.
According to the logfiles you posted via the Discord / Pastebin I have seen that one of your inverters has some difficulty getting the message over without Retransmits.
I am a bit indifferent whether this is a smart / bad thing to do, if it helps 👍
However there might be difficulties in the order of the packets that the Hoymiles inverter expects in a new communication with the DTU. We may not have the correct order yet, so this may contribute to the success of reaching the inverter.
In the past we used 0x0B RealTimeRunData_Debug (4 packets) first and now we ask for 0x01 InverterDevInform_Simple (should be only 1) in the beginning.

@tastendruecker123
Copy link
Contributor

I have experienced this as well. For what it's worth, I was able to reboot ahoy remotely by calling /reboot in this state. Since it does look like a memory allocation issue, perhaps we should add an automatic reboot as a temporary workaround. Maybe the httpd portion of the code could keep an eye on its own responses and trigger a reboot if needed.

@beegee3
Copy link
Contributor

beegee3 commented Feb 8, 2023

@Argafal @stefan123t delving deeper into the programming I found the explanation for 1.:
the timeout message belongs to the next inverter (no. 2). It's here: https://github.com/lumapu/ahoy/blob/development03/src/hm/hmPayload.h#L125-L129
Inverter no. 1 enqueuedCmd ended successfully and the process on inverter no. 2 started.
This one hasn't answered before (cf. a few lines above "(# 2) nothing received").

The timeout message is confusing because of the missing inverter no. and some hint like "last time ...",
but I think @lumapu liked to replace lines 125 and 126 with 127 - 129 and just forgot to delete the first two lines 😆

Sorry for wasting your time 😒

@stefan123t
Copy link
Collaborator

@beegee3 @Argafal I gave the timestamps in the sendTimePacket(s) some more attention:

# 63 E2 43 D7
$ date --date="@$(echo 'ibase=16; 63E243D7'|bc)" '+%F %H:%M:%S'
2023-02-07 13:28:07
# 63 E2 44 E5
$ date --date="@$(echo 'ibase=16; 63E244E5'|bc)" '+%F %H:%M:%S'
2023-02-07 13:32:37

But this seems to have been somewhere around lunch time, so the inverters should have eventually received enough sunshine.

@Argafal
Copy link
Contributor Author

Argafal commented Feb 9, 2023

I can rule out to little sun light as a part of the problem. Super sunny the last two days.

However, I think you are onto something with the retransmits/intervals. With retransmits=5, interval=30s I waited a long time for the inverters and sometimes never caught them. With retransmits=50, interval=30s it was probably the same. Maybe worse, hard to measure. I now use retransmits=5, interval=10s, and boy that seems to have made for a big improvement. Inverters are now found much more reliably and quickly upon boot-up. Is there a reason to have a 30s default? @knickohr what is your interval time?

Related question: does ahoy ever give up on an inverter entirely, and if so, when?

On the negative side, whilst I now quickly find the inverters the webserver breaks away much more quickly than before. I don't make it through even a few minutes now before the website becomes an empty template (screenshot above). If there is anything to test or any additional debug to activate to narrow the issue down, I now seem to reproduce the problem easily.

@Argafal
Copy link
Contributor Author

Argafal commented Feb 9, 2023

Additional symptom: I notice that the website occasionally recovers after a minute or two.

@knickohr
Copy link

knickohr commented Feb 9, 2023

I‘m still at default 30s interval time.

@beegee3
Copy link
Contributor

beegee3 commented Feb 9, 2023

@Argafal interval describes the time period between two requests. Since a loop iterates over the inverters it needs
(number of inverters) * interval seconds that the same inverter is requested again (with your current setting 3 * 10 = 30 seconds).
The default setting interval=30 comes from older times where the logic was different in a way.
retransmits is the maximum number of single requests if a frame is missing. That is the inverter has a counter in his answers (=frames) starting with 1. In the last frame 0x80 is added to the counter. That way it is easy to detect missing frames. These will be requested again immediately (up to x times with x = your retransmits value).
If there is no answer at all there is no retransmit.

@lumapu
Copy link
Owner

lumapu commented Feb 15, 2023

@Argafal please check again with latest version 0.5.89

@Argafal
Copy link
Contributor Author

Argafal commented Feb 17, 2023

I confirm that this issue still occurs with 0.5.89.

@tastendruecker123
Copy link
Contributor

@Argafal How are you using Ahoy? Is more than one client accessing the website (maybe browser + a script collecting values from the API) and are you writing limit to the inverter using either MQTT or the API?

@Argafal
Copy link
Contributor Author

Argafal commented Feb 17, 2023

  • AHOY sends inverter data on MQTT, whenever available. No interval specified.
  • I do not set any settings/limits via MQTT.
  • I occasionally refresh the ahoy website for the purpose of our current debugging.
  • I have exactly one tab open. I refresh a single time and wait for the answer before sending the next refresh. Does my refresh maybe sometimes coincide with the automatic refresh of the ahoy webserver homepage, which causes twice the load?
  • I have enabled serial output and read the serial log.
  • I did not enable further debugging when building.

@tastendruecker123
Copy link
Contributor

I have run into the same issue with a single inverter and multiple clients, but you don't seem to be hammering the website nearly as much as I was at the time.

It might not be a memory leak as such, but more an issue with memory fragmentation that can occur if one is creating and destroying larger String objects. Dynamically allocating and releasing blocks of memory will lead to 'holes' in the heap that are simply too small the accommodate the Strings that you're trying to put in there. I'm guessing that's also why the web server is returning 'null'. It can't allocate a large enough block of memory to hold the response. That would also explain why the site sometimes receivers, if a larger block of memory is released elsewhere.

Quick primer on the subject:
https://cpp4arduino.com/2018/11/06/what-is-heap-fragmentation.html

To diagnose the problem, the same site links to a code fragment that's able to check the size of the largest available block of memory and calculate the level of fragmentation on and ESP8266. It may be wise to include this piece of code to get a better idea of what's causing the issue.

As a 'quick fix' that would be very quick to implement, we may want to check whether it's possible to allocate a large enough block of memory and reset the micro if that's not the case.

With a bit more time and effort we could use the String.reserve() function for suspicious Strings to avoid much of the dynamic memory allocation.

To properly fix the problem permanently, we may have to avoid dynamically allocated String altogether.

@beegee3
Copy link
Contributor

beegee3 commented Feb 17, 2023

@tastendruecker123: the latest version 0.5.89 worked on the problem you describe. It seems to be fine now🤞

@lumapu: because of @Argafal's comment

I have enabled serial output and read the serial log

I looked at the code for creating the serial log output (web.h). Maybe there is a possible issue in the function serialCb where the serial buffer mSerialBuf of size WEB_SERIAL_BUF_SIZE is filled. There a two lines if((...) <= WEB_SERIAL_BUF_SIZE) {...} to avoid overflow errors. So it is possible that the buffer can be filled completely and does not contain a '0' indicating the end of the buffer, i.e. an overflow error occurs on sending it to the website.
In summary the two if clauses should be corrected to if((...) < WEB_SERIAL_BUF_SIZE) {...} (delete the = sign).

@Argafal
Copy link
Contributor Author

Argafal commented Feb 18, 2023

  • I have changed the serial buffer check as outlined by @beegee3 above, in two places. Code base 0.5.89.
  • I changed from a computer USB to a USB power supply to see if power could be an issue.
  • The webserver still dies. MQTT continues. Same symptoms as before.
  • I think the serial buffer change is still a good fix to include. Maybe it helps with Memory leak - Out of Memory Exception #644, for example.
  • What happens if serial debug is activated but no one is listening? Does a buffer fill up? Or do messages get discarded?

@lumapu
Copy link
Owner

lumapu commented Feb 19, 2023

the messages at discarded or not collected. I don't think the serial buffer will lead to a problem

This was referenced Feb 20, 2023
@niekniek89
Copy link

I have the same problem as mentioned above.
I'm on the latest developer version.
after +- 24 hours of working properly, a reboot takes place.
Before that, the web server is half/unreachable.

lumapu added a commit that referenced this issue Feb 25, 2023
reduced maximum allowed JSON size for API to 6000Bytes #660
small fix: output command at `prepareDevInformCmd` #692
improved inverter handling for MQTT #671
@lumapu
Copy link
Owner

lumapu commented Mar 9, 2023

should be solved with current dev version eg. 0.5.96

@lumapu lumapu added the fixed dev fixed label Mar 9, 2023
@lumapu lumapu self-assigned this Mar 9, 2023
@Argafal
Copy link
Contributor Author

Argafal commented Mar 9, 2023

So cool, well done. I'd love to hear what the solution was in the end. Do you care to share? :)

@lumapu
Copy link
Owner

lumapu commented Mar 9, 2023

main issue were strtok functions. It generates huge heap fragmentation which leads to the problems.
Never thought that before, functionality is now the same but working around strtok

@Argafal
Copy link
Contributor Author

Argafal commented Mar 10, 2023

Great work, @lumapu!

@Argafal
Copy link
Contributor Author

Argafal commented Mar 25, 2023

Problem ist scheinbar doch wieder aufgetaucht.... Schade.

Könnten Betroffene (und auch Nicht-Betroffene) bitte kurz diese Fragen beantworten:

  • ESP8266 oder ESP32?
  • Version?
  • Wie viele Inverter und welche?
  • Inverter-Abfrage Interval und Retries aus den Einstellungen? (Default: 30s und 5 retries)
  • MQTT Server konfiguriert: ja/nein?
  • Falls ja, auf welcher Hardware läuft der MQTT Server? (z.B. richtiger Rechner? oder Raspberry Pi? welche Variante?)
  • MQTT-Interval aus den Einstellungen? (Default: 0)
  • WebUI aktiv benutzt: ja/nein?
  • Typische heap_fragmentation Wert nach 20 Minuten Laufzeit? Siehe /system
  • Regelmässige Probleme dass die WebUI kaputt geht, ja/nein? (siehe screenshot weiter oben)

@Argafal Argafal reopened this Mar 25, 2023
@theunknownstarman
Copy link

theunknownstarman commented Mar 25, 2023

  • ESP8266
  • 0.5.92
  • 1 Hoymiles 1500
  • 30 Sekunden
  • MQTT: ja
  • Homeassistant
  • MQTT Default auf "0"
  • WEB UI aktiv
  • 10

Absturz wäre der falsche Begriff:
WLAN Verbindung für WEB UI geht nicht mehr (MQTT Datentransport funktioniert weiterhin).
Nach WLAN reconnect (via WLAN Überwachung wird der AHOY zum reconnect gezwungen) ist WEB UI wieder normal bis zum nächsten "Problem" erreichbar.

Meiner Einschätzung nach existiert ein Problem mit dem Port des Webservers der UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants