-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.0.9 - Enviro Urban - Exception while uploading - Caused board to hang fro 2hrs #119
Comments
Thank you for raising this. It looks like it got stuck during the Why it got stuck during that phase, I do not know yet. I'll fire my own Enviro Urban up and see if I can reproduce. |
Ah, the logs time reset to zero as I unplugged the battery before connecting to USB to view the logs. |
I had my own Enviro Urban lock up overnight (solid white light, but no red light). Unfortunately I needed it to be logging again so just reset it back to life. Hopefully the issue with appear again. |
Just to update that I am still investigating this. It's a hard one to diagnose, as it's so infrequent, even when logging every minute. Best I can gather though, it's not an issue with the board going to sleep, but rather the board waking up. We do some shenanigans during bootup before Micropython kicks in, and somewhere in that is the issue. Likely in this file, if you're curious https://github.com/pimoroni/pimoroni-pico/blob/main/micropython/_board/picow_enviro/wakeup_gpio.patch |
I too had a solid white light on enviro indoor a day or two back, but after power cycling several hours later (having spotted the issue) the board has been working fine since. I'm using a custom webhook rather than mqtt, if that matters. |
@TechWilk Thanks for the confirmation. I have not experienced this in Enviro Indoor myself yet, which has been running fine for over a week. Only issue I had was the RTC slowly going out of sync, causing an issue with Adafruit.io uploads. This makes me wonder if it's the underlying micropython version at fault, as my Enviro Indoor and Weather are using v1.19.7 as their base, with the v0.0.9 enviro firmware onboard. Whereas my Enviro Grow and Urban are v1.19.10 with v0.0.9. 🤔 |
Have been to a little experiment (2 x urbans and 2 indoors) all running at readings every 2 minutes and uploading evey 3 readings. I wanted to see how the readings varied from them when they are all in the same conditions. After about 3 hrs one urban has experienced the issue. Again has the red light and white LED on. Note this is a differnet urban to the original issue that was raised. To get it to work again had to remove power and the seems OK again. Will continue to monitor. They all running indentical 0.0.9 https://github.com/pimoroni/enviro/releases/download/v0.0.9/pimoroni-picow_enviro-v1.19.10-micropython-v0.0.9.uf2 The only thing differnet to v0.0.9 is I am sending over secure mqtt #122 on these ones but that should not have any impact. Logs File see events at 14:48:
|
This moring had 1 x indoor and 1 x urban in a failed state with the red and white light on. For the Urban it looks like once one of the uploads failed it stopped, from the logs doe snot look like it even tried to go back to sleep, see entries at 09:08. AT 10:25 was when I removed the power and plugged in to USB to get logs:
For the Indoor the logs are slightly differnet but again had to be powered off to get the logs and kick it back into life. See time 06:00 but this one did appear to try to go to sleep after a uplaod failure, the entries after that are when I powered off and plugged back into USB to get the logs.
The common theme seems to be that it only occurs after an upload failure. So perhaps might be able to simulate by disabling the remote destination (turn mqtt f) and see if it forces the error. On the logs for both also never shows a "shutting down" down entry so so not sure if it due to not waking up or it is just not shutting down properly after a upload failure. Note that until these failures they had been uploading fine taking readings evey 2 minutes and uploading every 3 readings. |
Indoor & urban had the same issue again today: For the Indoor this may be down to the battery level this time which was showing about 4% with my power reader but volatage should still be in tolerance. Urban had 84% battery so dont think be a reason this time,. Indoor Logs, see 05:55. Last log entry is when i disconnected battery and reconnected on USB to get logs.
Urban logs, see 06:48. Have included the previosu logs so can see it worked before. Last log entry is when i disconnected battery and reconnected on USB to get logs.
|
Hi @dave-ct, Thank you so much for this thorough testing! I can clearly see there being some correlation between the mqtt upload failing and the boards locking up, and it does seem to be during shutdown rather than startup (meaning this may be is a different issue to the one I and others have reported, which is definitely during startup). I will have a look in the code and see if there is something specific to do with uploads failing that would cause this kind of lock-up. Hopefully I can reproduce without setting up an mqtt server. |
@ZodiusInfuser if there is any extra logging/tests you want me to try let me know. Both Urbans have failed today just now with the same issue on after a upload failure, see another example below at 11:24 and then connected to USB at 12:04. .
|
@dave-ct to help with debugging, could you update the local from enviro import logging
from enviro.constants import UPLOAD_SUCCESS, UPLOAD_FAILED
from enviro.mqttsimple import MQTTClient
import ujson
import config
def log_destination():
logging.info(f"> uploading cached readings to MQTT broker: {config.mqtt_broker_address}")
def upload_reading(reading):
server = config.mqtt_broker_address
username = config.mqtt_broker_username
password = config.mqtt_broker_password
nickname = reading["nickname"]
try:
# attempt to publish reading
mqtt_client = MQTTClient(reading["uid"], server, user=username, password=password, keepalive=60)
mqtt_client.connect()
mqtt_client.publish(f"enviro/{nickname}", ujson.dumps(reading), retain=True)
mqtt_client.disconnect()
return UPLOAD_SUCCESS
except:
logging.debug(f" - an exception occurred when uploading")
return UPLOAD_FAILED to this: from enviro import logging
from enviro.constants import UPLOAD_SUCCESS, UPLOAD_FAILED
from enviro.mqttsimple import MQTTClient
import ujson
import config
def log_destination():
logging.info(f"> uploading cached readings to MQTT broker: {config.mqtt_broker_address}")
def upload_reading(reading):
server = config.mqtt_broker_address
username = config.mqtt_broker_username
password = config.mqtt_broker_password
nickname = reading["nickname"]
try:
# attempt to publish reading
mqtt_client = MQTTClient(reading["uid"], server, user=username, password=password, keepalive=60)
mqtt_client.connect()
mqtt_client.publish(f"enviro/{nickname}", ujson.dumps(reading), retain=True)
mqtt_client.disconnect()
return UPLOAD_SUCCESS
except Exception as exc:
import sys, io
buf = io.StringIO()
sys.print_exception(exc, buf)
logging.debug(f" - an exception occurred when uploading.", buf.getvalue())
return UPLOAD_FAILED This will log the actual exception that occurs when the MQTT broker fails. Meanwhile I have set up an Enviro Urban to log and send data to a MQTT broker that doesn't exist, to see if I can get some reproduction. This always causes an exception to occur (red flashing light), but is perhaps not the same exception you are experiencing, hence the need for extra logging. |
@ZodiusInfuser Just wanted to confirm, its is nothing to do with the secure mqtt I have setup as my urban which is physically outside also failed today and this just uses the default mqtt destination with no changes. Seems like the same issue at 23:05.
I will add the extra logging you mentioned above on the urbans and indoors to see if I can gte more details. Altough sometimes it seems lik eit gets a bit further than othe rother times based on the log messages. |
Thanks for that. The fact it fails at slightly different points after the exception could be a red herring, in that it's either still the exception at fault, or it's not the exception at all but some other factor. If the latter, then I have no idea how to begin diagnosing that! 😢 |
@ZodiusInfuser as these four boards all need ot use secure mqtt have modfied it slightly but should captcher the results as per the logging you suggested. This is the mqtt.py I am using with the extra logging added in.
|
Ah of course, thank you. By the way, for that SSL change, if you have tested it and are confident it works, could you raise a Pull Request for it please? There's a few things that may need tweaking, but that can be discussed in the PR |
I've just had the solid light issue (with flashing red led) occur. Best I can tell from the logs it's the same startup issue I was already aware of as there were going to sleep messages, whereas yous logs clearly show that not happening. |
@ZodiusInfuser Just got a failure on an urban, did work a few time before hand:
Is it possible to do a build with the micropython version from 0.0.8 with the 0.0.9 firmware. Might proove if its the updated micropython version. |
Thanks for that. I'll have a dig and see what that error means. In the meantime, yes you can try different micropython versions. Easiest way is to install the v0.0.8 micropython using this https://github.com/pimoroni/enviro/releases/download/v0.0.8/enviro-v0.0.8.uf2 Then put the v0.0.9 FW on using this https://github.com/pimoroni/enviro/releases/download/v0.0.9/enviro-v0.0.9-filesystem-only.uf2. Save a local copy of your config and mqtt files first though |
Have updated 2 x Urbans and 2 x indoors to the 0.0.8 miropython with the 0.0.9 Firmware using the above method, restored the saved config/mqtt/ca.crt files fro each one and will monitor and see how they get on...... |
Thanks @dave-ct. It will be interesting to hear. By the way, I noticed that in the v0.0.9 def connect(self, clean_session=True):
#def connect(self, clean_session=True, timeout=30): # TODO this was added to 0.0.8
self.sock = socket.socket()
#self.sock.settimeout(timeout) # TODO this was added to 0.0.8
addr = socket.getaddrinfo(self.server, self.port)[0][-1] This may resolve the ECONNABORTED exception, which the internet suggests is caused by the server rejecting the connection, or at least change it into an ECONNTIMEDOUT exception. In either case, I should update the logic for |
@dave-ct I just wanted to let you know that I too tested with v0.0.8's Micropython and the v0.0.9 firmware, but still had the issue occur. Also this is the first time I have your exact issue of needing to remove power to the board in order to fix things. I have left a board at the office in this error state so I can diagnose it further, but let me explain some of my understanding thus far. Apologies, it's rather long as I've kind of used this to air my own thoughts on what is going on. When Enviro is on battery and not doing anything, it is in (lets call it) "super sleep", where the Pico W literally has it's 3.3V power supply disabled. For all intents and purposes it is offline in this mode, with the only things getting power on Enviro being the RTC and user button. When either the RTC triggers its alarm, the user button is poked, or USB power is provided, the Pico W receives 3.3V and starts booting, turning the white LED on, setting HOLD_VSYS_EN to high keep the board alive, and disabling the RTC_ALARM (and this red flashing light). When Enviro it wants to go to back to "super sleep" it sets HOLD_VSYS_EN to low, and as long as there's no external source preventing it (such as USB power), it's 3.3V power gets deactivated. The situation you (and others) are experiencing is a lockup during boot (we'll get to possible reasons later), sometime after the white LED is enabled but before Micropython has fully loaded, where the white LED would start pulsing. This point is likely before the RTC_ALARM has been reset, as indicated by the red light still flashing on your boards. That means that 3.3V power is constantly provided to the Pico W. This is significant because normally on Enviro, when reset is pressed whilst the board is awake, it will cause the board to turn off, requiring a "Poke" to start it running again. What we are now both experiencing is the reset button just resetting the Pico W. Normally this would be fine, but the white LED just turns on and nothing happens. This should not be the case with a board reset, so there must be something persistently stalling the boot process of the RP2040. Incidentally this explains why you need to remove power from the board, as that is the only way to fully reset the state of everything on the board. For Enviro Urban the things this could be are: the RTC, the BME280, the analog microphone, or the Pico W's WiFi chip itself. If it is indeed the WiFi causing these lock-ups, then that's a much deeper problem to solve that ventures outside of the Enviro hardware and software and over to Raspberry Pi themselves. So that's where we're at... I wish I had an easy answer for you, but it's either one of those other internal components of Enviro that is causing the issue or the Pico W's WiFi itself. I will delve deeper over the next few days, and see if there's been any general Pico W reports that corroborate this theory. |
@ZodiusInfuser , thanks for the detail and efforts to find the solution. Since using v0.0.8's Micropython and the v0.0.9 firmware I have had no issues so far on any of the devices, even when there have been upload failures it recoveres itself when it wakes up next. I will monitor over the next couple of days as previosuly with the v0.0.9 Micropython is was happening regularly. |
Interesting. Perhaps my method of flashing my boards wasn't optimal. After all, I too have had an Enviro Weather and Indoor running for a few weeks now and they had v0.0.8's Micropython with the v0.0.9 (pre-release) Firmware on 🤔 Actually. Let me try it properly and get the system to generate a build with the right MP (v1.19.7) and v0.0.9 firmware 😆 : https://github.com/pimoroni/enviro/actions/runs/3624876714 |
No issues this morning, will also try the system build above later tonight to make sure. |
Thanks for the update. I've loaded an Enviro grow up with the firmware, so will leave that running through the work day. |
Putting this here for my own reference. It looks like there is nothing of note different between v1.19.7 and v1.19.10 that would explain the lock-ups 😕 pimoroni/pimoroni-pico@1.19.7...v1.19.10 |
@ZodiusInfuser could there be any changes in the upstream version of micropython that the Pimornoi build uses. Not sure how it works but does it incoporate the offical micropython release from the 18 June (1.19.1) or does it grab latest updates to core micropython or one of the nightly builds which might have some changes in it? |
That was the first thing I looked for, but the commit code we're using hasn't changed between our release versions: From what I recall, 1.19.1 predates the Pico W, so we are using a newer build (but not the absolute latest). We would prefer to use a stable numbered release, but there has yet to be one. The last time we updated was for our 1.19.6. An update on my Enviro Grow. It has locked up several times today. Both times pressing the reset button recovered it from the state, so whatever happened with the Urban is seriously bad. I tried looking at the Urban with a scope to see if there was any merit to the theory of WiFi comms locking things up, but did not see any activity on the lines. So instead I tried scoping one of the Pico W's flash chip lines and it consistently fails at the same point. Locked Enviro Urban during boot: Don't know how useful this information is yet, so I'll need to correlate this data with different points in the code. Or try and get in touch with someone at Pi to help as this is starting to get beyond me :sad: |
The technique of recovering from a processor hang using a timer implemented# in the PIO section of the Pico appears to work. The PIO code uses a 32 bit count down timer which negates HOLD_VSYS_EN_PIN when the count reaches zero. If the code has not reached the end of sleep within 4 minutes the power removed. Note I duplicated the code to set the RTC alarm to just before arming the watchdog.
|
@Julia7676 Thanks for sharing, will give it a test |
That's brilliant @Julia7676! I'm embarressed to say I never thought of using the PIO as a watchdog! I was too focused on using the system watchdog, but the timeout wasn't really long enough to be useful. I even tried ways of feeding the watchdog within the micropython C code itself, like in between it executing each line of user Python code, but that's not how their system is written. Just to get my understanding correct, this code initialises the watchdog during startup and sets the RTC in case code never reaches that part of the sleep() function? I also see you use the same text file trick as I do, to record that an issue did occur :) Regarding the comment about moving the RTC setting to a common function, note that there is this incoming pull request that modifies some of the code surrounding the RTC that may impact your logic: https://github.com/pimoroni/enviro/pull/132/files One possible extension of this, that could benefit users running their Enviro's off USB, would be to have the PIO trigger an interrupt (assuming Python lets you create an interrupt handler). That function could then call |
@ZodiusInfuser I had not tested all the RTC parts as did not have issues with that (only the mqtt bits we did). But have updated 2 x Urbans, 2 x Indoors and 1 x weather with the code form https://github.com/pimoroni/enviro/pull/132/files |
I have been experimenting with the watchdog this past week. I am recovering about 1 to 2 hangs a day. I increased the number of hangs by generating a lot of WIFI traffic after each upload. I should of have mentioned that the watchdog_live file needs to be deleted just before normal shutdown
There is small issue that will prevent the watchdog being effective when the RTC chip has it's time changed such as the first power up from cold. This probably can be ignored but might catch you out when testing. One way of testing is to remove the hold_vsys_en_pin.init(Pin.IN) line from sleep() while on USB power and then watching the vsys_en_pin go low. I have also been looking at reinstating battery monitoring. (It is a very good way of crashing the processor!) Can some one point me at the issue # so I can report my progress. After a lot of false starts I am close to a possible solution. |
@Julia7676 thanks for the pointer, that may be why when one of them froze it did not recover. Have updated and will see how they get on. |
Hi @Julia7676, thank you for your experimenting. Any chance you can raise a pull-request with your watchdog in? That way it will be easier for people to grab and test, and for code changes to be tracked. In the meantime I will merge the adafruit-io-fixes branch, to avoid later conflicts. Regarding battery monitoring, there is no specific issue listed for the battery monitoring being disabled (because it was disabled and noted by the last release). You are more than welcome to create one though. My attempt to resolve it has been to inject ADC reading within Micropython itself before the WiFi gets initialised. This seemed to work for me when I tested locally, but getting our CI system to generate a usable customer build is proving to be a challenge: |
Have given it ago with the follwoing init.py but still getting hangs.
The logs shows as follows
|
Hi @dave-ct, Just used PYCHARM to compare your code with mine, took me a little while to spot that you had put the watchdog stuff in as your code is verbatim with mine in that part of the file. After pondering why it was not working I realised that you are correct in questioning the 4000000ms delay. this equates to a little over 1hr 6 minutes. The 2nd value in the DELAYOFF constructor is the required delay is seconds. This needs to be more than the expected execution time but must not be so long that it happens after the alarm. I was think 4 * 60, but my finger typed 4 * 1000. Sorry. With that delay value, the watchdog had no chance of working. Did my testing with a delay of 90 seconds and a reading interval of 2 minutes. The simple fix is to replace 4 * 1000 with 4 * 60 but if you know how long it is to the alarm the delay valve should be just less than this by a safety margin to ensure they occur in the correct order. My thinking is that If you can get the RTC code to work out seconds to he alarm alarm you can give the maximum time for execution without compromising the operation of the watchdog. My job now is to go a retrieve my live weather enviro and change it so that it works. |
@Julia7676 thanks for confirming and the explanation. |
@ZodiusInfuser have created a pull request with the code from @Julia7676, might need some further testing but will allow others to test before it is merged. Pull request #144. |
@dave-ct Yes long updates after outage are possible. A design descision is needed to decide what should happen. We could allow the destination module to request an extension to the watchdog and RTC or force the destination module to yeald to allow the processor to shut down so that the next scheduled reading is not delayed. |
@dave-ct Another option would be to set the contingency alarm to say plus 30 miniutes and the watchdog to 29 miniutes. Normally the contingency alarm would be over ridden by sleep(). Down side of this is recover from processor hangs will take a bit longer. My prefered fix would be to add a break to the loop in upload_readings(). This could limit the number of readings per Go or test to see how close we are to the next alarm. |
@Julia7676 I think the simplest solution is just add two setting to the config file for each timer so each user can tune to their own preferences. So far my external weather and urban ones have been going strong for about 1-2 months (readings every 10 minutes and uploads every 3) and only the ones inside, that are more frequent readings, have the issue. The main aim for me is that I don’t want to retrieve the weather enviro from the top of a pole every time it freezes and would be happy if it takes 30 minutes to recover and not take readings. But I see you preferred approach is much more elegant. |
@dave-ct In that case I would put two advanced setting in the config file. The contingency alarm set at 30 (miniutes) and the watchdog delay set at 29*60. with an option of say zero to disable the watchdog. It would make sense to change the number sent to delay of to be in minutes vis
|
@Julia7676 @ZodiusInfuser have made changes so you can configure the watchdog timer in the config file and also disable if needed, set to default of |
@Julia7676 @ZodiusInfuser I have been testing the new PIO watchdog code for a few days and tonight after a failure it look slike the
|
Have had the same issue happen on another Urban Logs
Weather Logs
Indoor Logs
|
Hi, I'm having huge issues with my Enviro Urban, too. I'm currently running 2 Enviro Indoors flawlessly, while the Enviro Urban works for a while, then it stops, usually remaining with the red led or the pulsing white. But in my case the log ends with the log "connecting to wifi network":
I've tried with new batteries, with a different wifi router and network, I tried the stock 0.0.9 firmware and also the latest master branch with the latest MicroPython... but the issue is still there... |
HI @lornova. Lockups are sadly an ongoing issue with Enviro, that seem to stem from the Pico's WiFi in some form. Such lockups can either occur during Micropython boot-up (for which there is currently no fix) or during Enviro's shutdown. It is unclear from your log which of the two you are experiencing, but lockups following a WiFi issue do tend to be shutdown lockups. Also the white LED pulsing is another sign of it being a shutdown lockup. The fact it only happens for your Urban and not indoor is puzzling. One thing you can try, is to grab the code from this Pull Request (#144). It aims to deal with shutdown lockups by using the Pico's PIO to trigger a board reset (when on battery) after certain time. If this lets your Urban keep working, then that confirms it was a shutdown lockup. If not then it is a startup lock-up, which I cannot offer a solution for at this time. |
Thank you, I'll try with that patch.
I believe that I'm experiencing both lookups, as sometimes the board doesn't boot at all and just has the red led as soon as I power it. I also sometimes have issues when connecting to the PC, with the board not being recognized as an USB device (I've seen that in another issue, I increased the sleep at startup and I'll keep monitoring the issue).
I have kept my Urban on the windowsill, inside the Stevenson screen, for a couple of week, and back then I had almost no lockups. Then it suddenly stopped working and my fear was that it broke due to the dew. Then I used flash_nuke.uf2 and started from scratch (I had corrupted time in the RTC and I had to manually fix it via the interactive Python interpreter). Since then I fear placing the board outside... |
I seem to be seeing this issue with my Urban now. The Weather and Indoor don't seem to be affected. Power cycling fixes it. I'll try pulling the fix from #144 and see if it helps. I'm taking readings every 5 minutes, and sending every 6 readings. 2023-02-11 11:45:03 [debug / 115kB] > performing startup |
I ran the PR #144 on my weather board that has been experiencing the crash where it fails to wake up and hangs with the solid white light and no log entries after going to sleep as mentioned in #87. This watchdog failed to help there and the logs looked identical, the RTC was correct time, so the RP2040 just appears to be in a state where it can't be awoken and a reset is needed, I'm unsure if the watchdog fired or if that was also hung. I flashed the board to micropython 1.19.17 beforehand. I ran the board for a couple of days successfully on USB, but moving the board out the weather station I got a couple of crashes with I think a red light flashing or solid, requiring power reset where the last log is about failing to upload to influxdb (local database server on the LAN). I am hoping #144 mitigates that while the cause is hunted down, I will try it outside again with the watchdog running and see if the watchdog catches it. |
Late to the party but wanted to at least chime in on what I am seeing. I too am getting the solid white light. I have to cut power, turn it back on and hit the poke before it works right again. Sometimes I get my data, sometimes it's just gone. I see it getting posted in the log, but checking the local server log I can see the POST request never made it to the server. Now I have to reset it pretty much every day. From what I have read, this is failure to wake up, so I will try having it run continuously. I've never gotten an error in the log. 2023-09-10 21:45:03 [info / 138kB] > performing startup 2023-09-13 17:50:03 [debug / 120kB] > taking new reading |
I fixed nearly all of my issues by applying fixes to the wifi handling presented by another user. I captured these into PR #199 and have been running stable since then. Worth a try to see if your issues is caused by the same? |
@ZodiusInfuser have tried 0.0.9 on the Urban today (also on 2 x enviro indoors which dont seem to have any issues so far). On the Urban it worked for about 4hrs then stopped. Looking at the logs an error occured while uploading vi mqtt which then tried to go back to sleep but looks like it failed (last entry at 15:20) as the red LED was flashing probably for about two hours and this was at about 17:00. When I bought it in took a picture before disconnecting the battery as also seems the white LED is partially on.
Log Files
Photo of board LEDs when I bought it in (red LED was flashing and white LED was part lit).
The text was updated successfully, but these errors were encountered: