Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Devices without internet access fail ~50% of API calls (w/ ESP8266/nodeMCU) #90

Closed
jknaack opened this issue Aug 23, 2023 · 12 comments · Fixed by #99
Closed

Devices without internet access fail ~50% of API calls (w/ ESP8266/nodeMCU) #90

jknaack opened this issue Aug 23, 2023 · 12 comments · Fixed by #99

Comments

@jknaack
Copy link
Contributor

jknaack commented Aug 23, 2023

If your device doesn't have access to an NTP server, ~33% of the time, the API is unresponsive. You can test this as follows:

  1. Blocking internet access from your OpenGarage device at your router
  2. Browse to your device in a web browser (say Firefox)
  3. Open the debug tools (F12) and monitor the network traffic
  4. Refresh the page
  5. Reboot the OpenGarage

You'll start to see that about 1/2 the calls to /jc fail. During the times that they fail/timeout, nothing works.

Some other interesting observations:

If you are in this situation, you may see other side effects. For example, I noticed that my network signal strength was much weaker if I was blocked from accessing the internet than when I had access. This is a particularly subtle and problematic side effect, and is probably why I even noticed an issue in the first place. I thought I just didn't have enough signal strength in my garage, but it turned out the issue was that I didn't have internet access.

This can also happen if you have a misconfigurated DNS. It turned out that my pi.hole was blocking DNS requests from this device, but when I fixed that it still didn't work because I was blocking the device from the internet anyway.

The root cause turns out to be the delays in the time_keeping() function:
https://github.com/OpenGarage/OpenGarage-Firmware/blob/8335fe6b9af40cde95a49ae925bcd83757fc6b6b/OpenGarage/main.cpp#L1254C1-L1322C1

In particular, the default NTP servers are time.google.com, time.cloudflare.com, and time.windows.com, all require an internet connection to use, and if you don't have one available, the delays at https://github.com/OpenGarage/OpenGarage-Firmware/blob/8335fe6b9af40cde95a49ae925bcd83757fc6b6b/OpenGarage/main.cpp#L1278C1-L1278C16 cause the device to hang while delays are being called.

I'm not sure why API calls fail ~50% of the time instead of ~33% of the time, because there are 30 seconds of delay/retries, followed by 60 seconds of not calling delay. I would expect failures 30/90 seconds, but it is what it is.

I believe the correct solution is to stop using delay completely, because delay blocks device critical, behind-the-scenes operations, like managing the wifi state. I'll submit a MR if you are interested in it.

jknaack added a commit to jknaack/OpenGarage-Firmware that referenced this issue Aug 23, 2023
…th a very simple state machine to handle transitions and delays using the regular loop
@jknaack
Copy link
Contributor Author

jknaack commented Aug 23, 2023

I meant to mention there are a few workarounds depending on the root issue.

If your issue is your DNS can't resolve, you can put an IP address in the NTP Server Advanced setting.
If your issue is that you don't allow any external internet access, you'll have to set up an NTP server inside your private network and set that in the NTP Server Advanced Settings.

@jknaack
Copy link
Contributor Author

jknaack commented Aug 24, 2023

I've implemented a solution that keeps the current behavior and connected that merge request. If you are interested, I've also implemented a solution that uses a backoff strategy for errors. It basically starts at 1 second and doubles the wait time until 30 minutes, then it resets to 1 second. This means you'll get ~25 calls every hour to the ntp server during a failure state, rather than ~600/hour, but it'll never completely give up.

You can see the implementation at https://github.com/jknaack/OpenGarage-Firmware/tree/no_internet_causes_api_hangs_backoff_strategy

@rayshobby
Copy link
Contributor

What's the reason you want to block internet access to OpenGarage? Most users want to be able to remotely access OpenGarage to monitor the garage door status and trigger actions. Also, the device needs a valid NTP server to be able to obtain current time. It doesn't have a real-time clock on board so it needs to get the time from the Internet. If you block Internet access, you can at least provide a custom NTP server, for example, most routers can function as NTP server. So I am not sure how common is the situation described here, that is: internet access is blocked, and also you can't provide a local NTP server to the device.

@jknaack
Copy link
Contributor Author

jknaack commented Aug 24, 2023

Hi there, thank you for reaching out! I really appreciate all you've done with this device.

I try to keep all of my IoT devices isolated from the internet, and only access them from inside the network, via a VPN when I'm not at home. While not the most common setup, it's pretty common in the zero-trust model where you want to be confident that all IoT devices are locally-only.

I looked through the code and did a good bit of testing, and I can't find anything functionally wrong when the clock restarts to the epoch (1970-01-01) after a reboot and not calling an NTP server. API calls to open, close, and check the status of the garage work fine, and I believe a local MQTT connection would behave as well (although I haven't tested that). Blynk, OTC, and IFTTT aren't going to work if you restrict access no matter what, so I didn't investigate how those would handle the time being out of sync.

Like I mentioned in the workarounds, I can set up an NTP server, but it's an unexpected requirement, manifests in surprisingly negative ways (see below), and is a pretty large expectation for something that can be addressed by adhering to best practices (my understanding is that any calls to delay on these devices is bad because also blocks background processes, like wifi handling).

What was particularly surprising was how badly the delay degraded the WiFi. I was almost convinced that either my device's wifi hardware had gone bad or the router was failing because unless I moved the device within line-of-sight of the router, and usually not more than a few feet away, both the router and the device showed terrible signal. Apparently sleeping in a loop really hurts the wifi background stuff. Unless you have a debugger attached, there's really no way to figure out that this is the problem. It wasn't until I hooked up a debugger and noticed that the failures only happened during ntp error logging that I caught the correlation.

@rayshobby
Copy link
Contributor

NTP time sync is required for logging -- without the current time, the logged events will all be at incorrect times. You might not need this feature but most users want the logging feature.

@jknaack
Copy link
Contributor Author

jknaack commented Aug 25, 2023

That's true, logging would only be accurate as a time-since-boot log, and would reset every time it boots rather than the actual time. There's probably no perfect solution in that case, but it seems like the solution of a degraded network connection (meaning failed requests) AND an incorrect clock seems sub-optimal to a system that just has a clock-failure but otherwise works.

As for whether logging is important, I agree that it is important, but it seems like it is preferable to have the clock logging be incorrect (or disabled?) and use the device if an NTP server is unavailable, rather than having the clock incorrect and not be able to use it.

I'm not sure if your position is that without correct time logging, it's not even appropriate to allow opening/closing/status checking on the garage at all. If it is, it seems like that expectation still shouldn't impact the wifi connection itself, but rather should reject requests and provide an error to the user; possibly via an error code, a message on the home page or the log page, etc.

Another option would be to have the log page (or the home page?) warn the user that the clock is not set correctly, but the system could work as expected otherwise. The system could either disable logging, or log since uptime in those cases.

There are plenty of other options as well, and they are all really tradeoffs in a degraded state (inaccurate clock). In general, in a degraded state, the system should behave as well as possible, and currently, that doesn't happen. Currently, if the clock can't be set, the wifi signal itself is degraded, sometimes to the point of completely losing connectivity, which seems like the incorrect type of degradation.

For what it's worth, it's also very difficult to diagnose the problem, which means it's going to be very difficult for anyone else to fix if they run in to it. It's easy enough to recognize that an NTP server would fix the issue now, but this was very difficult to diagnose. I suspect it's very uncommon for wifi connection issues to be caused by NTP issues; normally I would expect the inverse to be the case, if I don't have a good wifi connection, I shouldn't be surprised that I have issues calling other devices.

@idxman01
Copy link

What's the reason you want to block internet access to OpenGarage? Most users want to be able to remotely access OpenGarage to monitor the garage door status and trigger actions.

The vlan isolation jknaack describes is very common in the custom home automation scenario. That’s what I do as well using HomeAssistant, run locally on a vm. It makes api (or mqtt) calls to OG internally and remote access is handled through HA.

@edwardreed81
Copy link

I've been fighting poor connectivity on my 2 OG devices for months now and finally ran across this issue.

I also have my OG devices isolated on a specific IoT VLAN with no internet access. I put in my pfSense router-hosted NTP server and voila, connectivity was back to normal.

This is a really odd fix.

@BillyCroan BillyCroan mentioned this issue Oct 31, 2023
@rayshobby
Copy link
Contributor

So if NTP sync is what's causing the problem, would it suffice to provide an option that disables NTP. This is pretty easy to do. That way it will largely eliminate the time_keeping function when this option is turned on.

I could also borrow the code from OpenSprinkler, which handles NTP sync more gracefully -- it's implemented so in OpenSprinkler because OpenSprinkler has an on-board real-time clock, so if one NTP fails, it's not a big deal, as long as it gets one valid time once say every couple of days it should be fine. In contrast, in OpenGarage there is no real-time clock, so it's important to ensure correct time.

@jknaack
Copy link
Contributor Author

jknaack commented Jan 8, 2024

I don't know how OpenSprinkler handles it, but I would assume an OpenGarage instance running on an isolated VLAN will never get a time, at least mine (and several others above, won't). At least part of the time_keeping() function is necessary to keep curr_utc_time and curr_utc_hour up-to-date. If those don't move forward, other things will break.

The original MR I posted disables the time if the IP address is 0.0.0.0. It's also easy enough to add as an option, but would requires a UI change as well as taking up a tiny bit more disk space. I don't know which is best, though.

Disabling NTP only fixes half the problem, though. The usage of delay will still exist and will cause network connection issues on an isolated network until you disable NTP (assuming that's a feature). It's not obvious that a user should turn off NTP to fix the wifi connection issue, and the only way to fix that is to avoid using delay completely. (Avoiding delay also helps with "regular" deployment, although they are not in a tight loop so not causing as many issues.)

I also have another branch that uses disables NTP, and also uses a backoff strategy to retry instead of retrying calling an NTP server every N seconds. It allows the device to run smoothly for users who don't have internet access, but doesn't disable NTP in case they do hook up the internet. At the same time, it significantly reduces the number of network failures compared to the original version (also linked above in this comment).

@joaquinagomes
Copy link

Do you have the compiled firmware with this problem fixed? When I reboot my router the OG hangs more than 50% of the time and I have to physically power off and power on ESP32 for OG to work again.

@rayshobby
Copy link
Contributor

A solution to address this issue has been added to the next firmware branch:
c8d1880
it is similar to PR #91 but is simpler as it doesn't use state transitions.

To respond to @jknaack's earlier question: disabling NTP doesn't mean completely commenting out everything in the time_keeping function. It just means not calling the time function. The calculation of curr_utc_time will still run as that only depends on the millis() function, which uses the mcu's internal clock and has nothing to do with NTP. This can be done easily but I didn't put it as an option in the firmware as I feel this would be rarely used.

In any case, if it can't get correct time, all the log records will have wrong times. So if you plan to use OpenGarage with no internet, AND you don't provide a local NTP server, then the log records will all be polluted. If you don't plan to look at the log, I guess this is ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants