Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute MQTT client synchronously in main loop() #1201

Closed

Conversation

schlimmchen
Copy link
Contributor

processing a published valued on a subscribed topic is currently running in a task that is not the task executing the main loop(). that's because the espMqttClient(Secure) was constructed without arguments, which selects the constructor with two arguments (priority and core), both of which have default values. that constructor selects espMqttClientTypes::UseInternalTask::YES, causing a task to be created in whose context the MQTT client loop is executed, which in turn executes all subscriber callbacks.

MQTT subscribers assume they are running in the same context as the main loop(). most code assumes exactly that. as the scheduler is preemptive and very little code is interlocked, we have to make sure to meet the programmer's expectations.

this changeset calls the MQTT client loop in the context of the main loop() and enforces the use of espMqttClientTypes::UseInternalTask::NO.

this fixes a real-world issue observed by @the-lonely-one in hoylabs#268 (comment): using MQTT, the inverter power limit was set with an interval of ~20 seconds. after at most 15 hours, the inverter would not be sent any limit command any more.

that symptom could be observed due to #1093, which is already merged in the downstream project OpenDTU-OnBattery. the CMD_PENDING flag would be set after the command was queued and processed, due to preemptive scheduling. since the command is gone, the CMD_PENDING flag is never reset.

this change was already tested by me, @helgeerbe, and more importantly by @the-lonely-one, who confirmed that the observed issue is gone with this change.

please note that even if #1093 is never merged (:disappointed:), there is still an issue with the non-thread-safety of this code and other pieces of code:

    ActivePowerControlCommand* cmd = _radio->enqueCommand<ActivePowerControlCommand>();
    cmd->setActivePowerLimit(limit, type);
    cmd->setTargetAddress(serial());
    SystemConfigPara()->setLastLimitCommandSuccess(CMD_PENDING);

if the scheduler changes contexts right after the first or second line, the second or third line will write freed memory when returning to the context that enqueued the command: the command will be dequeued in another context, rejected as it has no target serial set, and will then be deleted (shared_ptr goes out of scope).

it might be beneficial to interlock this and all other public interfaces to the hoymiles library. until then, I consider this changeset an important bugfix. even if the public interface of the hoymiles library is at some point interlocked, I still argue that this changeset is important. as I wrote above: most people assume their code is executed in the context of the main loop(), and more problems are to be expected if MQTT subscribers run in a separate context.

processing a published valued on a subscribed topic is currently running
in a task that is not the task executing the main loop(). that's because
the espMqttClient(Secure) was constructed without arguments, which
selects the constructor with two arguments (priority and core), both of
which have default values. that constructor selects
espMqttClientTypes::UseInternalTask::YES, causing a task to be created
in whose context the MQTT client loop is executed.

MQTT subscribers assume they are running in the same context as the main
loop(). most code assumes exactly that. as the scheduler is preemptive
and very little (none at all?) code is interlocked, we have to make sure
to meet the programmer's expectations.

this changeset calls the MQTT client loop in the context of the main
loop() and enforces the use of espMqttClientTypes::UseInternalTask::NO.
@tbnobody
Copy link
Owner

tbnobody commented Aug 2, 2023

In this case you would fix the problem with the mqtt interface but not with the web api which also runs in a seperate thread... therefor this is not the solution for the entire problem.

most people assume their code is executed in the context of the main loop()

In this case I would argue that the people learn it :)

@schlimmchen
Copy link
Contributor Author

I'm with you on this train of thought. The whole project should work well with multiple tasks, which should improve responsiveness. And even help making use of all cores... Personally, I am not afraid of using mutexes.

And I also already suspected that I will find the same problem in the web server implementation. Especially since multiple people in the downstream project are suspecting that the OpenDTU(-OnBattery) is unstable (reboots) if the web app is used for longer periods of time or with multiple clients at the same time.

However, I don't see mutexes anywhere except for the Hoymiles library, and even there the public API is not yet thread-safe, as I pointed out. It takes significant effort to fix all of this. Not even the access to the global config is thread-safe. Until at least an attempt on thread-safety is made, I would prefer if we could agree that we need to reduce the contexts in which externally triggered events are handled to a single one.

That includes MQTT subscribers and web application handlers in particular.

What do you think? Start tackling thread-safety immediately, risking deadlocks and missing vulnerable blocks, or reduce the contexts to gain stability until we can find time to address thread-safety throughout the whole project, one component at a time?

@tbnobody
Copy link
Owner

tbnobody commented Aug 2, 2023

In a first step, only completly assembled commands should be placed into the queue.
It will look like this:

    auto cmd = _radio->prepareCommand<ActivePowerControlCommand>();
    cmd->setActivePowerLimit(limit, type);
    cmd->setTargetAddress(serial());
    _radio->enqueCommand(cmd);

And the command queue is also thread safe now

@tbnobody
Copy link
Owner

tbnobody commented Aug 2, 2023

What do you think? Start tackling thread-safety immediately, risking deadlocks and missing vulnerable blocks, or reduce the contexts to gain stability until we can find time to address thread-safety throughout the whole project, one component at a time?

Since the entire web server is implemented to run in a separate thread, and I am unwilling to change it to a synchronous one, the root cause must be fixed in any case.

@schlimmchen
Copy link
Contributor Author

In a first step, only completrly assembled commands should be placed into the queue.

Again, I am with you, I had this exact same aspect in mind when I saw that. It's not critical any more if the interface is locked by a mutex, but I think it is bad design to add an incomplete data structure to a container.

and I am unwilling to change it to a synchronous one

Hm. I really was hoping I could convince you to consider this a hotfix.

the root cause must be fixed in any case.

Yes. It's nice to see that you want it done properly. Yet, I don't see it happening short-term.

So, what's the plan? I am sure you understood the issue and I think you are taking it seriously. How can we make progress? Do you think about tackling this by yourself?

I saw you using xSemaphoreTake and xSemaphoreGive and macros around it. Is that just personal taste and experience or are there issues with std::mutex and std::lock_guard and friends on ESP32?

@stefan123t
Copy link

@schlimmchen

I saw you using xSemaphoreTake and xSemaphoreGive and macros around it. Is that just personal taste and experience or are there issues with std::mutex and std::lock_guard and friends on ESP32?

I don't know if this is from the AhoyDTU legacy of some code or the same choise was made in OpenDTU too. What I recall is that the std::mutex and other std:: implementations would not work / compile on ESP8266 in Ahoy.

That said it was determined that it is necessary to handle the NRF (and probably CMT) interrupts in due time and therefor the Mutex/Semaphore were probably impemented in OpenDTU and some other way in Ahoy.

I find your idea to "reduce the contexts in which externally triggered events are handled to a single one" quite compelling. Though I do not know how this would work for the timing sensitive NRF and CMT code or whether this is required for the Async WebServer and TCP libraries which are included from upstream, as well ?

IMHO this would require first a change in the used MQTT Client library too, in order to make it Async. Or are we using an Async MQTT library already ?

@tbnobody
Copy link
Owner

tbnobody commented Aug 4, 2023

but I think it is bad design to add an incomplete data structure to a container.

Thats the reason why it is implemented differently now.

So, what's the plan?

If you look at the commits from yesterday you will see some commits related to that issues. In total it should address more locations as a change of the mqtt library scheduler would have done.

IMHO this would require first a change in the used MQTT Client library too, in order to make it Async. Or are we using an Async MQTT library already ?

The MQTT handling is currently running in a separate thread. It is not handled in the context of the main loop. (Same as AsyncTCP which is used by the webserver and also spawns a separate thread)

@schlimmchen
Copy link
Contributor Author

Though I do not know how this would work for the timing sensitive NRF and CMT code

Thanks for chiming in, @stefan123t. I am not an expert on the Hoymiles library, though I learned a lot the last weeks. As far as I can tell, all the heavy lifting of the Hoymiles library is done in main loop() anyways. So, there is no change required in the Hoymiles library when trying to reduce the amount of concurrent tasks within the project.

If you look at the commits from yesterday you will see some commits related to that issues.

Nice! Now let's move all lines [...]()->setLast[....]Success(CMD_PENDING); above the respective _radio->enqueCommand() line and I am happy with this aspect. I argue that this change is necessary because the last command success being CMD_PENDING might be invalid: Fully assembled command is enqueued in webserver or MQTT context, then scheduling happens, command is processed, and later the original context sets CMD_PENDING even though the command is already processed and CMD_PENDING will not be reset.

More feedback:

  • What's the use of prepareCommand? The command instances can be as well created using std::make_shared directly.
  • All commands are derived from CommandAbstract. There is no need to template enqueCommand(). It can simply be void enqueCommand(std::shared_ptr<CommandAbstract> cmd).

None of it is important. Take it or leave it. I like exchanging notes, maybe you like getting feedback. I usually do.

Are you planning to keep hunting these concurrency issues?

@stefan123t
Copy link

@schlimmchen thanks for the explanation behind your reasoning. To be honest I and Lukas have been hunting (and haunted 😀) by these concurrency issues in the AhoyDTU code for quite some time. Though tbnobody's code is much higher quality and I do not know what his current overall status with regards to concurrency of various parts of the code is. The original code by Hoymiles for the DTU Pro has a couple of timers and timeouts implemented to get the low-level timing hopefully mostly right.

Here are the two commits that tbnobody mentioned above:
832df5a
0bdee6e

I have checked the current master code for the xSemaphoreTake / Give macros you described:

  • src/Datastore.cpp
  • ‎lib/CMT2300a/cmt_spi3.c
  • lib/Hoymiles/src/Hoymiles.cpp
  • lib/Hoymiles/src/parser/DevInfoParser.cpp
  • lib/Hoymiles/src/parser/SystemConfigParaParser.cpp
  • lib/Hoymiles/src/parser/AlarmLogParser.cpp
  • lib/Hoymiles/src/parser/StatisticsParser.cpp

The parsers are used to read the response from the inverter where as Hoymiles.cpp (and cmt_spi3.c) are used to communicate with the inverter via NRF / CMT modules. Actually @LennartF22 and the creators of the openDTU Fusion v2.1 PCB are waiting for the merge from https://github.com/LennartF22/OpenDTU/tree/spi-eth and @dAjaY85 for a Display library update which would also include some changes to share the SPI for CMT, Display and NRF. Actually the NRF library is the reason tbnobody uses GPL-v2-or-later as being a derived work of https://github.com/nRF24/RF24 I do not know exactly what RF24 lib code maintainers actually use to make theirs IRQ / thread safe. Whereas the CMT code is from an application example.

Now to your proposed changes:

Nice! Now let's move all lines ...->setLast[....]Success(CMD_PENDING); above the respective _radio->enqueCommand() line and I am happy with this aspect. I argue that this change is necessary because the last command success being CMD_PENDING might be invalid: Fully assembled command is enqueued in webserver or MQTT context, then scheduling happens, command is processed, and later the original context sets CMD_PENDING even though the command is already processed and CMD_PENDING will not be reset.

What's the use of prepareCommand? The command instances can be as well created using std::make_shared directly.

Yes, this could probably be used directly, but I think tbnobody wants to make its use / reason explicit.

While I can understand why you would want to reverse the order, ie. especially the last two lines:

    auto cmd = _radio->prepareCommand<AlarmDataCommand>();
    cmd->setTime(now);
    cmd->setTargetAddress(serial());

    _radio->enqueCommand(cmd);
    EventLog()->setLastAlarmRequestSuccess(CMD_PENDING);

I would expect these two actions to be atomic and therefor enclosed in one of those xSemaphoreTake / Give macros to make this thread safe too.

@schlimmchen
Copy link
Contributor Author

This change is clearly not wanted here. I can understand that. Thanks @tbnobody for making improvements regarding the enqueuing of Hoymiles radio messages and using locking mechanisms to guard the public interface of the Hoymiles lib.

With the introduction of TaskScheduler, this change is also effectively gone in the downstream project OpenDTU-OnBattery. We will observe whether or not the problem we observed there is coming back. It probably does not, as you tightened down the public API of the Hoymiles library pretty well.

@schlimmchen schlimmchen deleted the mqtt-no-task-tbnobody branch February 29, 2024 17:52
Copy link

github-actions bot commented Apr 1, 2024

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants