-
Notifications
You must be signed in to change notification settings - Fork 951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection lost when one slave does not respond #2431
Comments
With v3.7.4, we have a more intelligent disconnect than earlier. If a device do not respond a counter is increased This means that, in normal situations, you read from several devices and thus only have a delay but not a disconnect. Sometimes the USB hangs, so not disconnecting at all is not a valid option. |
Be aware retries is something else. |
If you want more help, then please add a debug log, that shows what happens. |
I think your fix is a good solution. Thanks! |
Yeah, i've unfortunately had to resort to monkeypatching out this broken(for my usecase and a number of others it seems) auto-disconnect behavior by overriding the class method calling the disconnect, I however have to update the monkeypatch nearly every pymodbus release but I haven't figured out a better option to prevent everything from breaking horribly. Unfortunately my change adding a configuration option to inhibit the disconnects was rejected.
I mean, it literally is the best option for many use cases, especially for non-USB hardware rs485 ports with any sort of code that relies on auto-probing devices where you expect requests to not always return responses. Making it difficult to disable this auto-disconnect brokenness is super annoying. |
not expecting a response is actually something you can define. I have no idea how you monkeypatch the disconnect, but there are a number of ways to do it easy and future proof. We do it in our test harness. And yes your PR was rejected, but with an explanation, we cannot maintain every special case ! |
Except it's not that we know there won't be a response, we're using various requests to probe for devices and known addresses for device auto-configuration/detection, so we want a response, but a lack of a response is not really an error but rather simply a device not being present in most cases, a disconnect is highly problematic here and should never happen. For example in async def execute(
self, no_response_expected: bool, request: ModbusPDU
) -> ModbusPDU | None:
"""Execute requests asynchronously.
REMARK: this method is identical to sync_execute, apart from the lock and try/except.
any changes in either method MUST be mirrored !!!
"""
if not self.transport:
Log.warning("Not connected, trying to connect!")
if not await self.connect():
raise ConnectionException(
"Client cannot connect (automatic retry continuing) !!"
)
async with self._lock:
request.transaction_id = self.getNextTID()
count_retries = 0
while count_retries <= self.retries:
self.response_future = asyncio.Future()
self.pdu_send(request)
if no_response_expected:
return None
try:
response = await asyncio.wait_for(
self.response_future, timeout=self.comm_params.timeout_connect
)
return response
except asyncio.exceptions.TimeoutError:
count_retries += 1
txt = f"No response received after {self.retries} retries, continue with next request"
Log.error(txt)
raise ModbusIOException(txt)
TransactionManager.execute = execute
I really don't think it's that special a case, and the disconnect triggering logic is nearly entirely non-configurable(has strange magic numbers like 3 retries before disconnecting that are not easy to override). IMO user code in many cases should determine when a disconnect is needed as the logic may not be possible to implement generically in a reliable way. |
Remember the saying: "you can please all people some of the time and you can please some people all of the time, but you cannot please all people all of the time" For the non professional user the function helps. We need to support the code long term, because nearly all PRs are left unmaintained. The history of transaction is that you got a major change merged, that sadly broke the sync version seriously, so we decided to replace it, with a much cleaner version. As for monkey patching, why not simply replace execute with your own version, the execute interface have been stable for years. That would also allow you to remove the whole retry logic, which must be disturbing as well. |
I mean, arguably it helps for professional usage as well for cases where one needs app level control of disconnect behavior. My use case is for industrial controls so it's not like I'm using it for simple home automation use cases either. Professional stuff tends to have more config knobs for controlling precise behavior than non-professional in my experience.
Yeah, unfortunately that sync version was so complex it was nearly impossible for me to properly understand/debug without major refactoring.
I'm confused, this execute function seems to constantly changing implementation and moving around(it just recently moved from the client to the transaction manager), I'm literally swapping it for my own version with the disconnect logic removed in the above example.
Most of the retry logic is not too bad, although there's unhandled cases I had to deal with, for handling those I instead have wrapper functions that add additional retry conditions/logic(say for certain error/incomplete responses) on top to increase reliability, but that doesn't break as much as the auto-disconnect every release since it's not tweaking internal logic. |
If there are unhandled cases, then you should either create an issue or a pull request. Although many of your pull requests were abandoned by you some was merged...which all pymodbus users are thankful for. I have and am using quite a lot of hours maintaining and refactoring pymodbus, so please bear with me, that I cannot merge PR that are too special. I always try ask for the reasons (like the rs485, which I would like to see merged) and often accept PRs that I do not agree with. The sync transaction is a good example, sadly my review was not deep enough and we ended up with a non functioning sync version. |
And btw. there is a simple solution to the disconnect problem. App issues a request, where the device do not respond, execute times out, and raises a an error, BUT connection is not broken until after 3 consecutive requests.....so if this happens just do a request on a slave that responds. With the new trace method, it can even be a virtual slave. |
It just seems very strange to me that having the ability for application code to handle all the transport connection logic itself by disabling the internal auto-disconnect logic is not something pymodbus supports, especially since pymodbus inherently deals with many different use cases and device types which may have a large variation in real world behavior. For example it would obviously be a bug if an application/networking library were to unconditionally reconnect your wifi interface simply because a specific IP address is not reachable for whatever reason, even though disconnecting/reconnecting could in some cases fix an unreachable IP address it's also just as likely to cause many more issues as an unreachable IP address would not neccesarially mean an unreachable IP is due to an issue with the wifi interface specifically. Yet this is effectively equivalent to what the auto-disconnect logic in pymodbus is doing here when a slave is not reachable.
It's unclear to me what should be changed there other than rebasing, for example trying to implement some suggestions in the thread like userspace timing code(i.e. sleeps and such for sends) just seems like a bad idea and not something I think I would be able to implement, I was simply trying to add the option to pass though the params to pyserial which is fairly limited in scope.
Yeah, I was trying to refactor that but since I pretty much never work with that sort of sync io python code it's a bit out of my area of expertise(aside from the readability issues).
Unless I'm missing something this technique sounds extremely complex and error prone to implement and AFAIU would have a large performance penalty for device detection as it would from my understanding greatly increase the amount of requests needed to test for the presence of slave devices unless the virtual method has 0 performance penalty. Many slave devices we scan for aren't connected so I think it would roughly double the time needed to scan for slaves if we had to make an unnecessary request between each detection request that doesn't get a response.
During scans we start off with 0 slaves detected so we have no real devices to send requests to to reset the disconnect counter, it's unclear to me how to use the trace method to reliably trick the auto-disconnect code into never triggering. |
The idea I suggested adds a delay of a few millisecond and adding the trace method is just a couple of lines.....in any case far easier to develop and maintain than monkeypatching the disconnect code. The virtual slave would only run in python code, no timeout, no network and therefore with 0 penalty. I am actually using the method in a little pet project, a serial forwarder with auto detection == auto configuration apart from the ports. But of course it is your call not mine. |
You could also simply reset the variable that controls the disconnect when a device do not respond, that is a one liner, without the need for monkey patching. Just thinking you have chosen a very complex way to solve a simple problem. But of course that is your provocative. |
Hmm, not sure how to make this work without having to reset the variable at all the pymodbus operation call sites, but it did give me an idea for a better way to monkeypatch out the disconnects without having to actually modify This should prevent TransactionManager.count_no_responses = property(
lambda self: 0, lambda self, _: None
)
TransactionManager.accept_no_response_limit = property(
lambda self: 1, lambda self, _: None
) |
Fine if it works or you. TransactionManager is not part of the API, so using property is just adding overhead Apart from that the lambdas seems a lot more complicated than needed. |
Versions
Pymodbus Specific
Description
Greetings!
I have a bus with 2 slaves (EVSE and smartmeter).
Sometimes, the EVSE times out. When that occurs, pymodbus seems to mark the whole modbus connection as disconnected.
As a result, the next requests fail with an error "Not connected", no matter what slave these requests are for.
I made a test case :
I looked in the code:
client/base.py line 119 :
When one slave does not respond, after more than accept_no_response_limit attempts, it will disconnect even though the other slaves on that bus are fine. There seems to be a wait time for reconnection, so the next request for another slave also fails.
In my specific case, this EVSE sometimes stops responding for one second then resumes. This is enough for pymodbus to do its 3 retries, then decide to disconnect because accept_no_response_limit is also 3, then the next request that comes for the meter also fails.
I simply increased client.accept_no_response_limit to fix it, but it feels a bit like a kludge! I'm not sure what the behavior should be...
The text was updated successfully, but these errors were encountered: