Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection lost when one slave does not respond #2431

Closed
peufeu2 opened this issue Oct 30, 2024 · 16 comments · Fixed by #2433
Closed

Connection lost when one slave does not respond #2431

peufeu2 opened this issue Oct 30, 2024 · 16 comments · Fixed by #2433

Comments

@peufeu2
Copy link
Contributor

peufeu2 commented Oct 30, 2024

Versions

  • Python: 3.11
  • OS: Linux 5.15.80-sunxi
  • Pymodbus: 3.7.4
  • Modbus Hardware (if used): Waveshare USB-RS485 interface

Pymodbus Specific

  • Client: rtu - async

Description

Greetings!

I have a bus with 2 slaves (EVSE and smartmeter).

Sometimes, the EVSE times out. When that occurs, pymodbus seems to mark the whole modbus connection as disconnected.

As a result, the next requests fail with an error "Not connected", no matter what slave these requests are for.

I made a test case :

  • One USB interface, one bus, two slaves: evse (the actual evse, address 3) and mevse (smartmeter attached to it, address 1)
  • I set the wrong address (2 instead of 3) for one slave (evse), so it always times out
  • This blocks requests to the other slave (mevse) with "Not connected" error.

I looked in the code:

client/base.py line 119 :

self.count_no_responses >= self.accept_no_response_limit:
                self.ctx.connection_lost(asyncio.TimeoutError("Server not responding"))

When one slave does not respond, after more than accept_no_response_limit attempts, it will disconnect even though the other slaves on that bus are fine. There seems to be a wait time for reconnection, so the next request for another slave also fails.

In my specific case, this EVSE sometimes stops responding for one second then resumes. This is enough for pymodbus to do its 3 retries, then decide to disconnect because accept_no_response_limit is also 3, then the next request that comes for the meter also fails.

I simply increased client.accept_no_response_limit to fix it, but it feels a bit like a kludge! I'm not sure what the behavior should be...

@janiversen
Copy link
Collaborator

With v3.7.4, we have a more intelligent disconnect than earlier.

If a device do not respond a counter is increased
When a good response is received (from any device) the counter is reset.
When the counter is > 3 the connection is terminated.

This means that, in normal situations, you read from several devices and thus only have a delay but not a disconnect.

Sometimes the USB hangs, so not disconnecting at all is not a valid option.

@janiversen
Copy link
Collaborator

Be aware retries is something else.

@janiversen
Copy link
Collaborator

If you want more help, then please add a debug log, that shows what happens.

@peufeu2
Copy link
Contributor Author

peufeu2 commented Oct 30, 2024

I think your fix is a good solution. Thanks!

@jameshilliard
Copy link
Contributor

jameshilliard commented Dec 20, 2024

When that occurs, pymodbus seems to mark the whole modbus connection as disconnected.

Yeah, i've unfortunately had to resort to monkeypatching out this broken(for my usecase and a number of others it seems) auto-disconnect behavior by overriding the class method calling the disconnect, I however have to update the monkeypatch nearly every pymodbus release but I haven't figured out a better option to prevent everything from breaking horribly. Unfortunately my change adding a configuration option to inhibit the disconnects was rejected.

Sometimes the USB hangs, so not disconnecting at all is not a valid option.

I mean, it literally is the best option for many use cases, especially for non-USB hardware rs485 ports with any sort of code that relies on auto-probing devices where you expect requests to not always return responses. Making it difficult to disable this auto-disconnect brokenness is super annoying.

@janiversen
Copy link
Collaborator

not expecting a response is actually something you can define.

I have no idea how you monkeypatch the disconnect, but there are a number of ways to do it easy and future proof. We do it in our test harness.

And yes your PR was rejected, but with an explanation, we cannot maintain every special case !

@jameshilliard
Copy link
Contributor

not expecting a response is actually something you can define.

Except it's not that we know there won't be a response, we're using various requests to probe for devices and known addresses for device auto-configuration/detection, so we want a response, but a lack of a response is not really an error but rather simply a device not being present in most cases, a disconnect is highly problematic here and should never happen.

For example in 3.8.2, I'm doing this to monkey patch all the disconnect logic out of the execute function:

async def execute(
    self, no_response_expected: bool, request: ModbusPDU
) -> ModbusPDU | None:
    """Execute requests asynchronously.

    REMARK: this method is identical to sync_execute, apart from the lock and try/except.
            any changes in either method MUST be mirrored !!!
    """
    if not self.transport:
        Log.warning("Not connected, trying to connect!")
        if not await self.connect():
            raise ConnectionException(
                "Client cannot connect (automatic retry continuing) !!"
            )
    async with self._lock:
        request.transaction_id = self.getNextTID()
        count_retries = 0
        while count_retries <= self.retries:
            self.response_future = asyncio.Future()
            self.pdu_send(request)
            if no_response_expected:
                return None
            try:
                response = await asyncio.wait_for(
                    self.response_future, timeout=self.comm_params.timeout_connect
                )
                return response
            except asyncio.exceptions.TimeoutError:
                count_retries += 1
        txt = f"No response received after {self.retries} retries, continue with next request"
        Log.error(txt)
        raise ModbusIOException(txt)


TransactionManager.execute = execute

And yes your PR was rejected, but with an explanation, we cannot maintain every special case !

I really don't think it's that special a case, and the disconnect triggering logic is nearly entirely non-configurable(has strange magic numbers like 3 retries before disconnecting that are not easy to override). IMO user code in many cases should determine when a disconnect is needed as the logic may not be possible to implement generically in a reliable way.

@janiversen
Copy link
Collaborator

Remember the saying:

"you can please all people some of the time and you can please some people all of the time, but you cannot please all people all of the time"

For the non professional user the function helps.

We need to support the code long term, because nearly all PRs are left unmaintained. The history of transaction is that you got a major change merged, that sadly broke the sync version seriously, so we decided to replace it, with a much cleaner version.

As for monkey patching, why not simply replace execute with your own version, the execute interface have been stable for years. That would also allow you to remove the whole retry logic, which must be disturbing as well.

@jameshilliard
Copy link
Contributor

For the non professional user the function helps.

I mean, arguably it helps for professional usage as well for cases where one needs app level control of disconnect behavior. My use case is for industrial controls so it's not like I'm using it for simple home automation use cases either. Professional stuff tends to have more config knobs for controlling precise behavior than non-professional in my experience.

The history of transaction is that you got a major change merged, that sadly broke the sync version seriously, so we decided to replace it, with a much cleaner version.

Yeah, unfortunately that sync version was so complex it was nearly impossible for me to properly understand/debug without major refactoring.

As for monkey patching, why not simply replace execute with your own version, the execute interface have been stable for years.

I'm confused, this execute function seems to constantly changing implementation and moving around(it just recently moved from the client to the transaction manager), I'm literally swapping it for my own version with the disconnect logic removed in the above example.

That would also allow you to remove the whole retry logic, which must be disturbing as well.

Most of the retry logic is not too bad, although there's unhandled cases I had to deal with, for handling those I instead have wrapper functions that add additional retry conditions/logic(say for certain error/incomplete responses) on top to increase reliability, but that doesn't break as much as the auto-disconnect every release since it's not tweaking internal logic.

@janiversen
Copy link
Collaborator

If there are unhandled cases, then you should either create an issue or a pull request.

Although many of your pull requests were abandoned by you some was merged...which all pymodbus users are thankful for.

I have and am using quite a lot of hours maintaining and refactoring pymodbus, so please bear with me, that I cannot merge PR that are too special. I always try ask for the reasons (like the rs485, which I would like to see merged) and often accept PRs that I do not agree with.

The sync transaction is a good example, sadly my review was not deep enough and we ended up with a non functioning sync version.

@janiversen
Copy link
Collaborator

And btw. there is a simple solution to the disconnect problem.

App issues a request, where the device do not respond, execute times out, and raises a an error, BUT connection is not broken until after 3 consecutive requests.....so if this happens just do a request on a slave that responds.

With the new trace method, it can even be a virtual slave.

@jameshilliard
Copy link
Contributor

I have and am using quite a lot of hours maintaining and refactoring pymodbus, so please bear with me, that I cannot merge PR that are too special.

It just seems very strange to me that having the ability for application code to handle all the transport connection logic itself by disabling the internal auto-disconnect logic is not something pymodbus supports, especially since pymodbus inherently deals with many different use cases and device types which may have a large variation in real world behavior.

For example it would obviously be a bug if an application/networking library were to unconditionally reconnect your wifi interface simply because a specific IP address is not reachable for whatever reason, even though disconnecting/reconnecting could in some cases fix an unreachable IP address it's also just as likely to cause many more issues as an unreachable IP address would not neccesarially mean an unreachable IP is due to an issue with the wifi interface specifically. Yet this is effectively equivalent to what the auto-disconnect logic in pymodbus is doing here when a slave is not reachable.

I always try ask for the reasons (like the rs485, which I would like to see merged) and often accept PRs that I do not agree with.

It's unclear to me what should be changed there other than rebasing, for example trying to implement some suggestions in the thread like userspace timing code(i.e. sleeps and such for sends) just seems like a bad idea and not something I think I would be able to implement, I was simply trying to add the option to pass though the params to pyserial which is fairly limited in scope.

The sync transaction is a good example, sadly my review was not deep enough and we ended up with a non functioning sync version.

Yeah, I was trying to refactor that but since I pretty much never work with that sort of sync io python code it's a bit out of my area of expertise(aside from the readability issues).

App issues a request, where the device do not respond, execute times out, and raises a an error, BUT connection is not broken until after 3 consecutive requests.....so if this happens just do a request on a slave that responds.

Unless I'm missing something this technique sounds extremely complex and error prone to implement and AFAIU would have a large performance penalty for device detection as it would from my understanding greatly increase the amount of requests needed to test for the presence of slave devices unless the virtual method has 0 performance penalty. Many slave devices we scan for aren't connected so I think it would roughly double the time needed to scan for slaves if we had to make an unnecessary request between each detection request that doesn't get a response.

With the new trace method, it can even be a virtual slave.

During scans we start off with 0 slaves detected so we have no real devices to send requests to to reset the disconnect counter, it's unclear to me how to use the trace method to reliably trick the auto-disconnect code into never triggering.

@janiversen
Copy link
Collaborator

janiversen commented Dec 21, 2024

The idea I suggested adds a delay of a few millisecond and adding the trace method is just a couple of lines.....in any case far easier to develop and maintain than monkeypatching the disconnect code. The virtual slave would only run in python code, no timeout, no network and therefore with 0 penalty.

I am actually using the method in a little pet project, a serial forwarder with auto detection == auto configuration apart from the ports.

But of course it is your call not mine.

@janiversen
Copy link
Collaborator

You could also simply reset the variable that controls the disconnect when a device do not respond, that is a one liner, without the need for monkey patching.

Just thinking you have chosen a very complex way to solve a simple problem. But of course that is your provocative.

@jameshilliard
Copy link
Contributor

You could also simply reset the variable that controls the disconnect when a device do not respond, that is a one liner, without the need for monkey patching.

Hmm, not sure how to make this work without having to reset the variable at all the pymodbus operation call sites, but it did give me an idea for a better way to monkeypatch out the disconnects without having to actually modify execute.

This should prevent count_no_responses and accept_no_response_limit from changing at all and thus ever triggering a disconnect.

TransactionManager.count_no_responses = property(
    lambda self: 0, lambda self, _: None
)
TransactionManager.accept_no_response_limit = property(
    lambda self: 1, lambda self, _: None
)

@janiversen
Copy link
Collaborator

Fine if it works or you. TransactionManager is not part of the API, so using property is just adding overhead

Apart from that the lambdas seems a lot more complicated than needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants