Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STM32] Rectify IRQ Handling for Serial Devices #4780

Closed
wants to merge 1 commit into from

Conversation

betzw
Copy link
Contributor

@betzw betzw commented Jul 19, 2017

Description

STM32's serial device drivers currently feature some race conditions which might lead to loose interrupts coming from the serial device.

Status

READY

Fixes

  • clear TxIrq before calling irq_handler()
  • do not clear RxIrq but delegate this to irq_handler()
    (which is usually done by the higher level interrupt handler irq_handler() by simply calling serial_getc())
  • use correct mask USART_CR1_TCIE when checking if TxIrq is enabled

Note

This PR just fixes the issue for STM32F4 but is likely to be needed to be applied for all STM32 platfroms.

CCs

@bcostm, @adustm, @LMESTM, @nikapov-ST, @screamerbg

- clear `TxIrq` before calling `irq_handler()`
- do not clear `RxIrq` but delegate this to `irq_handler()`
  (which can do this by simply calling `serial_getc()`)
- use correct mask when checking if `TxIrq` is enabled
@LMESTM
Copy link
Contributor

LMESTM commented Jul 19, 2017

@betzw thanks for sharing this. Note that there has been ongoing work to improve the same piece of code (or part of it) in #4734. Could you consider using #4734 patch and see if this solves part of your problem, then maybe reapply any possibly missing part on top of it ?

@betzw
Copy link
Contributor Author

betzw commented Jul 19, 2017

Unfortunately, both PRs seem not to be with respect to the same base version, so its difficult to compare them.
Anyway, for sure PR #4734 is extending over a wider range, even though I am not sure that it correctly resolves for example the TX race condition issue. You should double check!

Just to more clear, I am quite sure that PR #4734 does not resolved the TX race condition issue! On the contrary it might introduce an endless busy loop.

@betzw
Copy link
Contributor Author

betzw commented Jul 19, 2017

Furthermore, patch in line #450 seems not to be catched!

@LMESTM
Copy link
Contributor

LMESTM commented Jul 19, 2017

@betzw As I said, this may not cover everything, but this is already ready to be merged.
So think that the best way forward is maybe that you rebase this PR on top of #4734 to share the part of your fix that is still needed. Do you think this is feasible ?

@betzw
Copy link
Contributor Author

betzw commented Jul 19, 2017

In your place I would stop the merge of PR #4734 and review it!

@LMESTM
Copy link
Contributor

LMESTM commented Jul 19, 2017

Then could you please help and make your comments in #4734 ?

@LMESTM
Copy link
Contributor

LMESTM commented Jul 19, 2017

Typically your PR address

  • clear TxIrq before calling irq_handler()
    => Avoid data loss with serial interrupt used at high baudrates  #4734 assumes that higher level should clear or disable the interrupt. There was a discussion in the PR about whether we should clear it or leave it up to the application. I understand that you prefer to clear it AFTER the handler has been called but then isn't there a risk again to miss an interrupt ?
  • do not clear RxIrq but delegate this to irq_handler()
    (which is usually done by the higher level interrupt handler irq_handler() by simply calling serial_getc())
    => also addressed in Avoid data loss with serial interrupt used at high baudrates  #4734
  • use correct mask USART_CR1_TCIE when checking if TxIrq is enabled
    => can be kept in this PR.

@betzw
Copy link
Contributor Author

betzw commented Jul 19, 2017

  • clear TxIrq before calling irq_handler()
    => Avoid data loss with serial interrupt used at high baudrates  #4734 assumes that higher level should clear or disable the interrupt. There was a discussion in the PR about whether we should clear it or leave it up to the application. I understand that you prefer to clear it AFTER the handler has been called but then isn't there a risk again to miss an interrupt ?

No, I think you need to clear it before calling the irq handler (i.e. irq_handler())

Ok.

  • use correct mask USART_CR1_TCIE when checking if TxIrq is enabled
    => can be kept in this PR.

Or inserted in yours.

@betzw
Copy link
Contributor Author

betzw commented Jul 19, 2017

I also think that you should not leave it up to the higher level irq handler (or application) to clear it, because most likely there is no API for doing so and further there is mbed code which does rely on the lower level device driver to clear it (e.g. ATParser and the like)!

@LMESTM
Copy link
Contributor

LMESTM commented Jul 19, 2017

@betzw - ok clear now, thanks.
I'm fine taking your inputs into account in my PR. I'll have a detailed look tomorrow.

@kjbracey
Copy link
Contributor

@betzw - I'm curious about you singling out ATParser. The original patch was tested with UARTSerial, whose interrupt handlers were derived from the BufferedSerial in ATParser. Those interrupt handlers take care to ensure the interrupt condition is cleared in the UART by reading all data or making sure they write until full, and disabling the TX interrupt if nothing left to send.

The interrupt must be unlatched manually in the HAL if accessing the UART is not sufficient to unlatch a pending condition, but in this case I believe there is no need for it - data register accesses are sufficient.

However, it sounds plausible that a manual clear before calling the handler would cover cases where the handler failed to clear the condition. (At a slight performance cost).

@betzw
Copy link
Contributor Author

betzw commented Jul 20, 2017

So, if I understand correctly you agree upon clearing the IRQ before calling the higher lever handler, correct?
If this is the case, don't let us loose time about discussions which in my opinion do not really help to find a better solution (above all if these are triggered by statements which might not be a 100% guessed) and clear the interrupt before calling the handler. Ok, this comes with a slight performance cost, but will make mbed more robust and interrupt handler implementation easier as (in most cases) you do not have to care about clearing anymore.

@LMESTM
Copy link
Contributor

LMESTM commented Jul 21, 2017

@betzw Some update after I looked into required updates to #4734 as discussed in this post yesterday

There are 2 aspects in #4734, 1) do not clear RXNE flag or that would cause possible data loss, and 2) move TxIrq to TXE interrupt (Tx Data register empty) instead of TC (transmission complete) because this is the way it is defined in MBED API and also this improves overall perf.

Because of 2) above I think that your comments on #4734 cannot be applied.
=> clear TxIrq before calling irq_handler
the TXE inerrupt flag is only cleared by writing to the DataRegister (therefore sending data) so this actually cannot be cleared by the driver. It is really up to application / middle layer to disable the IRQ, which is done for insatnce in UARTserial here: https://github.com/kjbracey-arm/mbed-os/blob/82f06ff6e811a6511fddb8480f6e974573b93367/drivers/UARTSerial.cpp#L342
This avoids the endless loop.

=> use correct mask USART_CR1_TCIE when checking if TxIrq is enabled
As we moved from TC to TXE interrupt, the correct mask is already being used.

Let me know your thoughts - also maybe could you actually run your tests using the version from #4734 and check if working ok ?

@betzw
Copy link
Contributor Author

betzw commented Jul 21, 2017

@LMESTM OK for me!
So I am closing this PR.

@betzw betzw closed this Jul 21, 2017
@betzw betzw reopened this Aug 9, 2017
@betzw
Copy link
Contributor Author

betzw commented Aug 9, 2017

@LMESTM,
Even if only occasionally, using TXE (as merged in with PR #4734) causes characters to be randomly lost in the WiFi driver.
When resorting back to TC (including all the modifications suggested by this PR) I cannot observe such a behavior anymore.
Any idea?

@theotherjimmy
Copy link
Contributor

@betzw It looks like you have a conflict. Could you rebase to resolve this conflict?

@betzw
Copy link
Contributor Author

betzw commented Aug 11, 2017

I would prefer to wait for @LMESTM comments on this!

@LMESTM
Copy link
Contributor

LMESTM commented Aug 16, 2017

@betzw Hi. I don't know about Wifi driver so it's hard to tell.
Using TC instead of TXE has 2 consequences : lower serial throughout from MCU to Wifi, and also MCU knows that full set characters has been sent when calls to Tx returns.

So when using TXE as recommended in MBED, you may need to check:

  1. if there is any need for HW or SW flow control ?
  2. make sure to wait for some time (max time for sending 1 char at given baudrate) before doign any possible GPIO signaling with Wifi device, or before entering low power / sleep modes. I have raised this issue 1 year back to MBED team (In few cases STM32 UART tx might not complete #1798 and There is no tests to verify that UART TX is complete #1849) but this hasn't been addressed yet. There is a new issue recently opened here; C HAL function to check Serial FIFO status #4408 which would allow to check that transmission has fully completed before entering sleep modes for instance.

Is is possible that any of the 2 above points could happen in your case ?

@betzw
Copy link
Contributor Author

betzw commented Aug 16, 2017

@LMESTM not sure if I understood everything ...

MCU knows that full set characters has been sent when calls to Tx returns.

Could you pls. elaborate on this.

So when using TXE as recommended in MBED, you may need to check: if there is any need for HW or SW flow control ?

Do you mean that using TXE as interrupt source (rather than TC) is influencing things regarding flow control?
For what reason TXE is recommended in mbed, what is changing from the mbed point of view?

@LMESTM
Copy link
Contributor

LMESTM commented Aug 16, 2017

@betzw
When using TXE instead of TC, you'll be increasing the overall baudrate and maybe could reach a limit (therefore the question about flow control ...). But I'd rather suggest looking at my point 2)

Concerning TXE usage, TxIrq in SerialBase Class is defined as "TxIrq for transmit buffer empty"
https://github.com/ARMmbed/mbed-os/blob/master/drivers/SerialBase.h#L100

One year back (when raising the shared issues) I suggested to use TC instead for reliability but the proposals were rejected because of performance degradation, typically with classes like BufferedSerial which indeed assume that TxIrq means "Tx buffer empty". Maybe the list of interrupts may be updated to also add TxCompleteIrq which would allow application to select the one it needs.

Could you try to check if the data loss still happens in case you disable sleep modes ?

@betzw
Copy link
Contributor Author

betzw commented Aug 16, 2017

@LMESTM

When using TXE instead of TC, you'll be increasing the overall baudrate and maybe could reach a limit (therefore the question about flow control ...). But I'd rather suggest looking at my point 2)

Well, following what an renowned HW engineer said to me when I was discussing with him about this issue, there is no significant difference between using TXE and TC in terms of real transmission speed.

"TxIrq for transmit buffer empty"

This is true also when TC triggers the interrupt ...

Could you try to check if the data loss still happens in case you disable sleep modes ?

The executable is already compiled with --profile=debug, so hal_sleep() is nerver called.

@betzw
Copy link
Contributor Author

betzw commented Aug 16, 2017

And no, no GPIO signaling is used after that the WiFi module is reset at startup.

@betzw
Copy link
Contributor Author

betzw commented Aug 16, 2017

@LMESTM, I have just taken a look on one of the MCU reference manual and have found the following:

After writing the last data into the USART_DR register, it is mandatory to wait for TC=1
before disabling the USART
...

@LMESTM
Copy link
Contributor

LMESTM commented Aug 16, 2017

After writing the last data into the USART_DR register, it is mandatory to wait for TC=1
before disabling the USART ...

And also before entering sleep mode or trying to change baudrate or switch clock off or also before releasing or re-using the buffer that is being transmitted ... Does the wifi driver do anything close to this ? Can you point to the code that is making use of Serial API and which hits the issue ? My only guess is that there is a need to "wait" after sending the data until we have a API to be informed that Tx is complete (and then you'll have to use this API)

To cope with every requirements, the Serial API should be extended to offer the API to clearly distinguish between "Room in Tx buffer" and "Tx complete". There have been similar discussions in other systems https://lists.zephyrproject.org/pipermail/zephyr-devel/2017-May/007642.html

@betzw
Copy link
Contributor Author

betzw commented Aug 16, 2017

With respect to the serial/usart the WiFi driver itself doesn't do much with it directly. All is delegated to mbed class ATParser and such class BufferedSerial to handle the communication.
Anyway, I am really disturbed by the fact that I cannot imagine what is going wrong when using the TXE version of the driver and therefore was hoping that you might have an explanation. For sure the USART must complete the transaction before turning it off (in which way ever) or changing configuration, but this does currently not happen ...?!?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants