Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Accelerate BMPremote SPI data phase by removing inter-byte gaps #1946

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ALTracer
Copy link
Contributor

@ALTracer ALTracer commented Oct 1, 2024

Detailed description

  • This is a perf fix to an existing feature.
  • The existing problem is significant gaps between SPI bytes in both read and write xfers as driven by bmpflash + blackpill-f411ce (and likely others).
  • This PR solves it by providing a continuous block xfer primitive (no IRQ and no DMA involved)

Tested to increase bmpflash read -b int dump times from 35 to 31 seconds for a 8192 KiB w25q64 chip (using 12 MHz). The atomic section is used to block interrupts for 170 microseconds (256 byte read), otherwise my patch made the board hang reliably (no read timeouts). I may likely rewrite this once more using direct register manipulation as opposed to libopencm3 spi API usage. Short reads, like SFDP, indicate normal gaps between command bytes (I didn't change them) but no gaps in data page phase.
The acceleration is achieved by keeping a byte (actually 8/16-bit SPI word) in flight behind the DR shadow register, which is how it is intended to be used. DMA bindings are harder and may result in channel/stream conflicts.

Your checklist for this pull request

Closing issues

@ALTracer
Copy link
Contributor Author

Rebased to main.
A one-line change in PR1968 triggers a warning from reusing tx buffer as a rx buffer now.
I can split them into distinct TX and RX buffers.
The IRQ-masking DR-polling version can deal with one of them possibly being NULL. I'd like to rely on TX buffer being non-null for simplicity, and containing zero-bytes which can be submitted to SPI DR.
Another venue is actually leveraging a DMA stream on F4 (or channel on F1) where platforms allow, and where it does not conflict with aux USART and SWO USART.
What is the API here -- submitting two unidirectional transfers, or submitting one transfer but specifying the Tx header length (and the rest is Rx)? I'm assuming full-duplex is time-divided into Tx then Rx. Block reads (and block writes) to 25-series flash are already implemented as a byte-wise Tx command then block-wise Rx.

Copy link
Member

@dragonmux dragonmux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies this landed by the way side, done an initial review as with the revised platform configurations and such this is worth getting in for v2.0.

@@ -77,6 +77,7 @@ bool platform_spi_deinit(spi_bus_e bus);

bool platform_spi_chip_select(uint8_t device_select);
uint8_t platform_spi_xfer(spi_bus_e bus, uint8_t value);
void platform_spi_xfer_block(spi_bus_e bus, uint8_t *const data, size_t count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The const here is correct on the buffer in the function definition itself, but should be dropped from this declaration per the clang-tidy lint about useless const.

@@ -67,9 +67,13 @@ void bmp_spi_read(const spi_bus_e bus, const uint8_t device, const uint16_t comm
bmp_spi_setup_xfer(bus, device, command, address);
/* Now read back the data that elicited */
uint8_t *const data = (uint8_t *const)buffer;
#if 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're adding this new functionality, please just add it - if you want to preserve working with the old API then introduce a #define in the platform header that can be tested for here to switch to the new implementation. This will then also fix the builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I felt like PLATFORM_HAS_SPI and PLATFORM_HAS_SPI_BLOCKWISE or so would be nice to add. The first macro would guard dummy impls in all but two platforms, the second macro would dispatch to calling a block xfer function instead of slow byte wise callchains.
But first I wanted to evaluate flash size increase from this feature.

return;
}

#if 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit and drawback of these two approaches? Can the simpler more expressive loop get similar performance if interrupts are suspended with an atomic context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it can't, because it blocking-waits for the entire duration of 8/16-bit SPI word in https://github.com/libopencm3/libopencm3/blob/201f5bcfb3fa70ee34818152463e7139f24db377/lib/stm32/common/spi_common_all.c#L189-L190
But thanks to that it does not submit an extra word in flight to keep data pumping, and hence cannot miss an Rx byte.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fair enough - then drop the simpler loop please as there's no point keeping it in this new code.

@dragonmux dragonmux added this to the v2.0 release milestone Nov 20, 2024
@dragonmux dragonmux added Enhancement General project improvement BMP Firmware Black Magic Probe Firmware (not PC hosted software) labels Nov 20, 2024
* Existing implementation has to walk up and down the function stack per byte,
  which is fine for commands and general poking
* 256-byte long page reads and writes can be accelerated because the length is known ahead of time
* Keep a byte in flight on stm32f1/f4 SPI (this is simpler than IRQ or DMA)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BMP Firmware Black Magic Probe Firmware (not PC hosted software) Enhancement General project improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants