Working SPI ISR optimizations with ~30% performance improvement #10

bigjosh · 2015-12-20T02:07:06Z

No functional changes, only local optimizations in the ShiftPWM.h file.

Performance changes...

Saves 2 cycles per bit by replacing the existing loop enclosing a sequence of 8 calls to
add_one_pin_to_byte() with the singe function send_spi_bytes().

Writing the whole send sequence in a single inline ASM allowed for explicit pre-decrement
indexing. This optimization did not happen naturally inside the loop because of compiler limitations.

Old emitted ASM per bit:

        add_one_pin_to_byte(sendbyte, counter, --ledPtr);
     3e2:   20 81           ld  r18, Z
     3e4:   52 17           cp  r21, r18
     3e6:   47 95           ror r20
     3e8:   31 97           sbiw    r30, 0x01   ; 1

New emitted ASM per bit...

     3d6:   02 90           ld  r0, -Z
     3d8:   20 15           cp  r18, r0
     3da:   37 95           ror r19

There is also a savings of 1 cycle per byte because the compiler redundantly compares the loop variable to zero after decrementing it.

Non-performance changes...

Using an EOR against a preloaded register rather than a NEG invert the output bits instead of a compare and branch. This is performance neutral if the option is selected, and costs a single cycle per byte if it is not since the code would have been statically eliminated in the old version.
Preloading the load balancing step factor into a register (0 or 8 depending on ShiftPWM_balanceLoad) and then always adding this to the counter on each loop pass. This is performance neutral if the option is selected, and costs a single cycle per byte if it is not since the code would have been statically eliminated in the old version.

These changes were motivated by keeping the ASM code clean and continuous. In order to allow static elimination based on a const variable, I would have either had to...

Break up the asm() into multiple parts. In my experience, this increases the chances that the compiler will mess up the emitted code, especially on older versions of the Arduino IDE.
Make 4 versions of the send_spi_bytes() function to cover each case of the options being selected. This is optimal, but ugly.

Overall, I think the per-bit savings get the code fast enough that it is keeping up with the SPI hardware, so any additional savings would likely be wasted waiting for the SPI transmit to complete.

That said, if there was motivation to support 2X SPI mode, then there are some tricks we could use to keep up with that. Let me know if you think this is a relevant use case.

Ossilicope traces of the before and after outputs attached.

Thanks!

-josh

…o functional changes Replaced the existing loop enclosing a sequence of 8 calls to add_one_pin_to_byte() with the singe function send_spi_bytes(). Writing the whole send sequence in a single inline ASM allowed for pre-decrement indexing and some other smaller optimization in count checking and bit inversion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working SPI ISR optimizations with ~30% performance improvement #10

Working SPI ISR optimizations with ~30% performance improvement #10

bigjosh commented Dec 20, 2015

Working SPI ISR optimizations with ~30% performance improvement #10

Are you sure you want to change the base?

Working SPI ISR optimizations with ~30% performance improvement #10

Conversation

bigjosh commented Dec 20, 2015

Performance changes...

Non-performance changes...