Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

feat: use loop vectorization for faster cpu broadcasting #85

Closed
wants to merge 22 commits into from

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Jul 13, 2024

Updates

  • uses LoopVectorization if possible for broadcasting on CPU.
  • refactors fast_activation!! to call fast_broadcast!!. the former is deprecated.
  • chainrules bypasses the LV impl completely.
    • For common activation functions, we see a >2x performance boost. Fallback performance still remains constant which is a good thing.
  • faster broadcasting in general:
    • dropout functions

Current Issues

  • CUDA CI time seems to have gone up
  • AMDGPU fused_dense_bias_activation gradients fails for certain inputs

TODOs

  • for enzyme we need a workaround with mixed forward mode.
  • need some proper benchmarks showcasing the improvements. we don't need any upstream Lux changes so benchmarking here is quite easy.
  • needs some testing for gradient correctness.
  • special case for broadcasting bias for convolutions. right now falls back to FB
  • faster broadcasting for other impls:
    • normalization functions
    • affine functionss
  • do we want to add a public bias_activation API?

@avik-pal avik-pal force-pushed the ap/update branch 2 times, most recently from c15f291 to 0184472 Compare July 13, 2024 21:51
@avik-pal avik-pal force-pushed the ap/loop_vectorization branch from f6fe2f4 to 582017a Compare July 13, 2024 22:33
@avik-pal avik-pal force-pushed the ap/loop_vectorization branch 2 times, most recently from 0ef320e to 9069e94 Compare July 13, 2024 23:23
Base automatically changed from ap/update to main July 14, 2024 00:09
@avik-pal avik-pal force-pushed the ap/loop_vectorization branch 4 times, most recently from cc2694a to 3c9f8ae Compare July 14, 2024 02:20
src/impl/broadcast.jl Outdated Show resolved Hide resolved
src/impl/broadcast.jl Outdated Show resolved Hide resolved
@avik-pal avik-pal force-pushed the ap/loop_vectorization branch 13 times, most recently from 0a73a5c to 5250f6c Compare July 14, 2024 18:40
@avik-pal avik-pal force-pushed the ap/patches branch 27 times, most recently from 27d1475 to ed5b6d7 Compare July 20, 2024 22:40
@avik-pal
Copy link
Member Author

Branches have diverged too much. I will open a fresh PR for this.

@avik-pal avik-pal closed this Jul 20, 2024
@avik-pal avik-pal deleted the ap/loop_vectorization branch July 31, 2024 16:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant