[Optimization] Implicit gemm rewrite #2545

wingertge · 2024-11-26T19:20:25Z

Pull Request Template

Checklist

Confirmed that run-checks all script has been executed.
Made sure the book is up to date with changes in this PR.

Related Issues/PRs

Requires tracel-ai/cubecl#309 to land first

Changes

Adds a brand new implicit GEMM implementation that uses the matmul primitives in cubecl. This is slower for small k sizes, but much faster for large ones, and more flexible. I'm keeping the current implementation because it's significantly faster for certain sizes, and uses a significantly different loader strategy (loading only within each warp, which skips cross warp syncs).

Adds a number of new convolution benchmarks to test performance with different sizes and characteristics.

Testing

All non-group tests pass, and CRAFT has the expected output with all layers using the new implicit GEMM. This tests many different and relatively large layers. Adds two new regression tests for bugs discovered during implementation.

…lt in CPU execution

…tial issues on AMD

…t-gemm-rewrite

codecov · 2024-11-27T21:08:15Z

Codecov Report

Attention: Patch coverage is 19.10230% with 775 lines in your changes missing coverage. Please review.

Project coverage is 81.96%. Comparing base (fba75d3) to head (5565f8e).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...it/src/kernel/conv/conv2d/gemm/homogeneous/base.rs	1.75%	280 Missing ⚠️
...tes/burn-jit/src/kernel/conv/conv2d/gemm/launch.rs	28.65%	127 Missing ⚠️
...n-jit/src/kernel/conv/conv2d/gemm/loader/im2col.rs	0.00%	103 Missing ⚠️
...urn-jit/src/kernel/conv/conv2d/gemm/loader/bias.rs	0.00%	83 Missing ⚠️
...n-jit/src/kernel/conv/conv2d/gemm/reader/im2col.rs	0.00%	64 Missing ⚠️
.../burn-jit/src/kernel/conv/conv2d/gemm/algorithm.rs	24.48%	37 Missing ⚠️
...s/burn-jit/src/kernel/conv/conv2d/implicit_gemm.rs	0.00%	23 Missing ⚠️
...urn-jit/src/kernel/conv/conv2d/gemm/reader/bias.rs	0.00%	17 Missing ⚠️
...rates/burn-jit/src/kernel/conv/conv2d/gemm/base.rs	0.00%	13 Missing ⚠️
...urn-jit/src/kernel/conv/conv2d/gemm/loader/base.rs	0.00%	13 Missing ⚠️
... and 7 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2545      +/-   ##
==========================================
- Coverage   82.51%   81.96%   -0.55%     
==========================================
  Files         828      837       +9     
  Lines      107123   108030     +907     
==========================================
+ Hits        88395    88550     +155     
- Misses      18728    19480     +752

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nathanielsimard

It looks awesome! Feels great to reuse a lot of components. There are still some improvements that we can make in our "design paradigm", especially in how we pass around the config. But this is beyond the scope of this PR.

I have a few comments, but it would also be great for @louisfd to review.

crates/burn-jit/src/kernel/conv/conv2d/base.rs

crates/burn-jit/src/kernel/conv/conv2d/gemm/algorithm.rs

nathanielsimard · 2024-11-28T13:37:35Z

crates/burn-jit/src/kernel/conv/conv2d/gemm/algorithm.rs

+pub struct CmmaHalf<EG: Numeric, Stage: StageSize> {
+    pub _eg: PhantomData<EG>,
+    pub _stage: PhantomData<Stage>,
+}


Could the Cmma struct be generic over the accumulation precision?

nathanielsimard · 2024-11-28T13:40:39Z

crates/burn-jit/src/kernel/conv/conv2d/gemm/homogeneous/base.rs

+            Self::LhsLoader::advance_view(&mut lhs_loader, k_step);
+            Self::RhsLoader::advance_view(&mut rhs_loader, k_step);
+        }
+


Somehow adding a sync_units after the for loop improved performance for the matmul. I think it makes sure all units in a plane are sync following the loop which improve the execution of following operations.

I'll benchmark it

crates/burn-jit/src/kernel/conv/conv2d/gemm/homogeneous/base.rs

nathanielsimard · 2024-11-28T13:42:33Z

crates/burn-jit/src/kernel/conv/conv2d/gemm/launch.rs

+///
+///


Empty lines in comment block 😅

nathanielsimard · 2024-11-28T13:45:11Z

crates/burn-jit/src/kernel/conv/conv2d/gemm/loader/base.rs

+use crate::kernel::conv::homogeneous::base::config;
+
+#[cube]
+/// Input to the convolution, responsible of filling the stage and providing a reader for it.
+/// Advances along the k-dimension to fill the stage with further data.
+pub trait Loader<EG: Numeric, ES: Numeric, G: global::Config>:
+    CubeType + 'static + Send + Sync
+{
+    /// The stage reader which matches the input of the underlying stage matmul.
+    type StageReader: CubeType;
+
+    /// Fills the stage at the current k offset and returns a reader for it.
+    fn fill_stage(this: &mut Self, #[comptime] config: G) -> Self::StageReader;
+
+    /// Move the k offset by k_offset
+    fn advance_view(this: &mut Self, k_offset: u32);
+}
+
+#[cube]
+impl<EG: Numeric, ES: Numeric, S: stage::Config, L: LoadingStrategy>
+    Loader<EG, ES, config::Config<S>> for RhsLoader<EG, ES, S, L>
+{
+    type StageReader = RhsReader<ES>;
+
+    fn fill_stage(this: &mut Self, #[comptime] config: config::Config<S>) -> Self::StageReader {
+        CyclicLoading::load_to_slice::<EG, ES, config::Config<S>>(
+            &this.tensor_view,
+            &mut this.stage.as_slice_mut(),
+            Ident::Rhs,
+            config,
+        );
+        RhsReader::new(this.stage)
+    }
+
+    fn advance_view(this: &mut Self, k_offset: u32) {
+        this.tensor_view.update_view(k_offset, Ident::Rhs);


Do you need to duplicate this? I don't believe we actually need to have the constraint G: global::Config in the trait, only G is good enough. The Algorithm trait can make the link between the two types.

You're right, this used to be required when the config was a generic on the function, but it's now superfluous.

wingertge and others added 30 commits October 19, 2024 23:18

Add SPIR-V backend

116a582

Update READMEs

5eb4dfe

More doc updates and testing

6f8f589

Ensure SPIR-V tests actually run

636b280

Disable SPIR-V tests if WGPU is disabled in general

325f659

Disable SPIR-V tests on MacOS

00a3036

Update cubecl

08f212f

Merge branch 'main' into feat/wgpu-spirv-backend

1cf435a

Disable SPIR-V CI tests until I can figure out what causes the segfau…

44f116e

…lt in CPU execution

Merge branch 'main' into feat/wgpu-spirv-backend

351cf07

Reenable SPIR-V tests to see if fixes work

bf81855

Temporarily point to fork to check fixes

2c124ae

Revert to main

9eecef1

Disable SPIR-V tests on CI

19a9f3f

More conv2d benches

82d4a58

Optimize implicit GEMM

27b93a5

Ensure weight isn't vectorized if it's loaded directly

3bfdfd6

Merge branch 'main' into opt/implicit_gemm

7535046

Implement tf32 for implicit GEMM

95822d0

Fixes

d472a17

Merge branch 'main' into opt/implicit_gemm

efe36e1

Make reduce checked since we're still getting segfaults

7543f7b

Merge branch 'main' into opt/implicit_gemm

b17713e

Make bicubic interp checked

07bd12f

Undo direct weight loader because it was backfiring

ea9e872

Undo version change

449e464

Use git version of cubecl

a95b692

Update cubecl

6577812

Use select to ensure correctness in bilinear interpolate

ca64d18

Disable reduce_dim_subcube if warp size isn't known, to prevent poten…

83adb2f

…tial issues on AMD

wingertge and others added 25 commits November 20, 2024 19:20

Merge branch 'main' into feat/conv2d-benches

3157e93

Use more descriptive naming

34e460a

Add custom NCHW to NHWC kernel to speed up implicit GEMM

b1f8af1

Tune block size

83601aa

Cleanup

3b65d28

Simplify swizzle

8b3b0df

Migrate

47e9a83

Merge branch 'opt/conv-custom-transpose' into opt/implicit-gemm-rewrite

42c73b1

Merge branch 'main' into opt/implicit-gemm-rewrite

3257cbd

Merge branch 'feat/conv2d-benches' into opt/implicit-gemm-rewrite

6853af4

Revert default conv to 16x16

55e207b

Check k bounds

ad307c8

Attempt fixes

d488193

Update matmul

41f2069

Fix bias loading, refactor

d8d8ccf

Refactor and documentation

25a8adb

Refactor

774e49f

Merge branch 'main' into opt/implicit-gemm-rewrite

7b50d93

Revert accidental changes

378a77c

Add newline

70b4532

Update cubecl

55cfbb5

Merge remote-tracking branch 'upstream/matmulupdate' into opt/implici…

100780e

…t-gemm-rewrite

Vectorize SMEM for implicit_gemm

3f17791

Merge branch 'main' into opt/implicit-gemm-rewrite

ac8ea65

Update cubecl

7d8d119

wingertge marked this pull request as ready for review November 27, 2024 18:14

wingertge added 2 commits November 27, 2024 20:21

Temp fix for cubecl strategy

135d256

Fix deform_conv_transpose2d

5565f8e

nathanielsimard reviewed Nov 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization] Implicit gemm rewrite #2545

[Optimization] Implicit gemm rewrite #2545

wingertge commented Nov 26, 2024 •

edited

Loading

codecov bot commented Nov 27, 2024

nathanielsimard left a comment

nathanielsimard Nov 28, 2024

nathanielsimard Nov 28, 2024

wingertge Nov 28, 2024

nathanielsimard Nov 28, 2024

nathanielsimard Nov 28, 2024

wingertge Nov 28, 2024

[Optimization] Implicit gemm rewrite #2545

Are you sure you want to change the base?

[Optimization] Implicit gemm rewrite #2545

Conversation

wingertge commented Nov 26, 2024 • edited Loading

Pull Request Template

Checklist

Related Issues/PRs

Changes

Testing

codecov bot commented Nov 27, 2024

Codecov Report

nathanielsimard left a comment

Choose a reason for hiding this comment

nathanielsimard Nov 28, 2024

Choose a reason for hiding this comment

nathanielsimard Nov 28, 2024

Choose a reason for hiding this comment

wingertge Nov 28, 2024

Choose a reason for hiding this comment

nathanielsimard Nov 28, 2024

Choose a reason for hiding this comment

nathanielsimard Nov 28, 2024

Choose a reason for hiding this comment

wingertge Nov 28, 2024

Choose a reason for hiding this comment

wingertge commented Nov 26, 2024 •

edited

Loading