Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

Commit

Permalink
chore: bump crate-ci/typos from 1.24.6 to 1.25.0
Browse files Browse the repository at this point in the history
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.6 to 1.25.0.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.24.6...v1.25.0)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
  • Loading branch information
dependabot[bot] authored and avik-pal committed Oct 7, 2024
1 parent e6dd65c commit ba739d3
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion .github/workflows/QualityCheck.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ jobs:
- name: Checkout Actions Repository
uses: actions/checkout@v4
- name: Check spelling
uses: crate-ci/typos@v1.24.6
uses: crate-ci/typos@v1.25.0

1 comment on commit ba739d3

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: ba739d3 Previous: e6dd65c Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5541 ns 6104.5 ns 0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5208.5 ns 6125 ns 0.85
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6834 ns 7166 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4917 ns 6042 ns 0.81
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 102997 ns 105660 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 422395 ns 401954 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10125 ns 9979 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10167 ns 10000 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9917 ns 10125 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10020.5 ns 10063 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 530333 ns 495391 ns 1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 11174375 ns 682487 ns 16.37
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 2854 ns 1812 ns 1.58
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1375 ns 1708 ns 0.81
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 1667 ns 2.25
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 2792 ns 2104 ns 1.33
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 19948 ns 20067 ns 0.99
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 33501 ns 31000 ns 1.08
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3834 ns 4041 ns 0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4250 ns 3625 ns 1.17
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4208 ns 4542 ns 0.93
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4416 ns 4250.5 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 131207.5 ns 133056 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 146692 ns 146031 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58167 ns 58042 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39792 ns 39959 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38209 ns 39792 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83208 ns 83333 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36515 ns 36918.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 80481 ns 76900 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2038875 ns 2030417 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2083750 ns 2081666.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2035541 ns 2084437 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2003250 ns 2002333 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 217066 ns 220443 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1203774 ns 1433294 ns 0.84
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 146333.5 ns 146500 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 147458 ns 164208.5 ns 0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 174542 ns 150937.5 ns 1.16
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 150167 ns 189709 ns 0.79
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167907.5 ns 166381.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 171622 ns 187972 ns 0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1119853.5 ns 1113437 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1129187.5 ns 1109375 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1072541 ns 1117083.5 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1117229.5 ns 1112084 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 620063 ns 646028 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1023002 ns 1026270 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5021.5 ns 6250.5 ns 0.80
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5083 ns 4917 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6417 ns 5562.5 ns 1.15
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4584 ns 4708 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 79500 ns 82687 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 59431 ns 59005.5 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8833 ns 8958 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8458 ns 8833 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9083 ns 9167 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8958 ns 8875 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 540188.5 ns 554954 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 390145 ns 384224 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17750 ns 18208 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17000 ns 22250 ns 0.76
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22125 ns 20500 ns 1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18146 ns 17833.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 61981.5 ns 62129 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78051 ns 77001 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212750 ns 234334 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 257833 ns 229500 ns 1.12
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221375 ns 224000 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221750 ns 219041.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 323096 ns 329979.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 463260 ns 465894 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 666 ns 584 ns 1.14
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 708 ns 0.88
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 875 ns 750 ns 1.17
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 645.5 ns 0.97
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 18860 ns 19107 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 30120 ns 32171 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1458 ns 1458 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1375 ns 1334 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1625 ns 1542 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1375 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 114822.5 ns 114910.5 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 123847 ns 124841 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7500 ns 7417 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5333 ns 5354.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5333 ns 5458 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10459 ns 10042 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23715.5 ns 23654 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 46501 ns 48941 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 227792 ns 256833 ns 0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 241750 ns 269917 ns 0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 241584 ns 269000 ns 0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 227125 ns 213417 ns 1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 188481.5 ns 184585 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 591832 ns 588346 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4084 ns 4084 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4125 ns 4084 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 4125 ns 0.96
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4125 ns 4083 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23784 ns 23536 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 45550 ns 47570 ns 0.96
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16750 ns 16500 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16792 ns 16667 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16791 ns 17042 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16500 ns 16500 ns 1
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 184666.5 ns 185621 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 171442 ns 171902 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 493292 ns 493500 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 312833 ns 313000 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 310584 ns 312583 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 847917 ns 847333 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113490 ns 113322 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 243193 ns 242543 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2121291 ns 2121250 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1584833 ns 1582666 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1574875 ns 1584000 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3034896 ns 3043250.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 228348 ns 230454 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 739108 ns 746137 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 7021 ns 7000.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6792 ns 6479.5 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7958 ns 6708 ns 1.19
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6875 ns 6458 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 82934 ns 83715.5 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 57300 ns 59480 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11520.5 ns 12396 ns 0.93
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11708 ns 11500 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12062.5 ns 12104.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10896 ns 11333.5 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 598177.5 ns 600141.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 401725 ns 410324 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 541 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 541 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23280.5 ns 23331 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 48351 ns 51010 ns 0.95
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2083 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2166 ns 2084 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2209 ns 2167 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2084 ns 2166 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 217524 ns 233774 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 178702 ns 182892 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8542 ns 8417 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9229.5 ns 9563 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11042 ns 10021 ns 1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8042 ns 8583 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 92171 ns 110268 ns 0.84
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 76060.5 ns 71861 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 19125 ns 18042 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18895.5 ns 18416.5 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 19375 ns 19083.5 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18458 ns 18187.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 534402.5 ns 612118 ns 0.87
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 379154 ns 379663 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33745.5 ns 34018 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 45241 ns 48210 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9104 ns 9000 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9583 ns 9250 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9187.5 ns 9541.5 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10042 ns 9187.5 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 242113 ns 263691 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 367124 ns 363818.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398958 ns 399291 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215291 ns 215375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 213750 ns 215291 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756041 ns 756375 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111898 ns 111229 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 77281 ns 74750 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1396458 ns 1397958 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 859875 ns 860270.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 847958 ns 859500 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2356833.5 ns 2356875 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 199002 ns 199160 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 322423 ns 325203 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7250 ns 7458.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7625.5 ns 7583 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9062.5 ns 8250 ns 1.10
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7229 ns 7188 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 126183.5 ns 138757.5 ns 0.91
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 57821 ns 59831 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16959 ns 12708.5 ns 1.33
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14354.5 ns 16250 ns 0.88
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14792 ns 16708 ns 0.89
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15042 ns 12250 ns 1.23
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 851673 ns 903568 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 420849.5 ns 426569.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 32959 ns 25146 ns 1.31
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 29083.5 ns 29875 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 30875 ns 29563 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25770.5 ns 28708 ns 0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 184566 ns 186563 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 110921 ns 112512 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 160875 ns 158917 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 124458 ns 155729 ns 0.80
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 145396 ns 147416.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 157729 ns 143875 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1005586 ns 1016648 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 576731 ns 580615 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 75875 ns 74583 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 75042 ns 75291 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 80959 ns 84145.5 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74437.5 ns 80750 ns 0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 190691 ns 192007 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 124242 ns 121601 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 300833 ns 303292 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 322542 ns 318458 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 298292 ns 310583.5 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219396 ns 286500 ns 0.77
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1023572 ns 1028367 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 692382 ns 694997 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 13000 ns 13208 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 13500 ns 13209 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14833 ns 14416.5 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 13208 ns 12583 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 136120 ns 137690 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 234302 ns 235293 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27083.5 ns 25916.5 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26395.5 ns 26042 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27146 ns 27125 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27770.5 ns 27750 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 907766 ns 917440.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 693402 ns 677137 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11500 ns 11021.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 10875 ns 12104 ns 0.90
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13249.5 ns 12667 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11666 ns 11084 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 119510.5 ns 118805.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 240667.5 ns 238257.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 23021 ns 22625 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 23312.5 ns 23354.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 23917 ns 23500 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 22708 ns 23125 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 664160.5 ns 678428 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 675107 ns 679757 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66750 ns 66333 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 63542 ns 64583.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 68709 ns 68500 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 65000 ns 64792 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 101310 ns 101302 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 234673 ns 234893 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 466062.5 ns 486625 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 478625 ns 486083 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 472875 ns 478646 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 518125 ns 464625 ns 1.12
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 484379 ns 490708 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 712597 ns 709767 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7479 ns 7562.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7687.5 ns 7875 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9958 ns 8500 ns 1.17
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7667 ns 7292 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 134386 ns 136584.5 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 57600 ns 57580 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15750 ns 14459 ns 1.09
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16333 ns 14417 ns 1.13
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15250 ns 14625 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15291 ns 16625 ns 0.92
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 880162.5 ns 882666 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 398914 ns 396884 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6151875 ns 6159458 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 3226750 ns 3225666 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 3223292 ns 3225333 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11913583 ns 11918958 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 350966 ns 345241.5 ns 1.02
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 302008 ns 301508 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19126979 ns 19144854.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 11161229.5 ns 11111958.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 11077916 ns 11126458 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36533646 ns 36537562.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1006948.5 ns 1009913 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1127082 ns 1164436.5 ns 0.97
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1042 ns 1083 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1042 ns 1125 ns 0.93
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1042 ns 1042 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1000 ns 1041 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23502 ns 23469 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 209393 ns 209702 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3958 ns 4000 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4083 ns 4000 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4041 ns 4000 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3917 ns 4000 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 270232 ns 270402 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 623846 ns 624936 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7833 ns 7896 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8042 ns 7624.5 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9750 ns 9041 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7625 ns 8792 ns 0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 116542 ns 116551 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 69700 ns 67301 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 12375 ns 12375 ns 1
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 12458 ns 12354.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12917 ns 13458 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 12292 ns 11521 ns 1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 604932 ns 608379 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 357073.5 ns 355544 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22511.5 ns 22683 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 46531 ns 48621 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3167 ns 2917 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3166 ns 3000 ns 1.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3333 ns 3458 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2875 ns 2917 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 194011 ns 194883.5 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 158126.5 ns 160881 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12125 ns 11833 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12333 ns 11771 ns 1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13708 ns 12666 ns 1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11937.5 ns 11708 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 115429.5 ns 114987 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 237322 ns 237082 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 22000 ns 22270.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24459 ns 23625 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 23396 ns 23145.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21792 ns 22417 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 554065.5 ns 559620 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 651546.5 ns 657467.5 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4375 ns 4417 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4416 ns 4417 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4291 ns 4375 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24232 ns 23954 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 48651 ns 47821 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16208 ns 16375 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16500 ns 16375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16042 ns 16500 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16250 ns 16250 ns 1
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 316149 ns 319321 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 208227 ns 205182 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2083 ns 2209 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2083 ns 2208 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2083 ns 2209 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2000 ns 2084 ns 0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 34761 ns 34739 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 205252 ns 207283 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 17937.5 ns 17729.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 19271 ns 19291.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 18584 ns 19125 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 18375 ns 17500 ns 1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 283100 ns 284503 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 682562.5 ns 683047 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 59229.5 ns 58771 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 60896 ns 61500 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 60959 ns 62167 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 53792 ns 51041 ns 1.05
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66317 ns 66683 ns 0.99
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 100931 ns 96771 ns 1.04
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 195625 ns 189875 ns 1.03
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 149417 ns 148499.5 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 138292 ns 141104 ns 0.98
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 219291 ns 271312 ns 0.81
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 208292.5 ns 208001 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 554746 ns 556366 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 85062 ns 83188 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 127458 ns 116270.5 ns 1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 86104 ns 87667 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 86812.5 ns 88791 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192707 ns 190555.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 169152 ns 168726.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1926791.5 ns 1885521 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1918312.5 ns 1906833 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1895083 ns 1922167 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1862750 ns 1922208.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 503729 ns 505315 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 915670 ns 918625.5 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21463.5 ns 21748.5 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 41990 ns 40920 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1834 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 244422 ns 243459 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 183082 ns 176522 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 11375 ns 11042 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 10292 ns 9834 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 12166 ns 11166.5 ns 1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9084 ns 9417 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 113574.5 ns 115799 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 237182 ns 235862 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9583 ns 9916 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 12396 ns 11000 ns 1.13
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10750 ns 10437.5 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9458 ns 9625 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 489512 ns 492386 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 632057 ns 634956.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57959 ns 58666 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39208 ns 39500 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38708 ns 39333 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83375 ns 83750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38522 ns 38435 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78311 ns 79261 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1724708.5 ns 1932333.5 ns 0.89
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1941208 ns 1949916 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1947834 ns 1971250 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1891208.5 ns 1900375 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 210148.5 ns 211772 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 998640 ns 1010796 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 269083 ns 276583 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 268833 ns 268541 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 275875 ns 270583.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 269729.5 ns 269542 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 193164 ns 196349 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 282737.5 ns 281833 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 587166.5 ns 662208 ns 0.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 614875 ns 709250 ns 0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 651500 ns 685042 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 652062 ns 690770.5 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 993619.5 ns 994716 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 899480 ns 902690 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2202416 ns 2181125 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2216125 ns 2197167 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2192812.5 ns 2214166 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2220500 ns 2217666 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 179761.5 ns 156988.5 ns 1.15
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 415294 ns 421825 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5520708 ns 5477291.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5537000 ns 5530250 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5449958.5 ns 5519334 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5515167 ns 5543313 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 930917 ns 938151 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1711728 ns 1722729 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 477542 ns 478167 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 257375 ns 257208 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 255375 ns 257292 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 908666 ns 908750 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46830 ns 46532.5 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 245313 ns 246353 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2116979 ns 2133375 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1589770.5 ns 1588083 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1579645.5 ns 1587417 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3037833.5 ns 3041125 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 274670.5 ns 256675 ns 1.07
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 769148 ns 775668 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57875 ns 58000 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39000 ns 39625 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38458 ns 39375 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83333 ns 83500 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28067 ns 27930.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75041 ns 73260 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2047334 ns 2017271 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2049854.5 ns 2083062.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2059333 ns 2080584 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1987666.5 ns 1994312.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 227893 ns 224353 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1038901 ns 1036751 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58000 ns 58292 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39333 ns 39917 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38333 ns 39750 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83125 ns 83458 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 48807.5 ns 48290 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 67171 ns 69781 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1934875 ns 1920208 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1962667 ns 1966666.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1938167 ns 1956354.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1827396 ns 1892750 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 233324 ns 231868 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 914834.5 ns 917180 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 34314.5 ns 33423 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 45171 ns 47961 ns 0.94
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6542 ns 6750 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7083 ns 6625 ns 1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7000 ns 6916 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6958 ns 6542 ns 1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 202653 ns 205663 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 366114 ns 364303.5 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 333 ns 0.75
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32763 ns 31975 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 38131 ns 40370 ns 0.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2792 ns 3667 ns 0.76
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3000 ns 3625 ns 0.83
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3459 ns 3209 ns 1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2875 ns 3250 ns 0.88
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 184852 ns 182875 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 151962 ns 146242 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 494188 ns 468625 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 500333.5 ns 492396 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 470041.5 ns 470250 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 489437 ns 466354 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 134801.5 ns 134348 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 322243 ns 349229 ns 0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4053479 ns 4091499.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4072375 ns 4078417 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4033500 ns 4081499.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4070625 ns 4051646 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 680027 ns 673570.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1463545 ns 1482381 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49933854 ns 49972812 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 26023000 ns 26026291 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 25982541.5 ns 25991500 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97045646 ns 97072458 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1626445 ns 1599973.5 ns 1.02
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1047410 ns 1057326.5 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 155000104.5 ns 154932104.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 89050542 ns 89308062.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 88666916.5 ns 88895875 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 295479666.5 ns 295925812.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6477658 ns 6475879 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5560101.5 ns 5578679 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 20062.5 ns 18917 ns 1.06
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 15500 ns 16000 ns 0.97
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 13833.5 ns 13708 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15708.5 ns 16437.5 ns 0.96
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 20427 ns 19926 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 25781 ns 27550 ns 0.94
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 11063 ns 10937 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 7895.5 ns 7770.5 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 7937.5 ns 7708 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17375 ns 17291 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 248558 ns 243495.5 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 143922 ns 147112 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8417 ns 8750 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 10229 ns 9708.5 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10375 ns 10667 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8646 ns 8646 ns 1
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 119635 ns 119480.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 239173 ns 237342 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10041.5 ns 10312.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10667 ns 11041 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10750 ns 10667 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10145.5 ns 10770.5 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 591757 ns 585828 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 654107 ns 655982 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10375 ns 10020.5 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9770.5 ns 9333 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11312.5 ns 10396 ns 1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9500 ns 9500 ns 1
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 117527.5 ns 115334.5 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 72401 ns 70430.5 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 14292 ns 15292 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 17708 ns 17375 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 14834 ns 15542 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 14750 ns 16250 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 562161 ns 558960.5 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 345113 ns 346234 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 34287 ns 33420.5 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 207072 ns 208233 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8625 ns 8875 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9667 ns 8917 ns 1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8667 ns 9375 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8687.5 ns 8125 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 224465.5 ns 223663.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 658996 ns 660067.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 17292 ns 15833 ns 1.09
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 13771 ns 14958 ns 0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 12458.5 ns 13166.5 ns 0.95
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10770.5 ns 12042 ns 0.89
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 20290 ns 20351 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 186982 ns 188642 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 35625 ns 35334 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 35625 ns 35396 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 35834 ns 35354.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 35666 ns 35459 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 261247.5 ns 258908.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 589266 ns 593676 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 450208 ns 453584 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 494583.5 ns 448854.5 ns 1.10
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 456791.5 ns 458979 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 461833 ns 463708 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194699 ns 194627 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 360324 ns 361629 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4069833 ns 4069291 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4063479 ns 4057666 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4038041.5 ns 4066166.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4038167 ns 4041000 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 514235 ns 509044 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1354948.5 ns 1369935 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 788948625 ns 786136291 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 416422208.5 ns 416023146 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 415183312.5 ns 416822792 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1509932250 ns 1513689687.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22522291.5 ns 22552578.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14572928 ns 14622705 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2530024250 ns 2527797917 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1506878542 ns 1507508250 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1519381125 ns 1513719042 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4752439166 ns 4744640792 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118941901 ns 119636395 ns 0.99
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 87857404.5 ns 87882829 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 77417 ns 78083.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 77625 ns 79375 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79500 ns 79292 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76875 ns 77417 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 194658.5 ns 195081 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 106561 ns 106236.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 284458 ns 291584 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 286188 ns 232333.5 ns 1.23
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 197750 ns 275646 ns 0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 192708 ns 268875 ns 0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1005733 ns 999623 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 630306 ns 637827 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199829146 ns 199983542 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 104009479.5 ns 103920208 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 103995667 ns 103978083 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 389216083 ns 389299042 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5833781 ns 5843844.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3615787 ns 3606828 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 620952291.5 ns 620238542 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 354227354.5 ns 353393416.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 354977104.5 ns 352881646 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1182226250 ns 1193561791 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26559529 ns 26518526 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21846736 ns 22094133 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 7250 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5375 ns 5375 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5250 ns 5375 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10292 ns 9875 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27179 ns 26733.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48210 ns 46490 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212666.5 ns 220979 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 222542 ns 224417 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221917 ns 223500 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206167 ns 207583 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 217340.5 ns 215495 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 523165 ns 519876 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8708 ns 10312.5 ns 0.84
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8958 ns 9479 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10667 ns 9895.5 ns 1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8813 ns 9937.5 ns 0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 115467 ns 113347 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 73431 ns 71090 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7584 ns 9604 ns 0.79
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 11521 ns 11437.5 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8542 ns 10042 ns 0.85
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8062.5 ns 10145.5 ns 0.79
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 494404 ns 491382 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 316873 ns 314464 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 708 ns 0.71
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 708 ns 709 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 708 ns 583 ns 1.21
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 25358 ns 24930.5 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 47920 ns 48911 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9250 ns 12375 ns 0.75
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 11396 ns 14958 ns 0.76
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10875 ns 9000 ns 1.21
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9750 ns 9666 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 246651 ns 246496 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 388584 ns 386995 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 110834 ns 110750 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 87791 ns 90417 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 87792 ns 88125 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 154959 ns 155146 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 23405 ns 23300 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 189432 ns 190702 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 539625 ns 534625 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 562458 ns 562249.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 535812.5 ns 542812.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 535000 ns 535250 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 220513 ns 217557.5 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 604586.5 ns 610017 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5354 ns 5375 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 7042 ns 6709 ns 1.05
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 8229.5 ns 7375 ns 1.12
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 6541 ns 6520.5 ns 1.00
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17715 ns 17156 ns 1.03
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 71815.5 ns 71171 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 11750 ns 12833 ns 0.92
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 11459 ns 11375 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 10792 ns 10145.5 ns 1.06
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 17125 ns 16708.5 ns 1.02
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 206057.5 ns 204040 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 379023.5 ns 364443 ns 1.04
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39250 ns 38834 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 51250 ns 50542 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 50583 ns 51417 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13750 ns 13854.5 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA 21128.5 ns 21940 ns 0.96
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 84216 ns 84996 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 36208 ns 36917 ns 0.98
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 30584 ns 31042 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 29250 ns 28125 ns 1.04
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57375 ns 77979.5 ns 0.74
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 184668 ns 180753 ns 1.02
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 414734 ns 397599 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1583.5 ns 1854.5 ns 0.85
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 2000 ns 1958 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2187 ns 2209 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1833.5 ns 1666.5 ns 1.10
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 19835 ns 19375 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 25650 ns 27490 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2292 ns 2208 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2459 ns 2167 ns 1.13
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2458 ns 2416 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2187.5 ns 2125 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 197459.5 ns 194356 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 134722 ns 136311 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5021 ns 5166.5 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5167 ns 5520.5 ns 0.94
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5500 ns 6396 ns 0.86
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5959 ns 5187.5 ns 1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 141255 ns 140899.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 59291 ns 57270 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8396 ns 9020.5 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9208 ns 9437.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9791 ns 8583 ns 1.14
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8417 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 823637 ns 815402.5 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 383144 ns 388544 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 54917 ns 55083 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 54291 ns 54292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 54250 ns 54375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 56541 ns 56417 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37246 ns 36794 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 204842 ns 206892 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 477000 ns 478792 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 496604 ns 535375 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 494271 ns 496937 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 467792 ns 474395.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 259843 ns 257604 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 794468 ns 810628 ns 0.98
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3306791 ns 3331771 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1761916 ns 1763000 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 1756167 ns 1769417 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6310604.5 ns 6317646 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 205873.5 ns 204848.5 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 214142 ns 209783 ns 1.02
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11469395.5 ns 11521375.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 6567229 ns 6550500 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 6474021 ns 6561792 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21232020.5 ns 21242604 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 743103.5 ns 741852 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1064100 ns 1060031 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7125 ns 6292 ns 1.13
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4791 ns 5666 ns 0.85
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7042 ns 7042 ns 1
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5333 ns 5209 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 130642.5 ns 132073.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 55570 ns 54021 ns 1.03
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7333 ns 10375 ns 0.71
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8500 ns 9584 ns 0.89
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 7417 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7625 ns 7667 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 721790 ns 718413.5 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 371284 ns 375894 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 124000 ns 144542 ns 0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 105458 ns 124479.5 ns 0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 100416.5 ns 101625 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 93688 ns 150583 ns 0.62
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 149649.5 ns 148583.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 203312 ns 182281 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020750 ns 2030666.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2021041 ns 2034833.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1993771 ns 2034166.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2025000 ns 2024125 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 676279 ns 674148 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1107011 ns 1114502 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 33958.5 ns 32917 ns 1.03
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 34334 ns 35208 ns 0.98
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 32584 ns 33334 ns 0.98
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 708 ns 645.5 ns 1.10
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16105 ns 15722 ns 1.02
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 78881 ns 79041 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2479.5 ns 3208 ns 0.77
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 4000 ns 3958 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3125 ns 3084 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2292 ns 2333 ns 0.98
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 139246 ns 136962.5 ns 1.02
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 352743.5 ns 340914 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7209 ns 7292 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5417 ns 5417 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5291 ns 5333 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10208 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36300 ns 35974 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49595.5 ns 50280 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 217854 ns 215209 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 222916.5 ns 228896 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220604.5 ns 220729.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206125 ns 205917 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 241210 ns 240303 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 515535 ns 519340 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3917 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22201 ns 21966 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 41991 ns 42521 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14708 ns 14709 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14708 ns 14792 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14750 ns 14834 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14708 ns 14708 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 301554 ns 299460 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 195902 ns 188891.5 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 116166.5 ns 128584 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 130416 ns 128208 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 104479 ns 106604 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 105250 ns 119354 ns 0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 135232 ns 132553 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 169232 ns 183902 ns 0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1928583 ns 1924833.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1925875 ns 1932167 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1895041.5 ns 1926479 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1745875 ns 1925542 ns 0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 664669 ns 662628 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1220022.5 ns 1065881 ns 1.14
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18583 ns 17958 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18792 ns 18625 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22250 ns 20812 ns 1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18250 ns 19584 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107671 ns 104706.5 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 77341 ns 81176 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216667 ns 217417 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 216667 ns 265209 ns 0.82
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217812.5 ns 222291 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 227125 ns 222917 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 497386 ns 497576 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 470184 ns 466715 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 26145.5 ns 24687 ns 1.06
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 28562 ns 29083 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 26792 ns 27250 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1458 ns 1417 ns 1.03
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16337 ns 16449.5 ns 0.99
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 86810 ns 80571 ns 1.08
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4875 ns 4729.5 ns 1.03
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5104 ns 5917 ns 0.86
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5333 ns 5459 ns 0.98
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4833 ns 4875 ns 0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 203656 ns 201398 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 391324 ns 373024 ns 1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222125 ns 223084 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 222583 ns 223479.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 226333 ns 225458.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 223333 ns 222541 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 222346 ns 220423 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 273793 ns 274373 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 500833 ns 497687.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 504334 ns 497958 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 498167 ns 501646 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 497542 ns 507125 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1053089 ns 1033721 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 851353.5 ns 858214 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20667 ns 20625 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20313 ns 22500 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23083 ns 21791 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20000 ns 20042 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 113758.5 ns 112240 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79011 ns 77390 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213084 ns 213084 ns 1
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213541 ns 218104.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214291 ns 219292 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215500 ns 217125 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 724087 ns 716111 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 538870.5 ns 532795 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6666 ns 6708 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6666.5 ns 7416 ns 0.90
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9125 ns 8166 ns 1.12
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6584 ns 6791 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 134050 ns 133925.5 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 67330 ns 65140 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10875 ns 9709 ns 1.12
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10603.5 ns 12458 ns 0.85
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10584 ns 11125 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10750 ns 10583 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 782883 ns 779907 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 386274 ns 379434 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5000 ns 7250 ns 0.69
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4625 ns 5250 ns 0.88
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6541 ns 6834 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6375 ns 4917 ns 1.30
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 136660 ns 135559.5 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 58460 ns 56400 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7667 ns 7542 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7916.5 ns 7792 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7750 ns 7875 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7750 ns 7625 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 747431 ns 742169 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 392653 ns 389854 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14573000 ns 14503334 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 7702333.5 ns 7723249.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 7661229.5 ns 7705416.5 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27919750 ns 27810125 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 552572 ns 535378 ns 1.03
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 402049 ns 390439 ns 1.03
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46551750 ns 46519500 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 26549208 ns 26614709 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 26263166.5 ns 26530062.5 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85671542 ns 85657500 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 3391019 ns 2847450.5 ns 1.19
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3300103 ns 3284834 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 67042 ns 68958 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 67375 ns 69084 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 70583 ns 68500 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 68291 ns 68166 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 103426.5 ns 104098 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 229352.5 ns 232172 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 468625 ns 480417 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 497666.5 ns 475791 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 469292 ns 474812.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 468500 ns 481041.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 709808.5 ns 714971 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 786728 ns 793828 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 584 ns 750 ns 0.78
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32664 ns 32749 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 47181 ns 49671 ns 0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8833 ns 9875 ns 0.89
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9750 ns 9875 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9708 ns 9375 ns 1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9792 ns 9208 ns 1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 281049 ns 282467 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 373464 ns 373314 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9666 ns 9708 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9708 ns 9708 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9625 ns 9625 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9666 ns 9666 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23531 ns 23485 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 211602 ns 211472 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 50250 ns 50208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 50250 ns 50042 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 50125 ns 50709 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 50167 ns 50209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 276186.5 ns 277646 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 603776 ns 614117 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 54916 ns 55291 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 54333 ns 54458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 54292 ns 54334 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 56125 ns 56458 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28315 ns 28038.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 204202 ns 206412 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 515312.5 ns 479020.5 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 495208 ns 525042 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 494875 ns 499937 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 465271 ns 462667 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 238356 ns 240355 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 843049 ns 838988 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 657146 ns 609500 ns 1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 678750 ns 661417 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 625021 ns 659375 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 649917 ns 653812.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 189901 ns 192690.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 230582 ns 262482 ns 0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2239292 ns 2226104 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2249895.5 ns 2247458 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2176354.5 ns 2238104 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2265625 ns 2244458.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 926422 ns 927304 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1211101.5 ns 1364114 ns 0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21083 ns 20208 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22187.5 ns 22354.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23666 ns 22167 ns 1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19959 ns 19375 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 112183.5 ns 109169 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 81261 ns 77150.5 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 254333 ns 222958 ns 1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220666 ns 220604.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220750 ns 227521 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226708 ns 225417 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 705957 ns 712641 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 548680 ns 558770.5 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 583 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23346 ns 23081 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 47671 ns 48321 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9500 ns 9208.5 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9917 ns 9250 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9959 ns 10666 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10083 ns 9791.5 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 260912 ns 263338 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 400874 ns 399114 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10500 ns 10500 ns 1
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8895.5 ns 8770.5 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11625 ns 10499.5 ns 1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8750 ns 10083 ns 0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 116855 ns 115864 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 67861 ns 68530 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7687.5 ns 7917 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8000 ns 7750 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7875 ns 8125 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7812.5 ns 7875 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 481589 ns 487126 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 324483 ns 322433 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1666 ns 1708 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2042 ns 1667 ns 1.22
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2104.5 ns 2125 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1459 ns 1541 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 19805 ns 19744 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 190981 ns 191542 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3520.5 ns 3584 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3792 ns 3708.5 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3854.5 ns 3937.5 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3583 ns 3625 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 211153.5 ns 212174.5 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 578046 ns 580786 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 147645.5 ns 147562.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 106542 ns 106562 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 106708.5 ns 107333 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 225875 ns 225583 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 23334 ns 23301 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 35995.5 ns 34030 ns 1.06
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 144708 ns 160417 ns 0.90
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 104000 ns 87959 ns 1.18
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 87625 ns 100250 ns 0.87
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 252562.5 ns 252167 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 210178 ns 211748 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 230212 ns 214182 ns 1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7125 ns 7291 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5375 ns 5333 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5292 ns 5250 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10250 ns 10417 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33945.5 ns 33560.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49690 ns 50310 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219375 ns 253958.5 ns 0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 260458 ns 253021.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228500.5 ns 235708 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 222499.5 ns 212792 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 257172 ns 260417 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 523825 ns 524496 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 13625 ns 12375 ns 1.10
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 13479 ns 12583 ns 1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 15125 ns 13896 ns 1.09
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 13333 ns 12792 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 132277 ns 134512.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 234872 ns 235902 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24084 ns 23959 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 23645.5 ns 24479.5 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24708.5 ns 25291 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24459 ns 24583 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 830067.5 ns 831522 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 681347 ns 684542 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9792 ns 9708 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 10063 ns 9917 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11375 ns 11625 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9291.5 ns 9209 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 120374.5 ns 120339 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 73601 ns 72241 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14541 ns 13750 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14813 ns 14187.5 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14812.5 ns 15083.5 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14875 ns 14084 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 637361.5 ns 638601 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 368293 ns 363914 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10333 ns 9208.5 ns 1.12
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9687.5 ns 10000.5 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12041.5 ns 11166 ns 1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10125.5 ns 10167 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 119012 ns 118694 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 73051 ns 72320 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12792 ns 13208.5 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13395.5 ns 13020.5 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13375 ns 13396 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13166 ns 12292 ns 1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 525610 ns 529419 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 342408 ns 342414 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 31416.5 ns 30416.5 ns 1.03
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 32520.5 ns 33666.5 ns 0.97
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 28917 ns 30542 ns 0.95
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 2167 ns 1917 ns 1.13
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16642 ns 16576 ns 1.00
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 78711 ns 77461 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5583.5 ns 5291.5 ns 1.06
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 4958 ns 4896 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5250 ns 5291.5 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6584 ns 6417 ns 1.03
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 137549 ns 137601 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 383954 ns 379919 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 24843 ns 24898 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48221 ns 49280 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6375 ns 6750 ns 0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6708.5 ns 6500 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6916.5 ns 6916.5 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6875 ns 6667 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 183051 ns 184245 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 391009 ns 386844 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 1958 ns 2125 ns 0.92
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2042 ns 2167 ns 0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2084 ns 2084 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 2041 ns 2083 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 25908 ns 25661 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 207502 ns 208752 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17333.5 ns 17250 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17333 ns 17292 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17625 ns 18584 ns 0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18000 ns 18416.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 266084 ns 269097.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 691847 ns 693937 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 153459 ns 150875 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 175583.5 ns 177416.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 150250 ns 153625 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 150417 ns 157791 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192072 ns 191062 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 176432 ns 174992 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1193541 ns 1338521 ns 0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1327291.5 ns 1328479 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1298166.5 ns 1328250 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1330166.5 ns 1330083.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 864717 ns 866603 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1114311 ns 1114201.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25604.5 ns 26208.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25333 ns 29479.5 ns 0.86
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28625 ns 27062.5 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25541 ns 24833 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 232128 ns 228889.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 115071 ns 116211 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 118791.5 ns 117584 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 126708 ns 140791 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 118625 ns 126021 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 117979 ns 119916.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 994805 ns 992184 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 588415.5 ns 594546 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 250 ns 334 ns 0.75
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 334 ns 334 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23227 ns 23038 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 46150 ns 49341 ns 0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6417 ns 6833 ns 0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6750 ns 6604 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6958 ns 7042 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6750 ns 6791 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 199656 ns 200303.5 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 393763.5 ns 388994 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6250 ns 6375 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6500 ns 5875 ns 1.11
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7291.5 ns 7812.5 ns 0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5291 ns 6458 ns 0.82
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 137884.5 ns 139406.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 233922 ns 235513 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10104.5 ns 10083.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10125 ns 10167 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10562.5 ns 10417 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10250 ns 9959 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 853228 ns 853447 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 672507 ns 676147 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 708 ns 750 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 708 ns 750 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 667 ns 1.12
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 708 ns 750 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22896 ns 23007 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 209942 ns 209722 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4834 ns 4958 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5042 ns 5000 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5125 ns 5125 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4834 ns 4917 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 220625.5 ns 221201.5 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 580650 ns 585401 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8750 ns 8708 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8708 ns 8833.5 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10395.5 ns 9812.5 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8167 ns 8625 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 118921.5 ns 118248.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 71421 ns 71271 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8292 ns 8959 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8791 ns 9041.5 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8958 ns 9333.5 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8916 ns 8687.5 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 567449 ns 566922 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 346934 ns 343484 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 125791.5 ns 126584 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 96000 ns 96271 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 96187.5 ns 96479.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 181542 ns 183375 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46439 ns 46672 ns 1.00
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 93231 ns 99821 ns 0.93
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 302834 ns 330333 ns 0.92
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 166542 ns 166292 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 166917 ns 170250 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 567708 ns 572041.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 186141 ns 187343 ns 0.99
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 466525 ns 487975 ns 0.96
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 398250 ns 398958 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215167 ns 215334 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 214291 ns 215041 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756250 ns 753500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43722 ns 43980 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 80301 ns 81451 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1402813 ns 1401520.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 862208 ns 862917 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 854333 ns 861417 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2359583.5 ns 2361042 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 247149 ns 253211 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 350254 ns 349378.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 657333 ns 651917 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 621958.5 ns 658334 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 628854 ns 662479 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 542146 ns 579395.5 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 185394 ns 189789 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 258293 ns 261218 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2469895.5 ns 2487416 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2491916.5 ns 2468708 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2389875 ns 2451333 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2478250 ns 2415666 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 934339.5 ns 951768.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1448647.5 ns 1454255 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 34271 ns 33000 ns 1.04
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 34250.5 ns 36083.5 ns 0.95
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 32312.5 ns 32167 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 916.5 ns 1041.5 ns 0.88
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16189.5 ns 16094 ns 1.01
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 71551 ns 77491 ns 0.92
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3166.5 ns 3187 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3437.5 ns 3208 ns 1.07
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3541 ns 3417 ns 1.04
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3125 ns 3209 ns 0.97
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 134833 ns 136515 ns 0.99
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 339494 ns 349978 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 437000 ns 437166.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 432458 ns 433083 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 432833 ns 434750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 449416 ns 449916 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 42351 ns 42836 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 238133 ns 238823 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4152625 ns 4154959 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4271667 ns 4268667 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4252417 ns 4254625 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4062020.5 ns 4048000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 231247 ns 236422 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1229715 ns 1232498 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 3959 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3875 ns 3916 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3916 ns 3917 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34451.5 ns 34298 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 38680 ns 40891 ns 0.95
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15458 ns 15583 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15708 ns 15666 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15625 ns 15708 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15459 ns 15459 ns 1
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 252640 ns 255323 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 169682 ns 170142 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 403417 ns 403708 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 221209 ns 221167 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 220042 ns 220959 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760791 ns 756709 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113133 ns 113380 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 87381 ns 89671 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1431749.5 ns 1430083 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 886583 ns 886645.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 881812.5 ns 879208.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2383750 ns 2383084 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 229435.5 ns 238474 ns 0.96
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 350874 ns 354939 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 459 ns 625 ns 0.73
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 584 ns 584 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 24713 ns 24737 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 207622 ns 210152 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7458.5 ns 8042 ns 0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8041.5 ns 7750 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8292 ns 8020.5 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7792 ns 8084 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 202392.5 ns 206918.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 689378 ns 691747 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 833145.5 ns 829437 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 466667 ns 466125 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 467771 ns 467854 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1542833 ns 1548750 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130433 ns 130261 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 166542 ns 166677 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2696000 ns 2692000 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1539437.5 ns 1529979 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1533500 ns 1534291.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4930000 ns 4940020.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 233723 ns 232798.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 771469 ns 770132.5 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31721 ns 32356 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 48111 ns 48991 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6312.5 ns 6583 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6812.5 ns 6625 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6875 ns 6708 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6500 ns 6625 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 217171.5 ns 227984 ns 0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 362335 ns 356278.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1777250 ns 1758084 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1758812.5 ns 1756792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1730917 ns 1737458 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1776250 ns 1733750 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 184219 ns 188495 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 354280 ns 357369 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4352917 ns 4372937 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4382542 ns 4370667 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4351834 ns 4369375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4391416 ns 4362583.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 837734 ns 853700 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1247440 ns 1252878 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6771 ns 6792 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7937.5 ns 7209 ns 1.10
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7333 ns 7333 ns 1
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6687.5 ns 7312.5 ns 0.91
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 22420 ns 22968 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 36840.5 ns 37681 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 45312.5 ns 48354 ns 0.94
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 48146 ns 69083 ns 0.70
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33917 ns 33542 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 52729.5 ns 44979 ns 1.17
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 206304 ns 210612 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 232673 ns 235022 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 22146 ns 21334 ns 1.04
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 23896 ns 24750 ns 0.97
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 22417 ns 22583.5 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5334 ns 5417 ns 0.98
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18024 ns 18352 ns 0.98
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 83860.5 ns 90001 ns 0.93
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12000 ns 12187 ns 0.98
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 9437.5 ns 9250 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 9583 ns 9625 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18250 ns 18375 ns 0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 218264 ns 219960 ns 0.99
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 367444 ns 383514 ns 0.96
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406417 ns 407000 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 223333 ns 223500 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 222292 ns 223250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762750 ns 762333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46291 ns 47174.5 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 88691 ns 90560 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1428625 ns 1429042 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 892375 ns 893625 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 886833 ns 893041 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2386333 ns 2387667 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 279641 ns 278164 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 379995 ns 378859 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 436833 ns 435708 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 432708 ns 431625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 429500 ns 432333 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 449500 ns 450291 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52933 ns 54012 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 235598 ns 238112 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4147167 ns 4144125 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4260354 ns 4245667 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4227333 ns 4258583 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4030354.5 ns 4033625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 252356.5 ns 257888 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1204784 ns 1222232 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9583 ns 9459 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 7292 ns 7250 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 7250 ns 7250 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 13500 ns 13458 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 23984 ns 24527 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 212683 ns 211892 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 49416 ns 49500 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 49459 ns 49708 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 49167 ns 49417 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 49625 ns 49208.5 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 333606 ns 339671 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 652008 ns 654987 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 106875 ns 125000 ns 0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 113729 ns 89417 ns 1.27
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 88666 ns 86583 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 89666.5 ns 120666.5 ns 0.74
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 191172 ns 191941.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 200642 ns 200372 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027750.5 ns 2022250 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2023896 ns 2017666.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1986666 ns 2024042 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2015667 ns 2020812.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 507573.5 ns 516999 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1086742.5 ns 1090611 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.