Add native support for BFloat16. #51470

maleadt · 2023-09-26T18:13:18Z

This PR adds native support for the LLVM bfloat type, through a new BFloat16 type. It doesn't however add any language-level functionality, only the bare minimum (e.g. runtime conversion routines), and it will thus still be required to use the BFloat16s.jl package.

One element that needs to be discussed, is that I didn't add a BFloat16-demote pass. This means that the back-end will be able to perform multiple operations in extended precision, before demoting back to 16-bits at the end, resulting in different results than if you were to perform the operations separately. That wasn't acceptable for Float16, because of TwicePrecision-like hacks and IEEE compatibility, but hopefully we don't require this for BFloat16.

Draft, as I still need to update BFloat16s.jl, and test on other platforms.

Alternative to #50607. LLVM only supports a limited number of types, so it currently doesn't seem worth the complexity to be able to dynamically register new types with codegen.

Fixes #41075, but we'll need LLVM 17 before we can emit AVX512BF16 instructions.

cc @chriselrod @vchuravy

oscardssmith · 2023-09-26T20:13:05Z

That wasn't acceptable for Float16, because of TwicePrecision-like hacks and IEEE compatibility

I don't think twice-precision stuff matters for either Float16 or Bfloat16. twice-precision float16 is still basically unusable because it doesn't let you represent bigger or smaller values and twice-precision BFloat16 is basically unusable because it only gives you 4 digits of accuracy. That said, IEEE compatibility matters roughly equally for both IMO.

maleadt · 2023-09-26T20:49:52Z

I don't think twice-precision stuff matters for either Float16 or Bfloat16

It does, for ranges. Base has explicit tests for this behavior, at least for Float16, and since the pre-#37510 behavior (using UInt16) always extended to Float32 and demoted back to 16 bits after every operation, we couldn't break that, so we need the demote pass for Float16.

With BFloat16, there's no such pre-existing behavior, and BFoat16s.jl doesn't test ranges. We may still want the demoting behavior if we ever want to implement ranges in the same way Base does though (i.e. using TwicePrecision).

oscardssmith · 2023-09-26T22:47:25Z

ah. ranges. yes. 😢.

vtjnash · 2023-09-28T13:13:56Z

That wasn't acceptable for Float16, because of TwicePrecision-like hacks and IEEE compatibility, but hopefully we don't require this for BFloat16.

It is required, since you are declaring the results to be consistent in inference, and this would violate that assumption. OTOH, maybe we should just mark these intrinsics as not-consistent / unpredictable in inference, since that annotation isn't really valid for Float64 either anyways (NaN is not consistent).

gbaraldi · 2023-09-28T15:33:48Z

Is the issue that LLVM will not round on every operation but we will?

vtjnash · 2023-09-28T16:57:39Z

The issue is that it is unknown and unpredictable what LLVM will do and what result you will get each time

vchuravy · 2023-09-28T19:12:49Z

So one of the nice thing about us doing the demote pass is that we had equivalent behavior between software and hardware implementation.

GCC added this much later under -fexess-precision, but I do think being able to test an algorithm with the software implementation and get the same answer as hardware is a valuable property.

gbaraldi · 2023-09-28T20:02:41Z

Do we need to implement conversion functions?

maleadt · 2023-09-28T20:09:06Z

Do we need to implement conversion functions?

Yes, as LLVM can emit calls to them. They are part of this PR.
Conversion from BFloat16 is a simple bitshift, so doesn't need runtime functionality.

maleadt · 2023-10-02T11:08:36Z

Added support for BFloat16 to the Float16 demote pass.

gbaraldi · 2023-10-02T11:41:41Z

Maybe add a test to llvmpasses? I imagine just copying the float16 one but using bfloat

maleadt · 2023-10-02T13:54:53Z

The Base part is done here. I'll work on BFloat16s.jl now, so maybe we shouldn't merge this until I've validated this functionality there.

While updating the demote pass, I noticed that on X86 we don't demote Float16 when we have avx512fp16, while on ARM we require fp16fml. The latter defines scalar operations on Float16, while AVX512 only defines vector instructions. If we only have vector instructions, shouldn't we still demote? The same applies for BFloat16/avx512bf16, which doesn't even have scalar support on ARM.

vchuravy · 2023-10-02T14:09:49Z

What does GCC 13 do with fexcess-precision=16 and avx512fp16?

gbaraldi · 2023-10-02T15:10:39Z

AVX512fp16 supports the full floating point instructions, so it just uses the vectorized ones with 1 value. For bfloat though, the only thing available is convert and dot product, so I think it shouldn't change outside of that.

And from https://godbolt.org/z/Kc36svedM it just does the operations

maleadt · 2023-10-02T15:31:04Z

AVX512fp16 supports the full floating point instructions, so it just uses the vectorized ones with 1 value.

Ah, interesting. Doesn't seem to be the case for AVX512bf16, so I'll have to remove that part.

maleadt · 2023-10-02T15:56:05Z

Also, it looks like LLVM now implements excess precision, https://reviews.llvm.org/D136176 is probably related, so I don't think we don't need the demote pass for Float16 anymore:

❯ cat wip.ll
define half @test(half %a, half %b) #0 {
top:
  %0 = fadd half %a, %b
  %1 = fadd half %0, %b
  %2 = fadd half %1, %b
  %3 = fmul half %2, %b
  %4 = fdiv half %3, %b
  ret half %4
}

❯ ./usr/tools/llc wip.ll -o -
	.text
	.file	"wip.ll"
	.globl	test                            # -- Begin function test
	.p2align	4, 0x90
	.type	test,@function
test:                                   # @test
	.cfi_startproc
# %bb.0:                                # %top
	pushq	%rax
	.cfi_def_cfa_offset 16
	movss	%xmm1, (%rsp)                   # 4-byte Spill
	callq	__extendhfsf2@PLT
	movss	%xmm0, 4(%rsp)                  # 4-byte Spill
	movss	(%rsp), %xmm0                   # 4-byte Reload
                                        # xmm0 = mem[0],zero,zero,zero
	callq	__extendhfsf2@PLT
	movss	%xmm0, (%rsp)                   # 4-byte Spill
	movss	4(%rsp), %xmm1                  # 4-byte Reload
                                        # xmm1 = mem[0],zero,zero,zero
	addss	%xmm0, %xmm1
	movaps	%xmm1, %xmm0
	callq	__truncsfhf2@PLT
	callq	__extendhfsf2@PLT
	addss	(%rsp), %xmm0                   # 4-byte Folded Reload
	callq	__truncsfhf2@PLT
	callq	__extendhfsf2@PLT
	addss	(%rsp), %xmm0                   # 4-byte Folded Reload
	callq	__truncsfhf2@PLT
	callq	__extendhfsf2@PLT
	mulss	(%rsp), %xmm0                   # 4-byte Folded Reload
	callq	__truncsfhf2@PLT
	callq	__extendhfsf2@PLT
	divss	(%rsp), %xmm0                   # 4-byte Folded Reload
	callq	__truncsfhf2@PLT
	popq	%rax
	.cfi_def_cfa_offset 8
	retq
.Lfunc_end0:
	.size	test, .Lfunc_end0-test
	.cfi_endproc
                                        # -- End function
	.section	".note.GNU-stack","",@progbits

But that can happen in a followup PR.

vchuravy · 2023-10-04T13:14:05Z

b01110b has me a bit concerned. We need to pass zext/sext for platform ABI reasons. As an example on PPC you can't pass a 16bit type in registers. So you need to extend it.

IIUC the change above is breaking for custom primitive types of size 16 (or other sizes on other platforms).

So we should set zext for all but floating types, and sext only for explicitly signed types.

src/abi_x86_64.cpp

After switching to LLVM for BFloat16 in #51470 (i.e., relying on `Intrinsics.sub_float` etc instead of hand-rolling bit-twiddling implementations), we also need to provide fallback runtime implementations for these intrinsics. This is too bad; I had hoped to put as much BFloat16-related things as possible in BFloat16s.jl. This required modifying the unary operator preprocessor macros in order to differentiate between Float16 and BFloat16; I didn't generalize that to all intrinsics as the code is hairy enough already (and it's currently only useful for fptrunc/fpext).

`numsToZero` relies on being able to sample arbitrary `AbstractFloat` with `rand`, which seemingly isn't possible with the new `Core.BFloat16` introduced in JuliaLang/julia#51470. See JuliaLang/julia#53651 for the upstream issue tracking this. 1.11 also introduces `AnnotatedString`, which interacts badly with the local scope of `@testset` and trying to lazily `join` things that may degenerate in inference to `AbstractString`. The type assertion is a quick "fix", since other than moving `irb` outside of that scope, inference will continue to mess with the test, even though no `AnnotatedString` could ever actually be produced.

maleadt added maths Mathematical functions compiler:codegen Generation of LLVM IR and native code labels Sep 26, 2023

maleadt force-pushed the tb/bfloat branch from e11e87c to de37d82 Compare October 2, 2023 11:00

maleadt marked this pull request as ready for review October 2, 2023 13:49

maleadt force-pushed the tb/bfloat branch 2 times, most recently from 2827502 to b01110b Compare October 4, 2023 07:41

maleadt mentioned this pull request Oct 4, 2023

Adapt to upstream changes wrt. native support for BFloat16 JuliaMath/BFloat16s.jl#51

Merged

JuliaLang deleted a comment from vchuravy Oct 4, 2023

maleadt requested a review from vchuravy October 4, 2023 16:16

gbaraldi reviewed Oct 4, 2023

View reviewed changes

src/abi_x86_64.cpp Show resolved Hide resolved

maleadt marked this pull request as draft October 5, 2023 11:35

maleadt added 2 commits October 5, 2023 16:34

Add native support for BFloat16.

53fcf07

Extend Float16 demote pass to BFloat16.

e033a42

maleadt added 10 commits October 5, 2023 16:34

Don't test BFloat16.

4e7706d

Add test and AVX detection.

3cbda01

Multiversioning.

4d6036b

Always demote BFloat16.

8e4bc16

Fix doctests.

ae5918e

Only ever attach zext/sext to integer types.

e6b8db9

Simplify conversion routines.

577f4d1

Fix doctest.

13a453d

Address review comments.

01d9a86

Revert Float16 ABI change, for now.

5ab8b46

maleadt marked this pull request as ready for review October 5, 2023 14:35

maleadt force-pushed the tb/bfloat branch from 74f2e5e to 5ab8b46 Compare October 5, 2023 14:35

JeffreySarnoff mentioned this pull request Oct 6, 2023

please give Tim Besard write permission JuliaMath/BFloat16s.jl#52

Closed

maleadt merged commit 5487046 into master Oct 6, 2023
1 check passed

maleadt deleted the tb/bfloat branch October 6, 2023 08:49

maleadt mentioned this pull request Oct 20, 2023

Add BFloat16 runtime intrinsics. #51790

Merged

tgymnich mentioned this pull request Mar 4, 2024

Add Support for BFloat16 JuliaGPU/Metal.jl#298

Open

Seelengrab mentioned this pull request Mar 8, 2024

Can't sample random Core.BFloat16 #53651

Closed

jonas-schulze mentioned this pull request Apr 10, 2024

Native BFloat16 support not working on AMD EPYC 9554 #54025

Closed

maleadt mentioned this pull request Jul 12, 2024

LLVM miscompiles consecutive half operations by using too much precision on several backends llvm/llvm-project#97975

Open

11 tasks

maleadt mentioned this pull request Aug 8, 2024

Support for BFloat16s eschnett/SIMD.jl#123

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native support for BFloat16. #51470

Add native support for BFloat16. #51470

maleadt commented Sep 26, 2023

oscardssmith commented Sep 26, 2023

maleadt commented Sep 26, 2023

oscardssmith commented Sep 26, 2023

vtjnash commented Sep 28, 2023

gbaraldi commented Sep 28, 2023

vtjnash commented Sep 28, 2023

vchuravy commented Sep 28, 2023

gbaraldi commented Sep 28, 2023

maleadt commented Sep 28, 2023 •

edited

Loading

maleadt commented Oct 2, 2023

gbaraldi commented Oct 2, 2023

maleadt commented Oct 2, 2023

vchuravy commented Oct 2, 2023

gbaraldi commented Oct 2, 2023 •

edited

Loading

maleadt commented Oct 2, 2023

maleadt commented Oct 2, 2023 •

edited

Loading

vchuravy commented Oct 4, 2023

Add native support for BFloat16. #51470

Add native support for BFloat16. #51470

Conversation

maleadt commented Sep 26, 2023

oscardssmith commented Sep 26, 2023

maleadt commented Sep 26, 2023

oscardssmith commented Sep 26, 2023

vtjnash commented Sep 28, 2023

gbaraldi commented Sep 28, 2023

vtjnash commented Sep 28, 2023

vchuravy commented Sep 28, 2023

gbaraldi commented Sep 28, 2023

maleadt commented Sep 28, 2023 • edited Loading

maleadt commented Oct 2, 2023

gbaraldi commented Oct 2, 2023

maleadt commented Oct 2, 2023

vchuravy commented Oct 2, 2023

gbaraldi commented Oct 2, 2023 • edited Loading

maleadt commented Oct 2, 2023

maleadt commented Oct 2, 2023 • edited Loading

vchuravy commented Oct 4, 2023

maleadt commented Sep 28, 2023 •

edited

Loading

gbaraldi commented Oct 2, 2023 •

edited

Loading

maleadt commented Oct 2, 2023 •

edited

Loading