Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iree-codegen-cpu-materialize-host-encoding fails with invalid reference to GlobalOp #18628

Open
sogartar opened this issue Sep 27, 2024 · 19 comments
Labels
bug 🐞 Something isn't working codegen/llvm LLVM code generation compiler backend

Comments

@sogartar
Copy link
Contributor

sogartar commented Sep 27, 2024

What happened?

I have input MLIR that is exported from the test in this PR.
It's compilation with assertions on fails with

iree-compile: repo/compiler/src/iree/compiler/Dialect/HAL/Analysis/DeviceAnalysis.cpp:180: auto mlir::iree_compiler::IREE::HAL::DeviceAnalysis::gatherDeviceTargets(mlir::Attribute, mlir::Operation *, SetVector<IREE::HAL::DeviceTargetAttr> &)::(anonymous class)::operator()(auto) const [symRefAttr:auto = mlir::SymbolRefAttr]: Assertion `globalOp && "global reference must be valid"' failed.

Full error message is in the reproducer zip.

Steps to reproduce your issue

  1. Download the reproducer sharded-toy-llama.zip.
  2. Run compile.sh.

What component(s) does this issue relate to?

Compiler

Version information

5a2dd56

Additional context

No response

@sogartar sogartar added bug 🐞 Something isn't working compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) codegen/llvm LLVM code generation compiler backend and removed compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) labels Sep 27, 2024
@benvanik
Copy link
Collaborator

that pass is fundamentally incompatible with multi-device - you need to disable data tiling

@sogartar
Copy link
Contributor Author

@benvanik How do you disable data tiling? I tried with

--iree-opt-data-tiling=false
--iree-llvmcpu-disable-arm-sme-tiling

But I get the same error.

@sogartar
Copy link
Contributor Author

@hanhanW do you know how to disable it?

@hanhanW
Copy link
Contributor

hanhanW commented Sep 27, 2024

Adding --iree-opt-data-tiling=false should disable the data-tiling. I can't tell much without looking at logs.

@sogartar
Copy link
Contributor Author

@hanhanW, I forgot to mention in the description that the full error message is in the reproducer archive.

@sogartar
Copy link
Contributor Author

@benvanik, aside from the problem that this pass should not be called at all, should flow.tensor.transfer ops that reference a nonexistent device be considered illegal? The verifier did not complain. We have

%2 = flow.tensor.transfer %1 : tensor<?x?x?x?xf32>{%c3, %c26, %c2, %c14} to #hal.device.affinity<@__device_1>

but there is no corresponding

util.global private @__device_1 ...

flow.tensor.transfer does not have the op interface SymbolUserOpInterface and does not verify the referenced symbol.

@benvanik
Copy link
Collaborator

benvanik commented Sep 30, 2024

The attribute is what is referencing the symbol, not the op. We verify these later on but because this pass is running in the wrong place it happens before the verification. I'm guessing you went in and deleted ops manually? There should not be a case where this happens without hand-editing (if there is that's what we need to track down).

@sogartar
Copy link
Contributor Author

sogartar commented Sep 30, 2024

I'm guessing you went in and deleted ops manually?

I have not hand-edited. The input IR references the symbol through promises. E.g.

%56 = flow.tensor.transfer %cast_2 : tensor<?x?xi32>{%dim_4, %dim_6} to #hal.device.promise<@__device_1>

The attribute is what is referencing the symbol, not the op.

One could make the same argument for function call ops. The attribute contains the symbol, but the op still verifies the symbol and it can not be any other way as the attribute itself is not bound to the graph (does not have this context).

There should not be a case where this happens without hand-editing (if there is that's what we need to track down).

@benvanik I will track where the origin of the problem is. But shouldn't we add this to the flow.tensor.transfer verification? It will be a lot easier to track this problem with such a verification.

@benvanik
Copy link
Collaborator

Promises are ok - affinities are not. Promises do not need the symbol to exist.

The op does not know that the affinity attribute even has a symbol. The cases are different.

@IanNod
Copy link
Contributor

IanNod commented Sep 30, 2024

Here is an mlir-print-ir-after-all dump compiling for hip giving the same issue with following iree compile command:

./build/tools/iree-compile ../sharktank/program.mlir --iree-input-type=torch --iree-hal-target-device=hip[0] --iree-hal-target-device=hip[1] --mlir-print-ir-after-all --iree-opt-data-tiling=false --iree-gl
obal-opt-enable-early-materialization=false -o program.vmfb &> hip.txt

https://sharkpublic.blob.core.windows.net/sharkpublic/ian/hip.txt

@hanhanW
Copy link
Contributor

hanhanW commented Sep 30, 2024

There is a bug in getFuncExecutableTargetAttrs on data-tiling path. The bug is likely in HAL device analysis and I can take a look.

There is another issue about disabling data-tiling. I thought that --iree-opt-data-tiling=false should work, but it does not work on my machine. It still runs the set_encoding passes even with the flag. I'll take a look.

@sogartar
Copy link
Contributor Author

The input for pass iree-auto-input-conversion lacks information about @__device_1.
It only has about the default local device.

module attributes {hal.device.targets = [#hal.device.target<"local", [#hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "znver4", cpu_features = "+prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,+xsaves,-avx512fp16,-usermsr,-sm4,-egpr,+sse4.1,+avx512ifma,+xsave,+sse4.2,-tsxldtrk,-sm3,-ptwrite,-widekl,+invpcid,+64bit,+xsavec,-avx10.1-512,+avx512vpopcntdq,+cmov,-avx512vp2intersect,+avx512cd,+movbe,-avxvnniint8,-ccmp,-amx-int8,-kl,-avx10.1-256,+evex512,-avxvnni,-rtm,+adx,+avx2,-hreset,-movdiri,-serialize,-sha512,+vpclmulqdq,+avx512vl,-uintr,-cf,+clflushopt,-raoint,-cmpccxadd,+bmi,-amx-tile,+sse,-avx10.2-256,+gfni,-avxvnniint16,-amx-fp16,-zu,-ndd,+xsaveopt,+rdrnd,+avx512f,-amx-bf16,+avx512bf16,+avx512vnni,-push2pop2,+cx8,+avx512bw,+sse3,+pku,-nf,+fsgsbase,+clzero,+mwaitx,-lwp,+lzcnt,+sha,-movdir64b,-ppx,+wbnoinvd,-enqcmd,-avx10.2-512,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,+bmi2,+fma,+popcnt,-avxifma,+f16c,+avx512bitalg,+rdpru,+clwb,+mmx,+sse2,+rdseed,+avx512vbmi2,-prefetchi,+rdpid,-fma4,+avx512vbmi,+shstk,+vaes,-waitpkg,-sgx,+fxsr,+avx512dq,+sse4a", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 64 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>]> : !hal.device], iree.consteval} {

Later at iree-hal-materialize-target-devices pass this becomes

util.global private @__device_0 = #hal.device.target<"local", [#hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "znver4", cpu_features = "+prfchw,-cldemote,+avx,+aes,+sahf,+pclmul,-xop,+crc32,+xsaves,-avx512fp16,-usermsr,-sm4,-egpr,+sse4.1,+avx512ifma,+xsave,+sse4.2,-tsxldtrk,-sm3,-ptwrite,-widekl,+invpcid,+64bit,+xsavec,-avx10.1-512,+avx512vpopcntdq,+cmov,-avx512vp2intersect,+avx512cd,+movbe,-avxvnniint8,-ccmp,-amx-int8,-kl,-avx10.1-256,+evex512,-avxvnni,-rtm,+adx,+avx2,-hreset,-movdiri,-serialize,-sha512,+vpclmulqdq,+avx512vl,-uintr,-cf,+clflushopt,-raoint,-cmpccxadd,+bmi,-amx-tile,+sse,-avx10.2-256,+gfni,-avxvnniint16,-amx-fp16,-zu,-ndd,+xsaveopt,+rdrnd,+avx512f,-amx-bf16,+avx512bf16,+avx512vnni,-push2pop2,+cx8,+avx512bw,+sse3,+pku,-nf,+fsgsbase,+clzero,+mwaitx,-lwp,+lzcnt,+sha,-movdir64b,-ppx,+wbnoinvd,-enqcmd,-avx10.2-512,-avxneconvert,-tbm,-pconfig,-amx-complex,+ssse3,+cx16,+bmi2,+fma,+popcnt,-avxifma,+f16c,+avx512bitalg,+rdpru,+clwb,+mmx,+sse2,+rdseed,+avx512vbmi2,-prefetchi,+rdpid,-fma4,+avx512vbmi,+shstk,+vaes,-waitpkg,-sgx,+fxsr,+avx512dq,+sse4a", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 64 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>]> : !hal.device

Whatever is generating the jit IR for iree-auto-input-conversion is dropping the device definitions that are present befor iree-consteval-jit-globals.

#device_target_local_0_ = #hal.device.target<"local", {ordinal = 0 : index}, [#executable_target_embedded_elf_x86_64_]> : !hal.device
#device_target_local_1_ = #hal.device.target<"local", {ordinal = 1 : index}, [#executable_target_embedded_elf_x86_64_]> : !hal.device
module @module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
  util.global private @__device_0 = #device_target_local_0_
  util.global private @__device_1 = #device_target_local_1_

@sogartar
Copy link
Contributor Author

sogartar commented Sep 30, 2024

It makes sense that iree-consteval-jit-globals would ignore device placement while evaluating to constants and use only the default 'local' device.
Maybe for each declared device it should set it to point to the single CPU device that is defined at time of compilation.
E.g.

#device_target_local_0_ = #hal.device.target<"local", {ordinal = 0 : index}, [#executable_target_embedded_elf_x86_64_]> : !hal.device
module @module {
  util.global private @__device_0 = #device_target_local_0_
  util.global private @__device_1 = #device_target_local_0_

@hanhanW
Copy link
Contributor

hanhanW commented Oct 1, 2024

Sorry that I was busy on other stuff today, so no more updates from my side. I'll prioritize this issue tomorrow.

@sogartar
Copy link
Contributor Author

sogartar commented Oct 1, 2024

I am working on a PR for iree-consteval-jit-globals to define all the required devices to point to the single default/local device in the JITed module.

@benvanik
Copy link
Collaborator

benvanik commented Oct 1, 2024

Hold up - that's not the correct approach.

@benvanik
Copy link
Collaborator

benvanik commented Oct 1, 2024

Two issues: data tiling as it is plumbed through codegen today is fundamentally incompatible with jit eval, and jit eval today should be stripping all transfer ops and device references.

Data tiling needs to separate itself from executable target and only use that as a way to seed default information. Data tiling must always be overridable independently of executable target once it gets to codegen. That is, we must be able to codegen data tiling behavior for one executable target on a different executable target. Today it always assumes it can use the executable target to derive the information it needs. Things like jit eval and hoisting do not work if we can't do that - and we need jit eval and hoisting.

In order for that to work jit eval needs to ensure that information is present in the IR separate from the target used for jitting (the host). This means that the split of executable target and tiling information needs to happen before jit eval such that what device is running any particular op is independent from what data tiling is being performed. After that point no device information from the original program is required in the jit eval program, and the only device that should exist is the host default device. All device affinities and transfer ops should be stripped.

So don't alias devices - remove them all. There should be no transfer ops in a jit eval function, and no device affinities specified.

@sogartar
Copy link
Contributor Author

sogartar commented Oct 1, 2024

OK, I will add stripping of flow.tensor.transfer ops in the iree-consteval-jit-globals pass.

I will have to see what other ops can have device affinities and strip them as well. Maybe some ops can just have the affinities stripped without removing the whole op if it is performing some "actual" operation.

@sogartar
Copy link
Contributor Author

sogartar commented Oct 2, 2024

Here is a PR #18663 that strips the affinities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working codegen/llvm LLVM code generation compiler backend
Projects
None yet
Development

No branches or pull requests

4 participants