Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: rocblas link fails with relocation R_X86_64_PC32 out of range #1448

Open
aagit opened this issue Jun 24, 2024 · 27 comments
Open

[Bug]: rocblas link fails with relocation R_X86_64_PC32 out of range #1448

aagit opened this issue Jun 24, 2024 · 27 comments
Assignees

Comments

@aagit
Copy link

aagit commented Jun 24, 2024

Describe the bug

Build fails during final shared lib linking.

To Reproduce

Steps to reproduce the behavor:

  1. build rocblas version 6.0.2 with export ROCM_GPUS="gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102"
  2. cmake -G Ninja
    -DBUILD_FILE_REORG_BACKWARD_COMPATIBILITY=OFF
    -DROCM_SYMLINK_LIBS=OFF
    -DHIP_PLATFORM=amd
    -DAMDGPU_TARGETS=${ROCM_GPUS}
    -DCMAKE_INSTALL_LIBDIR=$ROCM_LIB
    -DCMAKE_INSTALL_BINDIR=$ROCM_BIN
    -DBUILD_WITH_TENSILE=ON
    -DBUILD_WITH_PIP=OFF
  3. See error

Expected behavior

Build should not fail.

Log-files

ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7f991): relocation R_X86_64_PC32 out of range: -2179713377 is not in [-2147483648, 2147483647]; references section '.gcc_except_table.rocblas_gemm_ex3'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fa49): relocation R_X86_64_PC32 out of range: -2179713445 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._Z27exception_to_rocblas_statusNSt15__exception_ptr13exception_ptrE'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fa75): relocation R_X86_64_PC32 out of range: -2179713457 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._Z11log_profileIJRA7_KcRPS0_S2_S4_S2_S4_S2_S4_RA13_S0_S4_S2_RcS2_S7_RA2_S0_RiS9_SA_S9_SA_RA6_S0_dRA4_S0_SA_SE_SA_RA5_S0_dSE_SA_SE_SA_SG_R18rocblas_gemm_algo_RA15_S0_SA_SC_19rocblas_gemm_flags_EEvP15_rocblas_handleS3_DpOT_'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7faa1): relocation R_X86_64_PC32 out of range: -2179713477 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZNSt8__detaillsIcSt11char_traitsIcEEERSt13basic_ostreamIT_T0_ES7_RKNS_14_Quoted_stringIPKS4_S4_EE'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fad5): relocation R_X86_64_PC32 out of range: -2179713485 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._Z22log_bench_scalar_valueIfLi0EENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPKcPKT_'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fb05): relocation R_X86_64_PC32 out of range: -2179713501 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZN16argument_profileISt5tupleIJPKcS2_S2_21rocblas_atomics_mode_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_cS2_cS2_iS2_iS2_iS2_dS2_iS2_iS2_dS2_iS2_iS2_18rocblas_gemm_algo_S2_iS2_19rocblas_gemm_flags_EEED2Ev'
ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fb3d): relocation R_X86_64_PC32 out of range: -2179713529 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZN16argument_profileISt5tupleIJPKcS2_S2_21rocblas_atomics_mode_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_cS2_cS2_iS2_iS2_iS2_dS2_iS2_iS2_dS2_iS2_iS2_18rocblas_gemm_algo_S2_iS2_19rocblas_gemm_flags_EEEclEOS6_'
ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0x0): relocation R_X86_64_PC32 out of range: 2180923961 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(.gcc_except_table._Z27exception_to_rocblas_statusNSt15__exception_ptr13exception_ptrE+0x14): relocation R_X86_64_PC32 out of range: 2192096096 is not in [-2147483648, 2147483647]; references section '.data'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0x4): relocation R_X86_64_PC32 out of range: 2180924065 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(.gcc_except_table._Z27exception_to_rocblas_statusNSt15__exception_ptr13exception_ptrE+0x18): relocation R_X86_64_PC32 out of range: 2192096100 is not in [-2147483648, 2147483647]; references section '.data'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x7fa): relocation R_X86_64_PC32 out of range: -2180446035 is not in [-2147483648, 2147483647]; references '.L.str.36'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x801): relocation R_X86_64_PC32 out of range: -2180875840 is not in [-2147483648, 2147483647]; references '.L.str.35'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0x8): relocation R_X86_64_PC32 out of range: 2180924159 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x808): relocation R_X86_64_PC32 out of range: -2180734041 is not in [-2147483648, 2147483647]; references '.L.str'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x80f): relocation R_X86_64_PC32 out of range: -2180325041 is not in [-2147483648, 2147483647]; references '.L.str.19'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: lto.tmp:(.rodata._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_+0xc): relocation R_X86_64_PC32 out of range: 2180924253 is not in [-2147483648, 2147483647]; references section '.text._Z44rocblas_copy_alpha_beta_to_host_if_on_deviceIvE15rocblas_status_P15_rocblas_handleRPKT_S6_R15rocblas_union_uS8_i20rocblas_computetype_'
>>> referenced by rocblas_gemm_ex3.cpp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x816): relocation R_X86_64_PC32 out of range: -2180325041 is not in [-2147483648, 2147483647]; references '.L.str.20'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: lto.tmp:(function rocblas_gemm_ex3: .text.rocblas_gemm_ex3+0x836): relocation R_X86_64_PC32 out of range: -2180302935 is not in [-2147483648, 2147483647]; references '.L.str.37'
>>> referenced by rocblas_gemm_ex3.cpp:176 (/root/rpmbuild/BUILD/rocBLAS-rocm-6.0.2/library/src/blas_ex/rocblas_gemm_ex3.cpp:176)
>>> defined in lto.tmp

ld.lld: error: library/src/librocblas.so.4.0:(.eh_frame+0x7fb71): relocation R_X86_64_PC32 out of range: -2179713541 is not in [-2147483648, 2147483647]; references section '.gcc_except_table._ZNK24rocblas_internal_ostream3dupEv'
ld.lld: error: too many errors emitted, stopping now (use --error-limit=0 to see all errors)
clang: error: linker command failed with exit code 1 (use -v to see invocation)
ninja: build stopped: subcommand failed.

Environment

Should not matter, it is not a runtime issue.

Software version
rocm-core rocm-core-6.0.2-1.fc40.x86_64
rocblas rocblas-6.0.2-3.fc40.x86_64

Additional context

Despite I don't see this reported among the github issues, this should be a very well known issues. So I wonder if this is planned not to be ever fixed?

If the above assumption is correct, I would like to know if upstream is willing to take in a fix for it, assuming a fix is possible.

@IMbackK
Copy link

IMbackK commented Jun 24, 2024

You are running into the issue that ld can only link objects whos sections are at most a 32bit signed away from eatch other.
as you enable more targets rocblas gets larger eventually exceeding this limit. Yes this is a huge problem with how rocm is architectured and desperately needs some kind of resultion but for now the only solution is to build for less targets.

@IMbackK
Copy link

IMbackK commented Jun 24, 2024

If you want to remove an architecture i would recommend gfx803 as this architecture is currently broken anyhow, unless you disable the asm kernels provided by tensile.

@aagit
Copy link
Author

aagit commented Jun 24, 2024

Thanks for the quick feedback.

Yes, if I'd build for fewer targets it would succeed, but I already removed gfx1103 as I've been building for a older codebase where gfx1103 could not be enabled. So removing gfx803 will hide the problem and it would kick the can down the road, but it doesn't appear a satisfactory long term solution.

If we don't work on a solution for this now the end result is that every rocm accelerated app binary has to be built multiple times against independent and incompatible rocm builds just as if they were separate GPU compute stacks with nothing in common. This multiplies also the build time and the disk space requirements of every app, maybe not xN, but close.

It would provide a sub par experience also to the end user that has then to figure the right binary to install invoke, instead of rocm solving that gpu detail in a way that is transparent to the end user.

@IMbackK
Copy link

IMbackK commented Jun 24, 2024

jup, this is the major reason why rocm supports so few gpus, and if they dont address this soon it has the potential to sink rocm since it forces them to drop support for old gpus exreamly fast (ever accelerating in pace as rocm get larger even) which ultimately utterly destroys customer confidence.

@Mystro256
Copy link

@cgmb I think you had some other suggestions by using generic targets, but I can't remember how much progress has happened there.

@mahmoodw
Copy link
Contributor

Thank you for bringing this issue to our attention. We appreciate your feedback and suggestions.

We recommend building with the suggested targets in relation to the ROCm stack. The default target list for 6.0 includes:

  • gfx900
  • gfx906:xnack-
  • gfx908:xnack-
  • gfx90a:xnack+
  • gfx90a:xnack-
  • gfx940
  • gfx941
  • gfx942
  • gfx1010
  • gfx1012
  • gfx1030
  • gfx1100
  • gfx1101
  • gfx1102

The team is aware of the issue and is exploring possible solutions.

Thank you for your understanding and cooperation.

@IMbackK
Copy link

IMbackK commented Jun 25, 2024

@cgmb I think you had some other suggestions by using generic targets, but I can't remember how much progress has happened there.

sure https://llvm.org/docs/AMDGPUUsage.html#amdgpu-generic-processor-table could be used at the cost of some performance for the non gfx10-3-generic targets. Ultimately this just kicks this can further down the road, but for now yes this would be sufficient.

right now there is also no support for ELFABIVERSION_AMDGPU_HSA_V6 so those targets dont work yet, but soon i presume.

@aagit
Copy link
Author

aagit commented Jun 25, 2024

Would it be possible to split the librocblas.so.4.0 in librocblas-gfx900.so.4.0 librocblas-gfx90a.so.4.0 librocblas-gfxXYZ.so.4.0... so each individual gfx target lands in a different shared library, and then have the main librocblas.so.4.0 dynamically load only the gfx targets available in hardware either during initialization of the main library or even better lazily on demand?

@TorreZuk TorreZuk self-assigned this Jun 27, 2024
@TorreZuk
Copy link
Contributor

@aagit that separate gfx .so design has been evalutated as one possible solution but we are also looking at other strategies. For now until the full list of gfx that lands in a specific release requires a new build and packaging pattern we suggest you build and package the version specific set of gfx listed in the top level CMakeLists.txt. This corresponds to our build scripts default option.

@aagit
Copy link
Author

aagit commented Jun 28, 2024

I appreciate your suggestion above. I agree that's the least bad solution for the time being and I already gave it. If there's other ways to fix it, would you share them so they can be discussed here? Overall I would recommend to pick the simplest way to fix it and to ship it ASAP, because while working on a rocm accellerated app, I noticed that rocm has already been packaged in the open by building it N times and installing it in incompatible paths. The technical justification is to work around this issue (so it's like if there's a /opt/rocm1 /opt/rocm2 /opt/rocm3 /opt/rocmN installed, each one supporting a small subset of gfxes so that the link does not fail and gfx8 and gfx1103 can be enabled too). If the duplication was just on the rocm side it would be (perhaps) a lesser concern, but this causes all apps to be rebuilt N times and the build time is multiplied xN times. Last but not the least the end user would then have to pick the right binary (among N available) for its GPU or it won't work, and possibly just because of minor path differences. For example: I built an app linked against rocm that way and the total size of the N builds against N rocms, was 96GB. Then I run hardlink . and it dropped the size to 92GB. Then I run hardlink . -t -p and it dropped the size to 32GB. What I described in #1448 (comment) is already happening. My view is that such way to package rocm it is not sustainable even if the extra energy requirements for the buildsystem could be met, because it provides a sub par experience to the end user, if compared to the competing GPU compute stacks where building an app once is enough. I already gave your above suggestion of course, but it is now a matter of opinion if the workaround is worse than the disease. So I don't see a clear path to unwind the rocm build loop until this issue fixed... Thanks!

@IMbackK
Copy link

IMbackK commented Jun 28, 2024

another temopray option if you dont want to drop any gpus in your builds might be to build "gfx90a" or just "gfx90a:xnack-" the xnack+ configuration is very rare and omitting it do sent leave any user totally in the cold (just with possibly reduced performance depending on workload) and "gfx90a" should emit code that works in both xnack+ and xnack- modes.

@IMbackK
Copy link

IMbackK commented Jun 28, 2024

all gfx9 gpus support xnack+, the fact that only gfx90a is built both ways is a clear hint here as to how common this is

@TorreZuk
Copy link
Contributor

TorreZuk commented Jul 2, 2024

We have changed to only build our source kernels with xnack "any" for gfx90a after commit 6a267fd. We expect to adjust our gfx list before release and as always we ensure there are no linking issues on all supported OS and with any final target list. Other subdivisions of the library along functionality are also possible but none are trivial changes. Clang compiler and linker mcmodel flag changes are also possible with the current library design along with the target varations mentioned in earlier comments.

This bug should likely be considered fixed and the issue closed as when you built rocblas with our supported gfx list you didn't get the error. A new issue could be created as it is unclear to me your N different ROCm use case and why the app is rebuilt and linked against all of them and not built against the latest. If your application is open source please refer to it in your new issue and detail why it is built separately for each gfx. Or if this is really just a request to support more gfx then word it as such along with your use case and gfx list. If you rebuilt rocm or rocblas with one gfx in each version please also clarify that in your new feature request issue. It could be your new issue should be in ROCm if not particular to rocBLAS.

@IMbackK
Copy link

IMbackK commented Jul 8, 2024

I think wanting to build a version of rocblas for all targets that work as opposed to just the default targets is a reasonable desire esp from a distro maintainers perspective where amd's support status of a specific architecture is not important, it only needs to work.
Further as you expand rocblas and its supported target architectures you will eventually hit a wall here that will force you to change how this works, it is for instance impossible for you to support the full range of your released devices using the current implementation unlike for instance how cublas supports a huge range of devices, down to ones much older than gfx900.

Thus at the very least this i request this be left open as at least a feature request.

@aagit
Copy link
Author

aagit commented Jul 8, 2024

https://src.fedoraproject.org/rpms/python-torch/blob/rawhide/f/python-torch.spec#_998
https://src.fedoraproject.org/rpms/python-torch/blob/rawhide/f/python-torch.spec#_37
https://src.fedoraproject.org/rpms/rocm-rpm-macros/blob/rawhide/f/default
https://src.fedoraproject.org/rpms/rocm-rpm-macros/blob/rawhide/f/gfx9
https://src.fedoraproject.org/rpms/rocm-rpm-macros/blob/rawhide/f/gfx90a
https://src.fedoraproject.org/rpms/rocblas/blob/rawhide/f/rocblas.spec#_117
https://src.fedoraproject.org/rpms/rocblas/blob/rawhide/f/rocblas.spec#_142
https://src.fedoraproject.org/rpms/rocsolver/blob/rawhide/f/rocsolver.spec#_136

rocsolver has also the loop, but because it depends on rocblas it's not possible to tell if rocsolver is like yet another app depending on rocblas (including pytorch) having to be rebuilt N times as a dependency on rocblas being rebuilt N times.

For example, see there's no rocm build loop in packages that don't depend on rocblas like the opencl runtime:

https://src.fedoraproject.org/rpms/rocm-runtime/blob/rawhide/f/rocm-runtime.spec#_50

For the time being rocblas appears the origin of the rocm build loop caused by the link failure above, but if other rocm parts also share this same issue, once rocblas is fixed, the fix can trickle there too I assume. The core does not seem to have this issue and in fact it is being built only once as one would expect.

I can open a new issue but we'd lose part of the context so for now I post it here. We can always open a new issue later. The goal is to fix the build so it doesn't fail linkage, once fixed it wouldn't add any noticeable feature to the software other than succeeding a build that previously failed as far as I can tell.

Thank you!

@aagit
Copy link
Author

aagit commented Jul 8, 2024

About the possible solutions mentioned, I agree that your -mcmodel= (I suppose "medium") suggestion for the short term appears the most attractive solution. In fact, wishful thinking, I wonder if there's a chance it could already be switched on through some environment variable.

@cgmb
Copy link
Contributor

cgmb commented Jul 8, 2024

The compiler team is looking into adjusting the layout of the shared libraries. I'm not sure of the details of their proposal, but the gist of it is to move all the offload bundles to either the beginning or the end of the library, which will ensure that all host code is close together (and therefore not need more than 32-bit offsets). I'm not sure the status of that proposal, but it seemed promising.

About the possible solutions mentioned, I agree that your -mcmodel= (I suppose "medium") suggestion for the short term appears the most attractive solution.

The last time I checked, the mcmodel flag did not actually work when compiling HIP code. Although, perhaps that has changed.

it's not possible to tell if rocsolver is like yet another app depending on rocblas (including pytorch) having to be rebuilt N times as a dependency on rocblas being rebuilt N times.

The rocBLAS ABI does not change depending on the gfx architecture it was built for. Fedora has built rocBLAS for multiple architectures as separate packages, but you can build your application against any of those rocblas packages and it will work with all of them.

I don't think they're rebuilding the rocSOLVER or pytorch libraries because of rocBLAS. It is likely because those libraries/packages also benefit from the same architecture-splitting that they did with rocBLAS. @Mystro256 or @trixirt might be able to shed more light on Fedora's choices here.

@trixirt
Copy link
Contributor

trixirt commented Jul 8, 2024

We faced this problem last year getting started when I was first building rocBLAS for all the targets. There was discussion within Fedora about how to work around this and what we have in place is the solution. The builds are split along major family lines. This was done to keep the explanation of where your gpu's was in the split simple. So we have atm gfx8, gfx9, gfx10 and gfx11 in F40. The prefix for this is /usr/lib64/rocm/gfxXX . There is also a special set 'default' which is the union of gfx10 and gfx11, these install the normal prefix /usr. The main tradeoff that was made was to include as many targets as possible . This splitting does make it more challenging for packaging but that is why we have spec files and do work to make it happen in Fedora, with pytorch being an example of making it happen. Time to build is not a major concern.

If/when something changes in the upstream, I and other folks in Fedora ROCm packaging sig will readdress how rocBLAS and similar are built to maximize inclusiveness of gpu targets.

IMO this is a case of perfection being the enemy of good.

@IMbackK
Copy link

IMbackK commented Jul 8, 2024

One thing to note on this split package option is of course that any system with multiple gpus of different architectures will not be supported by your scheme. For this reason i would strongly recommend building as few versions of rocblas as possible, at most 2, to raise the chance of any given heterogeneous system being supported

@aagit
Copy link
Author

aagit commented Jul 9, 2024

@IMbackK agreed: running different gfxN in the same app has become impossible as result of the 5 rocm builds with the finegrined split. That is on top of having to pick the right rpm and/or binary of the AI APP or it won't work even with one GPU.

I asked to do a most 2 builds some time ago, so thanks for suggesting it too.

The "Good" to me is that there's just 1 binary of every AI app (be it llama-cpp, vllm, mojo, triton, pytorch, etc..) and there's no multiplication x5 of both rpm and binaries of every AI/GPU app under the sun, so when the AI dev that normally runs on the popular, but proprietary, GPU compute stack tries his favorite AI app on Open Source ROCm, it just works without extra complications compared to the previous experience.

Rejecting all suggestions above, some of which also allows to achieve full inclusiveness with a single build of rocblas (or at most 2 builds) to achieve minor extra optimizations, to me defines as the "Perfection".

@aagit
Copy link
Author

aagit commented Jul 9, 2024

Hello,

The rocBLAS ABI does not change depending on the gfx architecture it was built for. Fedora has built rocBLAS for multiple architectures as separate packages, but you can build your application against any of those rocblas packages and it will work with all of them.

It's hard to see how my app can work against all rocblas packages because it won't know where to find the file it needs at runtime, unless such knowledge is injected with "module load" just before the build, by rebuilding it N times with N different "module load" commands.

In other words the reason of the app being rebuilt N times in the best case could be just path differences, as workaround that rocblas wasn't meant to be built N times and installed in N different places.

rocblas rebuilt N times is a workaround around this rocblas build time link failure, and if you wish the app rebuilt N times are further orthogonal workarounds for the path differences caused by the first workaround, and the buildsystem takes one more hit at every step of the way, workaround on top of workaround.

I would have been content to get away with just 1 app binary for testing, using ROCBLAS_TENSILE_LIBPATH (or some other hidden env variable I found randomly to try to force it find the right files for the needed rocblas). That didn't appear to work and it segfaulted. Possibly I did some mistake, maybe that could have worked if I insisted in that direction, but I don't think an user should be required to set ROCBLAS_TENSILE_LIBPATH to some directory by hand specific to the GPU in use, for the app not to spawns some error about missing files in /usr/ with no hint on how to resolve it.

Among other suggestions to avoid the rebuild of all apps N times, before filing this issue, I tried to ask if the N rocblas builds could be installed in the same path location hoping then one app binary would just work. Problem is there's some file collisions, each build generates files in the same location that can't be automatically disambiguated. I had a quick look at the collisions of the non ELF-x86 parts and they didn't seem hard to disambiguate, but now I wonder if even /usr/lib64/librocblas.so (the otherwise >2G shared lib) would be one of them. I could imagine this path would be workable if you then accept not to be able to run on all gfx using the same container image (not just at runtime from the same binary among the N built, which as discussed earlier is already not possible).

Thanks!

@IMbackK
Copy link

IMbackK commented Jul 10, 2024

I would have been content to get away with just 1 app binary for testing, using ROCBLAS_TENSILE_LIBPATH (or some other hidden env variable I found randomly to try to force it find the right files for the needed rocblas). That didn't appear to work and it segfaulted. Possibly I did some mistake, maybe that could have worked if I insisted in that direction, but I don't think an user should be required to set ROCBLAS_TENSILE_LIBPATH to some directory by hand specific to the GPU in use, for the app not to spawns some error about missing files in /usr/ with no hint on how to resolve it.

This cant work, rocblas contains gpu code inside librocblas.so not just its modules and that gpu code must be available for all gpus in the system supported by the runtime.

You can, however just build several librocblas.so with different prefixes, and then use LD_LIBRARY_PATH or LD_PRELOAD to select witch one, no application rebuild required.

@Headcrabed
Copy link

Hello, is this issue still being worked on?

@TorreZuk
Copy link
Contributor

Yes we have have ongoing work on this topic, when anything relevant lands in develop branch I will comment here. Larger changes may not occur until a major release. gfx940 and gfx941 could be removed from the target list you build IMO.

@TorreZuk
Copy link
Contributor

Just to keep you all in the loop, one possible solution has landed in develop commit bb81a83 in which we enable llvm clang's hip offload compiler option --offload-compress. See https://clang.llvm.org/docs/ClangCommandLineReference.html This may provide a significant (around 90%) size reduction with little impact on build time, so along with removing the gfx940 and gfx941 targets as suggested earlier, it will greatly reduce the linked .so size for develop branch. The compiler flag requires the ROCm 6.2 clang or later. We are still evaluating it, but if it passes all our requirements it should be used by default in a future release.

@trixirt
Copy link
Contributor

trixirt commented Sep 28, 2024

Just to keep you all in the loop, one possible solution has landed in develop commit bb81a83 in which we enable llvm clang's hip offload compiler option --offload-compress. See https://clang.llvm.org/docs/ClangCommandLineReference.html This may provide a significant (around 90%) size reduction with little impact on build time, so along with removing the gfx940 and gfx941 targets as suggested earlier, it will greatly reduce the linked .so size for develop branch. The compiler flag requires the ROCm 6.2 clang or later. We are still evaluating it, but if it passes all our requirements it should be used by default in a future release.

Thanks for suggestion and it is simple to implement!
I did a number of experiments and compression is no worse than 80%.
Another bit of info, DEBUG makes .hip_fatbin about 2x and if you have no gpu debugger (Fedora doesn't yet) this doesn't make sense to have.
rocSPARSE and rocSOLVER could also benefit from this option.
Here is what I did in Fedora
https://src.fedoraproject.org/rpms/rocblas/c/5aa46bcd5960478990cd692f9339afb4265050d3?branch=rawhide

@IMbackK
Copy link

IMbackK commented Oct 2, 2024

The high compression ratio is not a big surprise given that there a really only 2 isas. i would expect all of gcn+cdna and all of rdna to generate each essentially the same code aside from edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants