Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make in-place FFT optional #155

Merged
merged 2 commits into from
Oct 8, 2024
Merged

Conversation

samhatfield
Copy link
Collaborator

It seems that hipFFT has a problem with in-place FFTs. Running on LUMI-G (MI250x), this error is thrown at runtime:

GPU runtime error at 1
GPU runtime error in file '/pfs/lustrep3/scratch/project_465000454/hatfield/jube-suites/ectrans/ectrans_gpu/000005/000000_clone/work/ectrans-bundle/source/ectrans/src/trans/gpu/algor/hicfft.hip.cpp'
GPU runtime error at 2
GPU runtime error line '48'
GPU runtime error at 3
GPU runtime error 6: HIPFFT_EXEC_FAILED
terminating!
ABOR1     [PROC=1,THRD=1] from [/pfs/lustrep3/scratch/project_465000454/hatfield/jube-suites/ectrans/ectrans_gpu/000005/000000_clone/work/ectrans-bundle/source/ectrans/src/trans/gpu/algor/hicfft.h +48] : Error in FFT

Some investigations with @PaulMullowney showed that if you create a separate buffer for the output, the error vanishes.

With this change we are now finally able to run the optimised GPU version of ecTrans (now the GPU version of ecTrans) on AMD GPUs, at least with a single MI250x GPU on LUMI-G. Multi-GPU runs are still a work in progress.

This PR makes in-place FFT compile-time adjustable. For now, in-place is enabled for cuFFT, but disabled for hipFFT.

@PaulMullowney and I will see what we can do to flag this to hipFFT developers so we can eventually remove this option.

Note that earlier HIPFFT_PARSE_ERRORs no longer appear with ROCm 6.0.2, which is the new default for LUMI-G.

Note that this requires testing on an Nvidia GPU. Hopefully we have not ruined things there.

This PR builds on #150 so best to merge that one first.

@samhatfield
Copy link
Collaborator Author

@lukasm91 can you see this having any performance implications for CUDA/cuFFT execution? I think I've put all the expensive stuff inside the guards so it shouldn't be compiled on an Nvidia platform.

@samhatfield samhatfield mentioned this pull request Sep 19, 2024
@samhatfield samhatfield force-pushed the samhatfield/add_in_place_fft_ifdef branch from ce5c7cf to dea414f Compare September 19, 2024 14:40
Copy link
Collaborator

@lukasm91 lukasm91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks like it should not have perf implications (and I also quickly checked it on our machines)

src/trans/gpu/internal/ftdir_mod.F90 Outdated Show resolved Hide resolved
src/trans/gpu/internal/ftinv_mod.F90 Outdated Show resolved Hide resolved
src/trans/gpu/internal/ftinv_mod.F90 Outdated Show resolved Hide resolved
@samhatfield samhatfield force-pushed the samhatfield/add_in_place_fft_ifdef branch from dea414f to 3c6810a Compare September 20, 2024 10:43
@samhatfield samhatfield added enhancement New feature or request gpu labels Oct 1, 2024
samhatfield and others added 2 commits October 1, 2024 16:51
This is currently disabled for cuFFT but enabled for hipFFT.

In-place FFTs seem to be an issue for ROCm at the moment. This is a
temporary workaround.
@samhatfield samhatfield force-pushed the samhatfield/add_in_place_fft_ifdef branch from 1758aa1 to 89def35 Compare October 1, 2024 16:51
REAL(KIND=JPRBT) :: DUMMY

#ifndef IN_PLACE_FFT
HFTDIR%HREEL_COMPLEX = RESERVE(ALLOCATOR, INT(KF_FS*D%NLENGTF*SIZEOF(DUMMY), KIND=C_SIZE_T))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIZEOF is a non-standard extension. Some compilers could complain.
Standard in F2008 is STORAGE_SIZE, which gives you bits (not bytes!) and C_SIZEOF, which gives you bytes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIZEOF occurs in a few places in the GPU tree. @lukasm91 any strong feelings about switching to 8*STORAGE_SIZE(DUMMY)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for letting me know... for me, fortran is learning by doing - I simple didn't realize that this is not in the standard, so this is good to know and learning about the right way to do it :)

PREEL_COMPLEX => PREEL_REAL
#else
CALL ASSIGN_PTR(PREEL_COMPLEX, GET_ALLOCATION(ALLOCATOR, HFTDIR%HREEL_COMPLEX),&
& 1_C_SIZE_T, INT(KFIELD*D%NLENGTF*SIZEOF(PREEL_COMPLEX(1)),KIND=C_SIZE_T))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here

REAL(KIND=JPRBT) :: DUMMY

#ifndef IN_PLACE_FFT
HFTINV%HREEL_REAL = RESERVE(ALLOCATOR, INT(D%NLENGTF*KF_FS*SIZEOF(DUMMY),KIND=C_SIZE_T))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

PREEL_REAL => PREEL_COMPLEX
#else
CALL ASSIGN_PTR(PREEL_REAL, GET_ALLOCATION(ALLOCATOR, HFTINV%HREEL_REAL),&
& 1_C_SIZE_T, INT(KFIELD*D%NLENGTF*SIZEOF(PREEL_REAL(1)),KIND=C_SIZE_T))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here..
Perhaps there are more instances in the code. This seems like a typical copy-paste-edit line.

@samhatfield samhatfield merged commit 4f2fa50 into develop Oct 8, 2024
21 checks passed
@samhatfield samhatfield deleted the samhatfield/add_in_place_fft_ifdef branch October 8, 2024 14:57
wdeconinck added a commit to wdeconinck/ectrans that referenced this pull request Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gpu
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants