-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUFFT: support for arbitrary dims #119
Comments
I'm running into this issue as well.
The error seems to apply to More generally it appears that the
and the following are successful
This is a problem because |
Is this a limitation of the CUFFT library? From looking at the CUFFT documentation (https://docs.nvidia.com/cuda/cufft/index.html) it wasn't super clear to me whether e.g. an FFT along dimension 2 for a 3D
If FFTs along non-batched dimensions are not supported by CUFFT then shouldn't this issue be closed as "can't fix"? I'm happy to help work on this issue as it would help resolve a number of important Oceananigans.jl issues (CliMA/Oceananigans.jl#586, CliMA/Oceananigans.jl#1007) and speed up channel models (an important class of simulations) by a factor of ~2x due to the 2D DCT I had to implement in CliMA/Oceananigans.jl#290. If CUFFT cannot be used along non-batched dimensions, it seems like it should be possible to implement with I think @btmit was thinking of a For reference, julia> @show typeof(src), typeof(dest);
(typeof(src), typeof(dest)) = (CuArray{Complex{Float64},3}, CuArray{Complex{Float64},3})
julia> @show size(src), size(dest);
(size(src), size(dest)) = ((256, 256, 256), (256, 256, 256))
julia> @benchmark CUDA.@sync permutedims!(dest, src, (2, 1, 3))
BenchmarkTools.Trial:
memory estimate: 800 bytes
allocs estimate: 34
--------------
minimum time: 1.828 ms (0.00% GC)
median time: 1.971 ms (0.00% GC)
mean time: 2.103 ms (0.00% GC)
maximum time: 37.842 ms (0.00% GC)
--------------
samples: 2371
evals/sample: 1 |
Since this is a limitation of the CUFFT library, it seems that transposing the array then performing an FFT along a non-batched dimension is the way to go. E.g. to compute an FFT along dimension 2 of an array Should this issue be closed since it's can't really be fixed? It's not really a bug either. |
I guess; I'm not familiar with CUFFT or the fft API. Seeing |
I believe you can do this with the various options in The problem I found with |
Ah I'm definitely not an expert in FFTs but can you see how to do an FFT along dimension 2 for an array with 3 dimensions with |
I will try and figure it out as soon as I have a minute. I think I might be able to track down some old code. As another data point, MATLAB CUDA can do this. Though there is a chance that they are doing a permutedim copy under the hood as well. |
I apologize for disappearing on this. It's a busy time of the year. From: https://docs.nvidia.com/cuda/cufft/index.html#advanced-data-layout
Let's take the example of an fft across the 2nd dimension only of an array of dimension 3. In this case Generalizing this, it seems like we could support strided ffts so long as the dimensions are sequential (no skips). |
My understanding (if this behaves similarly to FFTW) is that that would only do FFTs along the 2nd dimension in the plane corresponding to index 1 in the 1st dimension (the On the FFTW side, this is why the |
Here is some code, which can be used to transform along a single (internal) dimension: function fft!(arr::CuArray, d::Int)
if d>1 && d < ndims(arr)
front_ids = ntuple((dd)->Colon(), d)
ids = ntuple((dd)->ifelse(dd <= d,Colon(),1), ndims(arr))
p = plan_fft!((@view arr[ids...]), d)
for c in CartesianIndices(size(arr)[d+1:end])
p * @view arr[front_ids..., Tuple(c)...]
end
else
CUDA.CUFFT.fft!(arr, d)
end
end It is not ideal, but at least it does not do any extra GPU allocations. It is on my computer about 2x slower than the FFT along the outer dimensions. Ideally the loop over the cartesian indices should be replace by a call to |
... turns out it is even slightly worse. FFTing data with larger than 3 dims crashes: julia> fca = fft(CuArray(rand(4,4,4,4)));
ERROR: CUFFTError: user specified an invalid pointer or parameter (code 4, CUFFT_INVALID_VALUE)
Stacktrace:
[1] throw_api_error(res::CUDA.CUFFT.cufftResult_t)
@ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\libcufft.jl:11
[2] macro expansion
@ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\libcufft.jl:24 [inlined]
[3] cufftMakePlanMany(plan::Int32, rank::Int64, n::Vector{Int32}, inembed::Ptr{Nothing}, istride::Int64, idist::Int64, onembed::Ptr{Nothing}, ostride::Int64, odist::Int64, type::CUDA.CUFFT.cufftType_t, batch::Int64, workSize::Base.RefValue{UInt64})
@ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\call.jl:26
[4] cufftMakePlan(xtype::CUDA.CUFFT.cufftType_t, xdims::NTuple{4, Int64}, region::UnitRange{Int64})
@ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:37
[5] #133
@ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:145 [inlined]
[6] (::CUDA.APIUtils.var"#8#11"{CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})()
@ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:28
[7] lock(f::CUDA.APIUtils.var"#8#11"{CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, l::ReentrantLock)
@ Base .\lock.jl:185
[8] (::CUDA.APIUtils.var"#check_cache#9"{CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})(f::CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})
@ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:26
[9] pop!(f::Function, cache::CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, key::Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}})
@ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:47
[10] cufftGetPlan(::CUDA.CUFFT.cufftType_t, ::Vararg{Any})
@ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:143
[11] plan_fft(X::CuArray{ComplexF64, 4, CUDA.Mem.DeviceBuffer}, region::UnitRange{Int64})
@ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\fft.jl:163
[12] fft
@ C:\Users\pi96doc\.julia\packages\AbstractFFTs\0uOAT\src\definitions.jl:63 [inlined]
[13] fft (repeats 2 times)
@ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\fft.jl:124 [inlined]
[14] top-level scope
@ REPL[12]:1
[15] top-level scope
@ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\src\initialization.jl:171 Might try to give it a shot to fix these things... |
this should be mostly solved now by this pull request: |
@ali-ramadhan @whymusticode what is the consensus here? Is BTW, I would be grateful if somebody could provide a CUDA-compatible DCT implementation. :D |
I still receive the same error when attempting a 4D fft (even on the latest CUDA.jl release). Did your comment mean that the 4D fft should be working or just that some machinery is now in place to make this possible? I would be happy to help contribute to getting this functioning if someone can provide some initial guidance on what is needed. Thanks! |
Thanks for bringing this up. I will try to look into this. |
Yes, currenty only a total of up to 3 transform directions are supported. E.g. |
In support of investing the work (which I'm happy to help with if pointed in the right direction), I am using 4D FFTs in conjuction with the Fourier Neural Operator (https://github.com/SciML/NeuralOperators.jl) to model 4D PDE solutions (3 spatial dimensions, and time). This is a feature that is supported in PyTorch (https://pytorch.org/docs/stable/generated/torch.fft.fftn.html#torch.fft.fftn) so I think supporting an n-dimensional FFT would help improve feature parity between julia and python ML toolkits. |
Not really, sorry. I'm myself not a user of the CUDA FFT functionality. |
A 1d fft across the nth dimension of > n dimensional CuArray (where n is not 1) is not enabled by the wrapper (ERROR: ArgumentError: batching dims must be sequential from CuArrays>wXQp8>src>fft>wrappers.jl)
to reproduce:
using CuArrays, CUDAnative, CUDAdrv, FFTW
dim = 2
data = CuArrays.rand(ComplexF32,512,512,512);
myFft = plan_fft!(data,dim);
the wrapper for regular cpu arrays allows this, and if dim is 1 or 3 it also works as expected
The text was updated successfully, but these errors were encountered: