merge changes from flatironinstitute#567

mreineck · Oct 19, 2024 · feca94a · feca94a
2 parents f596cf6 + 60611eb
commit feca94a
Show file tree

Hide file tree

Showing 18 changed files with 399 additions and 356 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,10 +1,12 @@
 List of features / changes made / release notes, in reverse chronological order.
 If not stated, FINUFFT is assumed (cuFINUFFT <=1.3 is listed separately).
 
-Master (9/10/24)
+Master (10/8/24)
 
+* Support and docs for opts.gpu_spreadinterponly=1 for MRI "density compensation
+  estimation" type 1&2 use-case with upsampfac=1.0 PR564 (Chaithya G R).
 * reduced roundoff error in a[n] phase calc in CPU onedim_fseries_kernel().
-   #534 (Barnett).
+   PR534 (Barnett).
 * GPU code type 1,2 also reduced round-off error in phases, to match CPU code;
   rationalized onedim_{fseries,nuft}_* GPU codes to match CPU (Barbone, Barnett)
 * Added type 3 in 1D, 2D, and 3D, in the GPU library cufinufft. PR #517, Barbone
@@ -14,9 +16,9 @@ Master (9/10/24)
   - Minor fixes on the GPU code:
     a) removed memory leaks in case of errors
     b) renamed maxbatchsize to batchsize
-* Add options for user-provided FFTW locker (PR548, Blackwell). These options can be be
-used to prevent crashes when a user is creating/destroying FFTW plans and
-FINUFFT plans in threads simultaneously.
+* Add options for user-provided FFTW locker (PR548, Blackwell). These options
+  can be be used to prevent crashes when a user is creating/destroying FFTW
+  plans and FINUFFT plans in threads simultaneously.
 
 V 2.3.0 (9/5/24)
 

diff --git a/docs/c_gpu.rst b/docs/c_gpu.rst
@@ -1,3 +1,5 @@
+.. _c_gpu:
+
 C interface (GPU)
 =================
 
@@ -289,6 +291,8 @@ This deallocates all arrays inside the ``plan`` struct, freeing all internal mem
 Note: the plan (being just a pointer to the plan struct) is not actually "destroyed"; rather, its internal struct is destroyed.
 There is no need for further deallocation of the plan.
 
+.. _opts_gpu:
+
 Options for GPU code
 --------------------
 
@@ -311,11 +315,9 @@ while ``modeord=1`` selects FFT-style ordering starting at zero and wrapping ove
 
 **gpu_device_id**: Sets the GPU device ID. Leave at default unless you know what you're doing. [To be documented]
 
-Diagnostic options
-~~~~~~~~~~~~~~~~~~
+**gpu_spreadinterponly**: If ``0`` do the NUFFT as intended. If ``1``, omit the FFT and deconvolution (diagonal division by kernel Fourier transform) steps, which returns *garbage answers as a NUFFT*, but allows advanced users to perform an isolated spreading or interpolation using the usual type 1 or type 2 ``cufinufft`` interface. To do this, the nonzero flag value must be used *only* with ``upsampfac=1.0`` (since no upsampling takes place), and ``kerevalmeth=1``. The known use-case here is estimating so-called density compensation, conventionally used in MRI (see `MRI-NUFFT <https://mind-inria.github.io/mri-nufft/nufft.html>`_), although it might also be useful in spectral Ewald. Please note that this flag is also internally used by type 3 transforms (although it was originally a debug flag).
+
 
-**gpu_spreadinterponly**: if ``0`` do the NUFFT as intended. If ``1``, omit the FFT and kernel FT deconvolution steps and return garbage answers.
-Nonzero value is *only* to be used to aid timing tests (although currently there are no timing codes that exploit this option), and will give wrong or undefined answers for the NUFFT transforms!
 
 
 Algorithm performance options
@@ -326,7 +328,7 @@ Algorithm performance options
 * ``gpu_method=0`` : makes an automatic choice of one of the below methods, based on our heuristics.
 
 * ``gpu_method=1`` : uses a nonuniform points-driven method, either unsorted which is referred to as GM in our paper, or sorted which is called GM-sort in our paper, depending on option ``gpu_sort`` below
-  
+
 * ``gpu_method=2`` : for spreading only, ie, type 1 transforms, uses a shared memory output-block driven method, referred to as SM in our paper. Has no effect for interpolation (type 2 transforms)
 
 * ``gpu_method>2`` : (various upsupported experimental methods due to Melody Shih, not for regular users. Eg ``3`` tests an idea of Paul Springer's to group NU points when spreading, ``4`` is a block gather method of possible interest.)
@@ -335,7 +337,7 @@ Algorithm performance options
 
 **gpu_kerevalmeth**: ``0`` use direct (reference) kernel evaluation, which is not recommended for speed (however, it allows nonstandard ``opts.upsampfac`` to be used). ``1`` use Horner piecewise polynomial evaluation (recommended, and enforces ``upsampfac=2.0``)
 
-**upsampfac**: set upsampling factor. For the recommended ``kerevalmeth=1`` you must choose the standard ``upsampfac=2.0``. If you are willing to risk a slower kernel evaluation, you may set any ``upsampfac>1.0``, but this is experimental and unsupported.
+**upsampfac**: set upsampling factor. For the recommended ``kerevalmeth=1`` you must choose the standard ``upsampfac=2.0``. If you are willing to risk a slower kernel evaluation, you may set any ``upsampfac>1.0``, but this is experimental and unsupported. Finally, ``upsampfac=1.0`` is an advanced GPU setting only to be paired with the "spread/interpolate only" mode triggered by setting ``gpu_spreadinterponly=1`` (see options above); do not use this unless you know what you are doing!
 
 **gpu_maxsubprobsize**: maximum number of NU points to be handled in a single subproblem in the spreading SM method (``gpu_method=2`` only)
 

diff --git a/docs/error.rst b/docs/error.rst
@@ -29,6 +29,7 @@ has the following meanings (see codes in ``include/finufft_errors.h``):
   18 size of bins for subprob/blockgather invalid
   19 GPU shmem too small for subprob/blockgather parameters
   20 invalid number of nonuniform points: nj or nk negative, or too big (see defs.h)
+  23 invalid upsampfac set while using gpu_spreadinterponly mode
 
 When ``ier=1`` (warning only) the transform(s) is/are still completed, at the smallest epsilon achievable, so, with that caveat, the answer should still be usable.
 

diff --git a/docs/opts.rst b/docs/opts.rst
@@ -54,10 +54,11 @@ Here are their default settings (from ``src/finufft.cpp:finufft_default_opts``):
 
 As for quick advice, the main options you'll want to play with are:
 
+- ``upsampfac`` to trade-off between spread/interpolate vs FFT speed and RAM
 - ``modeord`` to flip ("fftshift") the Fourier mode ordering
 - ``debug`` to look at timing output (to determine if your problem is spread/interpolation dominated, vs FFT dominated)
 - ``nthreads`` to run with a different number of threads than the current maximum available through OpenMP (a large number can sometimes be detrimental, and very small problems can sometimes run faster on 1 thread)
-- ``fftw`` to try slower plan modes which give faster transforms. The next natural one to try is ``FFTW_MEASURE`` (look at the FFTW3 docs)
+- ``fftw`` to try slower FFTW plan modes which give faster transforms. The next natural one to try is ``FFTW_MEASURE`` (look at the FFTW3 docs)
 
 See :ref:`Troubleshooting <trouble>` for good advice on trying options, and read the full options descriptions below.
 
@@ -158,18 +159,19 @@ There is thus little reason for the nonexpert to mess with this option.
 * ``spread_kerpad=1`` : pad to next multiple of four
 
 
-**upsampfac**: This is the internal real factor by which the FFT (fine grid)
+**upsampfac**: This is the internal factor by which the FFT (fine grid)
 is chosen larger than
-the number of requested modes in each dimension, for type 1 and 2 transforms.
+the number of requested modes in each dimension, for type 1 and 2 transforms. For type 3 transforms this factor gets squared, due to type 2 nested in a type-1-spreading operation, so has even more influence.
 We have built efficient kernels
 for only two settings, as follows. Otherwise, setting it to zero chooses a good heuristic:
 
-* ``upsampfac=0.0`` : use heuristics to choose ``upsampfac`` as one of the below values, and use this value internally. The value chosen is visible in the text output via setting ``debug>=2``. This setting is recommended for basic users; however, if you seek more performance it is quick to try the other of the below.
+* ``upsampfac=0.0`` : use heuristics to choose ``upsampfac`` as one of the below values, and use this value internally. The value chosen is visible in the text output via setting ``debug>=2``. This setting is recommended for basic users; however, if you seek more performance it is quick to try the other of the values.
 
-* ``upsampfac=2.0`` : standard setting of upsampling. This is necessary if you need to exceed 9 digits of accuracy.
+* ``upsampfac=2.0`` : standard setting of upsampling. Due to kernel width restrictions, this is necessary if you need to exceed 9 digits of accuracy.
 
 * ``upsampfac=1.25`` : low-upsampling option, with lower RAM, smaller FFTs, but wider spreading kernel. The latter can be much faster than the standard when the number of nonuniform points is similar or smaller to the number of modes, and/or if low accuracy is required. It is especially much (2 to 3 times) faster for type 3 transforms. However, the kernel widths :math:`w` are about 50% larger in each dimension, which can lead to slower spreading (it can also be faster due to the smaller size of the fine grid). Because the kernel width is limited to 16, currently, thus only 9-digit accuracy can currently be reached when using ``upsampfac=1.25``.
 
+
 **spread_thread**: in the case of multiple transforms per call (``ntr>1``, or the "many" interfaces), controls how multithreading is used to spread/interpolate each batch of data.
 
 * ``spread_thread=0`` : makes an automatic choice between the below. Recommended.

diff --git a/include/cufinufft/impl.h b/include/cufinufft/impl.h
@@ -107,8 +107,10 @@ int cufinufft_makeplan_impl(int type, int dim, int *nmodes, int iflag, int ntran
     d_plan->ms = nmodes[0];
     d_plan->mt = nmodes[1];
     d_plan->mu = nmodes[2];
-    printf("[cufinufft] (ms,mt,mu): %d %d %d\n", d_plan->ms, d_plan->mt, d_plan->mu);
-  } else {
+    if (d_plan->opts.debug) {
+      printf("[cufinufft] (ms,mt,mu): %d %d %d\n", d_plan->ms, d_plan->mt, d_plan->mu);
+    }
+  } else { // type 3 turns its outer type 1 into spreading-only
     d_plan->opts.gpu_spreadinterponly = 1;
   }
 
@@ -153,7 +155,13 @@ int cufinufft_makeplan_impl(int type, int dim, int *nmodes, int iflag, int ntran
       printf("[cufinufft] upsampfac automatically set to %.3g\n", d_plan->opts.upsampfac);
     }
   }
-
+  if (d_plan->opts.gpu_spreadinterponly) {
+    // XNOR implementation below with boolean logic.
+    if ((d_plan->opts.upsampfac != 1) == (type != 3)) {
+      ier = FINUFFT_ERR_SPREADONLY_UPSAMP_INVALID;
+      goto finalize;
+    }
+  }
   /* Setup Spreader */
   if ((ier = setup_spreader_for_nufft(d_plan->spopts, tol, d_plan->opts)) > 1) {
     // can return FINUFFT_WARN_EPS_TOO_SMALL=1, which is OK
@@ -254,70 +262,75 @@ int cufinufft_makeplan_impl(int type, int dim, int *nmodes, int iflag, int ntran
     } break;
     }
 
-    cufftHandle fftplan;
-    cufftResult_t cufft_status;
-    switch (d_plan->dim) {
-    case 1: {
-      int n[]       = {(int)nf1};
-      int inembed[] = {(int)nf1};
-
-      cufft_status = cufftPlanMany(&fftplan, 1, n, inembed, 1, inembed[0], inembed, 1,
-                                   inembed[0], cufft_type<T>(), batchsize);
-    } break;
-    case 2: {
-      int n[]       = {(int)nf2, (int)nf1};
-      int inembed[] = {(int)nf2, (int)nf1};
-
-      cufft_status =
-          cufftPlanMany(&fftplan, 2, n, inembed, 1, inembed[0] * inembed[1], inembed, 1,
-                        inembed[0] * inembed[1], cufft_type<T>(), batchsize);
-    } break;
-    case 3: {
-      int n[]       = {(int)nf3, (int)nf2, (int)nf1};
-      int inembed[] = {(int)nf3, (int)nf2, (int)nf1};
-
-      cufft_status = cufftPlanMany(
-          &fftplan, 3, n, inembed, 1, inembed[0] * inembed[1] * inembed[2], inembed, 1,
-          inembed[0] * inembed[1] * inembed[2], cufft_type<T>(), batchsize);
-    } break;
-    }
+    // We dont need any cuFFT plans or kernel values if we are only spreading /
+    // interpolating
+    if (!d_plan->opts.gpu_spreadinterponly) {
+      cufftHandle fftplan;
+      cufftResult_t cufft_status;
+      switch (d_plan->dim) {
+      case 1: {
+        int n[]       = {(int)nf1};
+        int inembed[] = {(int)nf1};
+
+        cufft_status = cufftPlanMany(&fftplan, 1, n, inembed, 1, inembed[0], inembed, 1,
+                                     inembed[0], cufft_type<T>(), batchsize);
+      } break;
+      case 2: {
+        int n[]       = {(int)nf2, (int)nf1};
+        int inembed[] = {(int)nf2, (int)nf1};
+
+        cufft_status =
+            cufftPlanMany(&fftplan, 2, n, inembed, 1, inembed[0] * inembed[1], inembed, 1,
+                          inembed[0] * inembed[1], cufft_type<T>(), batchsize);
+      } break;
+      case 3: {
+        int n[]       = {(int)nf3, (int)nf2, (int)nf1};
+        int inembed[] = {(int)nf3, (int)nf2, (int)nf1};
+
+        cufft_status = cufftPlanMany(
+            &fftplan, 3, n, inembed, 1, inembed[0] * inembed[1] * inembed[2], inembed, 1,
+            inembed[0] * inembed[1] * inembed[2], cufft_type<T>(), batchsize);
+      } break;
+      }
 
-    if (cufft_status != CUFFT_SUCCESS) {
-      fprintf(stderr, "[%s] cufft makeplan error: %s", __func__,
-              cufftGetErrorString(cufft_status));
-      ier = FINUFFT_ERR_CUDA_FAILURE;
-      goto finalize;
+      if (cufft_status != CUFFT_SUCCESS) {
+        fprintf(stderr, "[%s] cufft makeplan error: %s", __func__,
+                cufftGetErrorString(cufft_status));
+        ier = FINUFFT_ERR_CUDA_FAILURE;
+        goto finalize;
+      }
+      cufftSetStream(fftplan, stream);
+
+      d_plan->fftplan = fftplan;
+
+      // compute up to 3 * NQUAD precomputed values on CPU
+      T fseries_precomp_phase[3 * MAX_NQUAD];
+      T fseries_precomp_f[3 * MAX_NQUAD];
+      thrust::device_vector<T> d_fseries_precomp_phase(3 * MAX_NQUAD);
+      thrust::device_vector<T> d_fseries_precomp_f(3 * MAX_NQUAD);
+      onedim_fseries_kernel_precomp<T>(d_plan->nf1, fseries_precomp_f,
+                                       fseries_precomp_phase, d_plan->spopts);
+      if (d_plan->dim > 1)
+        onedim_fseries_kernel_precomp<T>(d_plan->nf2, fseries_precomp_f + MAX_NQUAD,
+                                         fseries_precomp_phase + MAX_NQUAD,
+                                         d_plan->spopts);
+      if (d_plan->dim > 2)
+        onedim_fseries_kernel_precomp<T>(d_plan->nf3, fseries_precomp_f + 2 * MAX_NQUAD,
+                                         fseries_precomp_phase + 2 * MAX_NQUAD,
+                                         d_plan->spopts);
+      // copy the precomputed data to the device using thrust
+      thrust::copy(fseries_precomp_phase, fseries_precomp_phase + 3 * MAX_NQUAD,
+                   d_fseries_precomp_phase.begin());
+      thrust::copy(fseries_precomp_f, fseries_precomp_f + 3 * MAX_NQUAD,
+                   d_fseries_precomp_f.begin());
+      // the full fseries is done on the GPU here
+      if ((ier = fseries_kernel_compute(
+               d_plan->dim, d_plan->nf1, d_plan->nf2, d_plan->nf3,
+               d_fseries_precomp_f.data().get(), d_fseries_precomp_phase.data().get(),
+               d_plan->fwkerhalf1, d_plan->fwkerhalf2, d_plan->fwkerhalf3,
+               d_plan->spopts.nspread, stream)))
+        goto finalize;
     }
-    cufftSetStream(fftplan, stream);
-
-    d_plan->fftplan = fftplan;
-
-    // compute up to 3 * NQUAD precomputed values on CPU
-    T fseries_precomp_phase[3 * MAX_NQUAD];
-    T fseries_precomp_f[3 * MAX_NQUAD];
-    thrust::device_vector<T> d_fseries_precomp_phase(3 * MAX_NQUAD);
-    thrust::device_vector<T> d_fseries_precomp_f(3 * MAX_NQUAD);
-    onedim_fseries_kernel_precomp<T>(d_plan->nf1, fseries_precomp_f,
-                                     fseries_precomp_phase, d_plan->spopts);
-    if (d_plan->dim > 1)
-      onedim_fseries_kernel_precomp<T>(d_plan->nf2, fseries_precomp_f + MAX_NQUAD,
-                                       fseries_precomp_phase + MAX_NQUAD, d_plan->spopts);
-    if (d_plan->dim > 2)
-      onedim_fseries_kernel_precomp<T>(d_plan->nf3, fseries_precomp_f + 2 * MAX_NQUAD,
-                                       fseries_precomp_phase + 2 * MAX_NQUAD,
-                                       d_plan->spopts);
-    // copy the precomputed data to the device using thrust
-    thrust::copy(fseries_precomp_phase, fseries_precomp_phase + 3 * MAX_NQUAD,
-                 d_fseries_precomp_phase.begin());
-    thrust::copy(fseries_precomp_f, fseries_precomp_f + 3 * MAX_NQUAD,
-                 d_fseries_precomp_f.begin());
-    // the full fseries is done on the GPU here
-    if ((ier = fseries_kernel_compute(
-             d_plan->dim, d_plan->nf1, d_plan->nf2, d_plan->nf3,
-             d_fseries_precomp_f.data().get(), d_fseries_precomp_phase.data().get(),
-             d_plan->fwkerhalf1, d_plan->fwkerhalf2, d_plan->fwkerhalf3,
-             d_plan->spopts.nspread, stream)))
-      goto finalize;
   }
 finalize:
   if (ier > 1) {
@@ -748,8 +761,8 @@ int cufinufft_setpts_impl(int M, T *d_kx, T *d_ky, T *d_kz, int N, T *d_s, T *d_
           thrust::cuda::par.on(stream), phase_iterator, phase_iterator + N,
           d_plan->deconv, d_plan->deconv,
           [c1, c2, c3, d1, d2, d3, realsign] __host__ __device__(
-              const thrust::tuple<T, T, T> tuple,
-              cuda_complex<T> deconv) -> cuda_complex<T> {
+              const thrust::tuple<T, T, T> tuple, cuda_complex<T> deconv)
+              -> cuda_complex<T> {
             // d2 and d3 are 0 if dim < 2 and dim < 3
             const auto phase = c1 * (thrust::get<0>(tuple) + d1) +
                                c2 * (thrust::get<1>(tuple) + d2) +

diff --git a/include/finufft.fh b/include/finufft.fh
@@ -8,4 +8,5 @@ c     Also see ../fortran/finufftfort.cpp.  Barnett 5/29/20. One prec 7/2/20.
          real*8 upsampfac
          integer spread_thread,maxbatchsize,showwarn,nthreads,
      $        spread_nthr_atomic,spread_max_sp_size
+         integer fftw_lock_fun,fftw_unlock_fun,fftw_lock_data
       end type
diff --git a/include/finufft_errors.h b/include/finufft_errors.h
@@ -25,6 +25,7 @@ enum {
   FINUFFT_ERR_INSUFFICIENT_SHMEM     = 19,
   FINUFFT_ERR_NUM_NU_PTS_INVALID     = 20,
   FINUFFT_ERR_INVALID_ARGUMENT       = 21,
-  FINUFFT_ERR_LOCK_FUNS_INVALID      = 22
+  FINUFFT_ERR_LOCK_FUNS_INVALID      = 22,
+  FINUFFT_ERR_SPREADONLY_UPSAMP_INVALID  = 23,
 };
 #endif
diff --git a/make.inc.macosx_clang_matlab b/make.inc.macosx_clang_matlab
@@ -21,7 +21,7 @@ CXX=clang++
 CC=clang
 
 # taken from makefile...
-CFLAGS   += -I include -I/usr/local/include -I/opt/homebrew/include
+CFLAGS   += -I include -I/usr/local/include -I/usr/local/opt/libomp/include -I/opt/homebrew/include
 FFLAGS   = $(CFLAGS)
 CXXFLAGS = $(CFLAGS)
 LIBS += -L/usr/local/lib -L/opt/homebrew/lib