Legion and CUDA

Overview
Accessors
Restrictions and the CUDA Hijack
Streams
Use of CUDA Libraries
Deferred Values and Buffers

Overview

From its inception, Legion has supported the use of CUDA inside of Legion tasks running on GPU processors (a GPU processor is a CPU thread running with a CUDA context implicitly attached). Importantly though, the CUDA programming model is slightly altered inside of Legion tasks. Many things that CUDA requires users to do explicitly such as performing copies of data to and from the device are handled by the Legion programming model and runtime and are not something that users need to worry about when using CUDA inside of Legion. Similarly, Legion also does not require explicit synchronization at the end of tasks as Legion will track all the "effects" of asynchronous CUDA operations that are launched inside of a Legion task. The result is that the primary form of using CUDA inside of a Legion task is to simply launch a kernel and immediately return. Here is an example from our circuit simulation example:

__global__
void update_voltages_kernel(Point<1> first,
                            const int num_nodes,
                            const AccessorRWfloat fa_pvt_voltage,
                            const AccessorRWfloat fa_shr_voltage,
                            const AccessorRWfloat fa_pvt_charge,
                            const AccessorRWfloat fa_shr_charge,
                            const AccessorROfloat fa_pvt_cap,
                            const AccessorROfloat fa_shr_cap,
                            const AccessorROfloat fa_pvt_leakage,
                            const AccessorROfloat fa_shr_leakage,
                            const AccessorROloc fa_ptr_loc)
{
  const int tid = blockIdx.x * blockDim.x + threadIdx.x;

  if (tid < num_nodes)
  {
    const Point<1> node_ptr = first + tid;
    PointerLocation node_loc = fa_ptr_loc[node_ptr];
    if (node_loc == PRIVATE_PTR)
    {
      float voltage = fa_pvt_voltage[node_ptr];
      float charge = fa_pvt_charge[node_ptr];
      float capacitance = fa_pvt_cap[node_ptr];
      float leakage = fa_pvt_leakage[node_ptr];
      voltage += (charge / capacitance);
      voltage *= (1.f - leakage);
      fa_pvt_voltage[node_ptr] = voltage;
      fa_pvt_charge[node_ptr] = 0.f;
    }
    else
    {
      float voltage = fa_shr_voltage[node_ptr];
      float charge = fa_shr_charge[node_ptr];
      float capacitance = fa_shr_cap[node_ptr];
      float leakage = fa_shr_leakage[node_ptr];
      voltage += (charge / capacitance);
      voltage *= (1.f - leakage);
      fa_pvt_voltage[node_ptr] = voltage;
      fa_pvt_charge[node_ptr] = 0.f;
    }
  }
}

__host__
void UpdateVoltagesTask::gpu_base_impl(const CircuitPiece &piece,
                                       const std::vector<PhysicalRegion> &regions)
{
  const AccessorRWfloat fa_pvt_voltage(regions[0], FID_NODE_VOLTAGE);
  const AccessorRWfloat fa_pvt_charge(regions[0], FID_CHARGE);

  const AccessorRWfloat fa_shr_voltage(regions[1], FID_NODE_VOLTAGE);
  const AccessorRWfloat fa_shr_charge(regions[1], FID_CHARGE);
  
  const AccessorROfloat fa_pvt_cap(regions[2], FID_NODE_CAP);
  const AccessorROfloat fa_pvt_leakage(regions[2], FID_LEAKAGE);

  const AccessorROfloat fa_shr_cap(regions[3], FID_NODE_CAP);
  const AccessorROfloat fa_shr_leakage(regions[3], FID_LEAKAGE);

  const AccessorROloc fa_ptr_loc(regions[4], FID_LOCATOR);

  const int threads_per_block = 256;
  const int num_blocks = (piece.num_nodes + (threads_per_block-1)) / threads_per_block;

  update_voltages_kernel<<<num_blocks,threads_per_block>>>(piece.first_node,
                                                           piece.num_nodes,
                                                           fa_pvt_voltage,
                                                           fa_shr_voltage,
                                                           fa_pvt_charge,
                                                           fa_shr_charge,
                                                           fa_pvt_cap,
                                                           fa_shr_cap,
                                                           fa_pvt_leakage,
                                                           fa_shr_leakage,
                                                           fa_ptr_loc);
}

It is important to understand the motivation for writing CUDA code this way in Legion. Both Legion and CUDA operate on the principle of deferred execution which involves never blocking to expose as much parallelism to the underlying system as possible. Most Legion tasks only launch one or a handful of kernels and blocking to synchronize those kernels before the task ended would limit the number of other kernels we could identify in other tasks which could potentially execute in parallel on a GPU. We therefore do not require or suggest that users should synchronize any CUDA calls they perform inside their tasks unless absolutely necessary.

Restrictions and the CUDA Hijack

Legion supports the use of CUDA runtime API calls directly inside of Legion tasks. We do NOT support the use of CUDA driver API calls (more on the reason for this below). The reason for this restriction is that we provide our own implementation of the CUDA runtime API inside of the realm runtime that we refer to as our "CUDA hijack". The CUDA hijack intercepts all CUDA runtime API calls from an application and records any asynchronous effects that they produce so we can know that they must be completed before the task can be considered complete. We then pass these calls down to the normal CUDA driver API for execution, so your code has the same semantics as a normal CUDA program. Note that this means that you should still include the cuda_runtime.h header file for all your CUDA code, and you can still use nvcc for compiling your code. The primary difference is that you should never link against -lcudart as the same symbols will be defined in -lrealm. Both the Legion make and cmake build systems will maintain this invariant for you, but it is important that you do not circumvent them and add -lcudart yourself.

The CUDA runtime API is a large API that is constantly changing, so it is challenging for us to maintain perfect compatibility with it. It is possible that you will encounter functions which we have not implemented. In these cases your code will compile, but fail to link with a missing symbol for any unimplemented functions. Please request support for those functions by creating a github issue. If you want to run in the meantime while waiting for the function to be implemented, you can disable the CUDA hijack by setting REALM_USE_CUDART_HIJACK=0 when invoking either build system. This will disable our hijack and link against -lcudart. For correctness the runtime will automatically call cudaDeviceSynchronize for you after every task execution. This will likely significantly degrade your code's performance, but it will execute correctly for testing purposes until we can implement the CUDA feature that you need.

Accessors

You'll notice in the circuit code example above that Legion accessors are supported in CUDA kernels as well. Most accessor methods have the appropriate CUDA annotations on them so that you can call them in GPU code. The are two primary exceptions. We only support FieldAccessor objects that are instantiated with Realm::AffineAccessor instantiations because the more general Realm::GenericAccessor member methods are not supported on GPUs. Second, bounds checks on the GPU are not precise for sparse instances. The reason for this is that Realm currently does not support testing sparsity maps for membership on the GPU. We will check that accesses fall within the bounds of the "upper bound" rectangle for a sparse index space on the GPU, which is precise (e.g. you will never get a false positive), but is not sound (e.g. you could get a false negative). We will be working to address this shortcoming in the future.

Streams

Legion supports the use of CUDA streams inside of tasks. Calls to cudaStreamCreate will return actual CUDA streams. We make the best possible effort to support CUDA stream semantics, such as synchronizing with the CUDA default stream. There are a few exceptions here. In general, each Legion task is implicitly associated with its own stream, not the normal CUDA default stream. As a user, you can continue to use the default stream (e.g. stream 0) and Legion will automatically move your kernels onto the task's designated default stream. Additionally every call to cudaStreamCreate will return your default stream for this task unless you explicitly call cudaStreamCreateWithFlags and pass the cudaStreamNonBlockingFlag. Realm internally keeps a pool of streams to hand back to these calls. The default size of this pool is 12 and can be adjusted with the -ll:streams flag. After the pool has been used, later calls wrap around and continue allocating on the same pool. Please always call cudaStreamDestroy on any created streams and we guarantee the correct thing will happen. Do not worry about duplicate deletions of streams.

It is worth noting that our implementation maintains the synchronous semantics of the CUDA default stream as long as all CUDA code proceeds through our hijack implementation of the CUDA runtime API. If users have code that does not use our hijack implementation of the CUDA runtime API, then we can get non-standard execution of kernels with respect to the CUDA default stream. One case of this can occur with CUDA libraries which are statically linked against the CUDA runtime API (more on this below). We encourage users to be aware of this limitation and be on the lookout for cases where libraries or other shared objects have secretly statically linked against the CUDA runtime library and are circumventing the Legion CUDA hijack as it may lead to undefined behavior.

Use of CUDA Libraries

As was noted previously most CUDA libraries (e.g. cuBLAS, cuDNN, cuFFT) are statically linked and therefore circumvent our CUDA hijack. That however does not mean that they cannot be used inside of Legion tasks. Instead users should explicitly call cudaStreamCreate in order to get the default stream for their task and then use it to set the current stream for a CUDA library.

cudaStream_t stream;
cudaStreamCreate(&stream);
cublasSetStream(cublas_handle, stream);
// Do any cublas calls now
cudaStreamDestroy(stream);

Note that this must be done at the start of every task using a CUDA library as the "default" stream can (and will) change on a per task basis.

The creation of CUDA libraries such as cuBLAS is very expensive and we recommend that users do it infrequently. For example, you should not be initializing and destroying an instance of cuBLAS inside of every task. Instead, we recommend that users maintain a global lookup data structure for each CUDA library so that any GPU tasks running inside of GPU processor can get an instance of the CUDA library they need throughout the execution of the program. Users can run an initialization task to initialize all the CUDA libraries on the GPU processors where they will need them. Importantly, we recommend that users make one copy of each CUDA library they will need per GPU processor. The reason for this that calls like cublasSetStream are stateful calls and undefined behavior will occur if a different task changes the stream while another task is still executing cuBLAS calls.

Deferred Values and Buffers

The discouragement of synchronization for CUDA execution inside of Legion tasks can make it challenging for computations that need to return explicit future values or create temporary allocations for kernels. In the first case we do not want to synchronize to copy data back explicitly for the return value. In the second case, we do not want to synchronize unnecessarily to call cudaFree when the kernels are done. Legion provides special support for both of these cases with DeferredValue and DeferredBuffer objects. Examples of using both kinds of objects can be seen in the following code from the SNAP mini-application that performs a convergence test over all the cells on GPU down to a single bool return value indicating whether the computation has reached a steady state:

__global__
void gpu_outer_convergence(const Point<3> origin,
                           const AccessorRO<double,3> fa_flux0,
                           const AccessorRO<double,3> fa_flux0po,
                           const double epsi, 
                           const DeferredBuffer<int,1> results,
                           const int results_offset)
{
  // We know there is never more than 32 warps in a CTA
  __shared__ int trampoline[32];

  const int x = blockIdx.x * blockDim.x + threadIdx.x;
  const int y = blockIdx.y * blockDim.y + threadIdx.y;
  const int z = blockIdx.z * blockDim.z + threadIdx.z;
  const Point<3> p = origin + Point<3>(x,y,z);

  const double *flux0_ptr = fa_flux0.ptr(p);
  const double *flux0po_ptr = fa_flux0po.ptr(p);

  const double tolr = 1.0e-12;
  
  double flux0po = *flux0po_ptr;
  double df = 1.0;
  if (fabs(flux0po) < tolr) {
    flux0po = 1.0;
    df = 0.0;
  }
  double flux0 = *flux0_ptr;
  df = fabs( (flux0 / flux0po) - df );
  int local_converged = 1;
  if ((df >= -INFINITY) && (df > epsi))
    local_converged = 0;
  // Perform a local reduction inside the CTA
  // Butterfly reduction across all threads in all warps
  for (int i = 16; i >= 1; i/=2)
    local_converged += __shfl_xor_sync(0xfffffff, local_converged, i, 32);
  unsigned laneid;
  asm volatile("mov.u32 %0, %laneid;" : "=r"(laneid) : );
  unsigned warpid = 
    ((threadIdx.z * blockDim.y + threadIdx.y) * blockDim.x + threadIdx.x) >> 5;
  // First thread in each warp writes out all values
  if (laneid == 0)
    trampoline[warpid] = local_converged;
  __syncthreads();
  // Butterfly reduction across all thread in the first warp
  if (warpid == 0) {
    unsigned numwarps = (blockDim.x * blockDim.y * blockDim.z) >> 5;
    local_converged = (laneid < numwarps) ? trampoline[laneid] : 0;
    for (int i = 16; i >= 1; i/=2)
      local_converged += __shfl_xor_sync(0xfffffff, local_converged, i, 32);
    // First thread does the atomic
    if (laneid == 0)
      results.write(Point<1>(results_offset + 
        (blockIdx.z * gridDim.y + blockIdx.y) * gridDim.x + blockIdx.x), local_converged);
  }
}

__global__
void gpu_sum_outer_convergence(const DeferredBuffer<int,1> buffer,
                               const DeferredValue<bool> result,
                               const size_t total_blocks, 
                               const int expected)
{
  __shared__ int trampoline[32];
  int offset = threadIdx.x;
  int total = 0;
  while (offset < total_blocks) {
    total += buffer.read(Point<1>(offset));
    offset += blockDim.x;
  }
  for (int i = 16; i >= 1; i/=2)
    total += __shfl_xor_sync(0xfffffff, total, i, 32);
  unsigned laneid;
  asm volatile("mov.u32 %0, %laneid;" : "=r"(laneid) : );
  unsigned warpid = threadIdx.x >> 5;
  // Write results in the trampoline
  if (laneid == 0)
    trampoline[warpid] = total;
  __syncthreads();
  if (warpid == 0)
  {
    unsigned numwarps = blockDim.x >> 5;
    total = (laneid < numwarps) ? trampoline[laneid] : 0;
    for (int i = 16; i >= 1; i/=2)
      total += __shfl_xor_sync(0xfffffff, total, i, 32);
    if (laneid == 0)
      result.write(total == expected);
  }
}

__host__
void run_outer_convergence(Rect<3> subgrid_bounds,
                           const DeferredValue<bool> &result,
                           const std::vector<AccessorRO<double,3> > &fa_flux0,
                           const std::vector<AccessorRO<double,3> > &fa_flux0po,
                           const double epsi)
{
  
  // Launch the kernels
  const int x_range = (subgrid_bounds.hi[0] - subgrid_bounds.lo[0]) + 1;
  const int y_range = (subgrid_bounds.hi[1] - subgrid_bounds.lo[1]) + 1;
  const int z_range = (subgrid_bounds.hi[2] - subgrid_bounds.lo[2]) + 1;
  dim3 block(gcd(x_range,32),gcd(y_range,4),gcd(z_range,4));
  dim3 grid(x_range/block.x, y_range/block.y, z_range/block.z);
  const size_t total_blocks = grid.x*grid.y*grid.z;
  assert(fa_flux0.size() == fa_flux0po.size());
  const Rect<1> bounds(Point<1>(0),Point<1>(total_blocks * fa_flux0.size() - 1));
  DeferredBuffer<int,1> buffer(bounds, Memory::GPU_FB_MEM);
  for (unsigned idx = 0; idx < fa_flux0.size(); idx++) {
    gpu_outer_convergence<<<grid,block>>>(subgrid_bounds.lo,
                                          fa_flux0[idx], fa_flux0po[idx],
                                          epsi, buffer, idx * total_blocks);
  }
  dim3 block2((bounds.hi[0]+1) > 1024 ? 1024 : (bounds.hi[0]+1),1,1);
  // Round up to the nearest multiple of warps
  while ((block2.x % 32) != 0)
    block2.x++;
  dim3 grid2(1,1,1);
  const int expected = x_range * y_range * z_range * fa_flux0.size();
  gpu_sum_outer_convergence<<<grid2,block2>>>(buffer, result, bounds.hi[0]+1, expected);
}

__host__
DeferredValue<bool> 
            TestOuterConvergence::gpu_implementation(const Task *task,
      const std::vector<PhysicalRegion> &regions, Context ctx, Runtime *runtime)
{
  log_snap.info("Running GPU Test Outer Convergence");
  DeferredValue<bool> result(false);
  // If the inner loop didn't converge, then we can't either
  assert(!task->futures.empty());
  bool inner_converged = task->futures[0].get_result<bool>();
  if (!inner_converged)
    return result;
  // Get the index space domain for iteration
  assert(task->regions[0].region.get_index_space() == 
         task->regions[1].region.get_index_space());
  Domain<3> dom = runtime->get_index_space_domain(ctx, 
          IndexSpace<3>(task->regions[0].region.get_index_space()));
  const double epsi = 100.0 * Snap::convergence_eps;
  std::vector<AccessorRO<double,3> > fa_flux0(
                          task->regions[0].privilege_fields.size());
  std::vector<AccessorRO<double,3> > fa_flux0po(fa_flux0.size());
  unsigned idx = 0;
  for (std::set<FieldID>::const_iterator it = 
        task->regions[0].privilege_fields.begin(); it !=
        task->regions[0].privilege_fields.end(); it++, idx++)
  {
    fa_flux0[idx]   = AccessorRO<double,3>(regions[0], *it);
    fa_flux0po[idx] = AccessorRO<double,3>(regions[1], *it);
  }
  run_outer_convergence(dom.bounds, result, fa_flux0, fa_flux0po, epsi);
  return result;
}

DeferredValue types are objects that the users can directly use as a return value for task that the runtime understands and will introspect once all the "effects" (e.g. kernel launches are performed). Note that the task directly returns a DeferredValue<bool> type that will automatically be interpreted by the runtime as a bool type in the future that is produced. The DeferredValue objects can also be directly passed to GPU kernels and accessed as though they are a normal value of the type bool. It is important that every DeferredValue object be used as a return value for a task or small memory leaks may occur. Note that there are also specialized DeferredReduction objects (not shown in this example) that inherit from DeferredValue and support reduction operations on DeferredValue objects.

Unlike DeferredValue objects which are used for return values, DeferredBuffer objects are for temporary allocations used between kernels launched by the same task. DeferredBuffers create a temporary allocation that will automatically be freed by Legion only after all the asynchronous "effects" (e.g. kernels/copies) performed by a task are done, thereby allowing the user to return from task execution without having to synchronize to call free the memory. Users can specify either a Memory or a Memory::Kind to use when creating a DeferredBuffer. In the case of GPUs, it's most common to allocated DeferredBuffer objects on the GPU_FB_MEM associated with the current GPU as we do in this example. Here we create a deferred buffer to count temporary convergence counts across cells and fields and then later perform a reduction kernel to reduce those values down to a single result to put in the DeferredValue return result.

While DeferredValue and DeferredBuffer objects are mostly commonly used for tasks with asynchronous effects such as GPU tasks, they are not specific to GPUs and can be used in any kind of Legion task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly