Using CUDA Memory

Previous: CUDA Memory Types

In this article, we are going to learn how to interact with memory on both the host and the device through a set of short examples that each highlight an aspect of working with the different types of memory found in CUDA.

Copying data between the GPU and CPU

When using separate host and device memories (either pinned or paged host memory), any data that needs to go from one side to the other must be transferred via the cudaMemcpy call.

__host__ cudaError_t cudaMemcpy (void* dst, const void* src, size_t count, cudaMemcpyKind kind)

As you can see in the function header definition above, the copy function takes 4 parameters. In order, they are a pointer to the data's destination, a pointer to the data source, the number of bytes that should be copied from source to destination, and the kind of memory copy performed.

An example call to the CUDA copy function might look something like this:

// Copy 8 integers from h_array to d_array 
cudaMemcpy(d_array, h_array, 8*sizeof(int), cudaMemcpyHostToDevice)

Let's break down each parameter in a bit more detail.

Destination and Source Pointers

The first two parameters are both pointers to memory allocated either on the CPU or GPU, depending on the kind of copy being performed. These can be pointers to any data type, and they don't necessarily have to be the base address of the pointer.

For example, if we want to copy 4 elements into d_array beginning at the 4th element, we can modify our call to cudaMemcpy to accomplish this:

// Copy first 4 integers from h_array to d_array starting at d_array[4]
cudaMemcpy(d_array+4, h_array, 4*sizeof(int), cudaMemcpyHostToDevice)

Memory Block Size

The third parameter works much the same as the parameter in malloc does; its purpose is to tell the copy instruction how much memory from the source should be copied to the destination, in terms of bytes. The sizeof parameter in our example serves to perform this conversion from a number of integers to a number of bytes for us, so all we have to do is multiply that value by the number of elements we want to transfer. This same method will work for both primitive and complex data types.
Because this parameter controls the amount of memory transferred, you must be careful when choosing the value of this parameter so you don't go out of the bounds of either the source or destination pointer's memory allocation.

Copy Direction

Any time you use cudaMemcpy, you'll need to provide it with a copy direction so that CUDA can translate the software copy instruction into the appropriate hardware instruction sequence, as these are different depending on the direction of the copy. These directions are:

cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice

The first two are fairly self-explanatory and refer to CPU-GPU memory transfers, with the first device type in each name being the source (e.g., host for HostToDevice) and the second device type being the destination (e.g., device for HostToDevice). cudaMemcpyDeviceToDevice refers to copying memory to and from two separate locations on the same physical GPU.

Example Program

To show how memory can be copied from the CPU to the GPU (and vice versa) we are going to create a small program that

Creates an array of numbers from the host side
Copies them to the GPU
Uses the GPU to add 1 to each element in parallel
Copies the updated array back to the GPU and prints it out to verify the addition and copies were successful.

Below is our addOne kernel. As you can see, it is very similar to the kernel from the Basic CUDA Syntax example, and is equally simple. In it, we check to make sure that each thread's ID is within the bounds of the array, and then simply increment the element of the array each thread corresponds to.

#include <cstdio>

__global__ void addOne(int *array, int size) {
    if (threadIdx.x < size)
        array[threadIdx.x] += 1;
}

In our main function, we need to create and allocate our host and device arrays. Because we are not focused on optimizing performance, we will use allocate paged host memory rather than page-locked memory, but the code for using pinned memory is included as a comment for your reference.

int main(int argc, char *argv[]) {
    // Allocate 32 integer array of paged memory
    int numElements = 32;
    int *h_array = (int *) malloc(sizeof(int)*numElements);

    // Alternative: page-locked memory
    // int *h_array;
    // cudaMallocHost(&h_array, sizeof(int)*numElements);

    // Allocate 32 integer array of device memory
    int *d_array;
    cudaMalloc(&d_array, sizeof(int)*numElements);

Next we use a simple loop to initialize the elements of the host array and print the initial contents of the array before copying the array from the host to the device:

    // Initialize the array with elements 0, 1, ..., n-1
    for (int i = 0; i < numElements; i++)
        h_array[i] = i;

    printf("Initial array contents: ");
    for (int i = 0; i < numElements; i++)
        printf("%d ",h_array[i]);
    printf("\n");

    cudaMemcpy(d_array, h_array, sizeof(int)*numElements, cudaMemcpyHostToDevice);

Finally, we run the kernel on the newly received data, copy the results back to the host, and print our updated array values. After that, we free our host and device memory, something that should become a habit for any CUDA programmer as you don't want to create memory leaks.

    addOne<<<1,numElements>>>(d_array, numElements);

    cudaMemcpy(h_array, d_array, sizeof(int)*numElements, cudaMemcpyDeviceToHost);

    printf("Final array contents: ");
    for (int i = 0; i < numElements; i++)
        printf("%d ",h_array[i]);
    printf("\n");

    free(h_array);
    cudaFree(d_array);

You can download the complete version of this program here. If you run this program, you should receive the following output:

Initial array contents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
Final array contents: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Provide feedback

Saved searches