-
Notifications
You must be signed in to change notification settings - Fork 7
Using CUDA Memory
In this article, we are going to learn how to interact with memory on both the host and the device through a set of short examples that each highlight an aspect of working with the different types of memory found in CUDA.
When using separate host and device memories (either pinned or paged host memory), any data that needs to go from one side to the other must be transferred via the cudaMemcpy
call.
__host__ cudaError_t cudaMemcpy (void* dst, const void* src, size_t count, cudaMemcpyKind kind)
As you can see in the function header definition above, the copy function takes 4 parameters. In order, they are a pointer to the data's destination, a pointer to the data source, the number of bytes that should be copied from source to destination, and the kind of memory copy performed.
An example call to the CUDA copy function might look something like this:
// Copy 8 integers from h_array to d_array
cudaMemcpy(d_array, h_array, 8*sizeof(int), cudaMemcpyHostToDevice)
Let's break down each parameter in a bit more detail.
The first two parameters are both pointers to memory allocated either on the CPU or GPU, depending on the kind of copy being performed. These can be pointers to any data type, and they don't necessarily have to be the base address of the pointer.
For example, if we want to copy 4 elements into d_array beginning at the 4th element, we can modify our call to cudaMemcpy
to accomplish this:
// Copy first 4 integers from h_array to d_array starting at d_array[4]
cudaMemcpy(d_array+4, h_array, 4*sizeof(int), cudaMemcpyHostToDevice)
The third parameter works much the same as the parameter in malloc
does; its purpose is to tell the copy instruction how much memory from the source should be copied to the destination, in terms of bytes. The sizeof
parameter in our example serves to perform this conversion from a number of integers to a number of bytes for us, so all we have to do is multiply that value by the number of elements we want to transfer. This same method will work for both primitive and complex data types.
Because this parameter controls the amount of memory transferred, you must be careful when choosing the value of this parameter so you don't go out of the bounds of either the source or destination pointer's memory allocation.
Any time you use cudaMemcpy
, you'll need to provide it with a copy direction so that CUDA can translate the software copy instruction into the appropriate hardware instruction sequence, as these are different depending on the direction of the copy. These directions are:
- cudaMemcpyHostToDevice
- cudaMemcpyDeviceToHost
- cudaMemcpyDeviceToDevice
The first two are fairly self-explanatory and refer to CPU-GPU memory transfers, with the first device type in each name being the source (e.g., host for HostToDevice) and the second device type being the destination (e.g., device for HostToDevice). cudaMemcpyDeviceToDevice
refers to copying memory to and from two separate locations on the same physical GPU.
To show how memory can be copied from the CPU to the GPU (and vice versa) we are going to create a small program that
- Creates an array of numbers from the host side
- Copies them to the GPU
- Uses the GPU to add 1 to each element in parallel
- Copies the updated array back to the GPU and prints it out to verify the addition and copies were successful.
Below is our addOne
kernel. As you can see, it is very similar to the kernel from the Basic CUDA Syntax example, and is equally simple. In it, we check to make sure that each thread's ID is within the bounds of the array, and then simply increment the element of the array each thread corresponds to.
#include <cstdio>
__global__ void addOne(int *array, int size) {
if (threadIdx.x < size)
array[threadIdx.x] += 1;
}
In our main function, we need to create and allocate our host and device arrays. Because we are not focused on optimizing performance, we will use allocate paged host memory rather than page-locked memory, but the code for using pinned memory is included as a comment for your reference.
int main(int argc, char *argv[]) {
// Allocate 32 integer array of paged memory
int numElements = 32;
int *h_array = (int *) malloc(sizeof(int)*numElements);
// Alternative: page-locked memory
// int *h_array;
// cudaMallocHost(&h_array, sizeof(int)*numElements);
// Allocate 32 integer array of device memory
int *d_array;
cudaMalloc(&d_array, sizeof(int)*numElements);
Next we use a simple loop to initialize the elements of the host array and print the initial contents of the array before copying the array from the host to the device:
// Initialize the array with elements 0, 1, ..., n-1
for (int i = 0; i < numElements; i++)
h_array[i] = i;
printf("Initial array contents: ");
for (int i = 0; i < numElements; i++)
printf("%d ",h_array[i]);
printf("\n");
cudaMemcpy(d_array, h_array, sizeof(int)*numElements, cudaMemcpyHostToDevice);
Finally, we run the kernel on the newly received data, copy the results back to the host, and print our updated array values. After that, we free our host and device memory, something that should become a habit for any CUDA programmer as you don't want to create memory leaks.
addOne<<<1,numElements>>>(d_array, numElements);
cudaMemcpy(h_array, d_array, sizeof(int)*numElements, cudaMemcpyDeviceToHost);
printf("Final array contents: ");
for (int i = 0; i < numElements; i++)
printf("%d ",h_array[i]);
printf("\n");
free(h_array);
cudaFree(d_array);
You can download the complete version of this program here. If you run this program, you should receive the following output:
Initial array contents: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Final array contents: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32