-
Notifications
You must be signed in to change notification settings - Fork 7
Graph node glossary
Previous: Using host functions in graphs and streams
Child graph nodes allow you to embed a smaller graph as a node in an existing graph, allowing your graph functions to be compositions of smaller graphs.
To add a child graph node to a larger graph, you must have the child graph object completed, and then you can add it using cudaGraphAddChildGraphNode
:
cudaGraph_t childGraph;
// Build child graph...
cudaGraph_t parentGraph;
cudaGraphNode_t childGraphNode;
// Add as root to larger graph
cudaGraphAddChildGraphNode(&childGraphNode, parentGraph, NULL, 0, childGraph);
Host nodes are used to call CPU functions from within a graph. They require node parameters which contain the host function and host function parameters, but otherwise are similar to any other node as far as the creation and graph addition process is concerned.
struct myStruct {
float myFloat;
int myInt;
};
void CUDART_CB printValues (void *userData) {
myStruct values = *((myStruct *)userData);
printf("%d\n",values.myInt);
printf("%f\n",values.myFloat);
}
// Graph creation...
cudaGraphNode_t hostNode;
cudaHostNodeParams hostParams = {0};
hostParams.fn = printValues;
myStruct values = {0};
values.myInt = 9;
values.myFloat = 4.54;
hostParams.userData = (void *)&values;
cudaGraphAddHostNode(&hostNode, graph, NULL, 0, &hostParams);
For a more detailed explanation of how to use host nodes in graphs, check out the previous tutorial on using host nodes in streams and graphs.
Kernel nodes perform the majority of the work done in graphs and require specification of kernel parameters which set the kernel, grid size, block size, and kernel parameters to be used.
cudaGraphNode_t productNode;
cudaKernelNodeParams productParams = {0};
productParams.blockDim = BLOCK_SIZE;
productParams.gridDim = gridSize;
productParams.func = (void *)elementwiseProduct;
void *productfunc_params[4] = {(void *)&a, (void *)&b, (void *)&out, (void *) &size};
productParams.kernelParams = (void **)productfunc_params;
cudaGraphAddKernelNode(&productNode, graph, NULL, 0, &productParams);
For a more detailed guide on creating and using kernel nodes, refer to the tutorial on creating graphs explicitly.
Memory allocation nodes allow us to allocate memory from within the graph so that it can run without reference to memory objects that were declared outside the graph's scope. This makes the graphs much more portable, since they can be run without concerns about whether certain values will be present in the context it is run in.
cudaGraphNode_t mallocNode;
// Memory allocation parameters
cudaMemAllocNodeParams allocParams;
memset(&allocParams, 0, sizeof(allocParams));
allocParams.bytesize = sizeof(float)*size;
allocParams.poolProps.allocType = cudaMemAllocationTypePinned;
allocParams.poolProps.location.id = 0;
allocParams.poolProps.location.type = cudaMemLocationTypeDevice;
cudaGraphAddMemAllocNode(&mallocNode, graph, NULL, 0, &allocParams);
Simply put, because there are memory allocation nodes available for your graphs, so too must there be memory free nodes. As you would expect, these nodes deallocate the memory associated with the pointers they are given.
cudaGraphNode_t memFreeNode;
cudaGraphAddMemFreeNode(&memFreeNode, graph, NULL, 0, ptr);
Memset (memory set) nodes allow for allocated memory, regardless of whether it was allocated through a graph node or prior to graph launch, to be reset. This is especially using graph memory allocation nodes (above), this method can be much more elegant than attempting to use outside streams and synchronization to perform the memset.
cudaGraphNode_t memsetNode;
cudaMemsetParams memsetParams = {0};
memsetParams.dst = dest;
memsetParams.elementSize = sizeof(float);
memsetParams.value = 0;
memsetParams.pitch = 0;
memsetParams.width = size;
memsetParams.height = 1;
cudaGraphAddMemsetNode(&memsetNode, graph, NULL, 0, &memsetParams);
For most memory transfers inside graphs, this type of memcopy node is the way to go. It is easy to use, does not require the creation of any extra parameters to add to the graph. After the typical graph stuff, the only parameters you need to supply are the destination and source pointers, the amount of bytes to transfer, and the kind of transfer it is, just like with cudaMemcpy.
The only small difference is that the destination and source pointers need to be cast to void pointers before being passed to the add node function.
cudaGraphNode_t memcpy_1DNode;
cudaGraphAddMemcpyNode1D(&memcpy_1DNode, graph, NULL, 0, (void *) dest, (void *) source, sizeof(int)*size, cudaMemcpyDeviceToHost);
If you are using two- or three-dimensional data, then this is the type of mem copy you want to be using, but otherwise you'll likely want to leave this one alone. The example below shows how to set up a general mem copy for one-dimensional data, but as you can see, it quickly gets messy. However, it can be more useful to use this type of copy node instead of the 1D mem copy for applications which simulate 2D and 3D data, so it is still worth your while to be able to use these nodes and understand their parameters, however they are outside the extent of our scope for this series.
cudaGraphNode_t HtoDNode;
cudaMemcpy3DParms HtoDcopyParams = {0};
HtoDcopyParams.srcPtr = make_cudaPitchedPtr(source, sizeof(int) * size, size, 1);
HtoDcopyParams.dstPtr = make_cudaPitchedPtr(dest, sizeof(int) * size, size, 1);
HtoDcopyParams.extent = make_cudaExtent(sizeof(int) * size, 1, 1);
HtoDcopyParams.kind = cudaMemcpyHostToDevice;
cudaGraphAddMemcpyNode(&HtoDNode, graph, NULL, 0, &HtoDcopyParams);
As the name suggests, empty nodes have no function on their own, but they can be extremely useful for organizing the flow of execution in serving as a barrier. For example, if one stage of our graph has 8 nodes, all of which must complete before continuing on to another stage containing 8 nodes, we could either have each of the nodes in the second stage wait on all of the nodes in the stage before, or we could place an empty node between the phases to serve as a barrier, with the empty node having 8 dependencies from the first stage, and each of the 8 nodes in the second stage only having a single dependency on the empty node.
Here's what the example above might look like, assuming some of the necessary elements have already been defined
// Add stage 1 nodes
cudaGraphNode_t stageOneNodes[8];
for (int i=0; i<8; i++) {
cudaGraphAddKernelNode(&stageOneNodes[i], graph, NULL, 0, stageOneParams[i]);
nodeDependencies.push_back(kernelNode[i]);
}
// Add empty node with stage 1 dependencies
cudaGraphNode_t stageBarrier;
cudaGraphAddEmptyNode(&stageBarrier, graph, nodeDependencies.data(), 8);
// Add stage 2 nodes with dependency on stageBarrier
cudaGraphNode_t stageTwoNodes[8];
for (int i=0; i<8; i++) {
cudaGraphAddKernelNode(&stageTwoNodes[i], graph, stageBarrier, 1, stageTwoParams[i]);
}
Event record and event wait nodes can be used just as if you were using streams instead of graphs. Record nodes record events that can be waited upon by streams or other graph nodes, and wait nodes wait on events that can be triggered by streams or other parts of the graph. This can be used to time portions of the graph's execution, or send and receive signals from external streams, as well as other applications.
Typically you should not use events to establish dependencies between events in the same graph, as that is better handled by the built-in dependencies created while building the graph, but there are still plenty of applications for events within graphs
Event record nodes only require the node, graph, and event to record as parameters, in addition to dependency information.
cudaEvent_t event;
cudaEventCreate(&event);
cudaGraph_t eventRecord;
cudaGraphAddEventRecordNode(&eventRecord, graph, NULL, 0, event);
Event wait nodes require the exact same information, though presumably you will not be recording and waiting on the same event.
cudaEvent_t event;
cudaEventCreate(&event);
cudaGraph_t eventWait;
cudaGraphAddEventWaitNode(&eventWait, graph, NULL, 0, event);
Next: Performance Experiment: Graphs vs Streams vs Synchronous Kernels