Skip to content

Performance Experiment: Graph Length

rshipley160 edited this page Dec 10, 2021 · 3 revisions

Previous: Performance Experiment: Graphs vs Streams vs Synchronous Kernels

In this experiment, we are going to test the hypothesis that the completion time of an iterative process will get shorter the more of its iterations are built into a graph. For example, if we are going to perform a task 64 times, we would expect that a graph which has 16 of those iterations built into it (that is repeated 4 times) would complete the task faster than one which only has a single iteration built in that is repeated 64 times.

To do this, we are going to use the same graph version of the quadratic equation solver seen in the explicit graph building tutorial, which we will then use with the child graph node feature in order to produce graphs with 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 built-in iterations. Then each graph will be run the appropriate number of times so that the quadratic solver process is run a total of 131072 times by each type of graph, which will be timed so we can observe any trends that appear.

Experiment Setup

To use a graph as a child node of another graph, we must first create the base graph that the child node should contain.

In our case, this is the same graph that we built in the tutorial on explicit graphs, so we won't go into detail. All of the graph creation code has been bundled up into a nice neat buildQuadraticGraph function so that we don't have to bog down our program driver with all of that setup.

cudaGraph_t buildQuadraticGraph(float * a, float *b, float *c, float *sol1, float *sol2, float *tmp, int numElements, int blockSize) {
    int gridSize = numElements / blockSize + 1;

    cudaGraph_t quadraticGraph;
    cudaGraphCreate(&quadraticGraph, 0);

    // Define node parameters
    cudaGraphNode_t bSquaredNode;
    cudaKernelNodeParams bSquaredParams = {0};
    
    // ...

    nodeDependencies.clear();
    nodeDependencies.push_back(bMinusNode);
    // Add last node
    cudaGraphAddKernelNode(&bMinusQuotientNode, quadraticGraph, nodeDependencies.data(), 1, &bMinusQuotientParams);

    // Graph is ready to go
    return quadraticGraph;
}

Now that we have our child graph function, all we need to do is create child nodes using that graph as a template and them the appropriate number of times to graphs of varying lengths. First, we have to actually use our new function to get the graph.

int main(int argc, char *argv[]) {
    const int NUM_ELEMENTS = 64;
    const int BLOCK_SIZE = 32;
    
    float *a, *b, *c, *sol1, *sol2, *tmp;
    cudaMalloc(&a, sizeof(float)*NUM_ELEMENTS);
    cudaMalloc(&b, sizeof(float)*NUM_ELEMENTS);
    cudaMalloc(&c, sizeof(float)*NUM_ELEMENTS);
    cudaMalloc(&sol1, sizeof(float)*NUM_ELEMENTS);
    cudaMalloc(&sol2, sizeof(float)*NUM_ELEMENTS);
    cudaMalloc(&tmp, sizeof(float)*NUM_ELEMENTS);

    fillArray<<<NUM_ELEMENTS/BLOCK_SIZE + 1, BLOCK_SIZE>>>(a, NUM_ELEMENTS, 1);
    fillArray<<<NUM_ELEMENTS/BLOCK_SIZE + 1, BLOCK_SIZE>>>(b, NUM_ELEMENTS, 0);
    fillArray<<<NUM_ELEMENTS/BLOCK_SIZE + 1, BLOCK_SIZE>>>(c, NUM_ELEMENTS, -4);

    cudaGraph_t innerGraph = buildQuadraticGraph(a,b,c,sol1,sol2,tmp, NUM_ELEMENTS,BLOCK_SIZE);

Then the process is essentially to iterate through all of the desired graph sizes, build the graph, and time its execution.

cudaEvent_t clockStart, clockStop;
    cudaEventCreate(&clockStart);
    cudaEventCreate(&clockStop);

    cudaGraph_t builtinGraph;
    cudaGraphExec_t graphExecutable;

    int numTrials = 20;
    // Loop from 1 to 1024, doubling each loop
    for (int numIterations=1; numIterations<=1024; numIterations<<=1) {
        cudaGraphCreate(&builtinGraph, 0);

        nodeDependencies.clear();

        // Build the repeating graph, making sure each child graph is a parent of the next
        cudaGraphNode_t childNodes[numIterations];
        for (int i=0; i<numIterations; i++) {
            cudaGraphAddChildGraphNode(&(childNodes[i]), builtinGraph, nodeDependencies.data(), nodeDependencies.size(), innerGraph);
            nodeDependencies.clear();
            nodeDependencies.push_back(childNodes[i]);
        }

        cudaGraphInstantiate(&graphExecutable, builtinGraph, NULL, NULL, 0);

        // Record execution times
        printf("%d,",numIterations);
        for (int n=0; n<numTrials; n++) {

            cudaEventRecord(clockStart);

                for (int i=0; i<(131072/numIterations); i++)
                    cudaGraphLaunch(graphExecutable, 0);

            cudaEventRecord(clockStop);
            cudaEventSynchronize(clockStop);

            float timeElapsed;
            cudaEventElapsedTime(&timeElapsed, clockStart, clockStop);

            printf("%.4f",timeElapsed);
            if (numTrials-n==1) 
                printf("\n");
            else
                printf(",");
        }

        cudaGraphDestroy(builtinGraph);
        cudaGraphExecDestroy(graphExecutable);
    }

Hopefully all of this code is easy to understand at this point. After all, nothing that is being introduced is new aside from the idea of a child graph node. A child graph node is simply one that allows for existing graphs to be more easily integrated into a new workflow, as we are doing.

Now let's run this, save it to a file, and get started on the analysis!

Analysis

As per usual, we will start with the results file generated by the program above, and use some quick Python to create a visualization. For this one, I noticed during the testing of the experiment program that the execution times were very close together, so a box plot doesn't seem to be the right choice. Instead we are going to create a bar chart that will show the median value for each number of built-in quadratic solver iterations.

So clearly, building in more of the iterations of the process into a single graph seems to have improved the efficiency, at least up to a point. This makes some sense, as the intentions of the graph are to reduce overheads from kernel launch and to be able to make some optimizations because the whole workload is known in advance before being submitted to the GPU. It appears that this is exactly what has happened, but that it does have diminishing returns, as many practices regarding efficiency do, so in a more realistic environment (at least one more realistic than solving the quadratic equation hundreds of thousands of times) you must be cautious and decide at which point it is no longer worth making the graph longer.

On that note, let us turn to the elephant in the room: completion times get much higher once the number of built in iterations hits 512. Why is this?
It is likely because at some point between 256 and 512 iterations, the workload being submitted to the GPU is simply too much at once, causing a bottleneck that leads to the performance decrease, though this is only a theory based on basic understanding of the architecture and the results we received. As in every experiment, you are encouraged to perform more experiments and find out for yourself why these things are the way they are.