Final Project for CS259 Spring 2019
Matt and I evaluated and wrote a paper on the modern performance of executing parallel kernels on GPUs compared to the golden standard of batching jobs together. The report is in the report
directory, and an abstract of the report is as follows:
While Moore’s law may be slowing down, GPU’s continue to get considerably better with each generation. In 2010, Nvidia began to support executing multiple kernels at once (concurrently), initially allowing only 4-16 kernels to be executed concurrently but increasing it to 128 today. Kernels and machine learning problems that used to take up all the resources of a GPU several years ago now only take a fraction of compute power. In this report, we specifically focus on comparing the concurrent execution of kernels to their sequential counterparts to investigate whether the overhead of launching kernels in parallel defeats any performance gains from the concurrency we are exploiting. We also compare these kernels to their batched versions, the industry norm for maximizing performance. The results that we obtained suggest that concurrent kernel execution can increase performance anywhere from 1.25x to 15x, with smaller problems seeing larger performance gains. For smaller problems, batched kernels beat the performance of concurrent kernels greatly, but for large problems, the performances are approximately the same with concurrent kernels actually being about 5% to 10% faster. Our findings suggest that if there is a level of parallelism that can be added, such as in the classifier where not all dimensions of threads are used, concurrent kernels can provide performance on par or even better than batched kernels. Combined with the the paper of Jiao et al. where concurrent kernels provided almost 34.5% better energy efficiency, perhaps these may be the future of running many kernels [4]. Furthermore, concurrent kernel execution may be used over batching is if different types of problems were mixed together, as batching requires all the batches to be of the same size/type, but heterogeneous kernels can be concurrently executed.