Abstract:
We started the project by trying to implement a community detection algorithm (Permanence Maximization) [2] on Nvidia CUDA capable GPUs. From our initial implementation, we realized that irregular computations such as graph algorithms perform poorly on the GPU due to the static CUDA runtime on the GPU and the SIMT (Single Instruction Multiple Data) architecture of the GPU, which leads to problems like load imbalance and warp divergence. To address some of these issues, we decided to redirect our efforts towards developing a dynamic work-stealing runtime on the GPU, which could tackle issues such as load imbalance on the GPU and could help improve the productivity of the programmer a lot by greatly simplifying the API required to leverage the massive parallelism offered by GPUs. We use a simple async-finish style API along with lambda functions to enable tasking on the GPU with a dynamic work-stealing runtime for optimum work distribution among the many threads of the GPU. This report presents our implementation along with benchmark results and optimizations