Understanding Latency Hiding on GPUs
Vasily Volkov
University of California, Berkeley·2016
Modern commodity processors such as GPUs may execute up to about a thousand of physicalthreads per chip to better utilize their numerous execution units and hide execution latencies.Understanding this novel capability, however, is hindered by the overall complexity of thehardware and complexity of typical workloads. In this dissertation, we suggest a better way tounderstand modern multithreaded performance by considering a family of synthetic workloads,which use the same key hardware capabilities – memory access, arithmetic operations, andmultithreading – but are otherwise as simple as possible.One of our surprising findings is that prior performance models for GPUs fail on theseworkloads: they mispredict observed throughputs by factors of up to 1.7. We analyze these priorapproaches, identify a number of common pitfalls, and discuss the related subtleties inunderstanding concurrency and Little’s Law. Also, we help to further our understanding byconsidering a few basic questions, such as on how different latencies compare with each other interms of latency hiding, and how the number of threads needed to hide latency depends on basicparameters of executed code such as arithmetic intensity. Finally, we outline a performancemodeling framework that is free from the found limitations.As a tangential development, we present a number of novel experimental studies, such as on howmean memory latency depends on memory throughput, how latencies of individual memoryaccesses are distributed around the mean, and how occupancy varies during execution.