Kernel Float: Unlocking Mixed-Precision GPU Programming
Stijn Heldens, Ben van Werkhoven
ACM Transactions on Mathematical Software·2025
<jats:p>
Modern GPUs feature specialized hardware for low-precision floating-point arithmetic to accelerate compute-intensive workloads that do not require high numerical accuracy, such as those from artificial intelligence. However, despite the significant gains in computational throughput, memory bandwidth utilization, and energy efficiency, integrating low-precision formats into scientific applications remains difficult. We introduce
<jats:italic toggle="yes">Kernel Float</jats:italic>
, a header-only C++ library that simplifies the development of portable mixed-precision GPU kernels. Kernel Float provides a generic vector type, a unified interface for common mathematical operations, and fast approximations for low-precision transcendental functions that lack native hardware support. To demonstrate the potential of mixed-precision computing unlocked by our library, we integrated Kernel Float into nine GPU kernels from various domains. Our evaluation on Nvidia A100 and AMD MI250X GPUs shows performance improvements of up to
<jats:inline-formula content-type="math/tex">
<jats:tex-math notation="LaTeX" version="MathJax">\(12\times\)</jats:tex-math>
</jats:inline-formula>
over double precision, while reducing source code length by up to 50% compared to handwritten kernels and having negligible runtime overhead. Our results further show that mixed-precision performance depends not only on choosing appropriate data types, but also on tuning traditional optimization parameters (e.g., block size and vector width) and, when relevant, even domain-specific parameters.
</jats:p>