
Dynamic persistent tile scheduling with Cluster Launch Control on Blackwell
Persistent tile scheduling is crucial for hiding epilogue latency in various kernels. Simply, a threadblock (CTA) is assigned available worktiles continuously until the kernel has completed. In its most naive form, a persistent tile scheduler will assign to CTA X successive tiles of index n*X and "delinearize" this list to compute the proper logical grid coordinates. However, when one has imbalanced workloads, CTAs may be assigned inequal work, a problem that compounds to a phenomenon of "CTA drift"; in this case, the persistent scheduler may perform significantly worse thanan ordinary single tile scheduler.
The solution to this problem is dynamic persistent schedulers: instead of assigning a pre-determined list of worktiles to a CTA, have a CTA claim the next available tile and start computing on that immediately. The common approach to dynamic persistence is to keep a global memory semaphore; once a CTA has finished its computation, it reads the semaphore and then atomically adds to increment. This adds enormous engineering complexity and has a lot of room for error.
Cluster Launch Control (CLC), introduced on Blackwell, approaches dynamic persistent scheduling with hardware-level instructions, alleviating much of the engineering challenge associated with the semaphore-based approach.
Please enjoy our recent blog on CLC, with examples in CuTeDSL! While the post focuses on datacenter Blackwell (Sm100), CLC is available on consumer cards as well (Sm120).