u/Logical-Try-4084

Persistent tile scheduling is crucial for hiding epilogue latency in various kernels. Simply, a threadblock (CTA) is assigned available worktiles continuously until the kernel has completed. In its most naive form, a persistent tile scheduler will assign to CTA X successive tiles of index n*X and "delinearize" this list to compute the proper logical grid coordinates. However, when one has imbalanced workloads, CTAs may be assigned inequal work, a problem that compounds to a phenomenon of "CTA drift"; in this case, the persistent scheduler may perform significantly worse thanan ordinary single tile scheduler.

The solution to this problem is dynamic persistent schedulers: instead of assigning a pre-determined list of worktiles to a CTA, have a CTA claim the next available tile and start computing on that immediately. The common approach to dynamic persistence is to keep a global memory semaphore; once a CTA has finished its computation, it reads the semaphore and then atomically adds to increment. This adds enormous engineering complexity and has a lot of room for error.

Cluster Launch Control (CLC), introduced on Blackwell, approaches dynamic persistent scheduling with hardware-level instructions, alleviating much of the engineering challenge associated with the semaphore-based approach.

Please enjoy our recent blog on CLC, with examples in CuTeDSL! While the post focuses on datacenter Blackwell (Sm100), CLC is available on consumer cards as well (Sm120).

Dynamic persistent tile scheduling with Cluster Launch Control on Blackwell