u/Dryw_Filtiarn

Last night I worked on a custom node (Load CLIP FP8) for ComfyUI that allows me to keep the text encode (clip) model in FP8 memory rather than it upcasting to FP16/BF16 as per ComfyUI default.

What this means for me is that the Qwen3 8B FP8 model now will comfortably fit VRAM with about ~7.5GB of use rather than it exploding into a 15-16GB VRAM use as it did before with the automatic upfront upcasting to FP16.

The model will now fully stay in VRAM rather than overflowing into system RAM on my 16GB XT9070, which in turn means that clip encoding for my prompts now is sub second rather than 20-30 seconds per encode it used to be.

From my limited testing with it so far, there is zero loss in quality, there shouldn’t be since in the end it’s still upcasted to match the native ComfyUI behavior but it’s simply done on demand rather than in bulk up front, which obviously is a little runtime overhead, but still not close to cost compared to the cost of having it overflow into system RAM.

At the side I’m also testing the PR for Sage Attention v2 native implementation for ROCm.

My results so far with my work flow (which is a high/low pass setup with Flux 2 Klein Base 9B, so dual clip encode, dual sampler, dual vae encode, etc):

Stock clip load + Sage Attention v1: ~280 seconds execution time
Stock clip load + Sage Attention v2: ~180 seconds execution time
My FP8 clip load + Sage Attention v2: ~100 seconds

Conclusion is that Sage Attention v2 over Sage Attention v1 already gave a ~35% gain in performance, with my FP8 clip I’m now seeing a total gain of ~65% performance.

I will do some more extensive testing with my Load CLIP FP8 later today to ensure there’s no negative impact to it’s use and make some small fixes to some oddities that I have been observing. When it all turns out to function reliably and without quality consequences (which as said, it shouldn’t) I will publish the custom node to be available for everyone.
Keep in mind that the node will require FP8 support in your torch, which may not be the case for all setups.

Clip Load custom node that allows FP8 storage