u/shootthesound

SAM3 added to Comfyui-Angelo (sampler/inpainter/refiner)

I Added SAM 3 to Angelo after a lot of DMs, so now you don't have to paint or box anything to pick what you edit.

Type what you want ("the face", "her left hand", "the red car") or grab it from the Quick Detect dropdown, hit Detect, and it highlights every match on the preview. Click one to edit it. The rest stay up, so you just keep clicking through them - edited ones go green so you can see what's done. Set an Area Prompt once and it applies to whatever you click next, so you can run the same edit across every match without re-detecting.

Opacity slider to fade the highlights when you want to check edges, Esc/Space or a Cancel button to drop out.

SAM 3 will be used if installed rather than auto install - one-click installer included in the node folder, core node stays dependency-free. The node will prompt you on running the script if you dont have it installed.

https://github.com/shootthesound/ComfyUI-Angelo

u/shootthesound — 6 hours ago
▲ 151 r/comfyui+1 crossposts

Angelo - A Unified Sampler / Inpainter / Refiner (fix hands etc) for ComfyUI

https://github.com/shootthesound/ComfyUI-Angelo I'm a photographer who kept hitting the same wall in ComfyUI: generate an image, then to fix one thing I'd save it, open a Mask Editor or Photoshop, and fix. It works, but it's not smooth.

I've been editing photos for longer than I've been building nodes, so wanted to bring some some of that to comfy in the the way I like to work. If it works for you too or if you have ideas, let me know.

Right now the smart modes are Klein 9B focused, but should work with other edit models - again , let me know!

Here is a really shitty Youtube demo I just recorded: https://www.youtube.com/watch?v=x0Un3OkEHFA

Pete

UPDATE: Load Image button now included

u/shootthesound — 1 day ago

32:9 and 21:9 Wallpapers - via smart crop - 05/17 [OC]

Lots of Fantasy and low poly this week. Hope you are all having a great weekend. All the high res and rest of the images (reddit compresses multi image posts) and crops on https://UltrawideWallpapers.net

u/shootthesound — 4 days ago

New batch [7680x2160] Ultrawide Wallpapers (05/17) - link in comments for High res and crops

u/shootthesound — 4 days ago

LTX 2.3 is now supported in Comfyui-Mesh for splitting models across Ethernet or multigpu machines with Nvenc codec. Major vram fixes included for flux2/LTX model implementations in the node.

https://github.com/shootthesound/comfyui-mesh

Key Changes:

  1. Ltx 2.3 Dev and distilled. (See the readme, but tip: for loras for ltx, best to load them in the server app if they are big as they often are with ltx and you want to avoid the node firing them back to the server for the server loaded blocks)
  2. Fixes to vram issues where comfy was not resleasing some blocks from memory on the client.

IMPORTANT NOTES:

  1. For the LTX node, the Codec dropdown. If you client machine is on a 50XX series I recommend the Nvenc 5090 codec (I'll fix the name later should be 50 series). If on a 40/30 series try Nvenc and Raw modes. Nvenc will be quicker, Raw will be true to standard single machine single gpu output, but still works over Ethernet, just not as fast as either of the Nvenc options.

  2. This node pack is about making it possible for those who cant, not making it quicker for those who can. Its aim is to help people who cant run a given model. If you can run a model easily then this node wont help you with that model

u/shootthesound — 5 days ago
▲ 443 r/comfyui+1 crossposts

I built a custom NVENC encoder bridge to split FLUX 2 Models across two GPUs over Ethernet LAN (example: 5090 + laptop 4090 spreading model layers over two machines via Eth = 4.4s per image). Completely bypasses the need for NVLink. Multi GPU in one PC supported, Wifi 6 works very well also.

LTX 2.3 , Flux 2 Dev and Klein 9b supported . I've gone to a shit-tonne of effort to do a nice readme to get you up and running fast. There will be issues and I have upcoming testing requests. Any Nvidia card with NVENC supported.

I've even tested it over mobile tethering with my laptop in a cafe and my desktop at home and generated 1MP images with 70% of the model at home and 30% on the laptop in the cafe in under 8 seconds. (I used tailscale as a handy free vpn for this)

I plan to support LTX, Wan and some other visual models that have been too large for us until now.

P.S. I cant support Networking help requests in the issues in Github and will focus on architectural and usability issues.

Regarding the codec I've made for doing this, I've also made a version that splits 32B and 70B LLM models over two machines that works just as effectively, I'll try and release it this coming week. You'll also see in the readme on this node I've given the codec its own Github Repo for you to use.

I'm off to sleep now, 3.25 am here - glad to have this out, hope it helps you guys.

QUICK NOTE for flux 2 Dev. If you are using the massive 2.5gb turbo lora, use it in the lora field of the server app, and then to the RIGHT of the Icarus node (so you dont double up the wights). That means it will be used correctly across all weights local and remote without sending weights back and forth down the wire! With this setup I can do a Flux 2 Dev 1mp image in 14 secs with model spread over 1gb ethernet on my 5090 desktop and 4090 laptop.

More - less quick notes:

  1. More models are absolutely on the list — Wan, LTX, Qwen, Chroma, and some much larger models that are currently difficult for most people to run comfortably on consumer hardware at all.
  2. The foundations for a true multi-node architecture are already there. I need to develop that side further, but the core concepts are working.
  3. More server-side improvements are coming. Right now the client can already transmit active LoRA weights to the server automatically, but it's even faster if the LoRAs already exist server-side and can simply be selected remotely.
    • multi-LoRA handling
    • client-side remote LoRA selection
    • smarter server-side LoRA management
  4. I've had some incredibly promising results running Klein 9B remotely over 4G/5G from a laptop in a café, with almost the entire model executing on a 5090 back at home and only the final layer running locally. That direction is genuinely exciting to me.
  5. A framework for doing this with LLMs already exists internally, and I have a proof-of-concept running 70B-class models split across a 5090 and 4090 at genuinely usable speeds on consumer hardware.
  6. All of this will take time. I'm currently working from home and balancing some family responsibilities, so I have to be smart with where I allocate development time. Most of the bigger ideas are going to happen either way, but community support absolutely helps accelerate development.
  7. I would love results/logs from people with more than one Nvidia GPU in their machine. I dont have one and cant afford one for now. Check the readme for instructions for usage in this scenario.
  8. Loras work - when you apply one its weights are fired down the wire to the server. If its a hefty lora or you have a few, you can load the biggest one server side in the gui. See Point 3 above for more.

UPDATE:

  1. LTX 2.3 is now supported! https://github.com/shootthesound/comfyui-mesh
  2. For the devs among you this is a repo of my NVENC codec: https://github.com/shootthesound/torch-nvenc-compress
github.com
u/shootthesound — 6 days ago

32:9 and 21:9 Wallpapers - via smart crop - 05/10 [OC]

Lots of landscapes this week, many of them low poly as well as come materials inspo stuff. Hope you are all having a great weekend. All the high res and rest of the images (reddit compresses multi image posts) and crops on https://UltrawideWallpapers.net

u/shootthesound — 11 days ago

New batch [7680x2160] Ultrawide Wallpapers (05/10) - link in comments for High res and crops

u/shootthesound — 11 days ago
▲ 30 r/comfyui+1 crossposts

This is the end of my two day node-a-thon (for now - I've got about 5 more 70% there nodes) - I had a bunch of half baked nodes I've been using that I finished in a sprint. Sorry for the multiple posts, hope they are useful to some of you. Anyway:

When a workflow grows past 20-ish nodes you spend a real amount of time mentally tracing wires. "If I tweak this CLIP encode, what does it ripple into? Which sampler is on the other end of this controlnet apply? Why is that model loader still wired in — does anything still consume it?" — that kind of question.

I built Lighthouse to answer it visually. Right-click any node, pick Anchor from this node, and the whole canvas tells you how the graph relates to it. Direct neighbours glow red. Two hops away: orange. Then yellow, green, blue, violet (6+ hops or completely unconnected). The clicked node itself gets a bright white double-ring.

Two reasons it's actually useful:

  • Diagnostic — for big real workflows. "What does this node feed into? Is anything still consuming it?" answered at a glance.
  • Educational — for understanding workflows other people built. Downloaded a 60-node mystery workflow off civitai? Anchor on the CheckpointLoader to see the model's full influence radius. Anchor on the KSampler to see what's feeding it. Anchor on the SaveImage and walk the chain backwards. Each anchor point is a guided tour of one slice of the workflow's structure without manually following every wire.

Focus slider in the legend panel. Drag it up and the further bands progressively darken to black. At max only the 1-hop neighbours of the anchor stay visible — surgical for dense workflows, and a pretty good quiz tool if you're learning a workflow ("what's beyond this node? slide back down to check").

Non-destructive. Lighthouse only writes to its own draw hook — no node.bgcolor, no link state, no node properties. Toggle it off and the canvas is identical.

Bidirectional. Walks both upstream (inputs[i].link) and downstream (outputs[i].links[]), so a "neighbour" is anything reachable in either direction.

GitHub: https://github.com/shootthesound/comfyui-lighthouse

Install through ComfyUI Manager (search "Lighthouse") or clone from Github into custom_nodes/.

u/shootthesound — 14 days ago

Last one for today (been sitting on a backlog): Every ComfyUI workflow I make ends up looking like spaghetti within a few iterations. Existing arrange tools either reorder by execution depth (which breaks down the moment two nodes have the same depth) or just snap-to-grid (which doesn't actually organise anything).

So I built CleanFreak — it sorts your workflow by what each node is, not where it sits. Loaders go in one column. Encoders in the next. Then conditioning, samplers, decoders, post, outputs. Same workflow shape always lays out the same way regardless of how you originally built it.

What's in the box:

  • Tidy by Role (horizontal or vertical). Width-aware columns — the column is as wide as the widest node in it, narrower nodes are centred so everything lines up.
  • Optional coloured group cards around each role bucket. Re-tidying always wipes existing groups first so they never stack.
  • Subgraph + group-node unpacking before tidy. Modern subgraphs (post-0.3.51) and legacy group nodes both supported. Iterates so nested containers fully flatten.
  • Connections are never touched. ComfyUI links are by node id, so moving a node never breaks a wire. CleanFreak only writes to node.pos and to the graph's group list.
  • Editor modal — right-click → "review & edit assignments". Lists every node grouped by its current bucket with a per-row dropdown to re-assign. Click "Save assignments" and your edits persist to a JSON file in <ComfyUI>/user/cleanfreak/. The next time you open any workflow with those classes, your assignments are used. The classifier gets smarter the more you use it.
  • 1200+ node classes pre-classified out of the box. The entire stock ComfyUI node set, plus every node from Impact-Pack, controlnet_aux, rgthree-comfy, VideoHelperSuite, IPAdapter_plus, WAS Node Suite, comfyui-easy-use, KJNodes (full ~200), RES4LYF (~150), comfyui-dynamicprompts, comfyui-ollama, comfyui-automaticcfg, Comfyroll, and LTXVideo / LTXTricks.

GitHub: https://github.com/shootthesound/comfyui-CleanFreak

Install through ComfyUI Manager (search "CleanFreak") or clone the github into custom_nodes/.

u/shootthesound — 14 days ago
▲ 16 r/comfyui

Releasing the next of my custom nodes from my workflow - Finding LoRA:

I have way too many LoRAs. The stock LoRA Loader makes me scroll a giant dropdown or use very basic search, and if I want to stack another I have to drag out a second loader, wire its MODEL in, wire its MODEL out, and remember the trigger words. Every part of that workflow has been friction I've felt hundreds of times.

So I built this — what I wished the stock loader was:

  • Real fuzzy search. Click the LoRA bar, type a few characters, hit Enter. Substring matches always rank above scattered ones, so typing kase puts character_kasey_v3.safetensors at the top instantly.
  • Bookmarks. One click bookmarks the active LoRA. A second bar above the picker lists all your bookmarks; pick one and the main LoRA picker is set instantly. Bookmarks persist globally and sync live across every Finding LoRA node on your canvas — no restart, no refresh.
  • Trigger word storage. When you bookmark, you're prompted for an optional trigger phrase. It's emitted as a STRING output you can wire into your prompt encoder. The displayed trigger row is click-to-copy — paste it straight into a CLIPTextEncode.
  • One-click chaining. A button at the bottom spawns another copy of the node beside the current one and splices it into the model line automatically. Any downstream MODEL connections are re-routed through the new node — stack as many LoRAs as you want without manually re-wiring.
  • No horrible left/right chevron dropdowns. Both pickers (LoRA + bookmarks) open a proper modal — alphabetical with current selection scrolled into view, type to filter, up/down + Enter to navigate.

It's a model-only loader (matches LoraLoaderModelOnly), so it works with Flux, Klein, Wan, Z-Image, and anything else that doesn't run a CLIP through the LoRA chain.

GitHub

Install through ComfyUI Manager when it eventually appears there (search "Finding LoRA") or clone the above into custom_nodes/.

u/shootthesound — 14 days ago

Releasing the next of my custom nodes from my workflow - Finding LoRA:

I have way too many LoRAs. The stock LoRA Loader makes me scroll a giant dropdown or use very basic search, and if I want to stack another I have to drag out a second loader, wire its MODEL in, wire its MODEL out, and remember the trigger words. Every part of that workflow has been friction I've felt hundreds of times.

So I built this — what I wished the stock loader was:

  • Real fuzzy search. Click the LoRA bar, type a few characters, hit Enter. Substring matches always rank above scattered ones, so typing kase puts character_kasey_v3.safetensors at the top instantly.
  • Bookmarks. One click bookmarks the active LoRA. A second bar above the picker lists all your bookmarks; pick one and the main LoRA picker is set instantly. Bookmarks persist globally and sync live across every Finding LoRA node on your canvas — no restart, no refresh.
  • Trigger word storage. When you bookmark, you're prompted for an optional trigger phrase. It's emitted as a STRING output you can wire into your prompt encoder. The displayed trigger row is click-to-copy — paste it straight into a CLIPTextEncode.
  • One-click chaining. A button at the bottom spawns another copy of the node beside the current one and splices it into the model line automatically. Any downstream MODEL connections are re-routed through the new node — stack as many LoRAs as you want without manually re-wiring.
  • No horrible left/right chevron dropdowns. Both pickers (LoRA + bookmarks) open a proper modal — alphabetical with current selection scrolled into view, type to filter, up/down + Enter to navigate.

It's a model-only loader (matches LoraLoaderModelOnly), so it works with Flux, Klein, Wan, Z-Image, and anything else that doesn't run a CLIP through the LoRA chain.

GitHub

Install through ComfyUI Manager when it eventually appears there (search "Finding LoRA") or clone the above into custom_nodes/.

u/shootthesound — 14 days ago
▲ 23 r/comfyui+2 crossposts

I've been working on the consumer-multi-GPU PCIe bottleneck — Nvidia removed NVLink from the 4090/5090, and splitting a 70B model across two consumer cards drops you to ~30 GB/s over PCIe peer-to-peer.

Spent the last few months building a Python library that uses the GPU's otherwise-idle NVENC/NVDEC silicon to compress activations and KV cache on the fly, then ships the small bitstream across the same wire.

Repo: https://github.com/shootthesound/torch-nvenc-compress (Apache 2.0)

Prior art (this isn't novel as an idea)

  • LLM.265 — "Video Codecs are Secretly Tensor Codecs" (late 2025). The closest direct precedent: same insight applied to LLM weights, activations, KV cache.
  • KVFetcher (April 2026). KV compression for remote prefix fetching.
  • CodecFlow (April 2026). Codec motion-vector metadata for KV refresh during prefill.

The "video codec on tensors" idea was already in the literature when I started. What's added in this work:

  1. PCA + rank-truncation as preprocessing. Activations and KV in their standard basis are noise-like (~4× compression floor, basically the Gaussian-noise limit). The PCA basis reveals a heavy-tailed channel covariance that the codec can actually exploit. The basis is per-layer, computed offline, ships with the model LoRA-style (~32 MB for FLUX.2 Klein 9B's 8 double-blocks at K=500).
  2. Parallel-path / dual-lane architectural reframe. NVENC and NVDEC are physically separate hardware units from the SM cluster and the PCIe controller. With CUDA-stream pipelining, the codec time hides behind compute and transfer of other tensors. Compression ratio becomes effective-bandwidth multiplier rather than just a smaller payload.
  3. Pure-ctypes Direct Video Codec SDK wrapper (DirectBackend) — kills the FFmpeg subprocess overhead. Zero-copy from torch CUDA tensors, 8-deep async output ring per NVENC engine, optional CUDA stream binding via nvEncSetIOCudaStreams, MultiEngineDirectBackend across all 3 NVENC engines on the 5090.
  4. Three documented null findings — sparse residual, AV1 NVENC on Blackwell, channel reordering. So nobody else has to rerun the dead ends.

Measured results (RTX 5090, real workloads)

  • Compression ratios: 6.1× lossless on diffusion (FLUX.2 Klein 9B mid-block), 2.7× lossless on LLM KV cache (Mistral 7B v0.3). LOO-validated across 1,735 diffusion captures and 6 LLM prompts. (FLUX.2 Klein 9B was the internal research target; the public PoC repo uses FLUX.1-schnell since it's Apache 2.0 and freely downloadable. Numbers reproduce qualitatively on schnell — heavy-tailed PCA spectrum, similar Pareto.)
  • Codec speed: DirectBackend 0.243 ms/frame encode, 0.435 ms/frame decode at 256×256 YUV444 QP=18 on real PCA-rotated FLUX activations. MultiEngineDirectBackend across the 5090's 3 NVENC engines: 0.180 ms/frame encode, 0.262 ms/frame decode. ~7.9× over an FFmpeg subprocess baseline.
  • Parallel-path overlap empirically measured: 30×4096² fp16 GEMM on CUDA stream A + 64-frame DirectBackend encode on stream B (encoder bound to stream B via nvEncSetIOCudaStreams). Serialized wall-clock 40.1 ms; parallel wall-clock 26.0 ms; theoretical max overlap floor 20.9 ms. 1.34× speedup over serialized = 67% of theoretical max overlap realized. This is the load-bearing measurement for the architectural claim that NVENC silicon runs concurrently with SM compute.
  • Slow-wire wins, end-to-end: measured 3.13× wall-clock speedup at 100 Mbps residential broadband, 5.29× at 50 Mbps (real codec round-trip + simulated wire). 1.69× dual-lane on simulated 1 Gbit ethernet.

What is not measured end-to-end (projections from the above)

Multi-GPU PCIe peer-to-peer activation transfer recovering ~180 GB/s effective bandwidth — codec primitive is ready and benchmarked, but the cross-GPU PCIe peer-to-peer wiring is pending. (This is where I need community help, as my validation rig only has one desktop GPU and you need two on the same motherboard to test this).

Real two-machine ethernet split-model inference — wire-simulation PoC measures real codec time + simulated wire, but isn't a true two-machine deployment yet. (I have a 4090 laptop incoming next week to physically validate this networked leg).

Long-context KV-spill end-to-end tok/s on a real model decode loop — compression ratio is measured, but the actual N tok/s → 3N tok/s benchmark on e.g. 32B + 64K context isn't in the repo yet. The math implies it; the benchmark hasn't been written.

Where I'd value help

  • Anyone with a dual-4090 / dual-5090 / two-machine-with-PCIe-P2P rig who'd want to run the cross-GPU peer-to-peer benchmark when I write it. Would shrink the "75%" gap meaningfully.
  • Anyone running long-context KV-spill workloads who'd want to wire DirectBackend into their decode loop for the end-to-end tok/s measurement. I'd write the integration with you.
  • Cross-vendor coverage — AMD VCN and Intel QSV/Arc paths are completely open. Same architectural claim, different SDK surface.

What's in the repo

19 numbered runnable PoCs, every measured number reproducible. Honest status table at the top of the README. PCA basis builder + per-channel quantize + YUV pack/unpack + codec wrappers all separable so you can swap pieces.

Built solo around full-time caregiving — technical feedback, criticism, or pointers to related work I missed are genuinely appreciated.

u/shootthesound — 18 days ago