Been learning stable diffusion text to image and image to video with comfyui for a few months.

Now I have so many tools at my disposal that I'm feeling a bit lost, so I'm hoping that people in here won't mind sharing some advice on an overall process.

I'm starting to appreciate the grind required to build experience in this area, so thank you to anyone who does help.

Goal

Create some short films by editing together short clips generated from keyframes.

Roughly story board them so I get the right shots with the right angles and composition.

Have consistent characters.

Some of the films will be adult in nature. Nothing hardcore but I do want to have good looking people in revealing clothing and some nudity.

It's not a side hustle. I'm just doing it for fun and to learn.

Where I'm At

I can...

Build moderately complex comfyui workflows. (IPAdaptors, ControlNets, detailers, inpainting, Sam3 for segmenting and masking, general upscale / hiresfix steps, workflow components like switches, get/set nodes, etc).
Produce some nice looking images with Flux 1 Dev.
Use image editing models like Flux 2 Klein 9B and Qwen 2511 with some success.
Train decent character loras for Flux and SDXL.
Use image to video to generate alternate keyframes from an initial keyframe.

What I'm Struggling With

I can produce an image with the composition I want, the lighting, the characters in the right outfits and poses. But not all in the same image.

Building these elements up in multiple passes for each keyframe seems sensible.

I cannot figure out how to pull all my tools together into an efficient pipeline, or avoid compromise an earlier step with a later step (eg. got a good facial likeness and then ruin it with texturing).

More detail below about my experience so far in case it helps. General advice also most welcome.

-----------------------------------------------------------------------------------------------------------------------

What I've Found

Flux 1D is good at...

Creating REALLY nice looking images (textures, lighting, composition) with just a prompt. It's great for exploring concepts or producing that one perfect starting image for a clip.
Producing a consistent facial likeness across images with a well trained LoRA.

However, it's not so great at...

Producing the specific angles and image composition that I want, even with a lot of prompt iteration based on the wealth of prompting guides available.
Controlnet. No matter the strength and start/end settings, when I use depth, canny or pose controlnets, the images look washed out and lose that "magic" that Flux seems to be able to produce without them.
Maintaining micro details between images, even with really specific prompting. Generating 50 images straight out of Flux will mean slightly differing hair cuts, outfits, etc.
Nipples, genitals, and anatomy in general, at least compared to SDXL.
Revealing clothing without some specific outfit lora. Why does it insist on massive granny underwear in 99/100 generations when I just want a thong?

SDXL (Juggernaut Ragnarok in my case) is good at...

Producing EXACTLY the composition I want using controlnets, without compromising the image quality vs no controlnet. I can do this from a sketch or using a reference image/still. I may experiment with Blender to produce depth maps for consistent environments.
Nice looking nudes / good anatomy in general.
The LoRA ecosystem is just amazing. Any concept, clothing or style I can think of and there's probably a LoRA for it.

Not so good at...

Backgrounds, objects, lighting, textures and overall image quality / realism compared to Flux.
It seems to not latch onto facial likeness as well as Flux for character loras.

I've also been using Klien 9B and Qwen2511. They have their differences but between them I can do things like...

Fix small mistakes or bad anatomy with inpainting.
Create an outfit asset by taking one from a Flux image, put on a mannequin and then transfer to other images.
Change or remove backgrounds.
Change the camera angle.
Repose characters.
Do headswaps to preserve likeness, although even with BFS loras and injecting 4x face reference images, the likeness isn't 100%. The examples I see online always look amazing but I can't seem to replicate.

However, they tend to output waxy looking skin and bad faces. Every edit pass degrades the image, even with masking where possible. Pulling my keyframes together by stitching elements of multiple Flux images (outfit from one, head and hair from another, then pose, etc), just seems like the wrong angle.

u/DoskvolDenizen

Advice for overall image generation pipeline for video keyframes

Goal

Where I'm At

What I'm Struggling With

What I've Found