u/Crazy-Repeat-2006

AsymFLUX.2-klein-9B - Pixel Space Model.

Qwen-Image-VAE-2.0 Technical Report

arxiv.org/pdf/2605.13565

"We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability."

Key innovations:

  • Global Skip Connections (GSC): This architectural change allows the model to "remember" fine details from the original image and pass them directly through the compression bottleneck, significantly improving the clarity of the final output.
  • Asymmetric & Attention-Free Backbone: They made the encoder (which processes the image) very lightweight and fast while keeping the decoder (which reconstructs the image) powerful. By removing "Attention" layers in the VAE itself, they drastically reduced the computational cost (FLOPs).
  • Semantic Alignment Strategy: To make the model better for generating images (diffusability), they forced the latent space to align more closely with visual "meaning." This helps downstream models learn much faster.
  • Synthetic Rendering for Text: They trained the model on billions of images, including a massive set of synthetically rendered documents. This makes this VAE exceptionally good at reconstructing OCR-rich images (documents, posters, covers etc.) where most other VAEs fail.

alibaba/OmniDoc-TokenBench

"We conduct a comprehensive evaluation on OmniDoc-TokenBench (~3K text-rich images, 256×256 resolution). Models are grouped by spatial compression factor and sorted by NED within each group.

Our Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction across all compression ratios. The f16c128 variant attains SSIM 0.9706 and PSNR 30.45 dB, surpassing the best f8 baseline (FLUX.1-dev at 0.9364 / 26.24 dB) despite 2× higher spatial compression. In terms of text fidelity (NED), f16c128 reaches 0.9617, exceeding all evaluated VAEs. Even under extreme f32 compression, our f32c192 achieves NED 0.8555, surpassing multiple f16 baselines."

https://preview.redd.it/yrt8rsc8241h1.png?width=1918&format=png&auto=webp&s=3b812d1a9b4be2f9d2d6922d685c5077b7c9e242

reddit.com
u/Crazy-Repeat-2006 — 9 days ago

Causal-Forcing

Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

https://preview.redd.it/3hecgqcjpj0h1.png?width=4944&format=png&auto=webp&s=5da14de07296f8f4da64ad2659e04f59de7f1394

https://reddit.com/link/1taaof4/video/or66xjc6pj0h1/player

"Causal Forcing significantly outperforms Self Forcing in both visual quality and motion dynamics, while keeping the same training budget and inference efficiency —enabling real-time, streaming video generation on a single RTX 4090.

We identify a theoretical flaw in Self Forcing’s training pipeline during ODE initialization: a bidirectional teacher should not be used to supervise an autoregressive student, as this violates frame-level injectivity. Motivated by this analysis, we propose Causal Forcing: we first fine-tune a bidirectional base model into an autoregressive diffusion model, then use it as the teacher for ODE initialization, followed by the same DMD stage as in Self Forcing. Our method significantly outperforms Self Forcing in both visual quality and motion dynamics, while keeping the training budget and inference efficiency unchanged."

Site: Causal-Forcing

HF: zhuhz22/Causal-Forcing · Hugging Face

reddit.com
u/Crazy-Repeat-2006 — 11 days ago

Longcat Image Turbo - 4 NFEs

https://preview.redd.it/of7fd858kb0h1.png?width=3244&format=png&auto=webp&s=1c83f588ca7cf08e48b702113d2ede53e0f9817d

byliutao/Longcat-Image-Turbo · Hugging Face

"This repository contains the weights for Longcat-Image-Turbo, a few-step distilled version of Longcat-Image using the Continuous-Time Distribution Matching (CDM) method presented in Continuous-Time Distribution Matching for Few-Step Diffusion Distillation.

CDM migrates the Distribution Matching Distillation (DMD) framework from discrete anchoring to continuous optimization, allowing for high-quality image generation with very few steps (e.g., 4 NFE)."

reddit.com
u/Crazy-Repeat-2006 — 13 days ago

https://preview.redd.it/9fl7fg1l25yg1.png?width=2870&format=png&auto=webp&s=2f9a3e9832717e9320ec424c2bead3efeedf04cb

Image generation and generated-image detection have both advanced rapidly, but mostly along separate technical paths: generation is dominated by generative architectures, while detection is dominated by discriminative ones. This separation creates a persistent gap in practice: generators are not directly optimized by forensic criteria, and detectors are often trained on static snapshots of old forgeries, which limits robustness to new generators.

UniGenDet addresses this gap with a unified co-evolutionary framework that jointly optimizes generation and detection in one loop. The core idea is to make both tasks explicitly exchange useful signals instead of evolving independently.

  • Symbiotic multimodal self-attention bridges generation and authenticity understanding in a shared architecture.
  • Generation-detection unified fine-tuning (GDUF) equips the detector with generative priors, improving generalization and interpretability.
  • Detector-informed generative alignment (DIGA) feeds authenticity constraints back into synthesis, improving realism and fidelity.

In short, UniGenDet turns the traditional "generator vs. detector" arms race into a closed-loop collaboration. This repository provides the full training and evaluation pipeline built on pretrained BAGEL components.

HF: Yanran21/UniGenDet · Hugging Face

GH: Zhangyr2022/UniGenDet

reddit.com
u/Crazy-Repeat-2006 — 24 days ago