u/K_Kolomeitsev

A conv-free video tokenizer ~5M params that beats NVIDIA's Cosmos Tokenizer on all 16 UVG sequences

A conv-free video tokenizer ~5M params that beats NVIDIA's Cosmos Tokenizer on all 16 UVG sequences

I open-sourced a neural video tokenizer and I'd like the method and the eval

picked apart, because the headline number surprised me too.

Scope first, since "Cosmos" covers a lot of ground at NVIDIA: this is the Cosmos

Tokenizer (CV8x8x8, the continuous video tokenizer), compared tokenizer to

tokenizer on reconstruction. Not the world models, not the generators.

The short version: it's a reconstruction tokenizer (video -> discrete latent ->

video) with no convolutional VAE. The whole pixel-to-latent path is attention.

Most stacks keep 3D-RoPE in the diffusion transformer and let a conv VAE handle

the pixels (CogVideoX, Hunyuan, Wan); Cosmos and MAGVIT-v2 are mostly

convolutional. I dropped the conv VAE.

Method:

- Windowed axial 3D RoPE. Relative RoPE inside local axial attention, factored

over t/y/x. The rotation is orthogonal and the phase is linear in position, so

the attention logit only ever sees a relative offset. That gives per-level

translation equivariance and arbitrary spatial extent, and keeps the cost linear

in tokens.

- A phase-jump for cuts. I offset the temporal coordinate per scene

(tau_eff = tau + scene_id*theta_jump). Inside a shot the offset is constant and

equivariance holds; across a cut it's large enough to spin every temporal

frequency out of phase, so attention across the cut decays toward zero. No

masks, no custom kernels - the rotation does the gating and it stays

differentiable.

- Matryoshka Laplacian pyramid. Each level codes the residual of the coarser one,

so the file is a prefix you truncate: L0 preview, L0+L1 medium, all three full.

One file, whole ladder.

- FSQ + Balle scale hyperprior + arithmetic coder. FSQ is codebook-free, so

there's no codebook collapse to babysit. The training rate term reuses the

coder's integer CDF, so the bits I optimize land within ~1% of the bytes on

disk, and the decode is bit-identical with or without the entropy stage.

The eval is the part I'd attack first if I were reading this. Training is at 480x864. To

test at native 4K without downscaling I cover each frame with overlapping 864x480

tiles (25 of them at 4K), run each through encode/decode, and stitch back with a

partition-of-unity blend. Same tiling and native fps for both codecs - neither one

fits a 4K window on a single GPU anyway. PSNR is in-tensor, RGB and BT.709 Y.

Cosmos goes through its own sliding-window inference.

Result: higher Y-PSNR than the Cosmos Tokenizer on every one of the 16 UVG

sequences. Mean 38.07 vs 36.35 dB, +1.72. The spread runs +0.60 (FlowerPan) to

+4.28 (YachtRide), and the high-motion clips (Jockey, ReadySetGo) don't dent it.

Held-out RD on 32 unseen clips, off the single file: 27.85 dB @ 0.011 bpp preview,

32.76 @ 0.103 medium, 40.26 @ 0.351 full. On a 4K clip the neural encode->decode

is 17s vs Cosmos' 34s; a 4K window through the Cosmos encoder asks for ~94 GiB,

mine tiles at ~4.7. Full pipeline is 105s because the arithmetic coder

(unoptimized) eats about 84% of it. ~5.0M params total, ~290h of mixed video.

What I'm not saying:

- Not a bitrate win. Cosmos CV is continuous and not entropy-coded, so its

footprint and my bpp aren't on the same axis. The match is PSNR.

- Not an HEVC/VVC competitor. Fixed-size latent, no content-adaptive rate; on

smooth high-fps content the handcrafted codecs win on RD. It's a tokenizer / AI

latent, not a delivery codec.

- The native-res tiling works against me (overlap bits, no cross-tile context) and

the numbers keep it that way.

- It reconstructs; it doesn't generate. The point of a discrete, invertible,

collapse-free latent is to generate into it later.

Links:

- Paper: MPT-VC: A Rate-Scalable Video Tokenizer with Windowed Axial RoPE and

Scene-Aware Positional Encoding

- Code (inference, Apache-2.0): github.com/k-kolomeitsev/mpt-vc

- Weights + model card: huggingface.co/kkolomeitsev/mpt-vc

If the tiling protocol or the Cosmos config is unfair somewhere, that's the

feedback I'm after.

u/K_Kolomeitsev — 10 days ago