u/Deep-Delivery-5631

Hey everyone, wanted to share a project I've been developing over the past few months as a freelance video editor/director based in Paris. It's still very much a work in progress but I've reached a point where the pipeline is validated end-to-end and thought it was worth sharing.

The goal

Capture a human subject (dancer, performer) in full 360° volumetric video using only consumer hardware — 4 synchronized iPhones — and reconstruct it as a 4D Gaussian Splat sequence that can be navigated freely in post.

The pipeline

1. Capture 4 iPhones synchronized via a clapperboard, shooting at 25fps in 4K. The cameras are positioned roughly 90° apart for a full 360° ring around the subject.

2. Camera calibration — Pi3X I use Pi3X (built on MASt3R/DUSt3R) to estimate camera poses from a single reference frame per camera. Critical fix: the output is in OpenCV convention but downstream tools expect OpenGL — applying c2w[:3, 1:3] *= -1 before writing transforms.json is essential, otherwise you get outward-pointing frustums and collapsed PSNR on one camera.

3. Masking — MatAnyone + SAM2 MatAnyone propagates subject masks across all 300 frames per camera with SAM2 prompt points manually tuned per camera. Foreground coverage sits around 6-10% of frame which is expected for a full-body subject at distance.

4. Skeleton estimation — Sapiens 1b Goliath → COCO133 Running Facebook's Sapiens 1b torchscript model to extract 308 Goliath keypoints, then remapping to COCO-WholeBody 133 format (body + feet + hands). One important fix: skip the score-rescaling for keypoints 92–112 which causes a bug. Triangulation uses the camera YML files directly — not transforms.json — to avoid a coordinate system double-flip bug that causes ~190px reprojection error on keypoints.

5. Novel view synthesis — Diffuman4D This is the most expensive step. Diffuman4D takes the 4 input cameras + skeletons and generates 12 synthetic views at 30° intervals to complete the 360° ring. Running 300 frames on a local RTX 4070 Ti would take ~60h, so I've been running it on an A100 SXM4 via Vast.ai for around $6 total — bringing it down to ~7h.

6. 4D Gaussian Splatting — EasyVolcap 4K4D Sliding window training across 21 chunks of 15 frames each. Key config that fixed splat explosion: using raw images (not pre-composited) with msk_loss_weight: 0.01 and bg_brightness: 1.0. PSNR averages around 21-22 dB across the sequence.

7. Export + filtering 300 PLY files exported, then filtered for floating splats by opacity threshold + radius from subject centroid + brightness filter to catch the white splat bleed from the white studio background.

Current results

The 360° reconstruction is geometrically coherent with consistent identity across all views. Main remaining issues are some splat artifacts at the silhouette edges and occasional floaters on fast motion frames. I've also been exploring Vista4D as an alternative to Diffuman4D for the novel view synthesis step — managed to get it running on 12GB VRAM with 7 code patches (FP8 quantization, CPU offload fixes) but it's not yet viable for full 360° on a single GPU.

What's next

Tighter splat filtering
Testing Forge4D as a potential replacement for the full Diffuman4D step
Exploring relighting in post via RelightableAvatar

Happy to answer questions or share more details on any of the steps. This pipeline is entirely built on open-source tools running on consumer hardware — the only paid component is the occasional Vast.ai GPU rental for the heavy inference steps.

WIP — more updates to come.

[WIP] Building a 4D Gaussian Splatting pipeline with 4 synchronized iPhones — full breakdown