u/Opening-Ad5541

Phosphene 3.0 — open source AI video + image suite for Apple Silicon. Train your own LTX characters.

Sharing Phosphene 3.0. It's a free panel that runs LTX-Video 2.3 and a couple of image models natively on Apple Silicon. Local, MIT license, no subs, no cloud.

The thing that sets it apart from "yet another LTX wrapper": you can **train your own characters** inside the panel. Drop 30 to 80 photos, click Train, get a face LoRA back. Add a voice clip and you get a voice LoRA too. Auto-captions with Gemma 3 12B locally. ~3 hours per character on an M4 Max 64 GB.

**What 3.0 ships**

- Text → video+audio (LTX-2 generates joint audio+video in one pass)

- Image → video+audio

- Audio → video (drive a clip with an audio reference)

- FFLF (first frame + last frame interpolation)

- Extend (continue an existing clip)

- Character training (face + optional voice LoRA, from a single dataset)

- Image Studio with three engines: Qwen-Image-Edit-2511, HiDream-O1, and the FLUX.1 family. Multi-reference composition up to 3 subjects.

**HiDream-O1 ported to MLX**

HiDream released their O1 image model on May 14. Got it running natively on Apple Silicon five days later. Photoreal portraits, instruction edits, multi-subject. ~67 seconds per 1024² on a 64 GB Mac.

**Hardware**

Apple Silicon only. Capability tiers auto-detected:

- 16 / 24 GB: 512 px video, text-to-image works

- 32 GB: 768 px

- 64 GB+: 1024×576 video, full HD image, character training

- A 7-second character clip with synced audio renders in ~6 min on M4 Max 64 GB

- Character training takes ~3 hours per character

**Install**

One-click via Pinokio (search Phosphene). Or clone the repo and run the panel directly.

**Credits**

LTX Video 2.3 by Lightricks (their license on the weights). MLX port by `dgrauet/ltx-2-mlx`. HiDream by HiDream AI. Phosphene the panel is MIT.

**Honest limits**

- Apple Silicon only. No Intel Mac, no Windows, no Linux.

- Dialogue audio is hit-or-miss. Ambient/diegetic sound is where LTX-2 shines.

- Character LoRAs are video-only (face + voice). Image LoRAs work in the Studio via Qwen/HiDream + a separate LoRA stack.

- First run downloads ~28 GB of weights. Takes a while.

Repo: github.com/mrbizarro/phosphene

X: x.com/PhospheneAI

Dev: https://x.com/AIBizarrothe

Feedback welcome. Especially curious what people make with the character training side.

reddit.com
u/Opening-Ad5541 — 16 hours ago

Phosphene is a free desktop panel for generating video on Apple Silicon Macs. It wraps Lightricks' LTX 2.3 model running natively on Apple's MLX framework, and exposes a one-click install through Pinokio.

The differentiator is audio. LTX 2.3 generates video and audio in a single forward pass — they share the same diffusion process, so timing is tied at the frame level. Footsteps land on the correct frame. Lip movement matches dialogue. Ambient sound is conditioned on the visual content. Most other local video models (Wan, Hunyuan, Mochi) generate silent video; you add audio in post.

https://preview.redd.it/vutakjb0vgyg1.png?width=1916&format=png&auto=webp&s=bfde8a7f91b861666196158fbf0f2b76d7d7b828

What it can do

Four generation modes:

  • Text → video — describe a scene, get a 5-second clip with synthesized audio
  • Image → video — start from a still, animate from there with synced audio
  • First-frame / Last-frame — provide two images, the model interpolates the middle
  • Extend — append seconds onto an existing clip, audio continuous across the join

Plus prompt rewriting via a local Gemma 3 12B 4-bit text encoder. The same model that reads your prompt for the diffusion stage can also rewrite it in the format LTX 2.3 was trained on. Runs offline, takes a few seconds.

https://preview.redd.it/3irbyie5vgyg1.jpg?width=1920&format=pjpg&auto=webp&s=bb03a0c8e64899a83af7980847e61e28b75397ca

Quality tiers

Three quality levels, picked per-job:

  • Draft — half resolution, ~2 minutes. For iterating on prompts.
  • Standard — full 1280×704, 7 minutes. The daily driver. Q4 distilled (25 GB on disk).
  • High — Q8 two-stage with TeaCache acceleration, ~12 minutes. Adds ~25 GB. Optional download — a button in the panel pulls it on demand. Required for FFLF.

Hardware compatibility

Apple Silicon only. The panel detects your Mac's RAM at boot and gates features accordingly:

  • 32 GB → Compact: lower resolution, shorter clips
  • 64 GB → Comfortable: full 1280×704 baseline
  • 96 GB → High: longer clips, full Q8
  • 128+ GB → Pro: no clamps

This is enforced because LTX 2.3's working tensor footprint is real — there is no way to run a full 1280×704 5-second generation in less than ~30 GB of resident memory. The tier system is honest about it rather than letting users queue jobs that fall out of the OOM killer.

Intel Macs and other platforms are not supported. There is no port path for them — MLX is Apple-only by design.

Audio behavior

Audio quality is conditioned on the prompt. A visual-only prompt produces faint ambient sound, which can read as "near-silent." A prompt with explicit audio cues produces layered foreground sound.

Compare:

  • "Wizard in forest" → quiet room tone
  • "Wizard in forest, low whispered chant, ember crackle, distant owl hoot" → audible chant + crackle + owl, all timed to the visuals

This is documented behavior of LTX 2.3, not a Phosphene quirk. Describe the soundscape in your prompt the same way you describe the visual.

How it differs from existing tools

Compared to other locally-runnable video models on a Mac:

  • vs. ComfyUI workflows — ComfyUI runs LTX 2.3 too, but in a node graph that requires building per-job. Phosphene is a fixed panel: prompt, mode, dimensions, generate. No graph maintenance.
  • vs. native PyTorch builds (Wan, Mochi, Hunyuan) — those run on torch via MPS, which is a compatibility shim, not native Metal. MLX runs the model directly in Apple's compute framework. The result is meaningful speed and memory differences on the same hardware.
  • vs. cloud / API services (Pika, Runway) — those generate faster on H100s but require accounts, queue time, monthly subscriptions, and upload of source images. Phosphene runs with no network beyond the initial weight download.
  • vs. silent local video models — joint audio synthesis is, at the time of writing, unique to LTX 2.3 among models with usable Mac runtimes.

Output format

Lossless H.264 by default — yuv444p, CRF 0 — so your archive is the highest fidelity the renderer can produce. Web/social platforms will re-encode anyway. Override via env variables (LTX_OUTPUT_PIX_FMT, LTX_OUTPUT_CRF) if you want yuv420p directly.

The +faststart movflag is on, so the moov atom is at the front of the file. Gallery thumbnails decode the first frame instantly without downloading the full clip.

Install

Search Phosphene in Pinokio's Discover tab and click Install. Pinokio handles the venv, Python 3.11 pin, MLX pipeline install, codec patches, and ~31 GB of model downloads (Q4 LTX 2.3 + Gemma text encoder). Resumable — if a download is interrupted, hitting Install again picks up where it left off.

Optional: run "hf auth login" in Terminal first to authenticate the Hugging Face downloads. Anonymous downloads are throttled; authenticated downloads are roughly 10× faster, which matters for the optional 25 GB Q8 model.

[ATTACH VIDEO: phosphene_hero_x.mp4]

License + credits

Phosphene panel: MIT.
LTX 2.3 weights: Lightricks' own license — read it before commercial use.
MLX framework: Apache 2.0 (Apple).
Gemma weights: Google's terms.

Built on:

  • LTX 2.3 model — Lightricks
  • MLX port (ltx-2-mlx) — u/dgrauet
  • MLX framework — Apple ML
  • Pinokio runtime — u/cocktailpeanut

Source: github.com/mrbizarro/phosphene. Issues and PRs welcome.

reddit.com
u/Opening-Ad5541 — 22 days ago
▲ 36 r/opensource+1 crossposts

https://preview.redd.it/ls0zqztvpgyg1.png?width=1916&format=png&auto=webp&s=734c9b9d83ce1def55aa7fc39fc858d3f3618bf5

Phosphene is a free desktop panel for generating video on Apple Silicon Macs. It wraps Lightricks' LTX 2.3 model running natively on Apple's MLX framework, and exposes a one-click install through Pinokio.

The differentiator is audio. LTX 2.3 generates video and audio in a single forward pass — they share the same diffusion process, so timing is tied at the frame level. Footsteps land on the correct frame. Lip movement matches dialogue. Ambient sound is conditioned on the visual content. Most other local video models (Wan, Hunyuan, Mochi) generate silent video; you add audio in post.

https://preview.redd.it/t1aggto2qgyg1.jpg?width=1920&format=pjpg&auto=webp&s=4ac849e37292988fc6fe4c90bcef87d3ffe9af3a

What it can do

Four generation modes:

  • Text → video — describe a scene, get a 5-second clip with synthesized audio
  • Image → video — start from a still, animate from there with synced audio
  • First-frame / Last-frame — provide two images, the model interpolates the middle
  • Extend — append seconds onto an existing clip, audio continuous across the join

Plus prompt rewriting via a local Gemma 3 12B 4-bit text encoder. The same model that reads your prompt for the diffusion stage can also rewrite it in the format LTX 2.3 was trained on. Runs offline, takes a few seconds.

Quality tiers

Three quality levels, picked per-job:

  • Draft — half resolution, ~2 minutes. For iterating on prompts.
  • Standard — full 1280×704, 7 minutes. The daily driver. Q4 distilled (25 GB on disk).
  • High — Q8 two-stage with TeaCache acceleration, ~12 minutes. Adds ~25 GB. Optional download — a button in the panel pulls it on demand. Required for FFLF.

Hardware compatibility

Apple Silicon only. The panel detects your Mac's RAM at boot and gates features accordingly:

  • 32 GB → Compact: lower resolution, shorter clips
  • 64 GB → Comfortable: full 1280×704 baseline
  • 96 GB → High: longer clips, full Q8
  • 128+ GB → Pro: no clamps

This is enforced because LTX 2.3's working tensor footprint is real — there is no way to run a full 1280×704 5-second generation in less than ~30 GB of resident memory. The tier system is honest about it rather than letting users queue jobs that fall out of the OOM killer.

Intel Macs and other platforms are not supported. There is no port path for them — MLX is Apple-only by design.

Audio behavior

Audio quality is conditioned on the prompt. A visual-only prompt produces faint ambient sound, which can read as "near-silent." A prompt with explicit audio cues produces layered foreground sound.

Compare:

  • "Wizard in forest" → quiet room tone
  • "Wizard in forest, low whispered chant, ember crackle, distant owl hoot" → audible chant + crackle + owl, all timed to the visuals

This is documented behavior of LTX 2.3, not a Phosphene quirk. Describe the soundscape in your prompt the same way you describe the visual.

How it differs from existing tools

Compared to other locally-runnable video models on a Mac:

  • vs. ComfyUI workflows — ComfyUI runs LTX 2.3 too, but in a node graph that requires building per-job. Phosphene is a fixed panel: prompt, mode, dimensions, generate. No graph maintenance.
  • vs. native PyTorch builds (Wan, Mochi, Hunyuan) — those run on torch via MPS, which is a compatibility shim, not native Metal. MLX runs the model directly in Apple's compute framework. The result is meaningful speed and memory differences on the same hardware.
  • vs. cloud / API services (Pika, Runway) — those generate faster on H100s but require accounts, queue time, monthly subscriptions, and upload of source images. Phosphene runs with no network beyond the initial weight download.
  • vs. silent local video models — joint audio synthesis is, at the time of writing, unique to LTX 2.3 among models with usable Mac runtimes.

Output format

Lossless H.264 by default — yuv444p, CRF 0 — so your archive is the highest fidelity the renderer can produce. Web/social platforms will re-encode anyway. Override via env variables (LTX_OUTPUT_PIX_FMT, LTX_OUTPUT_CRF) if you want yuv420p directly.

The +faststart movflag is on, so the moov atom is at the front of the file. Gallery thumbnails decode the first frame instantly without downloading the full clip.

Install

Search Phosphene in Pinokio's Discover tab and click Install. Pinokio handles the venv, Python 3.11 pin, MLX pipeline install, codec patches, and ~31 GB of model downloads (Q4 LTX 2.3 + Gemma text encoder). Resumable — if a download is interrupted, hitting Install again picks up where it left off.

Optional: run "hf auth login" in Terminal first to authenticate the Hugging Face downloads. Anonymous downloads are throttled; authenticated downloads are roughly 10× faster, which matters for the optional 25 GB Q8 model.

License + credits

Phosphene panel: MIT.
LTX 2.3 weights: Lightricks' own license — read it before commercial use.
MLX framework: Apache 2.0 (Apple).
Gemma weights: Google's terms.

Built on:

  • LTX 2.3 model — Lightricks
  • MLX port (ltx-2-mlx) — u/dgrauet
  • MLX framework — Apple ML
  • Pinokio runtime — u/cocktailpeanut

Source: https://github.com/mrbizarro/phosphene Issues and PRs welcome.

Follow me on x: https://x.com/AIBizarrothe

reddit.com
u/Opening-Ad5541 — 22 days ago