Base model
JoyAI-Echo is LTX-2.3 fine-tuned by JD.com Future Academy. It inherits native audio and short-clip video generation from the Lightricks base and extends it to long-form coherence.
JoyAI-Echo Model
JoyAI-Echo is an open-source model from JD.com Future Academy that generates up to 5-minute multi-shot videos with native, synchronized audio. Released May 2026 under the LTX-2 community license.
This section is the citation-ready summary of JoyAI-Echo - what it does, what it solves, and why it matters for long-form video generation research.
JoyAI-Echo is an open-source long audio-visual generation framework that produces minute-level multi-shot videos with synchronized voice and music. The model addresses the core problem of long-form video generation: maintaining character appearance, voice timbre, and scene continuity across many shots without resetting identity every clip.
Most prior text-to-video models cap at 5-25 seconds because the underlying diffusion process loses coherence over longer sequences. JoyAI-Echo solves this with a cross-modal audio-visual memory bank that conditions every new shot on prior visual identity and voice context, letting the model carry a single character across a five-minute story.
In published human evaluation, JoyAI-Echo wins against HappyOyster on long-form generation at 63.6% for visual aesthetics, 81.7% for audio quality, 80.6% for prompt following, and 59.4% for identity consistency.
JoyAI-Echo is built on top of LTX-2.3 with Gemma-3-12B as the text encoder. The novel contribution lives in the post-training pipeline.
JoyAI-Echo is LTX-2.3 fine-tuned by JD.com Future Academy. It inherits native audio and short-clip video generation from the Lightricks base and extends it to long-form coherence.
The model uses Gemma-3-12B-IT as the text encoder. The instruction-tuned variant handles structured prompt JSON well for multi-shot story input.
The memory bank stores slot-paired visual and audio embeddings from prior shots, locking face geometry, voice timbre, scene layout, and ambient profile.
Memory-based reinforcement learning teaches the model to use memory effectively, while DMD compresses inference by about 7.5x without sacrificing benchmark quality.
The published paper isolates four advances that make JoyAI-Echo different from prior long-video research. Each is a separable contribution that the community can build on.
The slot-paired memory mechanism is what lets a single character look and sound the same from shot one to shot fifty. Most prior text-to-video models reset identity every clip; JoyAI-Echo does not.
One pipeline produces synchronized video and audio simultaneously. Lip sync, voice timbre, and ambient sound emerge from the same model pass, not a separate TTS or foley stage stitched on after.
Distribution matching distillation collapses multi-step diffusion inference into a few-step pipeline. Benchmark quality holds, but minute-long generation becomes feasible in near-real time on H100-class hardware.
A lightweight agent layer accepts natural-language edit instructions like "make shot two darker" or "change the voice to angrier" and re-renders only the changed segments.
The published JoyAI-Echo paper reports a Good-Same-Bad human evaluation across 100 stories and 3,000 evaluated shots.
| Dimension | JoyAI-Echo win | Same | HappyOyster win |
|---|---|---|---|
| Visual aesthetics | 63.6% | 8.8% | 27.6% |
| Audio quality | 81.7% | 6.5% | 11.8% |
| Prompt following | 80.6% | 13.5% | 5.9% |
| Identity consistency | 59.4% | 12.9% | 27.7% |
| Dimension | JoyAI-Echo win | Same | Wan 2.6 win |
|---|---|---|---|
| Visual aesthetics | 58.8% | 14.7% | 26.5% |
Benchmark scale: 100 multi-shot stories, 3,000 evaluated shots, 241 frames per shot at 25 FPS, and default output at 1280 x 736. The evaluation covers cooking scenes, action, fantasy stories, animation, talking-head explainers, and game cinematic styles.
What it takes to run JoyAI-Echo locally. Numbers are from the official GitHub README - confirm against the latest release tag before provisioning hardware.
| Spec | Value |
|---|---|
| Base architecture | LTX-2.3 (Lightricks) |
| Text encoder | Gemma-3-12B-IT (~24 GB) |
| Model weight format | Safetensors (~46 GB) |
| Peak GPU usage | 46-50 GB VRAM (H100 or A100-class card) |
| Default output resolution | 1280 x 736, with optional super-resolution upscale |
| Frames per shot | 241 frames at 25 FPS |
| Max coherent length | 5 minutes, multi-shot |
| Framework | Python 3.11, PyTorch 2.8, CUDA 12.8 |
| Other dependencies | ffmpeg for shot concatenation |
| Inference speedup | ~7.5x via DMD distillation |
| License | LTX-2 Community License, academic and non-commercial use |
If you do not have an H100 or A100, the JoyAI-Echo AI Video Generator runs the same model on cloud hardware behind a browser.
JoyAI-Echo is built on LTX-2.3 - the same base, fine-tuned and extended for long-form coherence.
| Capability | LTX-2.3 base | JoyAI-Echo fine-tune |
|---|---|---|
| Max coherent length | 20 seconds | 5 minutes |
| Multi-shot consistency | Single-clip only | Cross-shot memory bank |
| Audio | Stereo 24 kHz, native | Stereo 24 kHz, native + voice timbre lock |
| Inference speed | Baseline | ~7.5x faster via DMD |
| Output resolution | 4K (3840 x 2160) | 1280 x 736 + super-resolution to HD |
| License | Apache 2.0 | LTX-2 Community License |
| Best for | Short cinematic shots at 4K | Long multi-shot stories with identity continuity |
Use LTX-2.3 for short cinematic clips where 4K matters. Use JoyAI-Echo for long-form storytelling where character identity and voice need to hold across many shots.
Common technical questions about the open-source JoyAI-Echo model - license, hardware, base architecture, and how it compares to LTX-2.3 and Wan 2.6.
Yes. The model weights, inference code, and reference examples are open-sourced on Hugging Face and GitHub under the LTX-2 Community License for academic and non-commercial use.
About 46-50 GB of VRAM. That means an H100, A100, or a tight A6000 setup. Consumer GPUs like RTX 4090 cannot run JoyAI-Echo locally without CPU offloading.
JoyAI-Echo is built on LTX-2.3, the 22B-parameter open-source video diffusion model from Lightricks. The text encoder is Gemma-3-12B-IT.
Use LTX-2.3 for short 4K cinematic clips with a permissive license. Use JoyAI-Echo for long multi-shot stories where character and voice continuity matter.
Not directly under the open-source license. Commercial deployments require a separate license path or the hosted JoyAI-Echo AI Video Generator commercial license on paid tiers.
The paper is available from the official GitHub repository, with mirrors on ResearchGate and the project page.
The same JoyAI-Echo model, behind a browser. No GPU, no install, no waitlist. Free tier covers 60 seconds of finished video per month; paid plans add no watermark, longer renders, and a commercial license.
Try the hosted JoyAI-Echo AI Video Generator