JoyAI-Echo Model

JoyAI-Echo - The Open-Source Long Audio-Visual Generation Model

Name: JoyAI-Echo
Author: JD.com Future Academy

JoyAI-Echo is an open-source model from JD.com Future Academy that generates up to 5-minute multi-shot videos with native, synchronized audio. Released May 2026 under the LTX-2 community license.

22B-scale audio-video foundation model from the JD.com video AI program

5-minute multi-shot generation

7.5x faster than the original pipeline

LTX-2 community license for academic and non-commercial use

Abstract: What is JoyAI-Echo?

This section is the citation-ready summary of JoyAI-Echo - what it does, what it solves, and why it matters for long-form video generation research.

JoyAI-Echo is an open-source long audio-visual generation framework that produces minute-level multi-shot videos with synchronized voice and music. The model addresses the core problem of long-form video generation: maintaining character appearance, voice timbre, and scene continuity across many shots without resetting identity every clip.

Most prior text-to-video models cap at 5-25 seconds because the underlying diffusion process loses coherence over longer sequences. JoyAI-Echo solves this with a cross-modal audio-visual memory bank that conditions every new shot on prior visual identity and voice context, letting the model carry a single character across a five-minute story.

In published human evaluation, JoyAI-Echo wins against HappyOyster on long-form generation at 63.6% for visual aesthetics, 81.7% for audio quality, 80.6% for prompt following, and 59.4% for identity consistency.

JoyAI-Echo architecture: LTX-2.3 base + cross-modal memory bank

JoyAI-Echo is built on top of LTX-2.3 with Gemma-3-12B as the text encoder. The novel contribution lives in the post-training pipeline.

Base model

JoyAI-Echo is LTX-2.3 fine-tuned by JD.com Future Academy. It inherits native audio and short-clip video generation from the Lightricks base and extends it to long-form coherence.

Gemma encoder

The model uses Gemma-3-12B-IT as the text encoder. The instruction-tuned variant handles structured prompt JSON well for multi-shot story input.

Cross-modal memory bank

The memory bank stores slot-paired visual and audio embeddings from prior shots, locking face geometry, voice timbre, scene layout, and ambient profile.

Post-training pipeline

Memory-based reinforcement learning teaches the model to use memory effectively, while DMD compresses inference by about 7.5x without sacrificing benchmark quality.

Four key contributions of JoyAI-Echo

The published paper isolates four advances that make JoyAI-Echo different from prior long-video research. Each is a separable contribution that the community can build on.

Cross-modal audio-visual memory bank for character + voice consistency

The slot-paired memory mechanism is what lets a single character look and sound the same from shot one to shot fifty. Most prior text-to-video models reset identity every clip; JoyAI-Echo does not.

Joint audio-video generation from a single prompt JSON

One pipeline produces synchronized video and audio simultaneously. Lip sync, voice timbre, and ambient sound emerge from the same model pass, not a separate TTS or foley stage stitched on after.

DMD-distilled inference for ~7.5x speedup

Distribution matching distillation collapses multi-step diffusion inference into a few-step pipeline. Benchmark quality holds, but minute-long generation becomes feasible in near-real time on H100-class hardware.

Interactive conversational agent for real-time editing

A lightweight agent layer accepts natural-language edit instructions like "make shot two darker" or "change the voice to angrier" and re-renders only the changed segments.

JoyAI-Echo benchmarks: human-evaluator win rates

The published JoyAI-Echo paper reports a Good-Same-Bad human evaluation across 100 stories and 3,000 evaluated shots.

Dimension	JoyAI-Echo win	Same	HappyOyster win
Visual aesthetics	63.6%	8.8%	27.6%
Audio quality	81.7%	6.5%	11.8%
Prompt following	80.6%	13.5%	5.9%
Identity consistency	59.4%	12.9%	27.7%

Dimension	JoyAI-Echo win	Same	Wan 2.6 win
Visual aesthetics	58.8%	14.7%	26.5%

Benchmark scale: 100 multi-shot stories, 3,000 evaluated shots, 241 frames per shot at 25 FPS, and default output at 1280 x 736. The evaluation covers cooking scenes, action, fantasy stories, animation, talking-head explainers, and game cinematic styles.

JoyAI-Echo technical specs

What it takes to run JoyAI-Echo locally. Numbers are from the official GitHub README - confirm against the latest release tag before provisioning hardware.

Spec	Value
Base architecture	LTX-2.3 (Lightricks)
Text encoder	Gemma-3-12B-IT (~24 GB)
Model weight format	Safetensors (~46 GB)
Peak GPU usage	46-50 GB VRAM (H100 or A100-class card)
Default output resolution	1280 x 736, with optional super-resolution upscale
Frames per shot	241 frames at 25 FPS
Max coherent length	5 minutes, multi-shot
Framework	Python 3.11, PyTorch 2.8, CUDA 12.8
Other dependencies	ffmpeg for shot concatenation
Inference speedup	~7.5x via DMD distillation
License	LTX-2 Community License, academic and non-commercial use

If you do not have an H100 or A100, the JoyAI-Echo AI Video Generator runs the same model on cloud hardware behind a browser.

JoyAI-Echo vs LTX-2.3: what's added on top of the base model

JoyAI-Echo is built on LTX-2.3 - the same base, fine-tuned and extended for long-form coherence.

Capability	LTX-2.3 base	JoyAI-Echo fine-tune
Max coherent length	20 seconds	5 minutes
Multi-shot consistency	Single-clip only	Cross-shot memory bank
Audio	Stereo 24 kHz, native	Stereo 24 kHz, native + voice timbre lock
Inference speed	Baseline	~7.5x faster via DMD
Output resolution	4K (3840 x 2160)	1280 x 736 + super-resolution to HD
License	Apache 2.0	LTX-2 Community License
Best for	Short cinematic shots at 4K	Long multi-shot stories with identity continuity

Use LTX-2.3 for short cinematic clips where 4K matters. Use JoyAI-Echo for long-form storytelling where character identity and voice need to hold across many shots.

JoyAI-Echo FAQ

Common technical questions about the open-source JoyAI-Echo model - license, hardware, base architecture, and how it compares to LTX-2.3 and Wan 2.6.

Is JoyAI-Echo open source?

Yes. The model weights, inference code, and reference examples are open-sourced on Hugging Face and GitHub under the LTX-2 Community License for academic and non-commercial use.

What hardware does JoyAI-Echo need?

About 46-50 GB of VRAM. That means an H100, A100, or a tight A6000 setup. Consumer GPUs like RTX 4090 cannot run JoyAI-Echo locally without CPU offloading.

What model is JoyAI-Echo built on?

JoyAI-Echo is built on LTX-2.3, the 22B-parameter open-source video diffusion model from Lightricks. The text encoder is Gemma-3-12B-IT.

JoyAI-Echo vs LTX-2.3 - which should I use?

Use LTX-2.3 for short 4K cinematic clips with a permissive license. Use JoyAI-Echo for long multi-shot stories where character and voice continuity matter.

Can I use JoyAI-Echo commercially?

Not directly under the open-source license. Commercial deployments require a separate license path or the hosted JoyAI-Echo AI Video Generator commercial license on paid tiers.

Where is the JoyAI-Echo paper?

The paper is available from the official GitHub repository, with mirrors on ResearchGate and the project page.

Don't have an H100? Try the hosted JoyAI-Echo AI Video Generator

The same JoyAI-Echo model, behind a browser. No GPU, no install, no waitlist. Free tier covers 60 seconds of finished video per month; paid plans add no watermark, longer renders, and a commercial license.

Try the hosted JoyAI-Echo AI Video Generator