JoyAI-Echo Model

JoyAI-Echo - The Open-Source Long Audio-Visual Generation Model

JoyAI-Echo is an open-source model from JD.com Future Academy that generates up to 5-minute multi-shot videos with native, synchronized audio. Released May 2026 under the LTX-2 community license.

22B-scale audio-video foundation model from the JD.com video AI program
5-minute multi-shot generation
7.5x faster than the original pipeline
LTX-2 community license for academic and non-commercial use

Abstract: What is JoyAI-Echo?

This section is the citation-ready summary of JoyAI-Echo - what it does, what it solves, and why it matters for long-form video generation research.

JoyAI-Echo is an open-source long audio-visual generation framework that produces minute-level multi-shot videos with synchronized voice and music. The model addresses the core problem of long-form video generation: maintaining character appearance, voice timbre, and scene continuity across many shots without resetting identity every clip.

Most prior text-to-video models cap at 5-25 seconds because the underlying diffusion process loses coherence over longer sequences. JoyAI-Echo solves this with a cross-modal audio-visual memory bank that conditions every new shot on prior visual identity and voice context, letting the model carry a single character across a five-minute story.

In published human evaluation, JoyAI-Echo wins against HappyOyster on long-form generation at 63.6% for visual aesthetics, 81.7% for audio quality, 80.6% for prompt following, and 59.4% for identity consistency.

JoyAI-Echo architecture: LTX-2.3 base + cross-modal memory bank

JoyAI-Echo is built on top of LTX-2.3 with Gemma-3-12B as the text encoder. The novel contribution lives in the post-training pipeline.

Base model

JoyAI-Echo is LTX-2.3 fine-tuned by JD.com Future Academy. It inherits native audio and short-clip video generation from the Lightricks base and extends it to long-form coherence.

Gemma encoder

The model uses Gemma-3-12B-IT as the text encoder. The instruction-tuned variant handles structured prompt JSON well for multi-shot story input.

Cross-modal memory bank

The memory bank stores slot-paired visual and audio embeddings from prior shots, locking face geometry, voice timbre, scene layout, and ambient profile.

Post-training pipeline

Memory-based reinforcement learning teaches the model to use memory effectively, while DMD compresses inference by about 7.5x without sacrificing benchmark quality.

Four key contributions of JoyAI-Echo

The published paper isolates four advances that make JoyAI-Echo different from prior long-video research. Each is a separable contribution that the community can build on.

1

Cross-modal audio-visual memory bank for character + voice consistency

The slot-paired memory mechanism is what lets a single character look and sound the same from shot one to shot fifty. Most prior text-to-video models reset identity every clip; JoyAI-Echo does not.

2

Joint audio-video generation from a single prompt JSON

One pipeline produces synchronized video and audio simultaneously. Lip sync, voice timbre, and ambient sound emerge from the same model pass, not a separate TTS or foley stage stitched on after.

3

DMD-distilled inference for ~7.5x speedup

Distribution matching distillation collapses multi-step diffusion inference into a few-step pipeline. Benchmark quality holds, but minute-long generation becomes feasible in near-real time on H100-class hardware.

4

Interactive conversational agent for real-time editing

A lightweight agent layer accepts natural-language edit instructions like "make shot two darker" or "change the voice to angrier" and re-renders only the changed segments.

JoyAI-Echo benchmarks: human-evaluator win rates

The published JoyAI-Echo paper reports a Good-Same-Bad human evaluation across 100 stories and 3,000 evaluated shots.

DimensionJoyAI-Echo winSameHappyOyster win
Visual aesthetics63.6%8.8%27.6%
Audio quality81.7%6.5%11.8%
Prompt following80.6%13.5%5.9%
Identity consistency59.4%12.9%27.7%
DimensionJoyAI-Echo winSameWan 2.6 win
Visual aesthetics58.8%14.7%26.5%

Benchmark scale: 100 multi-shot stories, 3,000 evaluated shots, 241 frames per shot at 25 FPS, and default output at 1280 x 736. The evaluation covers cooking scenes, action, fantasy stories, animation, talking-head explainers, and game cinematic styles.

JoyAI-Echo technical specs

What it takes to run JoyAI-Echo locally. Numbers are from the official GitHub README - confirm against the latest release tag before provisioning hardware.

SpecValue
Base architectureLTX-2.3 (Lightricks)
Text encoderGemma-3-12B-IT (~24 GB)
Model weight formatSafetensors (~46 GB)
Peak GPU usage46-50 GB VRAM (H100 or A100-class card)
Default output resolution1280 x 736, with optional super-resolution upscale
Frames per shot241 frames at 25 FPS
Max coherent length5 minutes, multi-shot
FrameworkPython 3.11, PyTorch 2.8, CUDA 12.8
Other dependenciesffmpeg for shot concatenation
Inference speedup~7.5x via DMD distillation
LicenseLTX-2 Community License, academic and non-commercial use

If you do not have an H100 or A100, the JoyAI-Echo AI Video Generator runs the same model on cloud hardware behind a browser.

JoyAI-Echo vs LTX-2.3: what's added on top of the base model

JoyAI-Echo is built on LTX-2.3 - the same base, fine-tuned and extended for long-form coherence.

CapabilityLTX-2.3 baseJoyAI-Echo fine-tune
Max coherent length20 seconds5 minutes
Multi-shot consistencySingle-clip onlyCross-shot memory bank
AudioStereo 24 kHz, nativeStereo 24 kHz, native + voice timbre lock
Inference speedBaseline~7.5x faster via DMD
Output resolution4K (3840 x 2160)1280 x 736 + super-resolution to HD
LicenseApache 2.0LTX-2 Community License
Best forShort cinematic shots at 4KLong multi-shot stories with identity continuity

Use LTX-2.3 for short cinematic clips where 4K matters. Use JoyAI-Echo for long-form storytelling where character identity and voice need to hold across many shots.

JoyAI-Echo FAQ

Common technical questions about the open-source JoyAI-Echo model - license, hardware, base architecture, and how it compares to LTX-2.3 and Wan 2.6.

Is JoyAI-Echo open source?

Yes. The model weights, inference code, and reference examples are open-sourced on Hugging Face and GitHub under the LTX-2 Community License for academic and non-commercial use.

What hardware does JoyAI-Echo need?

About 46-50 GB of VRAM. That means an H100, A100, or a tight A6000 setup. Consumer GPUs like RTX 4090 cannot run JoyAI-Echo locally without CPU offloading.

What model is JoyAI-Echo built on?

JoyAI-Echo is built on LTX-2.3, the 22B-parameter open-source video diffusion model from Lightricks. The text encoder is Gemma-3-12B-IT.

JoyAI-Echo vs LTX-2.3 - which should I use?

Use LTX-2.3 for short 4K cinematic clips with a permissive license. Use JoyAI-Echo for long multi-shot stories where character and voice continuity matter.

Can I use JoyAI-Echo commercially?

Not directly under the open-source license. Commercial deployments require a separate license path or the hosted JoyAI-Echo AI Video Generator commercial license on paid tiers.

Where is the JoyAI-Echo paper?

The paper is available from the official GitHub repository, with mirrors on ResearchGate and the project page.

Don't have an H100? Try the hosted JoyAI-Echo AI Video Generator

The same JoyAI-Echo model, behind a browser. No GPU, no install, no waitlist. Free tier covers 60 seconds of finished video per month; paid plans add no watermark, longer renders, and a commercial license.

Try the hosted JoyAI-Echo AI Video Generator