LongLive 2.0: NVFP4 Parallel Infrastructure for Long Video Generation
LongLive 2.0: NVFP4 Parallel Infrastructure for Long Video Generation
Welcome to a detailed exploration of LongLive 2.0, a next-generation framework designed to push the boundaries of long-form video generation. Building on the momentum of the original LongLive, this release introduces an NVFP4-based parallel infrastructure that harmonizes training and inference at scale. The result is a system capable of real-time, long-horizon video generation with sophisticated attention mechanisms, efficient KV-cache handling, and multi-shot capabilities that empower both researchers and developers to experiment with longer, more complex prompts.
To set the stage, let’s visualize the essence of LongLive 2.0 and its mission: delivering smooth streaming long videos through a carefully engineered combination of parallelism, advanced attention techniques, and optimized inference pathways. The project sits at the intersection of diffusion-based video generation, transformer-style sequence modeling, and hardware-aware optimization, all orchestrated to deliver high-throughput, high-quality results.
TLDR: Infra with NVFP4 and Parallelism for Training and Inference
LongLive 2.0 introduces an NVFP4-based parallel infrastructure designed for both training and inference. The core ideas can be summarized as follows:
- NVFP4 (Non-Volatile Factorized Pipeline 4) sits at the heart of the system, enabling robust parallelism for long video generation. It supports BF16 precision for training and inference, while offering efficient handling of KV caches and attention mechanisms across long sequences.
- The framework introduces multi-shot support and a streaming decoding pathway, enabling interactive and real-time generation experiences even for long videos.
- Training supports balanced sequence parallelism for autoregressive (AR) training with teacher forcing, multi-shot training, and efficient weight management for NVFP4 (or BF16) backends.
- Inference supports NVFP4 with W4A4 quantization, KV-cache parallelism, and multi-shot attention sinks. The sequence-parallel inference and asynchronous decoding help sustain high frame rates and smooth video output.
- The architecture blends several advanced components, including relative RoPE-inspired KV-cache techniques, streaming VAE relocation for streaming pipelines, and compatibility with both Transformer Engine-based backends and FourOverSix-style checkpoints.
The result is a cohesive pipeline that can generate long-form videos with improved throughput, lower KV-cache synchronization overhead, and flexible deployment options across training and inference stages.
News and Milestones
- 2026.05.25: Optimized the NVFP4 inference path with fused Triton RoPE/adaLN kernels, reduced KV-cache synchronization overhead, in-place quantized KV-cache updates, faster FP4 KV dequantization, pinned VAE transfers, and a safer LoRA-before-quantization setup. Overall throughput improvement of 18.6%.
- 2026.05.13: Release of LongLive 2.0 — an infrastructure with NVFP4, parallelism, and multi-shot for autoregressive training, distillation, and inference (≈45.7 FPS). The original LongLive 1.0 is available in the v1.0 branch.
- 2026.04.12: LongLive supports KV-cache compression with TriAttention, achieving roughly 50% KV reduction without quality loss.
- 2026.01.27: LongLive accepted by ICLR-2026.
- 2026.01.11: LongLive adapts LongLive’s RoPE to a KV-cache relative RoPE and enables infinite long videos.
- 2025.11.03: LongLive extended to linear-attention models via SANA-Video, enabling around 60-second interactive videos in real time.
- 2025.09.29: Paper released, along with the LongLive GitHub repository, model weights (LongLive-1.3B), and an interactive demo site.
- 2025.01+: Ongoing research and development that matured into LongLive 2.0.
Introduction: From Real-Time Long Video in 1.0 to NVFP4-Driven 2.0
LongLive 1.0 introduced real-time, interactive long video generation by processing sequential prompts and streaming results in real time. The approach relied on attention sink concepts, KV-recache strategies, and streaming long-tuning techniques to enable user-guided long-form video creation.
LongLive 2.0 expands this vision with a dedicated NVFP4 parallel infrastructure designed to address the scalability challenges of much longer videos. The key idea is to leverage sequence-level parallelism for autoregressive training, multi-shot processing for longer prompts and scenes, and an inference path engineered for high-throughput, streaming generation. The framework supports both training-time parallelism and efficient inference, enabling researchers to experiment with more ambitious prompts and longer sequences without sacrificing real-time feedback.
To illustrate how LongLive 2.0 achieves these goals, here is a high-level framework overview:
Training
Balanced sequence parallel for autoregressive (AR) training with teacher forcing.
AR training on multi-shot or single-shot videos.
NVFP4 or BF16 backends for training, with careful module wrapping and weight materialization to maximize throughput.
Inference
NVFP4 inference (W4A4) and NVFP4 KV cache for fast, memory-efficient generation.
Multi-shot attention sink to manage attention across long video frames.
Sequence-parallel inference to exploit long-context opportunities.
Async decoding to maintain smooth streaming outputs.
To visualize the architecture and data flow, an overview diagram (LongLive 2.0 framework overview) is provided:
In addition, it’s useful to compare LongLive 2.0 with its predecessor. A separate teaser image highlights the evolution from LongLive 1.0 to 2.0, with a focus on interactive, long-form generation and the enhanced throughput made possible by the NVFP4 design.
Getting Started
If you want to dive in, there are comprehensive resources and quick-start guides that walk you through installation, NVFP4 setup, training, and inference. The documentation is hosted online, and the repository provides practical guidance and example pipelines that compile and run on standard GPU hardware.
- Full Documentation: https://nvlabs.github.io/LongLive/LongLive2/docs/
- Installation: https://nvlabs.github.io/LongLive/LongLive2/docs/#installation
- NVFP4 Setup: https://nvlabs.github.io/LongLive/LongLive2/docs/#nvfp4-installation
- Training: https://nvlabs.github.io/LongLive/LongLive2/docs/#training
- Inference: https://nvlabs.github.io/LongLive/LongLive2/docs/#inference
- Data Organization: https://nvlabs.github.io/LongLive/LongLive2/docs/#training-data
To give you a sense of what the quick start looks like, here are two representative paths: a BF16 quick-start and an NVFP4-specific setup.
Quick Start: BF16
This quick-start demonstrates running a simple inference using a BF16 model checkpoint. It assumes you have a CUDA-enabled environment and the required codebase installed.
Code snippet (BF16):
import torch
from omegaconf import OmegaConf
from pipeline import CausalDiffusionInferencePipeline
from utils.config import normalize_config
from utils.inference_utils import (
load_generator_checkpoint,
place_vae_for_streaming,
prepare_single_prompt_inputs,
save_video,
)
prompt = "A compact silver robot walks through a clean robotics lab."
merged_checkpoint_path = "LongLive-2.0-5B/model_bf16.pt"
config = normalize_config(OmegaConf.load("configs/inference.yaml"))
device = torch.device("cuda")
torch.set_grad_enabled(False)
pipe = CausalDiffusionInferencePipeline(config, device=device)
load_generator_checkpoint(pipe.generator, merged_checkpoint_path)
pipe = pipe.to(device=device, dtype=torch.bfloat16)
place_vae_for_streaming(pipe, config) # honor streaming_vae + vae_device when set
pipe.generator.model.eval().requires_grad_(False)
noise, prompts = prepare_single_prompt_inputs(config, prompt, device)
video = pipe.inference(noise=noise, text_prompts=prompts)
save_video(video[0], "videos/quickstart/sample.mp4", fps=24)
Note: placevaeforstreaming is a no-op unless inference.streamingvae is true and inference.vae_device is set, so toggling streaming-pipeline decode in your yaml is enough—the script does not need to change.
Quick Start: NVFP4
For NVFP4 setups, you’ll adjust the checkpoint path and the backend settings to match the FourOverSix or Transformer Engine-based configurations. The NVFP4 workflow emphasizes careful checkpoint loading, NVFP4 module wrapping, weight materialization, dtype/device placement, and the streaming-pipeline VAE relocation for both backends. The bf16 shortcut pipe.to(…) is unsafe here because it would cast the quantized buffers.
Code snippet (NVFP4):
import torch
from omegaconf import OmegaConf
from pipeline import CausalDiffusionInferencePipeline
from utils.config import normalize_config
from utils.inference_utils import prepare_single_prompt_inputs, save_video, setup_nvfp4_pipeline
prompt = "A compact silver robot walks through a clean robotics lab."
config = normalize_config(OmegaConf.load("configs/nvfp4/inference_nvfp4.yaml"))
device = torch.device("cuda")
torch.set_grad_enabled(False)
pipe = CausalDiffusionInferencePipeline(config, device=device)
setup_nvfp4_pipeline(pipe, config, device)
pipe.generator.model.eval().requires_grad_(False)
noise, prompts = prepare_single_prompt_inputs(config, prompt, device)
video = pipe.inference(noise=noise, text_prompts=prompts)
save_video(video[0], "videos/quickstart/sample_nvfp4.mp4", fps=24)
The NVFP4 setup path is designed to handle the intricacies of the NVFP4 backend, including the handling of quantum-like precision, KV-cache management, and streaming-ready VAE relocation, ensuring you can experiment with longer horizons and higher throughput without compromising video quality.
Framework at a Glance
The LongLive 2.0 framework packs a broad set of capabilities into a cohesive whole. The architecture is designed to scale, with parallelism at multiple levels and careful coordination between training and inference pathways. A central image captures the high-level structure of LongLive 2.0, illustrating how components fit together and how data flows through the system during typical AR training and inference sessions.
To appreciate the evolutionary step from 1.0 to 2.0, there is also a standalone teaser image that contrasts the two generations of the framework.
The 2.0 architecture emphasizes:
- Balanced sequence parallel for AR training (teacher-forcing) to efficiently learn long-horizon video generation.
- NVFP4 support for both autoregressive training and few-step distillation.
- Inference-time features such as W4A4 NVFP4, KV-cache management, and multi-shot attention sinks to sustain long sequences.
- Async decoding to maintain fluid video streaming and minimize latency spikes during long outputs.
- Streaming VAE relocation to ensure compatibility with streaming inference pipelines without performance bottlenecks.
Images from the input’s visual narrative help anchor these concepts, providing a visual reference for how LongLive 2.0 organizes data and computation across its parallel infrastructure.
LongLive 1.0 vs. LongLive 2.0: A Visual Evolution
The journey from 1.0 to 2.0 is marked by a shift from real-time interactive long video generation to a scalable, NVFP4-accelerated paradigm that can handle longer videos with improved throughput and robustness. The 1.0 framework emphasized real-time generation via streaming long tuning and efficient attention-sink management. The 2.0 extension retains the real-time spirit but adds a robust parallelization backbone (NVFP4) that supports multi-shot processing, sequence parallelism, and distributed tensor management. The visual comparison image highlights how 2.0 introduces greater scale and efficiency, without sacrificing the interactive, user-driven nature of generation.
The accompanying teaser image for 1.0 provides a reminder of the continuity: the core goal remains creating compelling, user-guided long videos, but the 2.0 version achieves this with a more scalable, hardware-conscious design.
Models, Performance, and Availability
LongLive 2.0 ships with multiple model variants, each tuned for different performance envelopes and use cases. Here are the key models and their trade-offs:
LongLive-1.3B
FPS: 20.7
Parameters: 1.3B
VBench: 84.87
Multi-shot: Not explicitly indicated
LongLive-2.0-5B
FPS: 24.8
Parameters: 5B
VBench: 85.06
Multi-shot: Supported (✅)
LongLive-2.0-5B-NVFP4-4Step
FPS: 29.7
Parameters: 5B
VBench: 84.51
Multi-shot: Supported (✅)
LongLive-2.0-5B-NVFP4-2Step
FPS: 45.7
Parameters: 5B
VBench: 83.14
Multi-shot: Supported (✅)
These variants illustrate a spectrum spanning lighter to heavier models, with NVFP4-based configurations offering rapid inference (notably up to 45.7 FPS in the 2-step NVFP4 setup) and multi-shot support for more complex prompts and longer videos.
The project remains open for broader use under the Apache 2.0 license, with documentation and training/inference code, as well as model weights, published to facilitate experimentation and adoption.
License
This repository is released under the Apache 2.0 license. See the LICENSE file for details. The license underscores a commitment to openness, collaboration, and practical reuse of the LongLive 2.0 framework in academic and industry settings.
Citation
If you find LongLive 2.0 useful for your research or applications, please consider citing the work:
BibTeX entry (article): @article{longlive_2.0, title={LongLive2.0: An NVFP4 Parallel Infrastructure for Long Video Generation}, author={Chen, Yukang and Wang, Luozhou and Huang, Wei and Yang, Shuai and Zhang, Bohan and Xiao, Yicheng and Chu, Ruihang and Mao, Weian and Hu, Qixin and Liu, Shaoteng and Zhao, Yuyang and Mao, Huizi and Chen, Ying-Cong and Xie, Enze and Qi, Xiaojuan and Han, Song}, journal={arXiv preprint arXiv}, year={2026} }
BibTeX entry (inproceedings, ICLR): @inproceedings{longlive, title={Longlive: Real-time interactive long video generation}, author={Yang, Shuai and Huang, Wei and Chu, Ruihang and Xiao, Yicheng and Zhao, Yuyang and Wang, Xianbang and Li, Muyang and Xie, Enze and Chen, Yingcong and Lu, Yao and others}, booktitle={ICLR}, year={2026}, }
For convenience, you can copy these into your bibliography manager when you cite LongLive 2.0 in your work.
Acknowledgement
LongLive 2.0 builds on a foundation of prior work and community contributions:
- Self-Forcing: the AR training codebase and formulation we build upon.
- Wan2.2: the base video diffusion model components used in this release.
These collaborations and inspirations help make LongLive 2.0 a practical, scalable platform for long-video generation research and application development.
Visuals and Key Figures
LongLive 2.0 teaser: a compact visual summary of the release’s capabilities and goals. [
]LongLive 1.0 framework overview: a reminder of the lineage and how 2.0 builds on the foundations. [
]Framework overview: the central diagram showing the interaction of components, parallelism, and data flow. [
]Watchable entry: a still frame inviting viewers to watch the illustrative video.

Logo: the LongLive 2.0 logo that anchors branding and recognition. [
]
If you’re exploring long-form content generation, LongLive 2.0 represents a substantial step toward scalable, interactive, and realistic video synthesis. The NVFP4 parallel infrastructure marries training efficiency with inference speed, enabling longer horizons, richer prompts, and a more responsive user experience. With the included documentation, code, and model variants, researchers and practitioners can experiment with augmented long-video generation pipelines, push the envelope on creative storytelling, and probe the practical limits of parallelized diffusion-based video synthesis. The ongoing evolution—reflected in recent optimizations, KV-cache innovations, and robustness enhancements—signals a strong trajectory for long-form generative video research in the months and years ahead.
Enjoying this project?
Discover more amazing open-source projects on TechLogHub. We curate the best developer tools and projects.
Repository:https://github.com/NVlabs/LongLive
GitHub - NVlabs/LongLive: LongLive 2.0: NVFP4 Parallel Infrastructure for Long Video Generation
LongLive 2.0 is an open‑source framework that introduces an NVFP4‑based parallel infrastructure for scalable, real‑time long‑video generation. It supports effic...
github - nvlabs/longlive