Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
SANA: A Deep Dive into Efficient High-Resolution Image and Video Generation
[logo image]
SANA stands at the forefront of efficient diffusion-based generation, delivering high-resolution imagery and video with an emphasis on scalable training, fast inference, and practical deployment. Born from a collaborative effort at NVlabs, this codebase packs a suite of models and pipelines—SANA itself, SANA-1.5, SANA-Sprint, SANA-Video, SANA-WM, and Sol-RL—into an open-source framework designed to run on modest hardware while achieving impressive results at resolutions up to 4K. The project has earned recognition in top-tier venues and continues to grow through active community engagement, documentation, and cross-project collaborations.
Teaser image: Sana page overview
Overview: What SANA Is and Why It Matters
SANA is an efficiency-oriented codebase for high-resolution image and video generation. It provides complete training and inference pipelines across its family of models, with a clear emphasis on reducing compute, memory, and data requirements without compromising perceptual quality. The project’s philosophy centers on making advanced diffusion-based synthesis accessible to researchers and developers who work on commodity hardware, mobile devices, or workstation GPUs with limited VRAM.
Key components in the SANA ecosystem include:
- SANA: The core diffusion-based model capable of high-resolution image synthesis, up to 4K, with dramatically lower resource needs compared to traditional large diffusion architectures.
- SANA-1.5: An iteration focused on scaling training-time and inference-time compute to improve quality while maintaining efficiency.
- SANA-Sprint: A one/few-step generator enabled by continuous-time distillation (sCM), delivering very fast per-image generation on high-end accelerators (0.1 seconds per 1024 px image on H100).
- SANA-Video: A streamlined framework for video generation, employing techniques such as Block Linear Attention and advanced refiners to produce longer sequences with manageable compute.
- SANA-WM: A controllable world model with 2.6B parameters, capable of generating 720p video worlds with 6-DoF camera control, enabling embodied AI-style simulations and planning.
- Sol-RL: A reinforcement learning-oriented module that leverages NVFP4 rollouts and BF16-based training to accelerate convergence and enable scalable post-training RL workflows.
- Supporting technologies: Diffusion-based architectures, DC-AE compression, linear attention, block causal design, and a variety of training and inference optimizations that collectively enable efficient high-quality generation.
A cohesive story is told through containerized pipelines, optimized attention mechanisms, and unified APIs that enable researchers to train, fine-tune, and deploy Sana models with relative ease. The project also maintains a rich set of documentation, tutorials, and example configurations to lower the barrier to entry for newcomers while offering the depth seasoned practitioners demand.
Teaser image: all Sana components
How SANA Works: Core Techniques and Design Principles
SANA is built around several core ideas that together deliver efficiency without sacrificing fidelity:
Linear Attention for High Resolution
Replacing vanilla attention in diffusion denoisers (DiT) with linear attention dramatically reduces the memory and compute burden when handling large image tensors. This enables making 4K samples feasible on practical hardware.
The diffusion transformer (DiT) backbone benefits from a carefully designed attention mechanism that scales close to linearly with sequence length, which is critical for high-resolution generation.
DC-AE: Ultra-Compression for Latent Tokens
The DC-AE (downstream compression autoencoder) reduces image tokens by preserving essential information while enabling far fewer latent tokens. This compression yields substantial memory savings and speedups during both training and inference.
Decoder-Only Text Encoding
A modern decoder-only text encoder is integrated to support in-context learning and improved text-to-image alignment. This choice aligns with the demand for efficient, scalable encoding in large-generation scenarios.
Block Causal Linear Attention and Causal Mix-FFN
For long video generation, these techniques combine efficient attention with optimized feed-forward networks, enabling longer temporal horizons without incurring the quadratic cost of full attention.
Flow-DPM-Solver: Efficient Sampling
A sampling strategy that reduces the number of diffusion steps while maintaining quality, contributing to faster inference and shorter generation times.
sCM Distillation: One/Few-Step Generation
Continuous-time consistency distillation (sCM) enables substantially shorter generation pipelines, compressing the diffusion process into one or a few steps with robust results.
Controllable World Modeling (SANA-WM)
A world model with long-context capabilities and a 6-DoF camera trajectory controller allows the agent to navigate and build consistent, expandable 3D-like worlds. This is critical for embodied AI experiments and video synthesis with coherent motion and perspective.
Sol-RL: Mixed-Precision Acceleration
Rollout selection is performed with low-precision NVFP4 while optimization remains BF16 for stability and speed, delivering faster RL training without sacrificing final policy performance.
Deployment Across Scales
The system is designed for deployment on laptop GPUs with less than 8GB VRAM via 4-bit quantization, making high-resolution generation accessible in constrained environments.
A quick visual note: the Sana family is united by a philosophy of combining efficiency with expressive capability. The result is a pipeline that can train fast, infer quickly, and deliver state-of-the-art-like results in both images and videos.
Teaser image: overview of Sana’s architecture
News and Milestones: A Timeline of Progress and Collaborations
SANA has grown through a continuous stream of releases, improvements, and community-driven enhancements. The project has been featured in major conferences and has established a broad ecosystem of tools, models, and integrations. Here are some of the standout milestones and updates that shape its ongoing development:
2026 May: SANA-WM 2.6B Controllable World Model released
Capable of 720p video generation with 1-minute horizons and 6-DoF camera control, providing a new baseline for world modeling and embodied AI.
Projected to influence how long-horizon planning and multi-modal world interaction are approached in diffusion-based systems.
References: Project page and arXiv paper.
2026 April: Sol-RL NVFP4 Rollout and BF16 RL Training available
Complete training recipes for SANA, FLUX.1, and SD3.5-L, including bundled post-training datasets.
A broader RL workflow that accelerates experimentation and deployment.
2026 March: SANA-Video 720p with LTX-VAE released
Enables upscaling with LTX2 Refiner to 2K resolutions, expanding the quality ceiling for video syntheses.
Access points include Model Zoo, dedicated documentation, and blog content detailing refinements.
2026 March: Post-Training infra: SANA × Cosmos-RL
A collaboration with Cosmos-RL to provide a full RL infrastructure for post-training (SFT/RL) of SANA-Image and SANA-Video.
Features state-of-the-art algorithms (Diffusion-NFT, Flow-GRPO), preset configs, and flexible datasets.
2026 February: SANA joins SGLang ecosystem
SANA is supported in SGLang for high-performance serving with an OpenAI-compatible API.
Documentation guidance and integration notes offer a path to production-level deployment.
2025 January to 2026 January: A flurry of releases and updates
SANA-Video accepted as an Oral at ICLR-2026.
LongSANA and other variants released with improved inference, training efficiency, and wider model support (multiple resolutions, 2K/4K, BF16, 8-bit, 4-bit).
2025 December to 2025 November: Broad diffusion and open diffusion
SANA models released across various diffusion formats, including diffusers compatibility and LoRA/dreambooth workflows.
Public APIs, tutorials, and example configurations expand the accessibility of Sana for a broad audience.
2025 March to 2025 May: Sprint and 1.5 milestones
SANA-Sprint code and weights released; one/few-step diffusion via distillation becomes a practical option.
SANA-1.5 code and weights released, with emphasis on training-time and inference-time scalability.
2024–2025: Early heyday and acceleration of adoption
4-bit and 8-bit quantization approaches demonstrated, enabling runs on small GPUs.
2K and 4K model releases, diffusion diffusion pipelines integrated into ComfyUI, and broader diffusion ecosystem integrations.
In our own words, the news stream embodies a strategic shift toward practical, deployable diffusion models that can operate on a range of hardware and budgets while maintaining or surpassing the visual quality of larger systems. The community traction—through Diffusers PRs, HuggingFace spaces, and multiple model hubs—speaks to SANA’s role as both a research prototype and a deployable toolkit.
Teaser image: SANA all components
Getting Started: How to Run SANA on Your Machine
If you want to explore SANA, the Quick Start section in the documentation lays out the steps clearly. Below is a compact walk-through to get you started, followed by code blocks you can run locally.
Quick Start: Clone and set up
Open a terminal and run:
- git clone https://github.com/NVlabs/Sana.git
- cd Sana
- ./environment_setup.sh sana
Inference with diffusers
The following example demonstrates how to load a pre-trained Sana model and generate an image from a text prompt:
- Python code block:
- import torch
- from diffusers import SanaPipeline
- pipe = SanaPipeline.frompretrained( "Efficient-Large-Model/SANA1.51.6B1024pxdiffusers", torch_dtype=torch.bfloat16, )
- pipe.to("cuda")
- pipe.vae.to(torch.bfloat16)
- pipe.text_encoder.to(torch.bfloat16)
- prompt = 'a cyberpunk cat with a neon sign that says "Sana"'
- image = pipe( prompt=prompt, height=1024, width=1024, guidancescale=4.5, numinferencesteps=20, generator=torch.Generator(device="cuda").manualseed(42), )[0]
- image[0].save("sana.png")
Tip: Upgrade your diffusers library to version 0.32.0 or later to ensure compatibility with SanaPipeline.
Full documentation and resources
Full Documentation: https://nvlabs.github.io/Sana/docs/
Installation Guide: https://nvlabs.github.io/Sana/docs/installation/
Model Zoo: https://nvlabs.github.io/Sana/docs/model_zoo/
Sana Inference & Training: https://nvlabs.github.io/Sana/docs/sana/
SANA-Sprint: https://nvlabs.github.io/Sana/docs/sana_sprint/
SANA-Video: https://nvlabs.github.io/Sana/docs/sana_video/
LongSANA: https://nvlabs.github.io/Sana/docs/longsana/
SANA-WM (coming soon): https://nvlabs.github.io/Sana/docs/world-model/
ControlNet: https://nvlabs.github.io/Sana/docs/sana_controlnet/
LoRA / DreamBooth: https://nvlabs.github.io/Sana/docs/sanaloradreambooth/
A Performance Snapshot: Efficiency Meets Quality
SANA places a premium on measurable efficiency while maintaining high perceptual fidelity. Here are notable performance themes that emerge from the published benchmarks and configurations:
Image generation at 1024x1024
Sana demonstrates high throughput with competitive or superior speedups versus prior heavy diffusion models, particularly when using the 0.6B and 1.6B variants.
Comparing models, Sana-0.6B achieves substantial speedups and robust FID/CLIP metrics, making it a strong option for interactive or batch image generation.
1.6B Sana variants deliver competitive latency with high-quality outputs, and the diffusion pipeline benefits from 4-bit and BF16 quantization strategies to reduce memory usage while preserving visual fidelity.
The combination of DC-AE compression and linear attention enables efficient 1024x1024 generation on mid-range GPUs, with reported sample efficiencies and practical runtimes.
Video generation (VBench 720p)
The SANA-Video family demonstrates scalable performance for video synthesis, with a representative chalk line showing favorable latency vs. quality trade-offs.
A notable exemplar: SANA-Video-2B achieving a favorable balance with relatively low latency and compact parameter count, enabling practical minute-length or longer sequences under certain configurations.
4K and multi-resolution support
The 4K capabilities are backed by DC-AE compression, efficient attention, and optimized diffusion steps. The results indicate the possibility of 4K outputs with manageable GPU memory footprints when using BF16/8-bit configurations.
Diffusers integration and deployment
The project has a long-standing collaboration with diffusers, enabling a smooth pipeline for users who prefer HuggingFace ecosystems. Diffusers compatibility extends Sana across pipelines, with SanaPipeline, SanaPAGPipeline, and advanced schedulers supported.
Real-world deployment
The ability to run with less than 8GB VRAM through quantization makes Sana practical for laptop-based development, on-device experiments, and lightweight cloud instances.
In short, Sana’s performance story is not simply about raw numbers; it’s about a coherent design that reduces the infrastructural burden of high-resolution generation. The results are accessible to researchers who want to iterate quickly, build prototypes, or deploy creative generative tools to end users.
What You Can Build with Sana
The Sana family opens doors to a broad spectrum of creative and technical projects:
- High-resolution art and design generation, with support for intricate textures, complex lighting, and cinematic visuals at 2K–4K scales.
- Real-time or near-real-time content creation pipelines for digital media, with the Sprint family enabling fast pipelines for rapid ideation and iteration.
- Video synthesis and editing workflows, including minute-length sequences with coherent motion and camera dynamics, useful for concept videos, prototyping scenes, or game content creation.
- Embodied AI experiments through SANA-WM, enabling interactive world-building, navigation, and 6-DoF camera control in generated environments.
- RL-driven content generation through Sol-RL, enabling reinforcement learning-based optimization of generation patterns and reward-guided improvements.
Acknowledgments: The Open-Source Spirit
SANA’s progress rests on the shoulders of many open-source projects and contributors. Key acknowledgments include:
- Diffusers: For diffusers framework compatibility and scheduling strategies.
- Pixel-level diffusion and DC-AE-based contributions: The DC-AE approach to compression, and diffusion-based improvements in large-scale generation.
- ComfyUI and related node ecosystems: Enabling practical, modular workflows for Sana in user-friendly environments.
- LoRA and dreambooth integrations: Expanding personalization and finetuning capabilities within accessible tooling.
- Open-source RL libraries and collaborators: Cosmos-RL and other allied ecosystems enabling post-training RL and rollout scalability.
- The broader diffusion community: Acknowledgments to many researchers and practitioners who contribute to diffusion-based image and video synthesis, pushing the boundaries of efficiency and quality.
Acknowledging the community’s energy helps explain how Sana has evolved from a research prototype to a robust, deployable toolkit with active documentation, tutorials, and community forums. The “Contribution” section of the project showcases the collective effort and thanks to individual contributors and teams.
Bonus: Acknowledgement of Visual Content
The blog post includes images that illustrate Sana’s scope:
- The main Sana overview image (Sana all components) helps visualize the architecture and the relationships across Sana’s family.
- The Sana teaser images (Sana.jpg and all.png) provide visual anchors for the architecture and outputs.
- A “Presenting Video of SANA” image linked to Paper2Video-the presentation content, illustrating the multi-modal storytelling around Sana’s capabilities.
What’s Next: A Forward-Looking Note
SANA’s roadmap points toward deeper integration with RL, world modeling, and flexible deployment, along with expanded model zoos for higher-resolution outputs and broader language-vision alignment. The ongoing work includes:
- Expanding the World Model (SANA-WM) to more complex scenes and longer horizons, with improved camera dynamics and scene consistency.
- Enhancing post-training RL workflows through Cosmos-RL and related tools for diffusion-based content optimization.
- Extending 2K/4K model coverage with more robust 8-bit and 4-bit pipelines, improving latency, memory usage, and stability for end-user products.
- Increasing diffusion efficiency with newer attention variants, more compact autoencoders, and advanced distillation methods to push 0.1s-per-1024px generation closer to practical real-time deployment.
Closing Thoughts: A Community-Driven Platform for Creative AI
SANA represents a deliberate fusion of efficiency, scale, and accessibility. By combining DC-AE compression, linear attention, distillation, and careful system design, SANA demystifies high-resolution diffusion for everyday developers and researchers. It is not merely about achieving 4K outputs or minute-length videos; it is about enabling a workflow where one can go from concept to polished media quickly, with the ability to experiment, refine, and deploy. The project’s rich documentation, active contributions, and multi-model ecosystem underscore a philosophy of openness, collaboration, and practical impact.
If you are curious about pushing the boundaries of what diffusion-based generation can do on modest hardware, SANA offers a well-documented, community-supported path. Start with the Quick Start guide, explore the model zoo, and experiment with SANA-Sprint for ultra-fast outputs or SANA-Video for cinematic video synthesis. The landscape of high-resolution generative modeling is rapidly evolving, and SANA remains at a compelling intersection of speed, quality, and accessibility.
Additional image: Paper2Video presentation thumbnail
Appendix: Quick References
Official documentation and resources
Documentation: https://nvlabs.github.io/Sana/docs/
Model Zoo: https://nvlabs.github.io/Sana/docs/model_zoo/
SANA-Video documentation: https://nvlabs.github.io/Sana/docs/sana_video/
ComfyUI guidance: https://nvlabs.github.io/Sana/docs/ComfyUI/comfyui/
sgLang integration: https://nvlabs.github.io/Sana/docs/sglang/
Community and demonstrations
Demo pages and spaces: https://nv-sana.mit.edu/, https://huggingface.co/spaces/Efficient-Large-Model/SanaSprint
Discord community: https://discord.gg/rde6eaE5Ta
Representative publications and arXiv entries
Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer (arXiv:2410.10629)
SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute (arXiv:2501.18427)
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation (arXiv:2503.09641)
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer (arXiv:2509.24695)
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling (arXiv:2604.06916)
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer (arXiv:2605.15178)
BibTeX (for reference)
Sana paper: Xie et al., Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer (arXiv:2410.10629)
SANA 1.5: Xie et al., SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute (arXiv:2501.18427)
SANA-Sprint, Sana-Video, and related works: additional citations available in project documentation and arXiv entries
Whether you’re a researcher chasing efficiency breakthroughs or a developer building creative tools, SANA offers a robust, transparent, and scalable path toward high-fidelity image and video generation. The journey from GPU-limited experiments to practical production-ready diffusion pipelines is now more accessible, with a rich set of resources, community support, and a forward-looking roadmap guiding continuous improvement.
Enjoying this project?
Discover more amazing open-source projects on TechLogHub. We curate the best developer tools and projects.
Repository:https://github.com/NVlabs/Sana
GitHub - NVlabs/Sana: Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
SANA: A Deep Dive into Efficient High-Resolution Image and Video Generation...
github - nvlabs/sana