Bilateral Reference for High-Resolution Dichotomous Image Segmentation

Bilateral Reference for High-Resolution Dichotomous Image Segmentation: A Close Look at BiRefNet

In the evolving field of high-resolution dichotomous image segmentation, BiRefNet stands out as a comprehensive framework that blends bilateral references to produce precise, pixel-level masks at high resolutions. Developed by a diverse team spanning Nankai University, Northwestern Polytechnical University, National University of Defense Technology, Aalto University, Shanghai AI Laboratory, and the University of Trento, BiRefNet embodies a concerted effort to push segmentation quality for difficult tasks such as dichotomous foreground-background separation, camouflage object detection, and high-resolution matting. Below is a detailed tour of the concept, architecture, datasets, capabilities, and practical usage of BiRefNet, illustrated with input visuals originally shared in the project repository.

Representative samples from the input data

In the BiRefNet project, early DIS samples illustrate the kind of high-contrast foregrounds and challenging backgrounds the model is designed to handle. These two samples provide a quick sense of the granularity and fidelity BiRefNet targets.

Overview: What BiRefNet Aims to Solve

The core goal is high-resolution, dichotomous segmentation with a robust, Bilateral Reference mechanism. This means the model uses dual cues—one that emphasizes the foreground (the object of interest) and another that reinforces the background constraints—to arrive at precise, boundary-aware masks even at large image sizes.
The approach is designed to scale from 1024x1024 up to very high resolutions, with dedicated variants to support general use and specialized tasks like matting, as well as dynamic-resolution workflows.

BiRefNet at a Glance: Key Concepts and Variants

Bilateral Reference: The central idea is to fuse complementary references that guide segmentation, enabling sharper boundaries and reduced misclassifications in cluttered scenes.
BiRefNet Dynamic: A version trained on a range of resolutions (from as low as 256x256 up to 2304x2304) to maintain robust performance across varied input sizes.
BiRefNet HR-matting: A high-resolution matting-focused variant tailored for precise alpha mattes on 2048x2048 images and beyond.
BiRefNet HR: A general-use high-resolution model that leverages substantial training data and high-capacity backbones to deliver strong results on large images.
Attention and Backbone Enhancements: The project notes improvements to the Swin Transformer attention implementation, optimized with an official SDPA-based approach to reduce memory usage and potentially accelerate training and inference. This aligns with the broader goal of making high-resolution dichotomous segmentation more practical in real-world settings.

Architecture and Processing Flow: An Intuitive Picture

Dual streams feeding a unified segmentation head: One stream emphasizes a bilateral foreground cue, while the other reinforces contextual background information. The interplay between these streams helps BiRefNet resolve ambiguous regions, especially around fine edges and translucent boundaries.
Efficient high-resolution handling: The system is designed to work with large input sizes without prohibitive memory demands. Key optimizations revolve around attention mechanisms and processing strategies that preserve accuracy while controlling resource usage.
Box-guided and general-use paths: The model supports box-guided segmentation workflows, enabling quick, user-friendly prompts for interactive or semi-automatic segmentation tasks. This makes BiRefNet appealing for practical applications where quick turnarounds and GUI-based interactions matter.

Datasets, Training, and Model Settings: What Powers BiRefNet

Core task datasets (for reference and benchmarking):
DIS: Dichotomous segmentation datasets used for foreground-background separation in natural scenes.
COD: Camouflaged object segmentation challenges, where the foreground blends with the background.
HRSOD: High-Resolution Salient Object Detection tasks that stress high-resolution performance.
Training and evaluation setups:
The project describes training on diverse task setups, including general-use configurations at large scales (e.g., 2048x2048) and custom-task arrangements for matting and background removal.
Backbones considered include variants of Swin Transformers and other modern architectures, with detailed notes on model zoo configurations and the corresponding training/test splits.
Weights and availability:
A wide range of pre-trained weights is provided for different tasks (DIS, COD, HRSOD) and general-use scenarios.
The weights are organized in “exp-TASK_SETTINGS” folders, with results and evaluation logs alongside the model checkpoints.
There are guidelines for converting PyTorch weights to ONNX for broader deployment, along with notes about inference-time trade-offs and compatibility considerations (CUDA/CUDNN versions, etc.).

From Research to Real-World Use: Inference, Demos, and Deployments

Accessibility via one-line loading on HuggingFace:
BiRefNet can be loaded with a single line of code, enabling rapid experimentation and integration into pipelines.
Python example snippet: loading the model with a single command, followed by a straightforward inference flow.
Online demos and APIs:
BiRefNet offers inference and evaluation interfaces via a Colab notebook, a browser-based GUI in HuggingFace Spaces, and a hosted API on Fal AI for easy online usage.
This makes it straightforward to test weights, compare results, and deploy segmentation in small to large production settings.
On-device and accelerated deployment:
ONNX conversion provides an option to run BiRefNet in optimized environments where runtime efficiency is critical.
There are third-party integrations and projects that port BiRefNet to ComfyUI, InvokeAI, Blender add-ons, Stable Diffusion WebUI, and other popular tools, illustrating the ecosystem that surrounds BiRefNet.

Model Zoo: Weights, Tasks, and Practical Guidance

General-use configurations (2048x2048):
Trains on composite datasets like AIM-500, DIS-TR, DIS-TEs, HIM2K, and others to support robust real-world performance across varied domains.
Evaluation on DIS-VD provides a standardized reference for comparison.
Task-specific checkpoints:
DIS, COD, and HRSOD checkpoints exist for both traditional task setups and extended variants involving custom data and domain-specific training.
For matting-focused tasks, specialized checkpoints (e.g., TE-P3M-500-NP, P3M-10k) support high-precision foreground extraction and edge fidelity.
Box-guided segmentation and efficiency-focused variants:
The model supports guidance with bounding boxes, enabling box-driven segmentation pipelines.
Efficiency-oriented configurations include lightweight backbones and quantized formats (e.g., ONNX, TensorRT) for faster inference on consumer hardware.

Third-Party Ecosystem and Community Contributions BiRefNet has inspired a broad ecosystem of third-party adaptations and integrations, reflecting its practical utility and community interest:

Applications and UI integrations:
ComfyUI has integrated BiRefNet, making high-resolution background removal more accessible to users who prefer node-based workflows.
Unscreen and other online services deploy BiRefNet as a backend model for real-time video background removal.
Inference and deployment tools:
invoke_birefnet integrates BiRefNet into the InvokeAI framework for flexible, scriptable workflows.
Blender add-ons combine BiRefNet with FLUX to generate AI-based 2D cutout assets for pre-visualization and other production tasks.
UI/UX improvements:
Projects extend BiRefNet’s UI within ComfyUI, provide demo GIFs, and offer more streamlined user experiences for non-programmers.
Cross-platform deployments:
Stable Diffusion WebUI, TorchScript, GGUF formats, and TensorRT variants show how BiRefNet can be adopted across a spectrum of hardware and software environments.

Representative Visuals and Qualitative Results

The BiRefNet repository includes rich qualitative results that illustrate segmentation quality across tasks:
Representative quantitative and qualitative visuals compare baselines and BiRefNet’s outputs, often highlighting sharper boundaries and more faithful foreground masks.
Qualitative samples showcase BiRefNet’s ability to delineate fine details in complex scenes, supporting the strong claims about high-resolution performance.
Sample visuals:
Qualitative example: a detailed binary mask produced by BiRefNet on a challenging image, demonstrating edge fidelity and accurate foreground capture.
In addition, several side-by-side comparisons and result galleries demonstrate improvements over prior approaches.

Usage and Practical Guidelines: How to Get Started

Environment setup:
A modern PyTorch environment (2.5.0 or newer) is recommended to leverage fast training and compilation features.
A recommended setup involves creating a dedicated conda environment and installing the repository requirements.
Dataset preparation:
DIS, COD, and HRSOD datasets can be downloaded from official pages or pre-packaged bundles in the repository.
For training efficiency, a combined dataset approach is suggested, with careful organization of im and gt under task-specific folders.
Training and evaluation workflow:
The provided scripts enable end-to-end training, testing, and evaluation across multiple tasks.
After evaluation, a script allows selecting the best checkpoint based on a chosen metric (S, wF, or HCE, depending on the task).
Fine-tuning on custom data:
A step-by-step tutorial is included for fine-tuning on custom data, covering directory structure, configuration changes, and how to resume training from existing weights.
A short instructional video is available to guide users through the fine-tuning process.
Inference options:
PyTorch-based inference remains straightforward, with optional ONNX and TensorRT paths for accelerated deployment.
Inference can be run on a single GPU or scaled across multiple GPUs as needed, with guidance on memory usage and batch sizing.

Representative Results: Quantitative and Visual Highlights

BiRefNet showcases state-of-the-art performance across multiple high-resolution dichotomous segmentation benchmarks, with SOTA-style claims for DIS, COD, and HRSOD tasks cited in the project materials.
The visual results, including both global metrics and region-level fidelity, illustrate the model’s strength in preserving intricate boundaries and foreground shapes, even in challenging camouflage and cluttered scenes.
A collection of sample visuals and comparison images helps readers quickly gauge the practical impact of the bilateral reference strategy on real-world images.

Acknowledgements and Collaboration Context

The BiRefNet project acknowledges contributions from a wide network of partners and supporters, including FAL, Freepik, Redmond.ai, and Alibaba-ICBU, among others.
The work also highlights the collaborative spirit of the research community, with many third-party integrations and user-driven improvements enriching the BiRefNet ecosystem.

Citation and Contact Information

If you plan to reference BiRefNet in a publication or presentation, use the standard bibliographic entry:
Zheng, Peng; Gao, Dehong; Fan, Deng-Ping; Liu, Li; Laaksonen, Jorma; Ouyang, Wanli; Sebe, Nicu (2024). Bilateral Reference for High-Resolution Dichotomous Image Segmentation. CAAI Artificial Intelligence Research, 3, 9150038.
For inquiries, collaboration opportunities, or support:
Email: zhengpeng0108@gmail.com
Calendly: calendly.com/zhengpeng0108/30min
Discord: https://discord.gg/d9NN5sgFrq

In Summary: Why BiRefNet Matters for High-Resolution Dichotomous Segmentation

BiRefNet represents a thoughtful synthesis of bilateral references, high-resolution capability, and practical deployment paths. It bridges cutting-edge research with accessible tools and community-driven enhancements, enabling researchers and practitioners to achieve precise, robust segmentation on large images and in demanding scenarios.
The suite of variants—from BiRefNet Dynamic to BiRefNet HR and BiRefNet HR-matting—offers flexible options for different applications, whether the goal is foreground extraction in images, clean mattes for video, or box-guided segmentation for interactive workflows.
With online demos, weight zips, a vivid model zoo, and a thriving ecosystem of third-party integrations, BiRefNet is well-positioned to influence both academic research and real-world image editing, film production, and automated analysis tasks.

Images from the Input, placed where they best illustrate the narrative

Two representative DIS samples:
DIS Sample 1:
DIS Sample 2:
Figure of Comparison: several visual references used to benchmark DIS methods
Example figure thumbnails:
ComfyUI integration image:
Additional ecosystem visuals:
Invoke workflow example:
BiRefNet model zoo visuals and comparisons:
Qualitative results gallery (sample):

If you’re exploring high-resolution segmentation challenges and want a framework that balances accuracy with practical deployment, BiRefNet offers a compelling blueprint. It demonstrates how bilateral cues, when combined with scalable architectures and thoughtful tooling, can push the boundaries of what is feasible in dichotomous image segmentation on modern hardware. Whether you’re conducting research, building production-grade imaging tools, or integrating segmentation into creative pipelines, BiRefNet provides a robust starting point and a rich ecosystem to grow with.

Bilateral Reference for High-Resolution Dichotomous Image Segmentation

Enjoying this project?

GitHub - ZhengPeng7/BiRefNet: Bilateral Reference for High-Resolution Dichotomous Image Segmentation

Stay Updated

Product

Learn

Company

Legal

Stay Updated

Browse by Category