Real-Time Object Detection: YOLOE and YOLOv10

YOLOE and YOLOv10: Real-Time Open-Set Detection and End-to-End Object Detection

In this blog post, we explore the latest updates from the YOLO family, focusing on two major strands: YOLOE, a highly efficient, unified, open-set object detection and segmentation model designed for real-time seeing anything with multiple prompt mechanisms; and YOLOv10, a new generation offering real-time end-to-end detection with end-to-end training and inference. The accompanying visuals from the official project pages illustrate how these systems compare, how their architectures differ, and how they perform across popular benchmarks. Below, you’ll find structured sections, concise explanations, and practical notes to help you grasp the innovations, capabilities, and practical usage.

Latest Updates: YOLOE — Real-Time Seeing Anything

Overview

YOLOE is designed to break free from predefined category constraints and deliver real-time detection and segmentation across open scenarios. The model emphasizes efficiency, unification, and openness, enabling open prompts through multiple modalities:

Text prompts
Visual prompts
Prompt-free operation

A central claim is zero inference and transferring overhead when comparing YOLOE to traditional closed-set YOLOs, making YOLOE attractive for real-time deployment where latency, cost, and flexibility are critical.

Visual snapshot

Comparison of performance, training cost, and inference efficiency between YOLOE (Ours) and YOLO-Worldv2 in terms of open text prompts is highlighted in the comparison figure. This image helps readers quickly assess where YOLOE gains efficiency and how its performance stacks up against a reference.

[Image: comparison.svg]

YOLOE’s prompt mechanisms and core strategies

Text prompts: RepRTA (Re-parameterizable Region-Text Alignment)
Refines pretrained textual embeddings via a lightweight auxiliary network that is re-parameterizable.
Improves visual-textual alignment with minimal overhead, enabling zero inference and transferring overhead when aligning with text prompts.
The aim is to maintain strong open-set performance without sacrificing real-time latency.
Visual prompts: SAVPE (Semantic-Activated Visual Prompt Encoder)
Uses decoupled semantic and activation branches to enhance visual embeddings.
Improves accuracy with a compact design, keeping complexity in check.
The approach optimizes how visual information is activated and parsed for open-set prompts.
Prompt-free scenario: LRPC (Lazy Region-Prompt Contrast)
Builds on a built-in large vocabulary and a specialized embedding to identify all objects without relying on external language models.
Targets end-to-end efficiency by avoiding dependencies that can slow down inference.

Performance highlights and transferability

LVIS transfer and efficiency: On LVIS, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP while using roughly three times less training cost and achieving about 1.4x faster inference. This highlights a compelling trade-off between open-set capability and practical training/inference costs.
COCO transfer: YOLOE-v8-L achieves gains of approximately 0.6 APb and 0.4 APm relative to the closed-set YOLOv8-L, with nearly 4x less training time. The gains demonstrate that open-set, prompt-based detection can deliver competitive or superior performance with substantial training efficiency.

Supporting visuals

A pipeline or visualization image helps convey the architectural and flow differences between YOLOE’s prompt mechanisms and traditional detectors.

[Image: visualization.svg]

A natural question is how such a model behaves across real-world scenarios: open vocabulary objects, variable contexts, and the practical need for fast inference. The combination of RepRTA, SAVPE, and LRPC addresses different facets of this problem—textual alignment, semantic-rich visual embeddings, and open-ended object discovery—while preserving end-to-end efficiency.

YOLOv10: Real-Time End-to-End Object Detection

Official release snapshot

YOLOv10 presents an official PyTorch implementation for real-time, end-to-end object detection. The project emphasizes cohesive optimization across model architecture and training/inference workflows to push the performance-accuracy frontier while minimizing latency and compute.

A key architectural and methodological claim is the move toward end-to-end, NMS-free training and inference through consistent dual assignments, which helps reduce post-processing latency and improve end-to-end throughput.

Architectural and performance highlights

Core claim: End-to-end detection with efficient, end-to-end training and inference.
NMS-free training: Consistent dual assignments enable competitive performance and low inference latency by avoiding the traditional non-maximum suppression post-processing bottleneck.
Holistic design: An efficiency-accuracy-driven strategy optimizes multiple components of the model, reducing computational overhead without sacrificing accuracy.

Performance snapshots

Compared with RT-DETR-R18 at similar AP on COCO, YOLOv10-S is about 1.8x faster, with 2.8x fewer parameters and FLOPs, underscoring stronger efficiency at a given accuracy point.
Compared with YOLOv9-C, YOLOv10-B delivers 46% lower latency and 25% fewer parameters for the same performance, illustrating meaningful improvements across model scales.

Notes and updates (context and maintenance)

2024/05/31: Guidance on export formats for benchmark comparisons. The note emphasizes using the exported format for fair speed measurements; non-exported formats may bias speed due to certain operations in the v10Detect path.
2024/05/30–05/31: Community clarifications and suggestions for detecting smaller objects or distant objects with YOLOv10; updates to model checkpoints with class names for easier use.
2024/06/01: Integration updates with C++/OpenVINO/OpenCV, thanks to contributions from multiple developers.
2024/06/01: HuggingFace hosting for models, enabling easier access and deployment via the HuggingFace Hub and Spaces ecosystem.
2024/05/29–05/31: Various demos and integrations, including Gradio demos for local use, integration with DeepSORT and other trackers, and support for ONNX weights and Transformers.js demos.

Demonstrated performance and model scale on COCO

YOLOv10-N: Test size 640; 2.3M parameters; 6.7 GFLOPs; APval 38.5%; latency 1.84 ms
YOLOv10-S: Test size 640; 7.2M parameters; 21.6 GFLOPs; APval 46.3%; latency 2.49 ms
YOLOv10-M: Test size 640; 15.4M parameters; 59.1 GFLOPs; APval 51.1%; latency 4.74 ms
YOLOv10-B: Test size 640; 19.1M parameters; 92.0 GFLOPs; APval 52.5%; latency 5.74 ms
YOLOv10-L: Test size 640; 24.4M parameters; 120.3 GFLOPs; APval 53.2%; latency 7.28 ms
YOLOv10-X: Test size 640; 29.5M parameters; 160.4 GFLOPs; APval 54.4%; latency 10.70 ms

A pipeline snapshot

[Image: pipeline.svg]

Notes on usage, installation, and validation

Installation: A Conda-based virtual environment is recommended. Typical steps include creating a Python 3.9 environment and installing requirements, followed by installing the package in editable mode.
Code example:
- conda create -n yolov10 python=3.9
- conda activate yolov10
- pip install -r requirements.txt
- pip install -e .
Demo: A simple demo interface can be launched with a Python script (e.g., app.py) and accessed via a local URL (e.g., http://127.0.0.1:7860).
Validation: Pretrained models and evaluation commands are provided for several YOLOv10 variants (n, s, m, b, l, x) and corresponding dataset configurations. You can validate models on COCO or other datasets using the provided commands and scripts.
Example validation flow:
- yolov10n/s/m/b/l/x
- data=coco.yaml
- batch=256
Training: The training workflow supports scalable distributed training across multiple devices and allows fine-tuning with pretrained weights if desired.
Example training flow:
- yolo detect train data=coco.yaml model=yolov10n/s/m/b/l/x.yaml epochs=500 batch=256 imgsz=640
Push to hub: You can push fine-tuned models to the Hugging Face Hub for sharing or deployment.
Example:
- model.pushtohub("<your-hf-username-or-organization/yolov10-finetuned-crop-detection")
- model.pushtohub("<your-hf-username-or-organization/yolov10-finetuned-crop-detection", private=True)
Prediction: You can run predictions using a model from Hugging Face or a local checkpoint, with simple APIs or CLI equivalents.
Example:
- yolo predict model=jameslahm/yolov10{n/s/m/b/l/x}
Export: YOLOv10 supports export to ONNX or TensorRT engines for optimized deployment.
End-to-End ONNX
- yolo export model=jameslahm/yolov10{n/s/m/b/l/x} format=onnx opset=13 simplify
End-to-End TensorRT
- yolo export model=jameslahm/yolov10{n/s/m/b/l/x} format=engine half=True simplify opset=13 workspace=16
- or use trtexec for engine generation and then predict with the engine
Prediction and export commands show how to move from training to productive deployment with minimal friction and coupling to common runtimes.
Acknowledgement: The code base is built with Ultralytics and RT-DETR, acknowledging their contributions and implementations as foundational to these advances.

Citations and reference notes

If you rely on these models or code for your research or product, please cite the YOLOv10 paper:

BibTeX @article{wang2024yolov10, title={YOLOv10: Real-Time End-to-End Object Detection}, author={Wang, Ao and Chen, Hui and Liu, Lihao and Chen, Kai and Lin, Zijia and Han, Jungong and Ding, Guiguang}, journal={arXiv preprint arXiv:2405.14458}, year={2024} }

Practical reflections and closing thoughts

The YOLOE family demonstrates how open prompts and open-set reasoning can be integrated into a single, efficient model that handles detection and segmentation with minimal overhead. The RepRTA, SAVPE, and LRPC mechanisms address different aspects of the open-world problem—from text alignment to visual embeddings to prompt-free discovery—without sacrificing real-time performance.
YOLOv10 represents a pragmatic shift toward end-to-end, NMS-free object detection with a strong emphasis on efficiency across model scales. The reported gains in speed and parameter efficiency, together with competitive accuracy, position YOLOv10 as a compelling option for scalable deployment and rapid iteration in production environments.
The ecosystem surrounding both projects—demos, Colab notebooks, HuggingFace hubs, OpenVINO integration, and community-driven tutorials—facilitates experimentation, benchmarking, and practical adoption across diverse hardware and software stacks.

Images included from the Input

Comparison image for YOLOE open-prompt performance
[Image: comparison.svg]
Abstract visualization for open-prompt concepts
[Image: visualization.svg]
YOLOv10 pipeline illustration
[Image: pipeline.svg]
Latency vs. accuracy and size vs. accuracy visuals
[Image: latency.svg]
[Image: params.svg]

In summary, the latest YOLO family releases embody a clear direction: achieve real-time, end-to-end capability with open-world flexibility while keeping inference latency, training costs, and deployment complexity under tight control. YOLOE provides a principled approach to open-set detection through text, visual, and prompt-free modalities, all with zero extra inference and transfer overhead. YOLOv10 advances the end-to-end end-user experience by delivering competitive accuracy with dramatically improved speed and a scalable, trainable pipeline that extends from development to production. Together, these updates mark a meaningful continuation of the YOLO lineage toward more capable, flexible, and efficient real-time vision systems.

Real-Time Object Detection: YOLOE and YOLOv10

Enjoying this project?

GitHub - THU-MIG/yolov10: Real-Time Object Detection: YOLOE and YOLOv10

Stay Updated

Product

Learn

Company

Legal

Stay Updated

Browse by Category

What's New