Real-Time Speech-to-Text Library for Applications
Detailed Description of RealtimeSTT: A Low-Latency Speech-to-Text Library for Real-Time Applications
Introduction
RealtimeSTT is a powerful, open-source speech-to-text library designed for real-time applications such as voice assistants, interactive user interfaces, and automated transcription systems. Developed by Kolja Beigel, this project builds upon the foundational work of Linguflex, an advanced voice-controlled environment assistant. RealtimeSTT leverages cutting-edge technologies to provide accurate, low-latency speech recognition, making it ideal for applications requiring immediate audio-to-text conversion.
The library supports multiple features, including voice activity detection (VAD), wake word activation, and real-time transcription, while offering flexibility through customizable configurations. Below is a comprehensive exploration of its architecture, functionalities, installation process, and practical use cases.
Key Features and Capabilities
1. Real-Time Speech-to-Text Conversion
RealtimeSTT employs Faster Whisper, a GPU-accelerated speech recognition model, to transcribe audio in real-time with minimal latency. The library ensures smooth performance even under high computational loads by optimizing batch processing and leveraging parallel computing where possible.
2. Voice Activity Detection (VAD)
The system uses a combination of WebRTC VAD and SileroVAD for detecting when speech begins and ends. This prevents unnecessary transcription during background noise, improving efficiency and reducing computational overhead.
- SileroVAD Sensitivity: Adjustable from
0to1, where0.6is the default. - WebRTC VAD Sensitivity: Configurable via an integer (
0to3), with a default of3.
3. Wake Word Activation
RealtimeSTT supports wake word detection using either:
- Porcupine (a lightweight, efficient wake word engine)
- OpenWakeWord (supports custom ONNX/TFLite models)
Users can define specific keywords (e.g., "Alexa," "Jarvis") to trigger recording. The system buffers audio before transcription starts, ensuring clean detection.
4. Asynchronous and Callback-Based Processing
The library provides callbacks for different events:
on_recording_start/on_recording_stopon_transcription_start/on_transcription_stabilizedon_wakeword_detected/on_vad_start/stop
This allows developers to handle transcription results dynamically, such as typing spoken text into an application or triggering actions based on detected phrases.
5. Batch and Real-Time Transcription
Users can choose between:
- Batch Processing: Processes audio in chunks for efficiency.
- Real-Time Transcription: Continuously updates text as speech is recorded (requires GPU acceleration).
The realtime_model_type parameter allows selecting a smaller, faster model for real-time use while maintaining accuracy with the primary transcription engine.
Technical Architecture and Dependencies
Core Technologies Used
| Component | Purpose | |-------------------------|------------------------------------------------------------------------| | Faster Whisper | GPU-accelerated speech recognition (supports multiple languages). | | WebRTC VAD | Initial voice activity detection. | | SileroVAD | More accurate end-of-speech detection in noisy environments. | | Porcupine / OpenWakeWord | Wake word detection for activating recording. | | PyTorch (CUDA) | Accelerates model inference on GPUs. |
Supported Models
RealtimeSTT supports various Whisper models, categorized by size and language:
- Small:
tiny,tiny.en(lightweight, fast) - Medium:
base,base.en - Large:
medium,medium.en,large-v1,large-v2
Models are downloaded from Hugging Face Hub if not pre-installed.
Installation and Setup
Prerequisites
Before installing RealtimeSTT, ensure the following dependencies are met:
System Requirements
- Linux: Install Python 3.7+ with
sudo apt-get install python3-dev portaudio19-dev. - MacOS: Use Homebrew to install
portaudio(brew install portaudio). - Windows: Ensure audio drivers support PyAudio.
GPU Acceleration (Recommended)
For optimal performance, RealtimeSTT requires a compatible NVIDIA GPU with CUDA support. Follow these steps:
- Install NVIDIA CUDA Toolkit (Download here).
- Install cuDNN (version matching your CUDA toolkit, e.g.,
cuDNN v8.7for CUDA 11.x). - Upgrade PyTorch with CUDA support:
pip install torch==2.5.1+cu118 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
(Replace cu118 with your CUDA version, e.g., cu121 for CUDA 12.X.)
Optional: FFmpeg
While not strictly required, installing FFmpeg can improve audio processing:
- Linux (Ubuntu/Debian):
sudo apt update && sudo apt install ffmpeg
- MacOS:
brew install ffmpeg
Installation Command
pip install RealtimeSTT
This installs all dependencies, including a CPU-only version of PyTorch. For GPU acceleration, ensure CUDA is properly configured.
Usage Examples
1. Basic Audio-to-Text Conversion
To print every spoken word in real-time:
from RealtimeSTT import AudioToTextRecorder
def process_text(text):
print(text)
if __name__ == "__main__":
print("Speak now...")
recorder = AudioToTextRecorder()
while True:
recorder.text(process_text)
Output: Continuously updates the console with transcribed text.
2. Typing Spoken Text into a Window
from RealtimeSTT import AudioToTextRecorder
import pyautogui
def process_text(text):
pyautogui.typewrite(text + " ") # Adds space after each word
if __name__ == "__main__":
print("Speak now...")
recorder = AudioToTextRecorder()
while True:
recorder.text(process_text)
Use Case: Automatically types spoken commands into a text field.
3. Wake Word Activation
Trigger recording with a predefined keyword (e.g., "Jarvis"):
from RealtimeSTT import AudioToTextRecorder
if __name__ == "__main__":
recorder = AudioToTextRecorder(wake_words="jarvis")
print("Say 'Jarvis' to start...")
while True:
print(recorder.text())
Output: Starts recording only after "Jarvis" is spoken.
4. Manual Recording Control
from RealtimeSTT import AudioToTextRecorder
if __name__ == "__main__":
recorder = AudioToTextRecorder()
recorder.start() # Begin recording
input("Press Enter to stop...") # Pause manually
recorder.stop()
print("Final transcription:", recorder.text())
Use Case: Manual control over recording sessions.
5. Feeding Raw Audio Chunks
For offline or custom audio processing:
from RealtimeSTT import AudioToTextRecorder
if __name__ == "__main__":
recorder = AudioToTextRecorder(use_microphone=False)
with open("audio_chunk.pcm", "rb") as f:
audio_chunk = f.read()
recorder.feed_audio(audio_chunk)
print("Transcription:", recorder.text())
Advanced Configuration Options
1. Real-Time Transcription
Enable continuous updates during recording:
from RealtimeSTT import AudioToTextRecorder
def on_realtime_update(text):
print(f"Real-time: {text}")
if __name__ == "__main__":
recorder = AudioToTextRecorder(
enable_realtime_transcription=True,
on_realtime_transcription_update=on_realtime_update
)
recorder.start()
Note: Real-time transcription is GPU-intensive; disable if performance degrades.
2. Customizing Wake Words
Use OpenWakeWord with custom ONNX models:
from RealtimeSTT import AudioToTextRecorder
if __name__ == "__main__":
recorder = AudioToTextRecorder(
wakeword_backend="oww",
wake_words_sensitivity=0.35,
openwakeword_model_paths="model1.onnx,model2.onnx"
)
recorder.start()
3. Logging and Debugging
Enable detailed logs for troubleshooting:
from RealtimeSTT import AudioToTextRecorder
if __name__ == "__main__":
recorder = AudioToTextRecorder(
debug_mode=True,
level=logging.INFO # Adjust logging level (DEBUG, INFO, WARNING)
)
Command-Line Interface (CLI)
RealtimeSTT provides a CLI for server/client management:
| Command | Description |
|---------------|--------------------------------------|
| stt-server | Starts the transcription server. |
| stt | Initiates the client mode. |
Example:
python -m RealtimeSTT.stt-server --model tiny.en
Testing and Demo Scripts
The library includes test scripts for evaluation:
simple_test.py: Basic transcription demo.realtimestt_test.py: Live transcription demonstration.wakeword_test.py: Wake word activation example.
For advanced demos (e.g., real-time translations), additional dependencies like OpenAI or ElevenLabs may be required.
Performance Considerations
Latency Optimization
RealtimeSTT minimizes latency by:
- Using smaller models (
tiny/base) for real-time processing. - Adjusting
realtime_processing_pauseto balance speed and accuracy. - Configuring
allowed_latency_limitto prevent buffer overflows.
GPU vs. CPU Performance
| Setting | GPU Acceleration | CPU Performance | |-----------------------|------------------|-----------------| | Model Size | Faster (large models) | Slower (smaller models) | | Real-Time Transcription | Highly recommended | May lag |
Recommendation: Use CUDA for optimal performance, especially in real-time applications.
Troubleshooting Common Issues
CUDA/cuDNN Mismatch Errors
If encountering errors like:
Unable to load any of {libcudnn_ops.so.9.1.0, ...}
Solution:
- Downgrade
ctranslate2to version 4.4.0:
pip install ctranslate2==4.4.0
- Upgrade cuDNN to version 9.2+.
Audio Input Issues
If the microphone isn’t detected:
- Ensure
portaudiois installed (sudo apt-get install portaudio19-devon Linux). - On Windows, verify PyAudio drivers are enabled.
Contributing to RealtimeSTT
The project welcomes community contributions! Key areas for improvement include:
- Adding more wake word models.
- Enhancing real-time transcription stability.
- Improving cross-platform compatibility (e.g., Android/iOS).
How to Contribute:
- Fork the repository: GitHub - KoljaB/RealtimeSTT.
- Submit a Pull Request with detailed documentation.
- Follow existing code conventions.
Conclusion
RealtimeSTT is a versatile, low-latency speech-to-text library designed for developers building voice-driven applications. Its combination of Faster Whisper, SileroVAD, and wake word detection makes it ideal for:
- Voice assistants
- Real-time transcription systems
- Interactive user interfaces
While the project is no longer actively maintained, its community-driven approach ensures ongoing improvements through pull requests. For optimal performance, GPU acceleration (CUDA) is strongly recommended.
For further exploration, refer to the official GitHub repository and documentation for advanced configurations and troubleshooting tips.
Note: The attached input includes a demo video link (tests/realtimestt_test.py), which showcases live transcription. For a visual demonstration, you can run the provided test script to see real-time audio-to-text conversion in action.
(End of detailed description)
Enjoying this project?
Discover more amazing open-source projects on TechLogHub. We curate the best developer tools and projects.
Repository:https://github.com/KoljaB/RealtimeSTT
GitHub - KoljaB/RealtimeSTT: Real-Time Speech-to-Text Library for Applications
Detailed Description of RealtimeSTT: A Low-Latency Speech-to-Text Library for Real-Time Applications...
github - koljab/realtimestt