19e5138525
·
2026-05-30 21:35:38 +07:00
3 Commits
🎙️ Standalone ONNX Real-Time Voice Changer Service
A high-performance, low-latency, real-time voice conversion system powered by ONNX Runtime and Retrieval-based Voice Conversion (RVC). This application enables real-time voice conversion from a microphone/browser source to a designated target character model with minimal processing latency.
✨ Key Features
- 🚀 WebSocket Audio Pipeline: Streaming audio transfer using binary WebSocket connections (raw PCM float32) for minimal overhead.
- ⚡ Multi-Backend ONNX Acceleration: Supports execution providers including NVIDIA
CUDA, AMD/IntelDirectML, and fallbackCPU. - 🎼 High-Fidelity DSP Pipeline:
- Low-Cut Filter: Active 1st order Butterworth high-pass filter at 80Hz to eliminate AC hum and rumble.
- Noise Gate: Threshold-based noise suppression to bypass inference during silence (saving CPU/GPU cycles).
- Gain Controls: Independent input/output digital gain staging.
- 🧠 Advanced Pitch Extraction: Optimized 16kHz pitch prediction using the RMVPE (Retrieval-based Minimum Vocal Pitch Estimation) model.
- 🌐 Dual Routing Architecture: Supports routing audio via the web browser (Web Audio API) or directly through the server's local audio hardware (using
sounddevice).
🛠️ System Architecture
graph TD
A[Microphone / Web Browser] -->|Web Audio API| B(WebSocket Connection)
B -->|Raw Float32 PCM Chunk| C[server.py Backend]
C -->|1. High-Pass Filter 80Hz| D[DSP Stage]
D -->|2. Gain & Noise Gate| D
D -->|3. Resample to 16kHz| E[Hubert/ContentVec ONNX]
D -->|4. Pitch Estimation RMVPE| F[Pitch Predictor]
E --> G[RVC ONNX Model Inference]
F --> G
G -->|Target Audio Chunks| H(WebSocket Connection)
H -->|Play audio| I[Browser Speakers / Audio Device]
📁 Repository Structure
- server.py — The main WebSocket backend and static HTTP server managing connection loops, audio resampling, and model execution.
- start.bat — Windows launcher batch file that automatically resolves the Python virtual environment and executes the server.
- requirements.txt — Python dependencies list.
- frontend/ — Contains client-side Web UI files:
- frontend/index.html — Control interface layout.
- frontend/app.js — WebSocket communication and client-side audio rendering.
- frontend/styles.css — Custom dashboard styling.
- lib/ — core package containing inference models and prediction tools.
- weights/ — Directory for voice model weights. Place your custom
.onnxand.pthmodel sub-directories here. - pretrained/ — Directory containing base pre-trained models such as
vec-768-layer-12.onnxorvec-256-layer-12.onnx.
🚀 Getting Started
📋 Prerequisites
- Python 3.10+ (Recommended)
- FFmpeg installed and added to the system PATH (Required for audio processing utilities).
- (Optional) NVIDIA CUDA Toolkit (v11.x/12.x) and cuDNN for GPU execution acceleration.
📦 Installation
- Clone this repository to your local directory.
- Initialize and activate a virtual environment (optional but recommended):
python -m venv venv .\venv\Scripts\activate - Install the required dependencies:
pip install -r requirements.txt - Place your ContentVec base model (
vec-768-layer-12.onnxorvec-256-layer-12.onnx) inside the pretrained/ directory. - Place your character models in weights/ in structured folders (e.g.,
weights/HuTao/containingHuTao.onnxandHuTao.pth).
🏃 Running the Server
Option A: Quick Launch (Windows)
Simply double-click the start.bat file. It will automatically detect Python, set up the directory paths, and launch the service.
Option B: Manual CLI execution
Execute the server using your terminal:
python server.py --host 127.0.0.1 --port 8765 --http_port 8000 --device cuda
⚙️ Command-Line Arguments
| Argument | Description | Default |
|---|---|---|
--host |
The address the WebSocket server binds to. | 127.0.0.1 |
--port |
WebSocket communication port. | 8765 |
--http_port |
Port serving the static frontend Web UI. | 8000 |
--device |
The ONNX Runtime execution device (cpu, cuda, dml). |
cuda |
--model |
Target folder name in weights/ to load directly upon startup. |
None |
Once the server begins execution, it will spin up the local server, and your Web UI should open automatically at http://localhost:8000.
🔊 Audio DSP Details
To achieve low latency without output artifacts, the audio processing utilizes:
- Sliding Window Context Buffer: Keeps a short historical buffer of the audio to feed the model the required context frames while minimizing output audio delay.
- Convolution Padding Fadeout: 120ms of trailing silent padding is temporarily appended to input segments to avoid edge-fading anomalies inherent to RVC convolutional steps.
- Linear Resampling: Low-overhead linear interpolation for quick sample rate adaptation.
Description
Languages
Python
48.2%
TypeScript
32%
JavaScript
12.9%
CSS
6.8%