7.7 KiB
🎙️ ONNX VC - Standalone Real-Time Voice Changer
A high-performance, low-latency, real-time AI voice conversion system powered by ONNX Runtime and Retrieval-based Voice Conversion (RVC). Features a premium dashboard built with Next.js App Router, TypeScript, and Tailwind CSS, supporting full internationalization.
✨ Key Features
- 🚀 WebSocket Audio Pipeline: Streaming audio transfer using binary WebSocket connections (raw PCM float32) for minimal overhead.
- ⚡ Multi-Backend ONNX Acceleration: Supports execution providers including NVIDIA
CUDA, AMD/IntelDirectML, and fallbackCPU. - 🌐 Universal Localisation: Fully translatable interface supporting English, Indonesian, Japanese, Chinese, and Spanish.
- 🎨 Premium Dashboard: Fully responsive workspace built using React 19, Radix UI, Framer Motion, and Tailwind CSS.
- 🎼 High-Fidelity DSP Pipeline:
- Low-Cut Filter: Active 1st order Butterworth high-pass filter at 80Hz to eliminate AC hum and rumble.
- Noise Gate: Threshold-based noise suppression to bypass inference during silence (saving CPU/GPU cycles).
- Gain Controls: Independent input/output digital gain staging.
- 🧠 Advanced Pitch Extraction: Optimized 16kHz pitch prediction using the RMVPE (Retrieval-based Minimum Vocal Pitch Estimation) model.
- 🌐 Dual Routing Architecture: Supports routing audio via the web browser (Web Audio API) or directly through the server's local audio hardware (using
sounddevice).
🛠️ System Architecture
graph TD
A[Microphone / Web Browser] -->|Web Audio API| B(WebSocket Connection)
B -->|Raw Float32 PCM Chunk| C[server.py Backend]
C -->|1. High-Pass Filter 80Hz| D[DSP Stage]
D -->|2. Gain & Noise Gate| D
D -->|3. Resample to 16kHz| E[Hubert/ContentVec ONNX]
D -->|4. Pitch Estimation RMVPE| F[Pitch Predictor]
E --> G[RVC ONNX Model Inference]
F --> G
G -->|Target Audio Chunks| H(WebSocket Connection)
H -->|Play audio| I[Browser Speakers / Audio Device]
📁 Repository Structure
- server.py — The main WebSocket backend and static HTTP server managing connection loops, audio resampling, and model execution.
- start.bat — Windows launcher batch file that automatically resolves the Python virtual environment and executes the server.
- requirements.txt — Python dependencies list.
- frontend-next/ — The development workspace for the frontend client (Next.js, TypeScript).
- frontend/ — Statically exported and optimized assets served by server.py backend.
- lib/ — Core package containing inference models, ONNX conversion scripts, and prediction tools.
- weights/ — Directory for character voice model weights (e.g.
weights/HuTao/). - pretrained/ — Directory containing base pre-trained models.
🚀 Installation & Setup
📋 Prerequisites
- Python 3.10+
- FFmpeg installed and added to the system PATH (Required for audio processing utilities).
- Node.js 18+ & npm (Only required if you want to modify and compile the frontend workspace).
- (Optional) NVIDIA CUDA Toolkit (v11.x/12.x) and cuDNN for GPU execution acceleration.
📦 1. Python Backend Installation
- Clone this repository to your local directory.
- Initialize and activate a virtual environment:
python -m venv venv # On Windows: .\venv\Scripts\activate # On Linux/macOS: source venv/bin/activate - Install the required dependencies:
pip install -r requirements.txt
📥 2. Download Pre-trained ContentVec (Required)
The model requires a ContentVec base model to generate speaker features from voice chunks.
- Download the
vec-768-layer-12.onnxmodel from Hugging Face: 👉 Download vec-768-layer-12.onnx - Save the downloaded file inside the pretrained/ directory:
pretrained/ └── vec-768-layer-12.onnx
🔄 3. Setup & Export RVC Models to ONNX
To run character models on ONNX Runtime, you must place your standard PyTorch RVC models (.pth) under the weights/ directory and convert them.
- Create a sub-folder under
weights/named after your character (e.g.HuTao):weights/ └── HuTao/ └── HuTao.pth - Run the ONNX conversion script by passing the folder name of the model:
python lib/export_onnx.py --model_name HuTao - The script will automatically search for the
.pthfile insideweights/HuTao/and export a correspondingHuTao.onnxfile inside the same directory:weights/ └── HuTao/ ├── HuTao.pth └── HuTao.onnx
🖥️ 4. Running the Frontend Client
Since the Python backend operates purely as a WebSocket API service, you must run the Next.js frontend client separately.
Option A: Development Server (Quick & Recommended)
- Navigate to the frontend directory:
cd frontend-next - Install npm dependencies:
npm install - Spin up the dev server:
Open your browser and navigate to
npm run devhttp://localhost:3000.
Option B: Compiled Static Production Web Server
- Navigate to
frontend-nextand build the application:Note: This will compile static pages and copy them into the rootcd frontend-next npm install npm run build/frontendfolder. - Serve the compiled output using a static file server of your choice:
- Using Node:
npx serve ../frontend -p 3000 - Using Python:
python -m http.server 3000 --directory ../frontendOpenhttp://localhost:3000in your browser.
- Using Node:
🏃 Running the Voice Changer
Step 1: Start the Python WebSocket Backend
Run the server using your terminal (defaults to port 8765):
python server.py --host 127.0.0.1 --port 8765 --device cuda
⚙️ Command-Line Arguments
| Argument | Description | Default |
|---|---|---|
--host |
The address the WebSocket server binds to. | 127.0.0.1 |
--port |
WebSocket communication port. | 8765 |
--device |
The ONNX Runtime execution device (cpu, cuda, dml). |
cuda |
--model |
Target folder name in weights/ to load directly upon startup. |
None |
Step 2: Open the Frontend Dashboard
Make sure your frontend client is running (via npm run dev or a static server on http://localhost:3000), open it in your browser, and it will automatically connect to the WebSocket API backend.
🔊 Audio DSP Details
To achieve low latency without output artifacts, the audio processing utilizes:
- Sliding Window Context Buffer: Keeps a short historical buffer of the audio to feed the model the required context frames while minimizing output audio delay.
- Convolution Padding Fadeout: 120ms of trailing silent padding is temporarily appended to input segments to avoid edge-fading anomalies inherent to RVC convolutional steps.
- Linear Resampling: Low-overhead linear interpolation for quick sample rate adaptation.