166 lines
7.3 KiB
Markdown
166 lines
7.3 KiB
Markdown
# 🎙️ ONNX VC - Standalone Real-Time Voice Changer
|
|
|
|
🌐 **Languages:** [English](README.md) | [Bahasa Indonesia](README.id.md) | [Español](README.es.md) | [日本語](README.ja.md) | [简体中文](README.zh.md)
|
|
|
|
A high-performance, low-latency, real-time AI voice conversion system powered by **ONNX Runtime** and **Retrieval-based Voice Conversion (RVC)**. Features a premium dashboard built with **Next.js App Router**, **TypeScript**, and **Tailwind CSS**, supporting full internationalization.
|
|
|
|
---
|
|
|
|
## ✨ Key Features
|
|
* **🚀 WebSocket Audio Pipeline:** Streaming audio transfer using binary WebSocket connections (raw PCM float32) for minimal overhead.
|
|
* **⚡ Multi-Backend ONNX Acceleration:** Supports execution providers including NVIDIA `CUDA`, AMD/Intel `DirectML`, and fallback `CPU`.
|
|
* **🌐 Universal Localisation:** Fully translatable interface supporting English, Indonesian, Japanese, Chinese, and Spanish.
|
|
* **🎨 Premium Dashboard**: Fully responsive workspace built using React 19, Radix UI, Framer Motion, and Tailwind CSS.
|
|
* **🎼 High-Fidelity DSP Pipeline:**
|
|
* **Low-Cut Filter:** Active 1st order Butterworth high-pass filter at 80Hz to eliminate AC hum and rumble.
|
|
* **Noise Gate:** Threshold-based noise suppression to bypass inference during silence (saving CPU/GPU cycles).
|
|
* **Gain Controls:** Independent input/output digital gain staging.
|
|
* **🧠 Advanced Pitch Extraction:** Optimized 16kHz pitch prediction using the RMVPE (Retrieval-based Minimum Vocal Pitch Estimation) model.
|
|
* **🌐 Dual Routing Architecture:** Supports routing audio via the web browser (Web Audio API) or directly through the server's local audio hardware (using `sounddevice`).
|
|
|
|
---
|
|
|
|
## 🛠️ System Architecture
|
|
|
|
```mermaid
|
|
graph TD
|
|
A[Microphone / Web Browser] -->|Web Audio API| B(WebSocket Connection)
|
|
B -->|Raw Float32 PCM Chunk| C[server.py Backend]
|
|
C -->|1. High-Pass Filter 80Hz| D[DSP Stage]
|
|
D -->|2. Gain & Noise Gate| D
|
|
D -->|3. Resample to 16kHz| E[Hubert/ContentVec ONNX]
|
|
D -->|4. Pitch Estimation RMVPE| F[Pitch Predictor]
|
|
E --> G[RVC ONNX Model Inference]
|
|
F --> G
|
|
G -->|Target Audio Chunks| H(WebSocket Connection)
|
|
H -->|Play audio| I[Browser Speakers / Audio Device]
|
|
```
|
|
|
|
---
|
|
|
|
## 📁 Repository Structure
|
|
* [server.py](file:///M:/Users/ahmad/project/onnx-voice-changer/server.py) — The main WebSocket backend server managing connection loops, audio resampling, and model execution.
|
|
* [start.bat](file:///M:/Users/ahmad/project/onnx-voice-changer/start.bat) — Windows launcher batch file that automatically resolves the Python virtual environment and executes the server.
|
|
* [requirements.txt](file:///M:/Users/ahmad/project/onnx-voice-changer/requirements.txt) — Python dependencies list.
|
|
* [frontend/](file:///M:/Users/ahmad/project/onnx-voice-changer/frontend) — The frontend client workspace built with Next.js (TypeScript, Tailwind CSS).
|
|
* [frontend-deprecated/](file:///M:/Users/ahmad/project/onnx-voice-changer/frontend-deprecated) — The old deprecated frontend code.
|
|
* [lib/](file:///M:/Users/ahmad/project/onnx-voice-changer/lib) — Core package containing inference models, ONNX conversion scripts, and prediction tools.
|
|
* [weights/](file:///M:/Users/ahmad/project/onnx-voice-changer/weights) — Directory for character voice model weights (e.g. `weights/HuTao/`).
|
|
* [pretrained/](file:///M:/Users/ahmad/project/onnx-voice-changer/pretrained) — Directory containing base pre-trained models.
|
|
|
|
---
|
|
|
|
## 🚀 Installation & Setup
|
|
|
|
### 📋 Prerequisites
|
|
* **Python 3.10+**
|
|
* **FFmpeg** installed and added to the system PATH (Required for audio processing utilities).
|
|
* **Node.js 18+** & **npm** (Required to run the Next.js frontend client).
|
|
* (Optional) **NVIDIA CUDA Toolkit** (v11.x/12.x) and **cuDNN** for GPU execution acceleration.
|
|
|
|
---
|
|
|
|
### 📦 1. Python Backend Installation
|
|
1. Clone this repository to your local directory.
|
|
2. Initialize and activate a virtual environment:
|
|
```bash
|
|
python -m venv venv
|
|
# On Windows:
|
|
.\venv\Scripts\activate
|
|
# On Linux/macOS:
|
|
source venv/bin/activate
|
|
```
|
|
3. Install the required dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
---
|
|
|
|
### 📥 2. Download Pre-trained ContentVec (Required)
|
|
The model requires a ContentVec base model to generate speaker features from voice chunks.
|
|
1. Download the `vec-768-layer-12.onnx` model from Hugging Face:
|
|
👉 **[Download vec-768-layer-12.onnx](https://huggingface.co/DogManTC/test-rvc-onnx/blob/main/vec-768-layer-12.onnx)**
|
|
2. Save the downloaded file inside the [pretrained/](file:///M:/Users/ahmad/project/onnx-voice-changer/pretrained) directory:
|
|
```
|
|
pretrained/
|
|
└── vec-768-layer-12.onnx
|
|
```
|
|
|
|
---
|
|
|
|
### 🔄 3. Setup & Export RVC Models to ONNX
|
|
To run character models on ONNX Runtime, you must place your standard PyTorch RVC models (`.pth`) under the [weights/](file:///M:/Users/ahmad/project/onnx-voice-changer/weights) directory and convert them.
|
|
|
|
1. Create a sub-folder under `weights/` named after your character (e.g. `HuTao`):
|
|
```
|
|
weights/
|
|
└── HuTao/
|
|
└── HuTao.pth
|
|
```
|
|
2. Run the ONNX conversion script by passing the folder name of the model:
|
|
```bash
|
|
python lib/export_onnx.py --model_name HuTao
|
|
```
|
|
3. The script will automatically search for the `.pth` file inside `weights/HuTao/` and export a corresponding `HuTao.onnx` file inside the same directory:
|
|
```
|
|
weights/
|
|
└── HuTao/
|
|
├── HuTao.pth
|
|
└── HuTao.onnx
|
|
```
|
|
|
|
---
|
|
|
|
### 🖥️ 4. Running the Frontend Client
|
|
The frontend client runs as a standalone Next.js development server or built production server.
|
|
|
|
1. Navigate to the frontend directory:
|
|
```bash
|
|
cd frontend
|
|
```
|
|
2. Install npm dependencies:
|
|
```bash
|
|
npm install
|
|
```
|
|
3. Start the development server:
|
|
```bash
|
|
npm run dev
|
|
```
|
|
Open your browser and navigate to **`http://localhost:3000`**.
|
|
|
|
Alternatively, to build and run the production server:
|
|
```bash
|
|
npm run build
|
|
npm run start
|
|
```
|
|
|
|
---
|
|
|
|
## 🏃 Running the Voice Changer
|
|
|
|
### Step 1: Start the Python WebSocket Backend
|
|
Run the server using your terminal (defaults to port `8765`):
|
|
```bash
|
|
python server.py --host 127.0.0.1 --port 8765 --device cuda
|
|
```
|
|
|
|
#### ⚙️ Command-Line Arguments
|
|
| Argument | Description | Default |
|
|
|---|---|---|
|
|
| `--host` | The address the WebSocket server binds to. | `127.0.0.1` |
|
|
| `--port` | WebSocket communication port. | `8765` |
|
|
| `--device` | The ONNX Runtime execution device (`cpu`, `cuda`, `dml`). | `cuda` |
|
|
| `--model` | Target folder name in `weights/` to load directly upon startup. | `None` |
|
|
|
|
### Step 2: Open the Frontend Dashboard
|
|
Make sure your frontend client is running (via `npm run dev` or `npm run start` on `http://localhost:3000`), open it in your browser, and it will automatically connect to the WebSocket API backend.
|
|
|
|
---
|
|
|
|
## 🔊 Audio DSP Details
|
|
To achieve low latency without output artifacts, the audio processing utilizes:
|
|
1. **Sliding Window Context Buffer:** Keeps a short historical buffer of the audio to feed the model the required context frames while minimizing output audio delay.
|
|
2. **Convolution Padding Fadeout:** 120ms of trailing silent padding is temporarily appended to input segments to avoid edge-fading anomalies inherent to RVC convolutional steps.
|
|
3. **Linear Resampling:** Low-overhead linear interpolation for quick sample rate adaptation.
|