Files

T

akukanara c8c6bc5770 feat: convert python backend to a pure WebSocket API server

2026-05-31 16:58:25 +07:00

7.7 KiB

Raw Blame History

🎙️ ONNX VC - Standalone Real-Time Voice Changer

A high-performance, low-latency, real-time AI voice conversion system powered by ONNX Runtime and Retrieval-based Voice Conversion (RVC). Features a premium dashboard built with Next.js App Router, TypeScript, and Tailwind CSS, supporting full internationalization.

✨ Key Features

🚀 WebSocket Audio Pipeline: Streaming audio transfer using binary WebSocket connections (raw PCM float32) for minimal overhead.
⚡ Multi-Backend ONNX Acceleration: Supports execution providers including NVIDIA CUDA, AMD/Intel DirectML, and fallback CPU.
🌐 Universal Localisation: Fully translatable interface supporting English, Indonesian, Japanese, Chinese, and Spanish.
🎨 Premium Dashboard: Fully responsive workspace built using React 19, Radix UI, Framer Motion, and Tailwind CSS.
🎼 High-Fidelity DSP Pipeline:
- Low-Cut Filter: Active 1st order Butterworth high-pass filter at 80Hz to eliminate AC hum and rumble.
- Noise Gate: Threshold-based noise suppression to bypass inference during silence (saving CPU/GPU cycles).
- Gain Controls: Independent input/output digital gain staging.
🧠 Advanced Pitch Extraction: Optimized 16kHz pitch prediction using the RMVPE (Retrieval-based Minimum Vocal Pitch Estimation) model.
🌐 Dual Routing Architecture: Supports routing audio via the web browser (Web Audio API) or directly through the server's local audio hardware (using sounddevice).

🛠️ System Architecture

graph TD
    A[Microphone / Web Browser] -->|Web Audio API| B(WebSocket Connection)
    B -->|Raw Float32 PCM Chunk| C[server.py Backend]
    C -->|1. High-Pass Filter 80Hz| D[DSP Stage]
    D -->|2. Gain & Noise Gate| D
    D -->|3. Resample to 16kHz| E[Hubert/ContentVec ONNX]
    D -->|4. Pitch Estimation RMVPE| F[Pitch Predictor]
    E --> G[RVC ONNX Model Inference]
    F --> G
    G -->|Target Audio Chunks| H(WebSocket Connection)
    H -->|Play audio| I[Browser Speakers / Audio Device]

📁 Repository Structure

server.py — The main WebSocket backend and static HTTP server managing connection loops, audio resampling, and model execution.
start.bat — Windows launcher batch file that automatically resolves the Python virtual environment and executes the server.
requirements.txt — Python dependencies list.
frontend-next/ — The development workspace for the frontend client (Next.js, TypeScript).
frontend/ — Statically exported and optimized assets served by server.py backend.
lib/ — Core package containing inference models, ONNX conversion scripts, and prediction tools.
weights/ — Directory for character voice model weights (e.g. weights/HuTao/).
pretrained/ — Directory containing base pre-trained models.

🚀 Installation & Setup

📋 Prerequisites

Python 3.10+
FFmpeg installed and added to the system PATH (Required for audio processing utilities).
Node.js 18+ & npm (Only required if you want to modify and compile the frontend workspace).
(Optional) NVIDIA CUDA Toolkit (v11.x/12.x) and cuDNN for GPU execution acceleration.

📦 1. Python Backend Installation

Clone this repository to your local directory.

Initialize and activate a virtual environment:

python -m venv venv
# On Windows:
.\venv\Scripts\activate
# On Linux/macOS:
source venv/bin/activate

Install the required dependencies:
```
pip install -r requirements.txt
```

📥 2. Download Pre-trained ContentVec (Required)

The model requires a ContentVec base model to generate speaker features from voice chunks.

Download the vec-768-layer-12.onnx model from Hugging Face: 👉 Download vec-768-layer-12.onnx
Save the downloaded file inside the pretrained/ directory:
```
pretrained/
└── vec-768-layer-12.onnx
```

🔄 3. Setup & Export RVC Models to ONNX

To run character models on ONNX Runtime, you must place your standard PyTorch RVC models (.pth) under the weights/ directory and convert them.

Create a sub-folder under weights/ named after your character (e.g. HuTao):
```
weights/
└── HuTao/
    └── HuTao.pth
```
Run the ONNX conversion script by passing the folder name of the model:
```
python lib/export_onnx.py --model_name HuTao
```
The script will automatically search for the .pth file inside weights/HuTao/ and export a corresponding HuTao.onnx file inside the same directory:
```
weights/
└── HuTao/
    ├── HuTao.pth
    └── HuTao.onnx
```

🖥️ 4. Running the Frontend Client

Since the Python backend operates purely as a WebSocket API service, you must run the Next.js frontend client separately.

Option A: Development Server (Quick & Recommended)

Navigate to the frontend directory:
```
cd frontend-next
```
Install npm dependencies:
```
npm install
```
Spin up the dev server:
```
npm run dev
```
Open your browser and navigate to http://localhost:3000.

Option B: Compiled Static Production Web Server

Navigate to frontend-next and build the application:
```
cd frontend-next
npm install
npm run build
```
Note: This will compile static pages and copy them into the root /frontend folder.
Serve the compiled output using a static file server of your choice:
- Using Node: npx serve ../frontend -p 3000
- Using Python: python -m http.server 3000 --directory ../frontend Open http://localhost:3000 in your browser.

🏃 Running the Voice Changer

Step 1: Start the Python WebSocket Backend

Run the server using your terminal (defaults to port 8765):

python server.py --host 127.0.0.1 --port 8765 --device cuda

⚙️ Command-Line Arguments

Argument	Description	Default
`--host`	The address the WebSocket server binds to.	`127.0.0.1`
`--port`	WebSocket communication port.	`8765`
`--device`	The ONNX Runtime execution device (`cpu`, `cuda`, `dml`).	`cuda`
`--model`	Target folder name in `weights/` to load directly upon startup.	`None`

Step 2: Open the Frontend Dashboard

Make sure your frontend client is running (via npm run dev or a static server on http://localhost:3000), open it in your browser, and it will automatically connect to the WebSocket API backend.

🔊 Audio DSP Details

To achieve low latency without output artifacts, the audio processing utilizes:

Sliding Window Context Buffer: Keeps a short historical buffer of the audio to feed the model the required context frames while minimizing output audio delay.
Convolution Padding Fadeout: 120ms of trailing silent padding is temporarily appended to input segments to avoid edge-fading anomalies inherent to RVC convolutional steps.
Linear Resampling: Low-overhead linear interpolation for quick sample rate adaptation.

7.7 KiB Raw Blame History