TITLE:
A Study and Practice of Singing Voice Conversion Based on E-SVS and R-SVC
AUTHORS:
Frank Ming Han Dong
KEYWORDS:
E-SVS, R-SVC, Deep Learning, System Architecture, Asynchronous Processing
JOURNAL NAME:
Journal of Computer and Communications,
Vol.13 No.9,
September
22,
2025
ABSTRACT: Aiming at the common issues of poor sound quality and significant artifacts involved in today’s AI singing voice conversion techniques, this paper proposes a new method of AI-driven singing voice conversion coupled with the development of a deployable system. First, an Expert-based Singing Voice Separation (E-SVS) model based on UVR5 was established to achieve high-fidelity vocal extraction, dereverberation, and denoising by cascading MDX-Net and VR Architecture models. Then, a Retrieval-based Singing Voice Conversion (R-SVC) model is constructed as the core conversion engine. Utilizing HuBERT to extract content features while performing efficient timbre feature retrieval via Faiss, the R-SVC model generated cover audio with highly similar timbre and accurate melody. Finally, by designing a task queue mechanism, WeChat Mini Program front-end and asynchronous processing back-end software is developed to enable providing a smooth user experience, capable of resolving lag issues associated with computationally intensive AI tasks. In practice, it was found that this system can train high-quality models with customized vocals at relatively low data and time costs (10 - 30 minutes of audio).