A Study and Practice of Singing Voice Conversion Based on E-SVS and R-SVC ()
ABSTRACT
Aiming at the common issues of poor sound quality and significant artifacts involved in today’s AI singing voice conversion techniques, this paper proposes a new method of AI-driven singing voice conversion coupled with the development of a deployable system. First, an Expert-based Singing Voice Separation (E-SVS) model based on UVR5 was established to achieve high-fidelity vocal extraction, dereverberation, and denoising by cascading MDX-Net and VR Architecture models. Then, a Retrieval-based Singing Voice Conversion (R-SVC) model is constructed as the core conversion engine. Utilizing HuBERT to extract content features while performing efficient timbre feature retrieval via Faiss, the R-SVC model generated cover audio with highly similar timbre and accurate melody. Finally, by designing a task queue mechanism, WeChat Mini Program front-end and asynchronous processing back-end software is developed to enable providing a smooth user experience, capable of resolving lag issues associated with computationally intensive AI tasks. In practice, it was found that this system can train high-quality models with customized vocals at relatively low data and time costs (10 - 30 minutes of audio).
Share and Cite:
Dong, F. (2025) A Study and Practice of Singing Voice Conversion Based on E-SVS and R-SVC.
Journal of Computer and Communications,
13, 42-54. doi:
10.4236/jcc.2025.139003.
Cited by
No relevant information.