Mastering Local and Open-Source AI Voice Cloning

This video details a challenging but rewarding project to clone voices using local and open-source AI software, providing a completely free alternative to expensive, cloud-based solutions. It outlines a comprehensive method for training neural networks, covering data preparation, model training, and integration into a local AI voice assistant.

image

Key Points Summary

  • Project Motivation and Goal

    The project's mission is to clone voices using local, open-source, and free software, avoiding expensive cloud-based solutions like Eleven Labs, and integrating the custom voice into a local AI voice assistant like Terry, which replaces Alexa.

  • Legal and Ethical Considerations for Voice Cloning

    Cloned voices cannot be used for commercial purposes, content production, or distribution without explicit permission. This is not legal advice, and it is recommended to keep everything local and consult with a lawyer if there are any doubts.

  • Hardware and Software Requirements

    A computer (Mac, Linux, or Windows) is required, with a GPU significantly accelerating AI model training compared to a CPU. Demonstrations include using WSL on a laptop with an Nvidia 3080, a dual 4090 AI server, and an AWS EC2 GPU-powered instance. The core software utilized is Hyper TTS, which trains voices into the standard ONNX model format, making them portable for use in any voice assistant or text-to-speech program.

  • Data Acquisition for Voice Cloning

    High-quality, clean voice data is essential for AI voice cloning. For cloning one's own voice, Piper Recording Studio is recommended for easy self-recording of phrases. To clone another voice, like from YouTube videos, audio files need to be downloaded, ensuring the source voice is clean without background music or interruptions. Access to pre-processed audio files is provided for convenience.

  • Setting Up Piper Recording Studio

    The process involves installing WSL (Windows Subsystem for Linux) and Ubuntu, then cloning the Piper Recording Studio repository. A Python virtual environment is created and activated to manage dependencies, followed by installing required Python packages. The Piper Recording Studio is then run as a web interface on localhost:8000, allowing users to record phrases directly into the system, with more recordings leading to higher accuracy.

  • Data Cleaning and Preparation

    Raw audio data requires cleaning to remove music, silence, and split into shorter segments. Audacity, an open-source audio editor, is used to manually remove music from downloaded YouTube audio files. FFMPEG is then used with a script to automatically remove silence. These longer audio files are subsequently cut into multiple shorter files, ideally no longer than 15 seconds each, using a bash script.

  • Audio Transcription with Whisper

    After cleaning and splitting audio files, Whisper, a local transcription tool, is used to transcribe all audio files into text. A Python script processes each file, transcribing it and generating a metadata.csv file that pairs each audio filename with its transcription, a format required by Piper for training.

  • Privileged Access Management (PAM) with Keeper

    Keeper's privileged access manager and Secrets Manager (KSM) address the critical cybersecurity issue of developers hard-coding secrets (passwords, API keys) into software. KSM offers a secure, cloud-based, zero-knowledge platform for managing credentials directly from the terminal, integrating with platforms like Ansible, AWS, and Docker, providing a robust solution for protecting sensitive information.

  • Training Environment Setup for Piper TTS

    A new directory for training is created, and the Piper repository is cloned into it. A Python virtual environment is set up and activated to manage specific library versions. Special attention is given to installing precise versions of PIP (23.3.1), NumPy (1.24.4), and Torch Metrics to avoid compatibility errors. For Nvidia 4090 GPUs, a specific fix involving modifying the requirements file and installing a particular PyTorch version is needed to overcome a Cuda 11.7 bug.

  • Data Pre-processing for Training

    The `Piper train pre-process` Python module is executed to prepare the cleaned and transcribed data for training. This step takes the wave directory and metadata.csv file, processes them, and outputs a new directory containing everything Piper needs to train the voice model. The process utilizes multiple CPU workers and processes utterances to optimize the data for the training phase.

  • Voice Model Training (Fine-tuning)

    Training begins by downloading an existing pre-trained model checkpoint (e.g., English, US, 'ic' medium quality) to use as a starting point for fine-tuning, significantly reducing training time. Training parameters like dataset directory, GPU usage, batch size (adjustable based on VRAM), and maximum epochs are configured. Training progress can be monitored and paused/resumed using checkpoints, ensuring flexibility.

  • Evaluating and Exporting the Trained Model

    Upon completion of training (indicated by the 'final epoch reached'), the final checkpoint is exported to the universal ONNX format using the `Piper train export onnx` command. The `config.json` file from the training directory also needs to be copied alongside the ONNX model file. Initial attempts at voice generation reveal that voice quality is highly dependent on the cleanliness, quantity, and diversity of the training audio clips, as well as the number of training epochs.

  • Integrating Cloned Voice into Home Assistant

    The exported ONNX model and its corresponding config JSON file are uploaded to the Home Assistant server via a Samba share. These files are placed in a 'Piper' folder, potentially requiring renaming to ensure recognition by the Piper add-on. After refreshing the Piper add-on and the Wyoming device, the new custom voice can be selected as a text-to-speech option within Home Assistant's voice assistant settings, allowing for integration with AI conversation agents like Terry.

  • Final Voice Assistant Demonstration

    The integrated voice assistant, configured with the cloned voice and a specific personality prompt (e.g., 'sinister man' for Mike), demonstrates its ability to answer questions with the new custom voice. Performance depends on the processing power of the Home Assistant server, with dedicated AI servers like Terry providing much faster response times.

This is why doing this locally and openly is loved, as it provides more power and ensures privacy.

Under Details

CategoryItemDescription
Core TechnologyPiper TTSOpen-source, local Text-to-Speech engine for voice cloning and generation.
Data Preparation ToolPiper Recording StudioWeb-based tool for easily recording and preparing custom voice datasets.
Audio Processing ToolsAudacity, FFMPEGUsed for cleaning audio data (removing music, silence) and splitting long files into short clips (max 15 seconds).
Transcription ToolWhisperLocal AI model for accurately transcribing audio files into text, generating metadata for training.
Key Output FormatONNXUniversal format for trained TTS models, allowing portability across different platforms.
Training StrategyFine-tuningUsing a pre-trained base model as a starting point to significantly reduce training time and effort.
Critical Dependencies FixSpecific Library VersionsRequired exact versions of PIP (23.3.1), NumPy (1.24.4), and Torch Metrics to prevent compatibility errors during Piper setup.
GPU Specific FixNVIDIA 4090 / Cuda 11.7Resolved a bug by changing Piper's requirements and PyTorch version to support 4090 GPUs.
Quality DeterminantsData Quality, Quantity, DiversityPristine, ample, and varied audio clips are crucial for high-quality voice cloning results.
Sponsor SolutionKeeper Secrets ManagerProvides secure, cloud-based, zero-knowledge management of developer secrets (API keys, passwords) directly from the terminal.

Tags

ArtificialIntelligence
VoiceCloning
Educational
PiperTTS
Keeper
WhisperAI
Share this post