Overview
Training a custom voice involves three main steps:1
Create the voice
Initialize a new voice model with your training parameters
2
Upload training audio
Provide high-quality audio samples for the voice to learn from
3
Start training
Begin the AI training process to create your custom voice model
Custom voice training consumes 10 API credits per minute of training time (specified by
maxMinutes). Choose your training duration wisely.Step 1: Create a Voice Model
Start by creating a new voice with your desired training parameters:Show Response
Show Response
Parameters Explained
Display name for your voice model (1-50 characters). Choose something descriptive and unique.
Number of audio files to upload for training (1-5). More diverse files typically yield better results.
Maximum duration of training audio in minutes (5-60). More training data generally produces better quality, but costs more credits.
Whether to use vocal separation during training. Set to
false if your training audio already contains isolated vocals.Step 2: Upload Training Audio
Use the provided upload URLs to upload your training audio files:Training Audio Guidelines
Audio Quality:- Use high-quality recordings (192kbps+ MP3 or lossless formats)
- A total of 3–5 minutes works well, but longer files (up to 20 minutes) are also supported
- Minimize background noise and reverb
- Ensure consistent audio levels across files
- Avoid heavily processed or auto-tuned vocals
- Include diverse vocal expressions (soft, loud, emotional)
- Mix different tempos and rhythms
- Include both sustained notes and quick phrases
- Vary pitch range throughout the samples
- Each file: 1-20 minutes duration
- Total training duration: Up to your specified
maxMinutes - Supported formats: MP3, WAV, M4A, FLAC, OGG
- Video files are supported (audio will be extracted)
Step 3: Start Training
Once all files are uploaded, start the training process:Show Response
Show Response
Step 4: Monitor Training Progress
Poll the voice status to track training progress:Show Response
Show Response
Status Values
new- Voice created, waiting for file uploadsqueued- All files uploaded, waiting in training queuestarting- Training initialization beginningprocessing- AI model training in progressfinalizing- Completing training and validating modeldone- Training complete! Voice ready for useerror- Training failed (checkerrorMessage)
Training typically takes up to
maxMinutes + setup time (a couple minutes) depending on the amount of training audio, but can be higher based on current queue load.Step 5: Use Your Trained Voice
Once training is complete (status: "done"), your voice is ready for dubbing:
Show Response (when complete)
Show Response (when complete)
When filtering public voices (languages, genres, styles), multiple values inside the same filter are matched using OR (for example, languages: English or Spanish). Different filter types are combined using AND (for example, languages and genres and styles must all match). See the API reference for details.
Complete Example
Here’s a complete Node.js example for training a custom voice:Training Duration Guidelines
Choose yourmaxMinutes based on your quality needs and budget:
Basic (5-10 min)
Cost: 50-100 creditsQuality: Good for simple voicesBest for: Testing, basic character voices
Standard (10-30 min)
Cost: 100-300 creditsQuality: High quality resultsBest for: Most use cases, content creation
Premium (30-60 min)
Cost: 300-600 creditsQuality: Exceptional qualityBest for: Professional projects, complex voices
Voice Quality Tips
Recording Environment
Recording Environment
- Quiet space: Record in a quiet room with minimal echo
- Consistent microphone: Use the same mic for all training files
- Stable distance: Maintain consistent distance from microphone
- Audio levels: Keep input levels consistent but avoid clipping
Content Selection
Content Selection
- Emotional range: Include happy, sad, excited, and calm expressions
- Pitch variety: Cover the full vocal range of the target voice
- Speech patterns: Include natural speech or singing rhythm and pacing
- Phonetic coverage: Ensure good coverage of different sounds
Technical Quality
Technical Quality
- Sample rate: 44.1kHz minimum, 48kHz preferred
- Bit depth: 16-bit minimum, 24-bit preferred
- Format: WAV or FLAC for best quality, high-bitrate MP3 acceptable
- Editing: Light noise reduction okay, avoid heavy processing
Pricing & Credits
Custom voice training consumes 10 API credits per minute of training time (based on yourmaxMinutes setting):
- 5-minute training = 50 credits
- 10-minute training = 100 credits
- 30-minute training = 300 credits
- 60-minute training = 600 credits
The exact cost is shown in the
requiredCredits field when you create the voice, before training starts.Troubleshooting
Training failed with poor quality
Training failed with poor quality
Common causes and solutions:
- Low-quality source audio: Use higher bitrate recordings
- Inconsistent audio: Ensure similar recording conditions for all files
- Insufficient data: Try increasing
maxMinutesfor more training material - Poor vocal separation: If audio has backing tracks, ensure
separate: true
Training stuck or taking too long
Training stuck or taking too long
Training typically takes the length of your
maxMinutes setting, plus 1–2 minutes for setup and processing.If your training is taking significantly longer than this, please reach out to our support team for assistance.Upload failures
Upload failures
- File too large: Max 50MB per file, consider compressing audio
- Unsupported format: Use MP3, WAV, M4A, FLAC, or OGG
- Network timeout: Try uploading smaller files or check connection
Voice doesn't sound right
Voice doesn't sound right
- Try different source material: More diverse audio often helps
- Adjust pitch in dubbing: Use
pitchShiftparameter when creating dubs - Increase training duration: More training data usually improves quality
- Use dry acapella recordings: Training works best with clean vocals free from background noise or music. Our vocal separation can help, but starting with isolated vocals gives the best results.
Voice Management
Once trained, your custom voices:- Persist indefinitely - No expiration or maintenance required
- Are private to your account - Only you can use them for dubbing
- Can be used unlimited times - Usage incurs standard API dubbing credits, in addition to the training cost
- Work with all dubbing features - Pitch shifting, vocal separation, etc.
Your trained voice is now ready! Use it in the dubbing API or integrate it into your applications.