Clone Any Voice and Make It Say Anything with Step Audio EditX

You have a voice you want to use. A character, a narrator, a specific person's tone. You need it saying something it never actually said.

No fine-tuning. No training data. No recording sessions.

One audio clip in. New speech in that same voice out.

Run it now on Floyo!

Why This Workflow Is Different

Most voice cloning tools require training runs, datasets, or lengthy setup. This one needs a single audio clip.

Upload your reference audio. Whisper auto-transcribes it. Step Audio EditX pairs the transcription with the audio as a voice reference and generates new speech from whatever text you type. You get an MP3 back, ready to use. No trigger words, no model training, no extra steps.

one reference clip is all you need
auto-transcription via Whisper no manual typing of the reference text
temperature control for delivery style
MP3 output, ready to drop into any project

How It Works

Whisper transcribes your reference audio automatically. You don't need to type out what the speaker says the workflow handles it.

Step Audio EditX takes the reference audio and its transcription together as a voice fingerprint, then generates new speech from your text prompt in that same voice. It captures tone, cadence, accent, and speaking style from the reference.

The output is a clean MP3 file.

Key Inputs

Reference Audio

The voice sample you want to clone. MP3 and WAV both work.

5 to 15 seconds is the sweet spot
clean recording with minimal background noise
close-mic or studio-quality audio gives the best clone
longer clips don't improve quality and slow generation

Works well with:

voiceover recordings
podcast or interview clips
clean dialogue from video
any clear, single-speaker audio

Works less well with:

clips with music or background noise underneath
very short clips under 3 seconds
multiple speakers in the same clip

Text Prompt

What you want the cloned voice to say. Write naturally punctuation and sentence structure affect pacing and delivery. Short to medium scripts work best.

Temperature: default 0.7

0.3–0.5: steady, predictable delivery, less variation
0.7: balanced, natural-sounding output
0.8–1.0: more expressive, dynamic reads
above 1.0: adds randomness that can sound unnatural

Max New Tokens: default 2048. Handles short to medium scripts. Raise it if output gets cut off. Lower it for shorter clips and faster generation.

Seed: randomized by default for slight variation between runs. Lock it to a specific number when comparing settings without voice variation.

What This Is Great For

Voiceover prototyping: hear how a script sounds in a specific voice before booking a recording session. Fast enough to test multiple scripts in one sitting.

Character voice work: generate dialogue for games, animations, or podcasts from a reference clip. Multiple takes at different temperatures to find the right delivery.

Content localization prep: generate placeholder voiceover in the right voice style before final recording.

Narration drafts: prototype narration tracks for video edits before committing to a full voice recording.

What to Watch Out For

Background noise in the reference clip transfers to the output. A noisy reference produces a noisier clone. Use the cleanest recording you have.

Long scripts hit token limits. At 2048 tokens the model handles short to medium content well. For longer narration, split the script into segments and generate each separately.

Temperature above 1.0 introduces artifacts. Stay between 0.3 and 0.9 for clean, usable output.

This workflow is not suited for long-form audiobook narration where you need precise emotional range across many paragraphs. For that, a dedicated TTS pipeline with prosody control will serve better.

Only clone voices you have permission to use. Don't use this to impersonate people without consent.