How to Convert Text to Audio for Free Using Python and Kokoro TTS

Learn how to build a free, local text-to-speech converter in Python using Kokoro TTS. Convert any document, script, or article to a natural-sounding MP3 — perfect for listening while driving, exercising, or doing chores.
Listen to Anything, Anywhere
Have a long article you never have time to read? A research document, blog post, meeting notes, or a script you want to hear out loud? What if you could convert any text file to a natural-sounding MP3 and listen to it while driving, exercising, or doing chores?
That’s exactly what this project does. With a single Python script and a free, open-source text-to-speech model called Kokoro, you can turn any .txt file into a high-quality audio file in minutes — entirely on your own machine, with no API keys, no subscriptions, and no data leaving your computer.
What Is Kokoro TTS?
Kokoro is a lightweight, open-source text-to-speech model with 82 million parameters. Despite its small size, it produces remarkably natural-sounding voices — far better than older TTS systems — and runs efficiently on a standard CPU without requiring a GPU.
It supports over 50 voices across multiple accents and languages including American English, British English, Spanish, French, Japanese, Mandarin, and more. The model weights download automatically on first use and are cached locally after that.
Why Build This Instead of Using a Cloud Service?
Most cloud-based TTS tools send your text to a third-party server. That means your documents, notes, or scripts are processed on someone else’s infrastructure. For anyone handling sensitive content — business strategies, personal notes, client materials — that is worth thinking carefully about.
This project runs entirely locally. Nothing leaves your machine. And once the model is downloaded, it works completely offline.
The Tech Stack
- Python 3.11 — the runtime
- Kokoro 0.9.4 — the TTS model
- soundfile — for writing audio output
- numpy — for audio array handling
- ffmpeg — for converting WAV to MP3 (optional but recommended)
- WSL2 on Windows 11 — the Linux environment used during development
Setting Up the Environment
Step 1: Install WSL2 (Windows users)
If you are on Windows, the easiest way to run this project is through WSL2 (Windows Subsystem for Linux). Open Command Prompt or PowerShell as Administrator and run:
wsl --install
Restart when prompted. Ubuntu will be installed by default.
Step 2: Install Python 3.11
Ubuntu’s default repositories may include a very recent Python version (3.14 at time of writing) that is not yet compatible with all of Kokoro’s dependencies. Install Python 3.11 explicitly via the deadsnakes PPA:
sudo apt install software-properties-common -y
sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt update
sudo apt install python3.11 python3.11-venv -y
Step 3: Create a Virtual Environment
python3.11 -m venv ~/tts-env
source ~/tts-env/bin/activate
You will see (tts-env) at the start of your prompt. Remember to run source ~/tts-env/bin/activate each time you open a new Ubuntu session.
Step 4: Install Dependencies
pip install kokoro soundfile numpy
Step 5: Install ffmpeg (Optional)
ffmpeg converts the WAV output to MP3. Without it, the script saves a WAV file instead — which is larger but still perfectly usable.
sudo apt install ffmpeg -y
Why Python 3.11 Specifically?
During development, installing Kokoro on Python 3.14 failed with a build error deep in the blis dependency chain:
gcc: error: unrecognized command-line option '-mavx512pf'
This is a known compatibility issue between the spacy library (which Kokoro depends on for English phonemization) and the newest versions of Python. Python 3.11 is stable, well-supported, and fully compatible with the entire Kokoro dependency tree.
The Script
Here is the full tts_converter.py script. Save it anywhere in your project directory.
#!/usr/bin/env python3
"""
tts_converter.py — Convert text or transcript files to MP3 using Kokoro TTS
Usage:
python tts_converter.py "Some text to speak"
python tts_converter.py --file transcript.txt
python tts_converter.py --file transcript.txt --voice am_fenrir
python tts_converter.py --file transcript.txt --voice am_fenrir --output my_audio.mp3
"""
import argparse
import sys
import os
import re
import numpy as np
import soundfile as sf
import subprocess
AVAILABLE_VOICES = [
# American Female
"af_heart", "af_alloy", "af_aoede", "af_bella", "af_jessica",
"af_kore", "af_nicole", "af_nova", "af_river", "af_sarah", "af_sky",
# American Male
"am_adam", "am_echo", "am_eric", "am_fenrir", "am_liam",
"am_michael", "am_onyx", "am_puck",
# British Female
"bf_alice", "bf_emma", "bf_isabella", "bf_lily",
# British Male
"bm_daniel", "bm_fable", "bm_george", "bm_lewis",
]
def install_dependencies():
try:
import kokoro
except ImportError:
print("Installing Kokoro TTS (first run only)...")
subprocess.check_call([
sys.executable, "-m", "pip", "install",
"kokoro>=0.9.4", "soundfile", "numpy", "--quiet"
])
print("Done.\n")
def clean_text(text):
text = re.sub(r'\*{1,3}(.+?)\*{1,3}', r'\1', text)
text = re.sub(r'^#{1,6}\s+', '', text, flags=re.MULTILINE)
text = re.sub(r'https?://\S+', '', text)
text = re.sub(r'^\s*[-*•]\s+', '', text, flags=re.MULTILINE)
text = re.sub(r'\n{3,}', '\n\n', text)
return text.strip()
def chunk_text(text, max_chars=500):
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current = ""
for sentence in sentences:
if len(current) + len(sentence) < max_chars:
current += (" " if current else "") + sentence
else:
if current:
chunks.append(current)
current = sentence
if current:
chunks.append(current)
return chunks
def text_to_mp3(text, output_path, voice="af_heart"):
from kokoro import KPipeline
print(f"Voice: {voice}")
print(f"Output: {output_path}")
print(f"Characters: {len(text)}\n")
pipeline = KPipeline(lang_code='a')
text = clean_text(text)
chunks = chunk_text(text)
print(f"Processing {len(chunks)} chunk(s)...")
all_audio = []
silence = np.zeros(int(24000 * 0.4), dtype=np.float32)
for i, chunk in enumerate(chunks, 1):
print(f" [{i}/{len(chunks)}] {chunk[:60]}{'...' if len(chunk) > 60 else ''}")
try:
for result in pipeline(chunk, voice=voice, speed=1.0):
samples = result.audio
if samples is None:
continue
if hasattr(samples, 'numpy'):
samples = samples.numpy()
samples = np.atleast_1d(np.array(samples, dtype=np.float32))
if samples.size > 0:
all_audio.append(samples)
all_audio.append(silence)
except Exception as e:
print(f" Warning: chunk {i} failed ({e}), skipping.")
if not all_audio:
print("Error: no audio generated.")
sys.exit(1)
combined = np.concatenate(all_audio)
wav_path = output_path.rsplit('.', 1)[0] + ".wav"
sf.write(wav_path, combined, 24000)
try:
subprocess.run(
["ffmpeg", "-y", "-i", wav_path, "-codec:a", "libmp3lame",
"-qscale:a", "2", output_path],
check=True, capture_output=True
)
os.remove(wav_path)
print(f"\n✓ Saved: {output_path}")
except (subprocess.CalledProcessError, FileNotFoundError):
print(f"\n✓ Saved: {wav_path} (ffmpeg not found, saved as WAV)")
def main():
parser = argparse.ArgumentParser(
description="Convert text or transcript files to MP3 using Kokoro TTS"
)
parser.add_argument("text", nargs="?", help="Text to convert (wrap in quotes)")
parser.add_argument("--file", "-f", help="Path to a .txt transcript file")
parser.add_argument("--voice", "-v", default="af_heart",
help="Voice to use (default: af_heart). Run --list-voices to see all options.")
parser.add_argument("--output", "-o", default=None,
help="Output file path (default: matches input filename with .mp3 extension)")
parser.add_argument("--list-voices", action="store_true",
help="List available voices and exit")
args = parser.parse_args()
if args.list_voices:
print("Available voices:")
for v in AVAILABLE_VOICES:
print(f" {v}")
sys.exit(0)
if args.file:
if not os.path.exists(args.file):
print(f"Error: file not found: {args.file}")
sys.exit(1)
with open(args.file, "r", encoding="utf-8") as f:
text = f.read()
if args.output is None:
base = os.path.splitext(os.path.abspath(args.file))[0]
args.output = f"{base}.mp3"
elif args.text:
text = args.text
if args.output is None:
args.output = "output.mp3"
else:
print("Error: provide text as an argument or use --file. Use -h for help.")
sys.exit(1)
install_dependencies()
text_to_mp3(text, args.output, args.voice)
if __name__ == "__main__":
main()
Usage Examples
Convert a quick string:
python tts_converter.py "Hello, this is a test of the Kokoro text to speech system."
Convert a text file (output named automatically):
python tts_converter.py --file my_notes.txt
# Saves as my_notes.mp3
Choose a specific voice:
python tts_converter.py --file article.txt --voice am_fenrir
See all available voices:
python tts_converter.py --list-voices
Available Voices
Kokoro includes over 50 voices. Here are the English options:
| Code | Description |
|---|---|
af_heart | American Female, warm (default) |
af_bella, af_nova, af_sky, af_sarah | American Female variants |
am_adam, am_fenrir, am_onyx, am_puck | American Male variants |
bf_alice, bf_emma, bf_lily | British Female |
bm_daniel, bm_george, bm_lewis | British Male |
Non-English voices are also available for Spanish, French, Hindi, Italian, Japanese, Portuguese, and Mandarin.
How the Script Works
Text Cleaning
The clean_text() function strips markdown formatting, URLs, and bullet point characters before passing text to the model. This means you can paste raw markdown or Claude conversation exports directly into a .txt file and the output will sound natural.
Chunking
Kokoro works best on shorter passages. The chunk_text() function splits text at sentence boundaries into chunks of around 500 characters, with a short silence gap inserted between each chunk for natural pacing.
The result.audio Fix
One gotcha worth documenting: Kokoro 0.9.4 returns a Result object from its pipeline generator rather than a raw tuple. The audio data lives at result.audio as a PyTorch tensor. Earlier versions of this script tried to unpack the result as a tuple, which caused a could not convert string to float error. The fix is straightforward:
for result in pipeline(chunk, voice=voice, speed=1.0):
samples = result.audio
if hasattr(samples, 'numpy'):
samples = samples.numpy()
First Run
On first run, Kokoro downloads two files:
- kokoro-v1_0.pth — the main model weights (~327 MB)
- Voice file — one per voice (~500 KB each), downloaded on demand
Both are cached locally after the first download. Subsequent runs are fast.
Privacy
Everything runs locally. No API calls, no cloud processing, no data transmitted anywhere. Your text stays on your machine.
GitHub
The full source code is available on GitHub: https://github.com/lowellniles/tts-converter
Pull requests welcome. If you find a bug or want to add a feature — batch directory processing, a GUI, audio speed control, or additional language support — feel free to open an issue.
What’s Next
Some ideas for extending this project:
- Batch mode — convert an entire folder of
.txtfiles in one command - Speed control — Kokoro supports a
speedparameter; expose it as a CLI flag - GUI wrapper — a simple drag-and-drop interface using Tkinter or a web UI
- Podcast mode — chain multiple files with intro/outro music
- Direct clipboard input — pipe in text from whatever you are reading
Built with Kokoro TTS by hexgrad. Running on Python 3.11, WSL2, Ubuntu.
