How to Convert Any Text or Document to MP3 Audio Using Kokoro TTS and Python (Free, Local, and PRIVATE)

Lowell
05.14 202605.14 2026
Python, Projects

How to Convert Text to Audio for Free Using Python and Kokoro TTS

Learn how to build a free, local text-to-speech converter in Python using Kokoro TTS. Convert any document, script, or article to a natural-sounding MP3 — perfect for listening while driving, exercising, or doing chores.

Listen to Anything, Anywhere

Have a long article you never have time to read? A research document, blog post, meeting notes, or a script you want to hear out loud? What if you could convert any text file to a natural-sounding MP3 and listen to it while driving, exercising, or doing chores?

That’s exactly what this project does. With a single Python script and a free, open-source text-to-speech model called Kokoro, you can turn any .txt file into a high-quality audio file in minutes — entirely on your own machine, with no API keys, no subscriptions, and no data leaving your computer.

What Is Kokoro TTS?

Kokoro is a lightweight, open-source text-to-speech model with 82 million parameters. Despite its small size, it produces remarkably natural-sounding voices — far better than older TTS systems — and runs efficiently on a standard CPU without requiring a GPU.

It supports over 50 voices across multiple accents and languages including American English, British English, Spanish, French, Japanese, Mandarin, and more. The model weights download automatically on first use and are cached locally after that.

Why Build This Instead of Using a Cloud Service?

Most cloud-based TTS tools send your text to a third-party server. That means your documents, notes, or scripts are processed on someone else’s infrastructure. For anyone handling sensitive content — business strategies, personal notes, client materials — that is worth thinking carefully about.

This project runs entirely locally. Nothing leaves your machine. And once the model is downloaded, it works completely offline.

The Tech Stack

Python 3.11 — the runtime
Kokoro 0.9.4 — the TTS model
soundfile — for writing audio output
numpy — for audio array handling
ffmpeg — for converting WAV to MP3 (optional but recommended)
WSL2 on Windows 11 — the Linux environment used during development

Setting Up the Environment

Step 1: Install WSL2 (Windows users)

If you are on Windows, the easiest way to run this project is through WSL2 (Windows Subsystem for Linux). Open Command Prompt or PowerShell as Administrator and run:

wsl --install

Restart when prompted. Ubuntu will be installed by default.

Step 2: Install Python 3.11

Ubuntu’s default repositories may include a very recent Python version (3.14 at time of writing) that is not yet compatible with all of Kokoro’s dependencies. Install Python 3.11 explicitly via the deadsnakes PPA:

sudo apt install software-properties-common -y
sudo add-apt-repository ppa:deadsnakes/ppa -y
sudo apt update
sudo apt install python3.11 python3.11-venv -y

Step 3: Create a Virtual Environment

python3.11 -m venv ~/tts-env
source ~/tts-env/bin/activate

You will see (tts-env) at the start of your prompt. Remember to run source ~/tts-env/bin/activate each time you open a new Ubuntu session.

Step 4: Install Dependencies

pip install kokoro soundfile numpy

Step 5: Install ffmpeg (Optional)

ffmpeg converts the WAV output to MP3. Without it, the script saves a WAV file instead — which is larger but still perfectly usable.

sudo apt install ffmpeg -y

Why Python 3.11 Specifically?

During development, installing Kokoro on Python 3.14 failed with a build error deep in the blis dependency chain:

gcc: error: unrecognized command-line option '-mavx512pf'

This is a known compatibility issue between the spacy library (which Kokoro depends on for English phonemization) and the newest versions of Python. Python 3.11 is stable, well-supported, and fully compatible with the entire Kokoro dependency tree.

The Script

Here is the full tts_converter.py script. Save it anywhere in your project directory.

#!/usr/bin/env python3
"""
tts_converter.py — Convert text or transcript files to MP3 using Kokoro TTS
Usage:
    python tts_converter.py "Some text to speak"
    python tts_converter.py --file transcript.txt
    python tts_converter.py --file transcript.txt --voice am_fenrir
    python tts_converter.py --file transcript.txt --voice am_fenrir --output my_audio.mp3
"""

import argparse
import sys
import os
import re
import numpy as np
import soundfile as sf
import subprocess


AVAILABLE_VOICES = [
    # American Female
    "af_heart", "af_alloy", "af_aoede", "af_bella", "af_jessica",
    "af_kore", "af_nicole", "af_nova", "af_river", "af_sarah", "af_sky",
    # American Male
    "am_adam", "am_echo", "am_eric", "am_fenrir", "am_liam",
    "am_michael", "am_onyx", "am_puck",
    # British Female
    "bf_alice", "bf_emma", "bf_isabella", "bf_lily",
    # British Male
    "bm_daniel", "bm_fable", "bm_george", "bm_lewis",
]


def install_dependencies():
    try:
        import kokoro
    except ImportError:
        print("Installing Kokoro TTS (first run only)...")
        subprocess.check_call([
            sys.executable, "-m", "pip", "install",
            "kokoro>=0.9.4", "soundfile", "numpy", "--quiet"
        ])
        print("Done.\n")


def clean_text(text):
    text = re.sub(r'\*{1,3}(.+?)\*{1,3}', r'\1', text)
    text = re.sub(r'^#{1,6}\s+', '', text, flags=re.MULTILINE)
    text = re.sub(r'https?://\S+', '', text)
    text = re.sub(r'^\s*[-*•]\s+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\n{3,}', '\n\n', text)
    return text.strip()


def chunk_text(text, max_chars=500):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    current = ""
    for sentence in sentences:
        if len(current) + len(sentence) < max_chars:
            current += (" " if current else "") + sentence
        else:
            if current:
                chunks.append(current)
            current = sentence
    if current:
        chunks.append(current)
    return chunks


def text_to_mp3(text, output_path, voice="af_heart"):
    from kokoro import KPipeline

    print(f"Voice: {voice}")
    print(f"Output: {output_path}")
    print(f"Characters: {len(text)}\n")

    pipeline = KPipeline(lang_code='a')
    text = clean_text(text)
    chunks = chunk_text(text)

    print(f"Processing {len(chunks)} chunk(s)...")

    all_audio = []
    silence = np.zeros(int(24000 * 0.4), dtype=np.float32)

    for i, chunk in enumerate(chunks, 1):
        print(f"  [{i}/{len(chunks)}] {chunk[:60]}{'...' if len(chunk) > 60 else ''}")
        try:
            for result in pipeline(chunk, voice=voice, speed=1.0):
                samples = result.audio
                if samples is None:
                    continue
                if hasattr(samples, 'numpy'):
                    samples = samples.numpy()
                samples = np.atleast_1d(np.array(samples, dtype=np.float32))
                if samples.size > 0:
                    all_audio.append(samples)
                    all_audio.append(silence)
        except Exception as e:
            print(f"  Warning: chunk {i} failed ({e}), skipping.")

    if not all_audio:
        print("Error: no audio generated.")
        sys.exit(1)

    combined = np.concatenate(all_audio)
    wav_path = output_path.rsplit('.', 1)[0] + ".wav"
    sf.write(wav_path, combined, 24000)

    try:
        subprocess.run(
            ["ffmpeg", "-y", "-i", wav_path, "-codec:a", "libmp3lame",
             "-qscale:a", "2", output_path],
            check=True, capture_output=True
        )
        os.remove(wav_path)
        print(f"\n✓ Saved: {output_path}")
    except (subprocess.CalledProcessError, FileNotFoundError):
        print(f"\n✓ Saved: {wav_path}  (ffmpeg not found, saved as WAV)")


def main():
    parser = argparse.ArgumentParser(
        description="Convert text or transcript files to MP3 using Kokoro TTS"
    )
    parser.add_argument("text", nargs="?", help="Text to convert (wrap in quotes)")
    parser.add_argument("--file", "-f", help="Path to a .txt transcript file")
    parser.add_argument("--voice", "-v", default="af_heart",
        help="Voice to use (default: af_heart). Run --list-voices to see all options.")
    parser.add_argument("--output", "-o", default=None,
        help="Output file path (default: matches input filename with .mp3 extension)")
    parser.add_argument("--list-voices", action="store_true",
        help="List available voices and exit")

    args = parser.parse_args()

    if args.list_voices:
        print("Available voices:")
        for v in AVAILABLE_VOICES:
            print(f"  {v}")
        sys.exit(0)

    if args.file:
        if not os.path.exists(args.file):
            print(f"Error: file not found: {args.file}")
            sys.exit(1)
        with open(args.file, "r", encoding="utf-8") as f:
            text = f.read()
        if args.output is None:
            base = os.path.splitext(os.path.abspath(args.file))[0]
            args.output = f"{base}.mp3"
    elif args.text:
        text = args.text
        if args.output is None:
            args.output = "output.mp3"
    else:
        print("Error: provide text as an argument or use --file. Use -h for help.")
        sys.exit(1)

    install_dependencies()
    text_to_mp3(text, args.output, args.voice)


if __name__ == "__main__":
    main()

Usage Examples

Convert a quick string:

python tts_converter.py "Hello, this is a test of the Kokoro text to speech system."

Convert a text file (output named automatically):

python tts_converter.py --file my_notes.txt
# Saves as my_notes.mp3

Choose a specific voice:

python tts_converter.py --file article.txt --voice am_fenrir

See all available voices:

python tts_converter.py --list-voices

Available Voices

Kokoro includes over 50 voices. Here are the English options:

Code	Description
`af_heart`	American Female, warm (default)
`af_bella`, `af_nova`, `af_sky`, `af_sarah`	American Female variants
`am_adam`, `am_fenrir`, `am_onyx`, `am_puck`	American Male variants
`bf_alice`, `bf_emma`, `bf_lily`	British Female
`bm_daniel`, `bm_george`, `bm_lewis`	British Male

Non-English voices are also available for Spanish, French, Hindi, Italian, Japanese, Portuguese, and Mandarin.

How the Script Works

Text Cleaning

The clean_text() function strips markdown formatting, URLs, and bullet point characters before passing text to the model. This means you can paste raw markdown or Claude conversation exports directly into a .txt file and the output will sound natural.

Chunking

Kokoro works best on shorter passages. The chunk_text() function splits text at sentence boundaries into chunks of around 500 characters, with a short silence gap inserted between each chunk for natural pacing.

The `result.audio` Fix

One gotcha worth documenting: Kokoro 0.9.4 returns a Result object from its pipeline generator rather than a raw tuple. The audio data lives at result.audio as a PyTorch tensor. Earlier versions of this script tried to unpack the result as a tuple, which caused a could not convert string to float error. The fix is straightforward:

for result in pipeline(chunk, voice=voice, speed=1.0):
    samples = result.audio
    if hasattr(samples, 'numpy'):
        samples = samples.numpy()

First Run

On first run, Kokoro downloads two files:

kokoro-v1_0.pth — the main model weights (~327 MB)
Voice file — one per voice (~500 KB each), downloaded on demand

Both are cached locally after the first download. Subsequent runs are fast.

Privacy

Everything runs locally. No API calls, no cloud processing, no data transmitted anywhere. Your text stays on your machine.

GitHub

The full source code is available on GitHub: https://github.com/lowellniles/tts-converter

Pull requests welcome. If you find a bug or want to add a feature — batch directory processing, a GUI, audio speed control, or additional language support — feel free to open an issue.

What’s Next

Some ideas for extending this project:

Batch mode — convert an entire folder of .txt files in one command
Speed control — Kokoro supports a speed parameter; expose it as a CLI flag
GUI wrapper — a simple drag-and-drop interface using Tkinter or a web UI
Podcast mode — chain multiple files with intro/outro music
Direct clipboard input — pipe in text from whatever you are reading

Built with Kokoro TTS by hexgrad. Running on Python 3.11, WSL2, Ubuntu.

Tags:AI Audio Kokoro Open Source python Text to Speech TTS Tutorial Ubuntu WSL2

How to Convert Text to Audio for Free Using Python and Kokoro TTS

Listen to Anything, Anywhere

What Is Kokoro TTS?

Why Build This Instead of Using a Cloud Service?

The Tech Stack

Setting Up the Environment

Step 1: Install WSL2 (Windows users)

Step 2: Install Python 3.11

Step 3: Create a Virtual Environment

Step 4: Install Dependencies

Step 5: Install ffmpeg (Optional)

Why Python 3.11 Specifically?

The Script

Usage Examples

Available Voices

How the Script Works

Text Cleaning

Chunking

The `result.audio` Fix

First Run

Privacy

GitHub

What’s Next

Previous Article

Next Article

Leave a Reply Cancel reply

How to Convert Text to Audio for Free Using Python and Kokoro TTS

Listen to Anything, Anywhere

What Is Kokoro TTS?

Why Build This Instead of Using a Cloud Service?

The Tech Stack

Setting Up the Environment

Step 1: Install WSL2 (Windows users)

Step 2: Install Python 3.11

Step 3: Create a Virtual Environment

Step 4: Install Dependencies

Step 5: Install ffmpeg (Optional)

Why Python 3.11 Specifically?

The Script

Usage Examples

Available Voices

How the Script Works

Text Cleaning

Chunking

The result.audio Fix

First Run

Privacy

GitHub

What’s Next

Previous Article

Next Article

Leave a Reply Cancel reply

The `result.audio` Fix