Create a Long Form Audio Book with Azure Speech service

Illustration of an android reading a book

Creating long-form audiobooks from your written content can be an incredible way to repurpose your content and reach a broader audience. By leveraging the Azure Speech Service, it becomes surprisingly straightforward to convert long textual content into audiobooks. In this article, we will demonstrate how to achieve this with a step-by-step guide and Python code.

Setting up Azure Speech Service

Before diving into the code, let’s first set up the Azure Speech Service:

Azure Subscription: If you don’t have one, you can create a free Azure subscription.
Create a Speech Resource: Go to the Azure portal and create a new Speech resource.
Get Your Speech Resource Key and Region: After deploying your Speech resource, click on ‘Go to resource’. This will allow you to view and manage your keys. If you need additional details about Azure AI service resources, refer to the official documentation.

With these steps completed, you are now ready to integrate the Azure Speech Service into your Python code.

Prerequisites

To work with the code, ensure you have the following packages installed. You can do this in a Jupyter notebook:

%pip install azure-cognitiveservices-speech
%pip install markdown
%pip install beautifulsoup4
%pip install pydub

Converting Markdown to Plain Text

First, let’s convert our markdown content to plain text. This ensures the reader doesn’t get distracted by markdown elements like hash symbols or bullets.

import markdown
from bs4 import BeautifulSoup

# Open the markdown file
with open('./bronze/target.md', 'r') as f:
    # Read the content of the file
    contents = f.read()

# Convert the markdown to plain text
plain_text = markdown.markdown(contents)
output = ''.join(BeautifulSoup(plain_text).findAll(text=True))

# Write the plain text to a new file
with open('./bronze/output.txt', 'w') as f:
    f.write(output)

Splitting the Text

The Azure Speech Service can be a bit finicky with longer content. To avoid any issues, we’ll break the text into smaller chunks. This makes it easier for the service to process the text in a batch mode.

lines_per_file = 25
smallfile = None
file_index = 0
with open('./bronze/output.txt') as bigfile:
    for lineno, line in enumerate(bigfile):
        if lineno % lines_per_file == 0:
            if smallfile:
                smallfile.close()
            file_index += 1
            small_filename = './silver/{0:0>3}file.txt'.format(file_index)
            smallfile = open(small_filename, "w")
        smallfile.write(line)
    if smallfile:
        smallfile.close()

Setting Up the Azure Speech SDK

Before using the Azure Speech SDK, ensure it’s installed. If not, we’ll display an error message.

try:
    import azure.cognitiveservices.speech as speechsdk
except ImportError:
    print("""
    Importing the Speech SDK for Python failed.
    Refer to
    https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstart-text-to-speech-python for
    installation instructions.
    """)
    import sys
    sys.exit(1)

Make sure to add your Azure Speech key and service region:

# Replace with your own subscription key and service region (e.g., "westus").
speech_key, service_region = "YOUR_SPEECH_KEY", "YOUR_SERVICE_REGION"

Text to Audio Conversion

Now, for the fun part. We’ll convert the small text files into individual audio segments.

from pydub import AudioSegment

def speech_synthesis_to_mp3_file(small_text_file):
    """Performs speech synthesis to an mp3 file"""
    # Setup for Azure Speech SDK
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Audio16Khz32KBitRateMonoMp3)
    speech_config.speech_synthesis_voice_name = "en-US-NancyNeural"

    # Initialize a speech synthesizer object
    speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)

    # Initialize an empty AudioSegment object for accumulating audio data
    combined_audio = AudioSegment.empty()

    # Process the text file and perform speech synthesis
    with open('./{v}'.format(v=small_text_file)) as f:
        text = f.read()
        lines = text.split(". ")
        
        for i, line in enumerate(lines):
            result = speech_synthesizer.speak_text_async(line + ".").get()
            
            if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
                audio_chunk = AudioSegment(
                    data=bytes(result.audio_data),
                    sample_width=2,
                    frame_rate=16000,
                    channels=1
                )
                combined_audio += audio_chunk

            elif result.reason == speechsdk.ResultReason.Canceled:
                cancellation_details = result.cancellation_details
                print(f"Speech synthesis canceled: {cancellation_details.reason}")
                if cancellation_details.reason == speechsdk.CancellationReason.Error:
                    print(f"Error details: {cancellation_details.error_details}")

    # Generate the output filename
    to_filter = small_text_file
    to_filter = "".join(_ for _ in to_filter if _ in "1234567890")
    file_name = f"./gold/{to_filter}.mp3"

    # Save the accumulated audio data to an mp3 file
    combined_audio.export(file_name, format="mp3")
    print(f"Speech synthesized and the audio was saved to {file_name}")

Next, let’s iterate through each of these files and convert them:

import os
directory = 'silver'

for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    size = os.stat(f).st_size
    if (os.path.isfile(f)) and (size!=0):
        speech_synthesis_to_mp3_file(f)

Combining Audio Segments

To make listening easier, we’ll combine all these segments into one audio file:

from pydub import AudioSegment
import os

combined_audio = AudioSegment.empty()

file_list = os.listdir('./gold')
sorted_list = sorted(file_list)

for filename in sorted_list:
    audiofile = AudioSegment.from_mp3('./gold/{f}'.format(f=filename))
    combined_audio += audiofile

combined_audio.export('./output/final.mp3', format='mp3')

Cleanup

Lastly, it’s always a good practice to clean up any intermediate files to keep your workspace tidy:

import os
import glob

directories = ['./gold', './silver', './bronze']

for directory in directories:
    files = glob.glob(directory + '/*')
    for file in files:
        os.remove(file)

Wrapping Up

This process is perfect for turning longer documents or even blog posts into audio content. Whether you’re going for a walk, a run, or just resting your eyes, now you can listen to your content on-the-go.

Remember, the directories used (bronze, silver, gold, and output) are essential for the project’s structure, so ensure they exist in your workspace before running the code.

Happy listening!

Create a Long Form Audio Book with Azure Speech service

Setting up Azure Speech Service

Prerequisites

Converting Markdown to Plain Text

Splitting the Text

Setting Up the Azure Speech SDK

Text to Audio Conversion

Combining Audio Segments

Cleanup

Wrapping Up

Published by Shawn Deggans

Leave a Reply Cancel reply

Setting up Azure Speech Service

Prerequisites

Converting Markdown to Plain Text

Splitting the Text

Setting Up the Azure Speech SDK

Text to Audio Conversion

Combining Audio Segments

Cleanup

Wrapping Up

Share this:

Published by Shawn Deggans

Leave a Reply Cancel reply