Creating long-form audiobooks from your written content can be an incredible way to repurpose your content and reach a broader audience. By leveraging the Azure Speech Service, it becomes surprisingly straightforward to convert long textual content into audiobooks. In this article, we will demonstrate how to achieve this with a step-by-step guide and Python code.
Setting up Azure Speech Service
Before diving into the code, let’s first set up the Azure Speech Service:
- Azure Subscription: If you don’t have one, you can create a free Azure subscription.
- Create a Speech Resource: Go to the Azure portal and create a new Speech resource.
- Get Your Speech Resource Key and Region: After deploying your Speech resource, click on ‘Go to resource’. This will allow you to view and manage your keys. If you need additional details about Azure AI service resources, refer to the official documentation.
With these steps completed, you are now ready to integrate the Azure Speech Service into your Python code.
Prerequisites
To work with the code, ensure you have the following packages installed. You can do this in a Jupyter notebook:
%pip install azure-cognitiveservices-speech
%pip install markdown
%pip install beautifulsoup4
%pip install pydub
Converting Markdown to Plain Text
First, let’s convert our markdown content to plain text. This ensures the reader doesn’t get distracted by markdown elements like hash symbols or bullets.
import markdown
from bs4 import BeautifulSoup
# Open the markdown file
with open('./bronze/target.md', 'r') as f:
# Read the content of the file
contents = f.read()
# Convert the markdown to plain text
plain_text = markdown.markdown(contents)
output = ''.join(BeautifulSoup(plain_text).findAll(text=True))
# Write the plain text to a new file
with open('./bronze/output.txt', 'w') as f:
f.write(output)
Splitting the Text
The Azure Speech Service can be a bit finicky with longer content. To avoid any issues, we’ll break the text into smaller chunks. This makes it easier for the service to process the text in a batch mode.
lines_per_file = 25
smallfile = None
file_index = 0
with open('./bronze/output.txt') as bigfile:
for lineno, line in enumerate(bigfile):
if lineno % lines_per_file == 0:
if smallfile:
smallfile.close()
file_index += 1
small_filename = './silver/{0:0>3}file.txt'.format(file_index)
smallfile = open(small_filename, "w")
smallfile.write(line)
if smallfile:
smallfile.close()
Setting Up the Azure Speech SDK
Before using the Azure Speech SDK, ensure it’s installed. If not, we’ll display an error message.
try:
import azure.cognitiveservices.speech as speechsdk
except ImportError:
print("""
Importing the Speech SDK for Python failed.
Refer to
https://docs.microsoft.com/azure/cognitive-services/speech-service/quickstart-text-to-speech-python for
installation instructions.
""")
import sys
sys.exit(1)
Make sure to add your Azure Speech key and service region:
# Replace with your own subscription key and service region (e.g., "westus").
speech_key, service_region = "YOUR_SPEECH_KEY", "YOUR_SERVICE_REGION"
Text to Audio Conversion
Now, for the fun part. We’ll convert the small text files into individual audio segments.
from pydub import AudioSegment
def speech_synthesis_to_mp3_file(small_text_file):
"""Performs speech synthesis to an mp3 file"""
# Setup for Azure Speech SDK
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_config.set_speech_synthesis_output_format(speechsdk.SpeechSynthesisOutputFormat.Audio16Khz32KBitRateMonoMp3)
speech_config.speech_synthesis_voice_name = "en-US-NancyNeural"
# Initialize a speech synthesizer object
speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
# Initialize an empty AudioSegment object for accumulating audio data
combined_audio = AudioSegment.empty()
# Process the text file and perform speech synthesis
with open('./{v}'.format(v=small_text_file)) as f:
text = f.read()
lines = text.split(". ")
for i, line in enumerate(lines):
result = speech_synthesizer.speak_text_async(line + ".").get()
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
audio_chunk = AudioSegment(
data=bytes(result.audio_data),
sample_width=2,
frame_rate=16000,
channels=1
)
combined_audio += audio_chunk
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation_details = result.cancellation_details
print(f"Speech synthesis canceled: {cancellation_details.reason}")
if cancellation_details.reason == speechsdk.CancellationReason.Error:
print(f"Error details: {cancellation_details.error_details}")
# Generate the output filename
to_filter = small_text_file
to_filter = "".join(_ for _ in to_filter if _ in "1234567890")
file_name = f"./gold/{to_filter}.mp3"
# Save the accumulated audio data to an mp3 file
combined_audio.export(file_name, format="mp3")
print(f"Speech synthesized and the audio was saved to {file_name}")
Next, let’s iterate through each of these files and convert them:
import os
directory = 'silver'
for filename in os.listdir(directory):
f = os.path.join(directory, filename)
size = os.stat(f).st_size
if (os.path.isfile(f)) and (size!=0):
speech_synthesis_to_mp3_file(f)
Combining Audio Segments
To make listening easier, we’ll combine all these segments into one audio file:
from pydub import AudioSegment
import os
combined_audio = AudioSegment.empty()
file_list = os.listdir('./gold')
sorted_list = sorted(file_list)
for filename in sorted_list:
audiofile = AudioSegment.from_mp3('./gold/{f}'.format(f=filename))
combined_audio += audiofile
combined_audio.export('./output/final.mp3', format='mp3')
Cleanup
Lastly, it’s always a good practice to clean up any intermediate files to keep your workspace tidy:
import os
import glob
directories = ['./gold', './silver', './bronze']
for directory in directories:
files = glob.glob(directory + '/*')
for file in files:
os.remove(file)
Wrapping Up
This process is perfect for turning longer documents or even blog posts into audio content. Whether you’re going for a walk, a run, or just resting your eyes, now you can listen to your content on-the-go.
Remember, the directories used (bronze
, silver
, gold
, and output
) are essential for the project’s structure, so ensure they exist in your workspace before running the code.
Happy listening!