Why Synthetic Voices Make Audio Description Much Easier

The traditional audio description workflow can be expensive and time-consuming.

That’s why audio description professionals are opting for technological solutions to bring down expenditure.

One method is to read our audio description transcripts with synthetic voices, but how does the technology fare in practice?

First off, what is the traditional audio description workflow?

An audio description is a spoken narration that helps people with visual or intellectual disabilities follow a film or TV show.

Gregory Frazier put in the first concrete work on audio description in the mid-1970s.

In the following decade, audio description services started to emerge with a primary focus on cinema.

The conventional way of making audio descriptions requires different personnel for each stage in the process.

Writers create the audio description transcript.
Voice actors are hired to read out the transcript in a recording studio.
Sound engineers master the final mix.

Why is the conventional production process expensive and time-consuming?

We go into greater detail in our article exploring why audio description production is a headache for production companies.

Here’s a quick overview of our findings:

The workflow is complex.
It is difficult to say how long an audio description will take to produce.
Production costs are unpredictable and can rise quickly.

To address all these problems, production companies are turning to synthetic voices.

Okay, so what are synthetic voices?

Speech synthesis is the artificial production of human speech.

Based on machine learning, the technique is used for:

speech generation
music generation
navigation systems
speech-enabled devices
text-to-speech
accessibility for people with visual impairments

The latter two points are what we are particularly interested in.

What is text to speech?

Text to speech converts written text into a synthesized spoken voice.

And where does accessibility come into it?

Accessibility supports social inclusion for people with disabilities, ensuring everyone is treated the same.

Technology has a big hand in promoting accessibility.

For example, many disabled people use specialized text-to-speech software such as screen readers when online or on devices.

Screen readers read aloud the text on a website and provide important functionality when navigating through headings and reading out alt text for images.

Person using a braille terminal linked to a screen reader with their laptop — **Image credit:** Wikipedia, Sebastien.delorme - Own work, CC BY-SA 3.0

And where does audio description come into it?

Following the logic above, audio description transcripts can also be read aloud as a synthesized voice using text-to-speech technology.

Don’t you need proper voice actors for audio descriptions?

Not necessarily.

Studies show that text-to-speech audio description is generally regarded as an acceptable solution by viewers with visual impairments.

The best approach is to test audio descriptions created with synthetic voices on your audience and decide from there.

For example, German broadcaster MDR tested text-to-speech audio descriptions on blind consumers at the Louis Braille Festival in 2019 before adopting the technology.

Aren’t synthetic voices robotic-sounding, though?

No, they don’t need to be.

Synthetic voices have a reputation for sounding robotic, as many people associate the technology with the JAWS screen reader, HAL, or even German electronic act Kraftwerk.

While JAWS has its functional merits, it is not suitable for audio description.

Thankfully, the quality and range of voices have come on a long way in recent years, giving the narration a natural feel.

With advancements in artificial intelligence, the technology is also improving all the time.

Take a listen for yourself – the female voice over in this video is a text-to-speech audio description:

Open Youtube Video

Sounds good! But how does synthesize speech cut audio description costs?

With text to speech, you can deliver the audio description without a voice artist or recording studio.

If there are subsequent corrections made to the transcript, you don’t need to worry about rehiring voice talent and studio space.

Using synthetic voices also saves time.

In the traditional workflow, production companies are sometimes left waiting on voice actor and studio availability.

Yet with synthesized speech, the audio description transcript can be read out as soon as the text is ready.

And this makes budgeting a lot easier, right?

Exactly.

Unpredictable variable costs become easy-to-manage fixed costs.

So you want to get rid of voice artists…

No, not at all!

Synthesized speech isn’t there to compete with professional voice artists.

Instead, synthetic voices play a supporting role in audio description creation.

The issue is that not enough productions are made with an audio description due to budget restraints.

Time is another sticking point – most companies need the audio description for their productions right away; for cinema, it usually takes weeks for the audio description to be ready.

So it’s not a case of replacing voice artists – it’s about making audio description affordable and time-saving for projects where it would otherwise be economically unviable.

In turn, the number of productions provided with an audio description should increase, boosting the availability of accessible content as a whole.

What types of projects are best suited to text-to-speech audio descriptions?

After extensive testing, early adopters have switched to text-to-speech audio descriptions for productions of various lengths.

MDR, a broadcaster in Germany, is using synthetic voices to audio-describe 45-minute documentaries, 30-minute cultural shows, and 10-minute web series episodes.

Media firm Februar Film audio-described tutorial videos for German insurer AOK with text to speech.

Text-to-speech audio descriptions are perfect for smaller budgets and certain formats, particularly those intended for the web.

What is the best way to integrate audio descriptions with synthetic voices into videos?

VIDEO TO VOICE developed Frazier, a Production Suite for creating text-to-speech audio descriptions.

The transcript can be written directly into the tool; the audio description is then generated using text to speech in seconds.

The user can choose from hundreds of voices in over 40 languages for the audio description.

Here is a short narration of a pole vault attempt delivered using text-to-speech audio description:

Open Youtube Video

Wow, great job! But I don’t want to download more software onto my computer.

You don’t have to!

Frazier is browser-based, so there’s no need to download anything.

You’ll be given log-in details to access the platform online, and can start creating your own text-to-speech audio descriptions right away.

What if I need to deliver my audio description in different languages?

Frazier includes a neural machine translation feature.

This automatically converts your audio description into another language.

After the translation has been generated, a post-editor can make any adjustments to the text and select the voice in the new language.

All these fancy synthetic voices don’t come cheap, I bet.

Using synthetic voices is a lot less expensive than using voice actors.

In fact, it is the most affordable option out there.

Through Frazier, you have access to the best synthetic voices out there from leading providers.

The software also takes care of the final mix and mastering.

That’s value for money.

You can book a call to see if text-to-speech audio descriptions are the right fit for you and your audience.

Summary – what have we learnt?

We got through quite a lot there, so let’s quickly sum up our findings.

The fact is…

audio description is essential for accessibility

but…

there are not enough productions provided with audio descriptions,

because…

the traditional audio description workflow is complex and expensive,

so…

synthesized speech should be used in place of voice actors where necessary.

With synthetic voices, you make…

unpredictable variable costs become fixed costs,

so…

text-to-speech audio descriptions are a suitable solution

for…

productions with small budgets or short formats,

meaning…

more productions are available with audio descriptions.

Without synthetic voices, audio description is a non-starter for many productions. Yet as the aforementioned studies and examples show, synthesized speech is a viable and affordable solution for audio describing content. Browser-based and easy to use, Frazier provides the perfect platform for integrating audio descriptions into your videos through text to speech.

2021-06-10