Why Synthetic Voices Make Audio Description Much Easier

The traditional audio description workflow can be expensive and time-consuming. That's why audio description professionals are opting for technological solutions to bring down expenditure. One method is to read our audio description transcripts with synthetic voices, but how does the technology fare in practice?

First off, what is the traditional audio description workflow?

An audio description is a spoken narration that helps people with visual or intellectual disabilities follow a film or TV show.

Gregory Frazier put in the first concrete work on audio description in the mid-1970s.

In the following decade, audio description services started to emerge with a primary focus on cinema.

The conventional way of making audio descriptions requires different personnel for each stage in the process. 

  1. Writers create the audio description transcript.
  2. Voice actors are hired to read out the transcript in a recording studio.
  3. Sound engineers master the final mix.

Why is the conventional production process expensive and time-consuming?

We go into greater detail in our article exploring why audio description production is a headache for production companies. 

Here's a quick overview of our findings:

  • The workflow is complex.
  • It is difficult to say how long an audio description will take to produce.
  • Production costs are unpredictable and can rise quickly.

To address all these problems, production companies are turning to synthetic voices.

Synthetic voices: Microphone set up in a recording studio with sound engineer in the background

Okay, so what are synthetic voices?

Speech synthesis is the artificial production of human speech.

Based on machine learning, the technique is used for: 

  • speech generation
  • music generation
  • navigation systems
  • speech-enabled devices
  • text-to-speech
  • accessibility for people with visual impairments

The latter two points are what we are particularly interested in.

What is text to speech?

Text to speech converts written text into a synthesized spoken voice.

And where does accessibility come into it?

Accessibility supports social inclusion for people with disabilities, ensuring everyone is treated the same. 

Technology has a big hand in promoting accessibility.

For example, many disabled people use specialised text-to-speech software such as screen readers when online or on devices.

Screen readers read aloud the text on a website and provide important functionality when navigating through headings and reading out alt text for images.

Person using a braille terminal linked to a screen reader with their laptop

Image credit: Wikipedia, Sebastien.delorme - Own work, CC BY-SA 3.0

And where does audio description come into it?

Following the logic above, audio description transcripts can also be read aloud as a synthesized voice using text-to-speech technology.

Don't you need proper voice actors for audio descriptions?

Not necessarily.

Studies show that text-to-speech audio description is generally regarded as an acceptable solution by viewers with visual impairments.

The best approach is to test audio descriptions created with synthetic voices on your audience and decide from there.

For example, German broadcaster MDR tested text-to-speech audio descriptions on blind consumers at the Louis Braille Festival in 2019 before adopting the technology.

Aren't synthetic voices robotic-sounding though?

No, they don't need to be.

Synthetic voices have a reputation for sounding robotic, as many people associate the technology with the JAWS screen reader, HAL, or even German electronic act Kraftwerk.

While JAWS has its functional merits, it is not suitable for audio description.

Thankfully, the quality and range of voices have come on a long way in recent years, giving the narration a natural feel.

With advancements in artificial intelligence, the technology is also improving all the time.

Take a listen for yourself – the female voice over in this video is a text-to-speech audio description:


Sounds good! But how does synthesized speech cut audio description costs?

With text to speech, you can deliver the audio description without a voice artist or recording studio.

If there are subsequent corrections made to the transcript, you don't need to worry about rehiring voice talent and studio space.

Using synthetic voices also saves time.

In the traditional workflow, production companies are sometimes left waiting on voice actor and studio availability.

Yet with synthesized speech, the audio description transcript can be read out as soon as the text is ready. 

And this makes budgeting a lot easier, right?

Exactly.

Unpredictable variable costs become easy-to-manage fixed costs.

Man sitting at desk writing budget with calculator money and laptop in front of him

So you want to get rid of voice artists...

No, not at all!

Synthesized speech isn't there to compete with professional voice artists.

Instead, synthetic voices play a supporting role in audio description creation. 

The issue is that not enough productions are made with an audio description due to budget restraints.

Time is another sticking point - most companies need the audio description for their productions right away; for cinema, it usually takes weeks for the audio description to be ready.

Other factors also come into play, as we have highlighted here.

So it's not a case of replacing voice artists - it's about making audio description affordable and time-saving for projects where it would otherwise be economically unviable.

In turn, the number of productions provided with an audio description should increase, boosting the availability of accessible content as a whole.

What types of projects are best suited to text-to-speech audio descriptions?

After extensive testing, early adopters have switched to text-to-speech audio descriptions for productions of various lengths.

MDR, a broadcaster in Germany, is using synthetic voices to audio-describe 45-minute documentaries, 30-minute cultural shows, and 10-minute web series episodes.

Media firm Februar Film audio-described tutorial videos for German insurer AOK with text to speech.

Text-to-speech audio descriptions are perfect for smaller budgets and certain formats, particularly those intended for the web.

What is the best way to integrate audio descriptions with synthetic voices into videos?

VIDEO TO VOICE has developed software for creating text-to-speech audio descriptions.

The transcript can be written directly into the tool; the audio description is then generated using text to speech in seconds.

The user can choose from hundreds of voices in over 40 languages for the audio description.

Here is a short narration of a pole vault attempt delivered using text-to-speech audio description:


Wow, great job! But I don't want to download more software onto my computer.

You don't have to!

VIDEO TO VOICE production tools are browser-based, so there's no need to download anything.

You'll be given log-in details to access the platform online, and can start creating your own text-to-speech audio descriptions right away.

All these fancy synthetic voices don't come cheap, I bet.

Using synthetic voices is a lot less expensive than using voice actors.

In fact, it is the most affordable option out there.

Through VIDEO TO VOICE production tools, you have access to the best synthetic voices out there from leading providers.

The software also takes care of the final mix and mastering.

That's value for money.

Okay, you've sold it to me. Where can I sign up?

You can take the software on a 7-day test drive first to see if text-to-speech audio descriptions are the right fit for you and your audience.

The price is dependent on how much audio description you intend to produce.

Whether you audio-describe content on a regular basis or only occasionally, VIDEO TO VOICE has developed different packages tailored to your audio description needs.

screenshot of the VIDEO TO VOICE app interface with video player, waveform, and descriptions

Summary

We got through quite a lot there, so let's quickly sum up our findings.

The fact is:

  • audio description is essential for accessibility

but...

  • there are not enough productions provided with audio descriptions

because...

  • the traditional audio description workflow is complex and expensive

so...

  • synthesized speech should be used in place of voice actors where necessary


With synthetic voices, you make:

  • unpredictable variable costs become fixed costs

so...

  • text-to-speech audio descriptions are a suitable solution

for...

  • productions with small budgets or short formats

meaning...

  • more productions are available with audio descriptions 


Without synthetic voices, audio description is a non-starter for many productions. Yet as the aforementioned studies and examples show, synthesized speech is a viable and affordable solution for audio describing content. Browser-based and easy to use, VIDEO TO VOICE production tools provide the perfect platform for integrating audio descriptions into your videos through text to speech.


We work with leading experts from academic institutions in our software's development:

zhaw Logo Uni HIldesheim Logo