In addition to providing technological solutions for audio description, VIDEO TO VOICE is actively involved in other fields surrounding accessibility for digital media. For some time, VIDEO TO VOICE has been working in close partnership with the Fraunhofer Institute for Integrated Circuits (IIS) on topics such as automated mixing and MPEG-H Audio. The following article provides a fascinating glimpse at Dialog+, a production technology developed by Fraunhofer IIS to make TV dialogue easier to understand.
It's a frustration many of us have experienced while watching TV: We can't quite hear what the actor is mumbling during an important scene, so lose track of the plot for the rest of the show. Likewise, when the music, sound effects or background noises are too loud, the dialogue becomes difficult to follow, prompting us to turn up the volume to an uncomfortable level or switch over.
To combat this issue, the Fraunhofer Institute for Integrated Circuits (IIS) teamed up with public broadcasters in developing Dialog+ – an innovative AI-based solution for making TV dialogue easier to understand. In essence, Dialog+ provides viewers with an enhanced version of the dialogue which they can select as an alternative to the original audio mix.
In late 2020, German television networks Westdeutscher Rundfunk (WDR) and Bayerischer Rundfunk (BR) conducted viewer testing to see how Dialog+ would be received. But before we examine the results, let's take a closer look at why TV dialogue is difficult to follow in the first place.
According to a study conducted by Fraunhofer IIS and WDR, 68% of the people surveyed found it hard to understand speech on TV frequently or very frequently.
From age 40 to 50, the range of sounds we can perceive steadily decreases, often without us initially noticing a difference. As we age, our hearing continues to deteriorate.
If the dialogue is obscured by overly loud background music or sound effects, following along can soon become tiring and ruin the viewing experience.
Most importantly, unintelligible dialogue makes the production less accessible.
There are a number of factors that determine how well viewers understand the dialogue in a TV show or film.
First of all, the viewer's hearing plays a crucial role – people with mild to severe degrees of hearing loss often need to rely on captions to fully follow the spoken dialogue.
Voice signal quality can also be affected by quickly spoken dialogue, unfamiliar accents, and people talking over each other.
Another telling factor is the type of production. While audiences have little trouble following what is being said on the news, problems with speech intelligibility usually crop up in fictional TV series and films.
To create the right mood, TV series and films often contain music and background noises, which can sometimes have a masking effect on the dialogue.
These days, most television and film productions are shot on location. Away from a controlled studio environment, the sound quality of the dialogue can be compromised, as Christian Simon, Senior Engineer at Fraunhofer IIS, explains, "It starts at location scouting. Shooting takes place in locations that look wonderful, but sound terrible."
For example, background noise from a busy road could interfere with the speech signal from the actors' dialogue in an otherwise perfect backdrop for a scene.
What happens behind the mixing desk isn't really the problem.
Mixing and quality control is performed in perfect acoustic conditions at the studio to create the best results. However, viewers watch the finished product on a variety of devices with contrasting audio output configurations.
For example, listening to a show on your phone will sound different to what you hear from your television's loudspeakers.
Therefore, the audio mix depends on the viewer's individual requirements and preferences, something Fraunhofer IIS discovered after conducting tests with the BBC at Wimbledon in 2011. Christian Simon adds, "There's no such thing as a perfect mix, as viewers have vastly differing needs. You won't be able to please everybody."
Dialog+ is an AI-based speech enhancement technology that gives users a personalised audio experience.
For television and film productions, it places extra emphasis on the spoken dialogue by reducing the volume of music, background noises, and sound effects.
Dialog+ doesn't replace the original audio mix, but provides an alternative that viewers can switch to when they find speech difficult to follow.
During testing with WDR, participants were asked to watch three short clips and compare the original audio with the technically processed version.
The results were conclusive: 80% of the participants approved of having the option to switch between audio tracks, while over twice the amount of testers preferred the version enhanced with Dialog+ to the original audio.
After their own series of successful testing, BR launched a pilot project that provided an alternative audio mix with Dialog+ for two of its most popular programs.
In the field tests, viewers had three options for experiencing the Dialog+ audio mix. WDR broadcasted over satellite TV, where the Dialog+ audio track could be selected via the remote. In addition, the ARD online media library contained programs that were available with and without Dialog+.
In BR’s testing, viewers could watch shows with the choice of two different Dialog+ versions which they could access through HbbTV 2.0-capable devices.
Dialog+ technology is, in fact, only used for existing material.
From a technical standpoint, older shows and movies can be problematic because their audio tracks were not designed and archived with the intention of enabling user personalisation. This means it is difficult to access the separate audio tracks and individual elements forming the final audio mix.
Therefore, the Dialog+ deep learning algorithm is used on the final audio mix. The technology extracts the speech from the music and background noises, so that these spoken elements can be analysed and estimated. After that, it is possible to provide a version of the dialogue with enhanced volume.
If new material requires adjustable elements, producers would use MPEG-H Audio to create audio objects.
Dialog+ is integrated at the post-production stage by the service provider. All the viewer has to do is select the level that they want on their smart TV and enjoy the show.
Go to 3:55 in this video from Fraunhofer IIS to listen to a short demo of Dialog+ in use.
German speakers can also go to the ARD media library and watch this episode of the Wunderschön! travel show with "Clear Speech" to listen to the difference.
The way sound is perceived varies from person to person. The fact that we now can watch TV and films on a variety of end devices also affects what the final result sounds like. As a result, it is impossible to create a single audio mix that is fully accessible and satisfies everybody. Instead, the focus should be on tailoring the viewing and aural experience to viewers' individual needs. With Dialog+ alternate audio mixes, Fraunhofer IIS is doing just that.