In addition to providing technological solutions for audio description, VIDEO TO VOICE is actively involved in other fields surrounding accessibility for digital media. For some time, VIDEO TO VOICE has been working in close partnership with the Fraunhofer Institute for Integrated Circuits (IIS) on topics such as automated mixing and MPEG-H Audio. The following article is the first instalment of a two-part series shedding light on MPEG-H Audio, a technology developed by Fraunhofer IIS to improve accessibility and deliver the best sound experience possible.
There are now more options than ever for consuming media. Over the years, the development of television technology, the internet, portable devices, and streaming has given many people greater access to information and contributed to their empowerment. Even so, broadcasting and streaming services are often hindered by accessibility barriers, depriving millions of people from proper access to media. At the same time, there has been increasing consumer demand for a more personalised experience when watching and listening to content.
To address these issues, Fraunhofer IIS has made a significant contribution towards the development of MPEG-H Audio – an interactive and immersive audio system for broadcasting and streaming content. By replacing channel-based audio with an object-based approach, the technology gives end users greater flexibility in engaging with content and adapting it to their own personal needs. For broadcasters and content creators, MPEG-H Audio can be integrated into existing production workflows with little fuss.
Philipp Eibl, Research Engineer from the SoundLab team at Fraunhofer IIS, hails MPEG-H Audio as the "one mix to rule them all." But what features make MPEG-H Audio so revolutionary?
Before we find out, let's take a closer look at the status quo in audio production, the fundamentals behind object-based audio, and the accessibility barriers many people face in broadcasting and streaming.
Digital and file-based approaches have been used in media production for years now. Yet workflows often try to mirror analogue and tape-based methods of production and delivery in the digital world.
In audio production, various sound sources are mixed in a digital audio workstation to produce a channel-based mix for a specific target loudspeaker layout.
Each audio channel then needs to be reproduced by the loudspeaker at a well-defined position.
In the channel-based world, the fixed audio mix cannot be adapted to individual needs. In most instances, the end user can only adjust the loudness and dynamic ranges on their devices.
Lack of control over the audio tends to affect hard-of-hearing viewers the most. For example, people with hearing impairments often have trouble discerning the dialogue from overly loud background noises or music in a show. The channel-based approach provides no quick fixes for overcoming this accessibility hurdle.
In addition, the end user may want to watch or listen to a show on different devices which have contrasting audio output configurations. For example, a commuter may start watching something on their phone in the train and want to finish it on TV at home. This too has driven consumer demand for personalisation, making the channel-based approach outdated.
Audio becomes an object when it is associated with metadata. This is information that describes the audio object's existence, position, and function. The combination of audio objects and the interplay between them can be flexible, taking into account user, environment, and platform-specific factors.
The BBC's Research & Development team sums up object-based media as follows:
"By breaking down a piece of media into separate objects, attaching meaning to them, and describing how they can be rearranged, a programme can change to reflect the context of an individual viewer."
Sporting events provide a good use case. TV coverage for an American football game may have audio objects for:
These audio objects can then be adjusted by the end user in accordance with their needs and preferences on any device.
With user personalisation coming to the fore, object-based audio can now help to break down barriers in broadcasting and streaming.
According to the World Health Organization, around 15% of the world's population has some sort of disability. Chances are, then, that a production with accessibility problems will alienate many potential listeners and viewers. In fact, there are lots of factors that make a broadcast inaccessible to disabled people, as well as those without a disability.
For example, people with hearing loss often struggle to understand the dialogue in films and TV series. Similarly, productions without an audio description can cause problems for viewers with visual impairments, cognitive disabilities or low literacy levels.
Language presents another barrier. Most of today's TV programming is available in only one language, meaning it is inaccessible to people who don't understand it. The complexity of the spoken language might also exceed the capabilities of some people, if they are new learners, have cognitive disabilities or are fatigued.
These are just a few of the accessibility issues that Fraunhofer IIS's MPEG-H Audio technology tackles.
First, it is important to clear up some confusion surrounding terminology.
MPEG-H is a collection of standards developed by the ISO/IEC Motion Picture Experts Group (MPEG). It consists of various "components", with each one considered a separate standard.
One such component is the MPEG-H 3D Audio standard, which is what Fraunhofer IIS's MPEG-H Audio system is based on.
MPEG-H Audio was developed as a way to deliver object-based audio to the end user.
MPEG-H Audio is a Next Generation Audio (NGA) technology that delivers on three key principles:
The end user can adapt the audio mix to their needs and preferences.
MPEG-H goes beyond traditional surround formats and brings 3D audio to broadcasts. With immersive sound, the aural experience known from the cinema can be enjoyed on home theatre set-ups or via headphones.
The content creator mixes the largest export required for delivery. Any alternative renderings are then derived from that mix. These renderings could be for any kind of layout, such as 3D home cinema systems, regular stereo TVs, mobile devices or binaural rendering for headphones.
The interplay between these core principles is key, as Adrian Murtaza, the Senior Manager for Technology and Standards at Fraunhofer IIS, explains:
"This means viewers can personalise a programme’s audio mix, for instance by switching between different languages, enhancing hard-to-understand dialogue, or adjusting the volume of the commentator in sports broadcasts."
To address the issue of unintelligible dialogue on TV, Fraunhofer IIS developed Dialog+, an MPEG-H production technology. Our article on Dialog+ provides in-depth analysis on why this technology is seen as the game-changer for easier-to-follow TV dialogue.
Dialog+ is only used to enhance dialogue audio in existing productions though. For new content, MPEG-H Audio can already be created on the basis of existing, individual tracks.
Part two of this article series provides detailed analysis of the innovative tools that make up Fraunhofer IIS's MPEG-H Audio system. See you there!