Capturing volumetric audio to ring in the future of immersive media for sports broadcasting
By Rob Oldfield, co-founder, Salsa Sound
We are on the verge of possibly the most exciting evolution of sport broadcast in the last 50 years. Not just an incremental improvement in quality or a new format, but a whole new paradigm: volumetric and immersive media.
The live broadcast and gaming worlds are starting to merge and audiences are demanding new, dynamic and immersive experiences to better connect them to action. Rather than being served a one-size-fits-all linear broadcast feed, in this new paradigm viewers will be able to have complete control over the content they receive, the data and stats, their viewpoint, their preferences and the device they consume it one.
Practically what this looks like is being able to choose the position you want to view the content from eg, choosing to watch a penalty being taken from the perspective of the goal keeper, or standing on an athletics track to experience the speed of the runners as they pass by, or facing the pitch in baseball.
This all requires volumetric video technology where multiple cameras are used to compile a 3D video representation of the scene. This can be done in several ways, from limb tracking to depth extraction and texture mapping. Either way, the viewer is presented with a broadcast experience that is more akin to a computer game than a traditional broadcast with the content even being rendered and served to viewers through a game engine eg, Unity or UnReal.
This may seem very futuristic to some, but these technologies are becoming more and more sophisticated, and we have witnessed many demonstrations and practical examples of this working well in boxing, rugby, basketball and football, with several companies showing off their technologies in this space.
Furthermore, another step towards this has been in the delivery of 360-degree video content. This is where the viewer is served with multiple high-resolution panoramic video perspectives and is able to freely navigate pan, zoom and switch between cameras, putting viewers closer to actually being in the stands at the game.
But whilst these use cases present a fantastic video experience, they are massively reduced in impact if the audio is not well implemented from the outset. Allowing viewers to have a customisable viewpoint on [a sporting event] without the audio matching, massively reduces the effectiveness of the experience and has a negative impact on viewer ‘buy-in’. It is hence imperative that we develop and deploy audio capture, processing and reproduction techniques that enable viewers to have a dynamic and immersive audio presentation to match the dynamic video presentation; we need volumetric audio.
There are several key elements that need to be addressed if we are to seriously consider the implementation of a true volumetric audio presentation.
Scene capture
As always in a live broadcast set up it is a good idea not to burden an already stretched broadcast team with too many additional equipment requirements, hence where possible the scene should be captured with the existing microphone set up.
There are however some new requirements which will require some additional immersive microphones to be installed at the venue. These could be in the form of immersive microphone arrays such as an ORTF 3D or a B-format microphone or another array able to capture an immersive audio presentation. Depending on the use case, several of these microphone arrays can be deployed at the venue to provide either discrete audio viewpoints or combined intelligently to help produce a volumetric description of the scene.
The standard broadcast microphones enhance this by bringing in the field-of-play sounds and other diegetic sources to enhance the experience. These sounds can then be triangulated so they can be dynamically panned to match the bespoke visual rendering.
Scene composition and processing
Once the audio sources have been captured there needs to be a process of scene composition. This is difficult to do using manual methods but thankfully artificial intelligence (AI) can help to detect, understand and extract diegetic sources from the scene, and then audio triangulation methods can be employed to assign a location to those sources.
The output from this stage is a volumetric representation of the space with sources, ambient beds and accompanying metadata that the rendering environment can use to present the audio so it matches the video.
Scene representation and transmission
With the scene captured and processed with accompanying metadata, it needs to then be streamed into a gaming engine or rendering environment so it can be freely manipulated to match the visual representation. This is easier said than done as these engines were not designed originally for real time streaming, but it can be done.
Scene reproduction
With the audio scene in a rendering environment, we are now in a position to create bespoke mixes that match a customised visual perspective. At this point there is obviously a need for some automated mixing to be able to handle the addition of custom components like various commentary options, and to create individual mixes. The scene is then typically rendered as a binaural mix if being delivered over, for instance, a VR headset, but other immersive set ups can obviously be implemented too if loudspeakers are being used.
5G Edge-XR project
An example of this kind of volumetric workflow was completed recently by Salsa Sound and BT Sport as part of the 5G Edge-XR project. The system was demonstrated at IBC 2022 and not only won the Best Technical Paper award, but the project as a whole won the Content Everywhere Award for its implementation of XR experiences on consumer devices using 5G networks.
The set up here involved three 360 degree cameras positioned around the ground (in the main gantry and behind the left and right goals) at a Premier League football match between Wolves and Brentford. Alongside each camera was a B-format immersive microphone array to help produce the spatial audio representation to match the visual view, which could be rotated as the viewer rotated their perspective.
In addition, standard broadcast microphones were used to extract, process and position of the diegetic sounds.
At the centre of the audio production was Salsa Sound’s MIXaiR software which created an automated pitch mix from the 13 pitch-side microphones, balanced and processed the commentary, and created three simultaneous 5.1.4 mixes (one for each of the different crowd perspectives). These mixes were then streamed into the Unity game engine as virtual loudspeakers at the camera positions.
As the viewer changed their perspective or camera, they were placed within the corresponding virtual loudspeaker and, as they rotated their heads in the scene, the audio rotated accordingly. The (optional) commentary components were static and the pitch sounds came from the pitch location and the crowd dynamically rotated to match head rotation. All of this brings viewers much closer to the action with the goal being that it sounds so realistic as to not really be noticed.
The world of broadcast is changing fast and audio needs to not only keep pace, but lead the way in delivering great experiences for viewers.