1.2.2: Audiovisual `Scene` objects#

Up until now, we’ve seen how we can use AudibleLight to generate audio-only Scene objects.

However, AudibleLight can also be used to create audio-visual soundscapes. This allows a visual representation of a Scene to be generated, in a similar manner to SELDVisualSynth[adrianSRoman/SELDVisualSynth].

The resulting video can be played back for debugging MicArray and Event positions, or can be used to train a model on audio-visual data. For more information, see the DCASE community challenges on audio-visual sound event localisation and detection.

Creating a `Scene` that supports images#

Let’s create a Scene that supports visual synthesis. To make this easier, we can pass in a directory of images using the image_path argument.

[13]:

import matplotlib.pyplot as plt
from IPython.display import Image, Audio, display

from audiblelight.core import Scene
from audiblelight import utils

[14]:

visual_scene = Scene(
    duration=10,
    backend="rlr",
    sample_rate=22050,
    video_low_power=True,
    max_overlap=1,
    video_res=(960, 480),
    video_fps=5,
    backend_kwargs=dict(
        mesh=utils.get_project_root() / "tests/test_resources/meshes/Oyens.glb"
    ),
    fg_path=utils.get_project_root() / "tests/test_resources/soundevents",
    image_path=utils.get_project_root() / "tests/test_resources/images",
)

2026-01-14 12:54:07.974 | WARNING  | audiblelight.worldstate:load_mesh_navigation_waypoints:1884 - Cannot find waypoints for mesh Oyens inside default location (/home/huw-cheston/Documents/python_projects/AudibleLight/resources/waypoints/gibson). No navigation waypoints will be loaded.

CreateContext: Context created

We can inspect the images available to our Scene under the fg_images attribute:

[15]:

print(visual_scene.fg_images[:2])

[PosixPath('/home/huw-cheston/Documents/python_projects/AudibleLight/tests/test_resources/images/doorCupboard/27_0.jpg'), PosixPath('/home/huw-cheston/Documents/python_projects/AudibleLight/tests/test_resources/images/doorCupboard/36_0.jpg')]

Note that currently visual synthesis is only support for the ray-tracing backend (i.e., where scene.state.name.upper() == "RLR").

Optional arguments#

When creating such a Scene, we can specify a variety of arguments, including:

video_fps: the frames-per-second of the video
video_res: the resolution of the video, in the format (width x height)
- Note that the width of the video must be exactly twice that of the height, for correct perspective video rendering
video_low_power: applies a variety of adjustments to improve performance on weaker hardware
video_overlay_distance_scale_factor: scales the size of overlaid images depending on proximity to camera.
video_overlay_base_size: the base size of overlaid images on the video, independent of distance.

For more information, including default arguments, see Scene.__init__.

Adding `Event` objects with images#

If we’ve correctly specified image_filepath, then image objects will automatically be added to Event objects, using the closest possible class to the audio file.

To show what we mean, let’s start by adding a waterTap Event:

[16]:

tap = visual_scene.add_event(
    event_type="static",
    filepath=utils.get_project_root() / "tests/test_resources/soundevents/waterTap/95709.wav",
)
print(tap.image_filepath)

Warning: initializing context twice. Will destroy old context and create a new one.
2026-01-14 12:54:10.471 | INFO     | audiblelight.core:add_event:1152 - Event added successfully: Static 'Event' with alias 'event000', audio file '/home/huw-cheston/Documents/python_projects/AudibleLight/tests/test_resources/soundevents/waterTap/95709.wav' (unloaded, 0 augmentations), 1 emitter(s).

CreateContext: Context created
/home/huw-cheston/Documents/python_projects/AudibleLight/tests/test_resources/images/waterTap/32_0.jpg

We can load up the image as a numpy array using Event.load_image.

This method operates similarly to Event.load_audio: once an image has been loaded once, it will be cached to speed up future reads. Cached generation can be disabled with ignore_cache:

[17]:

tap_image = tap.load_image()
print(tap_image.shape)

(96, 95, 3)

Let’s try plotting the image.

[18]:

plt.imshow(tap_image)
plt.show()

../_images/_examples_1.2.2_audiovisual_scenes_13_0.png

That looks like a tap to me!

Manually adding `Event` images#

Of course, we can also add a specific image to an Event when adding this to a Scene. All we need to do is specify image_filepath in Scene.add_event.

Let’s try adding an Event with the audio from a running tap, but the image of a telephone:

[19]:

visual_scene.clear_events()
tapphone = visual_scene.add_event(
    event_type="static",
    filepath=utils.get_project_root() / "tests/test_resources/soundevents/waterTap/95709.wav",
    image_filepath=utils.get_project_root() / "tests/test_resources/images/telephone/3_0.jpg",
    duration=0.5,
    scene_start=2.5
)

print(tapphone.image_filepath)
print(tapphone.filepath)

CreateContext: Context created

Warning: initializing context twice. Will destroy old context and create a new one.
Warning: initializing context twice. Will destroy old context and create a new one.
2026-01-14 12:54:11.338 | INFO     | audiblelight.core:add_event:1152 - Event added successfully: Static 'Event' with alias 'event000', audio file '/home/huw-cheston/Documents/python_projects/AudibleLight/tests/test_resources/soundevents/waterTap/95709.wav' (unloaded, 0 augmentations), 1 emitter(s).

CreateContext: Context created
/home/huw-cheston/Documents/python_projects/AudibleLight/tests/test_resources/images/telephone/3_0.jpg
/home/huw-cheston/Documents/python_projects/AudibleLight/tests/test_resources/soundevents/waterTap/95709.wav

Now, let’s play the audio and show the image, to show that the classes are different.

[20]:

phone_img = tapphone.load_image()
tap_audio = tapphone.load_audio()

[21]:

plt.imshow(phone_img)
plt.show()

../_images/_examples_1.2.2_audiovisual_scenes_19_0.png

[22]:

Audio(tap_audio, rate=visual_scene.sample_rate)

[22]:

Audiovisual `Scene` synthesis#

Now that we’ve added Event objects with images to our Scene, we’re ready to generate a video file containing these objects.

[23]:

# Remove all the Events
visual_scene.clear_events()
visual_scene.clear_microphones()

# Add a microphone + single Event at known positions
visual_scene.add_microphone(microphone_type="ambeovr", position=[3.5, -3.5, 1.5])
ev = visual_scene.add_event(
    event_type="static",
    filepath=utils.get_project_root() / "tests/test_resources/soundevents/telephone/30085.wav",
    image_filepath=utils.get_project_root() / "tests/test_resources/images/telephone/3_0.jpg",
    duration=0.5,
    scene_start=2.5,
    # add at a known position: definitely visible from microphone
    position=[1.5, -0.5, 0.5],
)

CreateContext: Context created
CreateContext: Context created

Warning: initializing context twice. Will destroy old context and create a new one.
Warning: initializing context twice. Will destroy old context and create a new one.
Warning: initializing context twice. Will destroy old context and create a new one.

CreateContext: Context created

Warning: initializing context twice. Will destroy old context and create a new one.

CreateContext: Context created

2026-01-14 12:54:12.812 | INFO     | audiblelight.core:add_event:1152 - Event added successfully: Static 'Event' with alias 'event000', audio file '/home/huw-cheston/Documents/python_projects/AudibleLight/tests/test_resources/soundevents/telephone/30085.wav' (unloaded, 0 augmentations), 1 emitter(s).

To generate the video file, just call Scene.generate with video=True. You can also pass video_fname to specify the output pattern for any video files.

By default, one video will be generated for each MicArray object added to the Scene.

[24]:

visual_scene.generate(video=True, audio=False, metadata_dcase=False, metadata_json=False)

Rendering video...:  22%|██▏       | 11/50 [00:01<00:02, 13.63it/s]/home/huw-cheston/.cache/pypoetry/virtualenvs/audiblelight-5f5KpqNP-py3.10/lib/python3.10/site-packages/pyvista/core/filters/data_object.py:180: PyVistaDeprecationWarning: The default value of `inplace` for the filter `PolyData.transform` will change in the future. Previously it defaulted to `True`, but will change to `False`. Explicitly set `inplace` to `True` or `False` to silence this warning.
  warnings.warn(msg, PyVistaDeprecationWarning)
Rendering video...: 100%|██████████| 50/50 [00:03<00:00, 15.12it/s]

Now that we’ve generated the video, we can play it. We’ll use ffmpeg to generate a .gif file that should play in most browsers.

Under normal use cases, however, you’ll probably just use the .mp4 file!

[26]:

import subprocess
# Convert with ffmpeg
subprocess.run([
    'ffmpeg', '-i', 'video_out_mic000.mp4',
    '-vf', 'fps=5,scale=960:-1:flags=lanczos',
    '-loop', '0',
    '-y', 'video_out.gif'
])
display(Image(filename='video_out.gif'))

ffmpeg version 6.1.1-3ubuntu5 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 13 (Ubuntu 13.2.0-23ubuntu3)
  configuration: --prefix=/usr --extra-version=3ubuntu5 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --disable-omx --enable-gnutls --enable-libaom --enable-libass --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libglslang --enable-libgme --enable-libgsm --enable-libharfbuzz --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzimg --enable-openal --enable-opencl --enable-opengl --disable-sndio --enable-libvpl --disable-libmfx --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-ladspa --enable-libbluray --enable-libjack --enable-libpulse --enable-librabbitmq --enable-librist --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libx264 --enable-libzmq --enable-libzvbi --enable-lv2 --enable-sdl2 --enable-libplacebo --enable-librav1e --enable-pocketsphinx --enable-librsvg --enable-libjxl --enable-shared
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    60. 16.100 / 60. 16.100
  libavdevice    60.  3.100 / 60.  3.100
  libavfilter     9. 12.100 /  9. 12.100
  libswscale      7.  5.100 /  7.  5.100
  libswresample   4. 12.100 /  4. 12.100
  libpostproc    57.  3.100 / 57.  3.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'video_out_mic000.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2mp41
    encoder         : Lavf59.27.100
  Duration: 00:00:10.00, start: 0.000000, bitrate: 121 kb/s
  Stream #0:0[0x1](und): Video: mpeg4 (Simple Profile) (mp4v / 0x7634706D), yuv420p, 960x480 [SAR 1:1 DAR 2:1], 120 kb/s, 5 fps, 5 tbr, 10240 tbn (default)
    Metadata:
      handler_name    : VideoHandler
      vendor_id       : [0][0][0][0]
Stream mapping:
  Stream #0:0 -> #0:0 (mpeg4 (native) -> gif (native))
Press [q] to stop, [?] for help
[swscaler @ 0x5737909cd580] [swscaler @ 0x5737909dd040] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x5737909ec840] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x5737909fc040] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x573790a0b840] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x573790a1b040] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x573790a2a840] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x573790a3a040] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x573790a49840] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x573790a59040] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x573790a68840] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x573790a78040] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x573790a87840] No accelerated colorspace conversion found from yuv420p to bgr8.
[swscaler @ 0x5737909cd580] [swscaler @ 0x573790a97040] No accelerated colorspace conversion found from yuv420p to bgr8.
Output #0, gif, to 'video_out.gif':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2mp41
    encoder         : Lavf60.16.100
  Stream #0:0(und): Video: gif, bgr8(pc, gbr/unknown/unknown, progressive), 960x480 [SAR 1:1 DAR 2:1], q=2-31, 200 kb/s, 5 fps, 100 tbn (default)
    Metadata:
      handler_name    : VideoHandler
      vendor_id       : [0][0][0][0]
      encoder         : Lavc60.31.102 gif
[out#0/gif @ 0x57379092c940] video:217kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.009000%
frame=   50 fps=0.0 q=-0.0 Lsize=     217kB time=00:00:09.80 bitrate= 181.4kbits/s speed=79.9x

<IPython.core.display.Image object>

1.2.2: Audiovisual Scene objects

Contents

1.2.2: Audiovisual `Scene` objects#

Creating a `Scene` that supports images#

Optional arguments#

Adding `Event` objects with images#

Manually adding `Event` images#

Audiovisual `Scene` synthesis#

1.2.2: Audiovisual Scene objects

Contents

1.2.2: Audiovisual Scene objects#

Creating a Scene that supports images#

Optional arguments#

Adding Event objects with images#

Manually adding Event images#

Audiovisual Scene synthesis#

1.2.2: Audiovisual `Scene` objects#

Creating a `Scene` that supports images#

Adding `Event` objects with images#

Manually adding `Event` images#

Audiovisual `Scene` synthesis#