π» Make your first submission using the starter-kit.
ποΈ Find the challenge resources over here.
π΅οΈ Introduction
Currently available generative models are typically restricted to a single modality, and generating both video and audio remains a challenging task. In this challenge, we focus on achieving spatial alignment between the generated video and audio. Participants take an initiative towards generating both modalities in an unconditional manner. The dataset used in this challenge primarily includes humans and musical instruments with sound captured by microphones that can reflect activities producing sound in the corresponding video.
π Task β Spatially aligned Audio and Video Generation
Goal: The objective of the spatial alignment track is to develop a generative model that can create videos and corresponding stereo audio that are spatially aligned using a 5-second video with associated stereo audio as training data. The resulting model should produce high-quality videos with matching audio. We evaluate the results based on both quality of generated video and sound and alignment scores, and participants are encouraged to achieve the highest possible scores in these metrics.
Unconditional Generation: The model focuses on unconditional generation tasks, meaning we need to build and train models that generate audio and video without any specific conditions or prompts.
Format: The target resolution for video is 256x256, and the channel is 2 for audio (stereo).
βπ Dataset
For this task, we are using a customised dataset named SVGSA24 derived from the STARSS23 dataset, where the original videos with an equirectangular view and Ambisonics audio have been converted to videos with a perspective view and stereo audio. Additionally, we have curated content focusing on on-screen speech and instrument sounds.
Some examples are shown below.
Content: The dataset predominantly features humans and musical instruments, providing a diverse range of audio-visual scenarios. This includes various speeches by humans and musical sounds.
Audio: The audio component is captured using high-quality microphones, ensuring that the sounds accurately reflect the activities depicted in the video. This includes speech, musical notes, and background sounds.
Video: The video component includes visual data from different angles. The duration of a video is set to 5 seconds.
Split: We release the development set to the public and keep the evaluation set for the challenge evaluation. These evaluation set serves as a target distribution to quantify the quality of generated video and audio.
π Evaluation Metrics
In this challenge, we have chosen to use the following evaluation scores for quality measurement:
- FrΓ©chet Video Distance (FVD Score): To assess the quality of generated videos.
- FrΓ©chet Audio Distance (FAD Score): To assess the quality of generated audio.
We have prepared an isolated dataset for evaluation.
To quantify the alignment spatially we use metrics below:
- Spatial AV-Align: To evaluate spatial alignment, we newly employ Spatial AV-Align, a novel metric using pretrained object detection and sound event localization and detection (SELD) models.
To be specific, we explain below:
-
We first detect candidate positions of sounding objects per frame in each modality separately.
-
Then, for each position in audio, we validate whether a position is also detected in the video.
-
We determine whether a SELD result has an area of overlap with an object detection result. If there is an area of overlap, it is TP; if not, it is FN.
We allow the use of object detection results within a video frame considering also t-1 and t+1 frames.
(We donβt validate whether each position in video is detected in audio because the dataset includes persons who donβt talk or play instruments.)
-
Finally, we calculate a recall metric as the alignment score ranging between zero and one: Given TP and FN, the alignment score is defined as: TP / (TP+FN).
π° Prizes
Spatial Alignment Track
Total Prize: 17,500 USD
- π₯ 1st: 10,000 USD
- π₯ 2nd: 5,000 USD
- π₯ 3rd: 2,000 USD
More details about the leaderboards and the prizes will be announced soon. Please refer to the Challenge Rules for more details about the Open Sourcing criteria for each of the leaderboards to be eligible for the associated prizes.
πͺ Getting Started
Make your first submission to the challenge using this easy-to-follow starter kit.
π Baseline System
You can find the baseline pretrained models in the starter kit. The baseline models contain the base diffusion model for generating samples in 64x64 and the super resolution model to upsample the generated video to 256x256. Please note that during the challenge, we might add more baselines. Below we provide the score for our baseline model based on MM-Diffusion. In this implementation, we take into account stereo audio and train the model on the introduced SVGSA24 dataset. π» Make your first submission using the starter-kit.
Below are some generated results (256x256) using our provided pretrained models: Inference - DPM Solver
FVD β | FAD β | Spatial AV Alignment β | |
Baseline | 1050.3 | 9.65 | 0.48 |
Ground Truth | 572.05 | 3.70 | 0.92 |
π Timeline
The SVG Challenge takes place in two rounds, with an additional warm-up round. The tentative launch dates are:
- Warmup Round: 29th Oct 2024
- Phase I: 18th Nov 2024
- Phase II: 16th Dec 2024
- Challenge End: 14th Feb 2025
π Citing the Challenge
If you are participating in this challenge and/or using the datasets involved, consider citing the following paper:
π± Challenge Organising Committee
Sounding Video Generation - Spatial Alignment Track
Kazuki Shimada, Christian Simon, Shusuke Takahashi, Shiqi Yang, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Yuki Mitsufuji (Sony)