Loading
Phase 1: 19 days left
8973
158
20
54

Problem Statements

Temporal Alignment Track

Generate Videos with Temporal and Semantic Audio Sync

1433
24
Spatial Alignment Track

Create Videos with Spatially Aligned Stereo Audio

1402
30

 

⏰ The challenge is now live! 

πŸ’» Don't know where to start? Check out the starter-kits for Temporal Alignment Track and Spatial Alignment Track

Generate Synchronized & Contextually Accurate Videos

Welcome to the Sounding Video Generation (SVG) Challenge 2024!

The Sounding Video Generation (SVG) Challenge 2024 is a competition to create AI models that make videos where the visuals match perfectly with sounds, like a dog barking in sync with the video. Participants will work to improve how well sounds and scenes align, with prizes for the best results.

This challenge invites you to build models that generate synchronized and contextually accurate videos. You can showcase their skills and push the boundaries of sounding video generation with two tracks -

  1. Temporal Alignment 
  2. Spatial Alignment

πŸ“œ Introduction

Video generation research has progressed significantly, with large-scale diffusion models producing realistic videos. However, sounding video generation, which involves well-aligned video and audio modalities, remains underexplored. The SVG Challenge aims to advance this field by providing a platform for benchmarking and showcasing state-of-the-art models.

πŸŽ₯ The Sounding Video Generation Challenge

Build state-of-the-art AI models to generate videos, ensuring the audio is synchronized and contextually appropriate.



⏰ Temporal Alignment Track

This track aims to generate videos that are temporally and semantically aligned with their corresponding audio. This involves producing high-resolution videos (256x256 pixels, 8fps) with monaural audio (1 channel, 16kHz).

You will tackle two types of alignment:

  1. Semantic Alignment: The audio’s semantic class should match the video. For instance, if the video shows a dog barking, the audio should contain a barking sound.

  2. Temporal Alignment: The audio should be synchronized with the video. For example, the barking sound should occur precisely when the dog is seen barking.

In this track, submissions will be evaluated on how well the audio and video synchronize over time. Participants will use customised datasets named SVGTA24 derived from the Greatest Hits dataset with prepared video captions for training. A baseline model based on AnimateDiff and AudioLDM is provided. Submissions will be tested on a set of text prompts to assess synchronization.

More details are available on the Temporal Alignment Track page.

🌐 Spatial Alignment Track

This track aims to create videos with spatially aligned audio, giving a sense of space and direction. This involves producing high-resolution videos (256x256 pixels, 4fps) with stereo audio (2 channels, 16kHz).

Participants should focus on generating videos where the spatial alignment of the audio enhances the sense of space and direction, ensuring that the audio and video components are well-integrated.

Participants will use a customized SVGSA24 dataset derived from the STARSS23 dataset, where the original videos with an equirectangular view and Ambisonics audio have been converted to videos with a perspective view and stereo audio. Additionally, we have curated content focusing on on-screen speech and instrument sounds. This will be used for training and submit systems that generate video and 2-channel audio signals. A baseline model based on MM-Diffusion is provided. Evaluation will consider how well the generated video and audio align spatially.

More details are available on the Spatial Alignment Track page.

πŸ—“ Timeline

The SVG Challenge takes place in two rounds, with an additional warm-up round. The tentative launch dates are:

  • Warmup Round: 29th Oct 2024
  • Phase I: 2nd Dec 2024
  • Phase II: 3rd Jan 2025
  • Challenge End: 25th Mar 2025

πŸ† Prizes

The total prize pool is $35,000, divided between the two tracks. Teams can win prizes across multiple leaderboards.

Track 1: Temporal Alignment ($17,500)

  • First place: $10,000

  • Second place: $5,000

  • Third place: $2,500
     

Track 2: Spatial Alignment ($17,500)

  • First place: $10,000

  • Second place: $5,000

  • Third place: $2,500

Please refer to the Challenge Rules for more details on the Open Sourcing criteria for eligibility.

Participants

Leaderboard

01 lljjol 6.000
01
  binhnk
6.000
01 kcy4 6.000
02 christian.simon 11.000
03 akio_hayakawa 13.000