The workshop features four distinct challenge tracks, focusing on media generation and transmission with GAI. Track 1~3 target reducing computation and transmission for efficient media delivery, and Track 4 targets controlled novel content creation.
A large-scale multi-modality multi-view dataset, named M3VIR, is provided, featuring computer-synthesized virtual content. M3VIR further comprises two subsets, a multi-resolution subset M3VIR_MR for track 1~3 and a multi-style subset M3VIR_MS for track 4. The entire M3VIR dataset covers 100 scenes from 10 categories (examples shown below), 10 scenes in each category. The UE5 Unreal Engine is employed where for each scene a variety of videos are simulated with matching content to serve as ground-truth for the competition tasks.
GAI has reshaped media delivery solutions for gaming and entertainment by reducing rendering and transmission costs. For example, in cloud gaming, server-side computation and transmission needs can be largely reduced by rendering low-resolution (LR) frames and computing high-resolution (HR) frames on client side. For immersive applications, only reference views need to be rendered by server, and the remaining views are computed by client. NVIDIA's Deep Learning Super Sampling (DLSS) commercializes such solutions by a suite of GAI-based tools including multi-frame generation, enhanced ray reconstruction and super resolution (SR). The key to its success is the large-scale ground-truth LR-HR or multi-view simulated data for training. In contrast, the research community generally uses pseudo training data for restoration tasks, e.g., by downsampling and degrading HR data to generate paired pseudo LR data for SR research. Due to characteristics and limitations of the rendering process, such pseudo data do not match real simulated data, leading to inferior performance. Therefore, M3VIR_MR provides ground-truth LR-HR paired frames to facilitate restoration research in track 1 ~ 3.
To support the media delivery solution of transferring LR frames and restoring HR frames by client. The segmentation map and depth map can be leveraged to enhance performance. M3VIR_MR supports 2x and 3x SR, i.e., from 960x540 to 1920x1080 and from 960x540 to 2880x1620.
To support the solution of transferring part of multi-view frames and generating the remaining frames by client. The task is to synthesize intermediate RGB frames from a sparse set of reference RGB frames in multi-view videos. Only 600 sets of 1920x1080 videos with static scenes in M3VIR_MR are used for Track 2.
A combination of track 1 & 2 to support the solution of transferring part of multi-view LR frames and generating all HR frames by client. Track 3 also uses the 600 sets of videos with static scenes in M3VIR_MR same as Track 2, but with all 3 different resolutions to support 2x and 3x SR
Algorithms for track 1 ~ 3 will be evaluated based on objective metrics including PSNR, SSIM, LPIPS and FID. The corresponding segmentation maps and depth maps can be leveraged to improve performance. The amount of additional segmentation and/or depth map used will also be considered for evaluation (the more information used, the more transmission bits consumed).
GAI holds great promise for generating fast and accessible videos to train vision models, with particular importance for robotics and embodied AI where real-world data are scare and expensive to obtain. Controlling generated content to ensure spatial-temporal consistency and physical accuracy is crucial such applications. While text-to-image generation has been successful, text-to-video generation is inherently challenging due to the limitations of text in describing video content. In contrast, multi-modal guidance including both visual and text descriptions are more accurate and efficient.
Track 4 focuses on controlled video generation using ground-truth data from M3VIR_MS. The target is to edit specific objects in a photo-realistic video by spatial-temporal consistently changing the style of objects (to cartoon style or to metallic style). Track 4 focus on a several foreground object categories: people, animals, cars, tables, coach-chairs, lights-lamps, etc. M3VIR_MS enables training and evaluating content editing methods with ground-truth paired data. To reduce the difficulty of this challenging task and accommodate different possible solutions, only 18000 data samples corresponding to the static scenes are used for evaluation.
Performance will be evaluated based on both objective quality metrics including PSNR, SSIM, LPIPS and FID, as well as temporal consistency metrics like deep video prior.
Two datasets will be used for the challenges, each with a training (80 scenes) and test (20 scenes) partition. Due to the large size, a small mini training set will be provided for each track, and participants can optionally use the full training dataset to enhance their performance. Training set will be released in 4 batches, 20 scenes in each batch, scheduled around April 5, April 19, May 03, and May 17. Test set will be relased in June.
For each of the 100 scenes, 3 sets of multi-modal multi-view data packages are collected: dynamic scene with static camera; static scene with moving camera; dynamic scene with moving camera. Each set of data package comprises 6 sets of temporally synchronized RGB videos from co-located cameras with 6 views. Each of the 6 sets further has videos at 3 different resolutions: 960x540, 1920x1080, 2880x1620. In addition to RGB images, the corresponding pixel-level synchronized semantic segmentation map and depth map are also provided. Each video is 2-sec long at 15fps. In total, M3VIR_MR has 54000 data samples from 1800 sets of videos, each data sample consisting of matching RGB images, segmentation maps and depth maps at 3 different resolutions. The corresponding intrinsic and 6-DoF extrinsic camera parameters for each frame are also provided. One data sample is shown below. 80 scenes are used for training and 20 scenes are used for testing.
For track 1, the mini training set comprises randomly sampled 5% data samples (2160 data samples from 43200 training data samples), each data sample consisting of matching RGB images, depth and segmentaion maps at 3 resolutions.
Track 1: Full training set: [To be provided]
Track 1: Mini training set: [To be provided]
For track 2, only 600 videos corresponding to static scene with moving camera at 1080p resolution is used. The mini training set comprises 1/3 data samples (4800 from 14400 training data samples), where only the first 10 frames of each video is used. Each data sample consisting of matching RGB images, depth and segmentaion maps at 1080p.
Track 2: Full training set: [To be provided]
Track 2: Mini training set: [To be provided]
For track 3, only 600 videos corresponding to static scene with moving camera is used. The mini training set comprises 1/6 data samples (2400 from 14400 training data samples), where only the first 5 frames of each video is used. Each data sample consisting of matching RGB images, depth and segmentaion maps at 3 resolutions.
Track 3: Full training set: [To be provided]
Track 3: Mini training set: [To be provided]
The multi-style M3VIR_MS dataset aims to facilitate research on controlled video generation. From the 54000 data samples in the above M3VIR_MR, videos with 1920x1080 resolution are taken out, and for each video, a cartoon-style video and a metallic-style video are rendered with the same geometry. That is, M3VIR_MS contains 54000 data samples, each comprising 3 videos having matching geometry at the frame level and with 3 styles, photo-realistic, cartoon, and metallic, as well as the corresponding segmentation map, depth map, and camera intrinsic and extrinsic parameters. An example is shown below. Same as M3VIR_MR, 80 scenes are used for training and 20 scenes are used for testing.
For track 4, the mini training set comprises 1/6 data samples (7200 from 43200 training data samples), where only the first 5 frames of each video is used. Each data sample consisting of matching RGB images, depth and segmentaion maps with 3 styles.
Full training set: [To be provided]
Mini training set: [To be provided]
For all tracks, the dataset also contains the camera intrinsic and extrinsic parameters
Test sets: [To be provided]