ACCV 2024 Workshop on
Rich Media with Generative AI
Overview
The goal of this workshop is to showcase the latest developments of generative AI for creating, editing, restoring, and compressing rich media, such as images, videos, neural radiance fields, and 3D scene properties. Generative AI models, such as GAN and diffusion models have enabled remarkable achievements in rich media from both academia research and industrial applications. For instance, cloud-based video gaming is a booming industry with an expected global market value of over $12 billion by 2025. Generative AI transforms the gaming industry by enabling anyone to build and design games without professional artistic and technical skills, empowering immeasurable market growth.
With the success of the 1st RichMediaGAI Workshop@WACV 2024, we expand the 2nd RichMediaGAI workshop@ACCV2024 by organizing competitions with industry-level data, soliciting paper submissions, and continuing to invite top-tier speakers from both industry and academia to fuse the synergy.
Important Dates + Author Guidelines
Author Guidelines: Formatting, Page Limits, Author Kits, and Submission Policies follow the ACCV 2024 Author Guidelines
Challenges Data Available at: | August 6, 2024, 11:59 PM PST |
Regular Paper Submission Deadline: | Extented to September 27, 2024, 11:59 PM PST |
Challenges Results and Reports Submission Deadline: | Extended to September 27, 2024, 11:59 PM PST |
Submission Site: | CMT Submission Site |
Paper Review Back and Decision Notification: | October 4, 2024, 11:59 PM PST |
Challenges Results and Decision Notification: | October 4, 2024, 11:59 PM PST |
Camera-Ready Deadline: | October 10, 2024, 11:59 PM PST |
1. Regular Paper Submissions
Papers addressing topics related to image/video restoration, compression, enhancement, and manipulation, using generative AI technologies are welcome to submit. The topics include but are not limited to:
- Restoration and enhancement of rich media with generative AI
- Editing and manipulation of rich media with generative AI
- Compression and codec design with generative AI
- Modeling and rendering rich media with generative AI
- Neural radiance fields with generative AI
- Rich media creation and editing with large language models
- Acceleration of generative AI models on edge devices
Author Guidelines: Formatting, Page Limits, Author Kits, and Submission Policies follow the ACCV 2024 Author Guidelines
All papers must be uploaded to the submission site by the deadline. There is no rebuttal for this call. Reviews and paper decisions will be sent back to the authors on the date specified above.
2. Challenges
Cloud gaming poses tremendous challenges for compression and transmission. To avoid delay and bandwidth overload, high-quality frames need to be heavily compressed with very low latency. Traditional codecs like H.264/H.265/H.266 or recent neural video coding targeting natural videos generally do not perform well.
Generative AI technologies, e.g., super-resolution, image synthesis and rendering, can largely alleviate the transmission issues. Server-side computation and transmission can be reduced by leveraging the computation power of client de- vices. For example, the server can render low-resolution (LR) frames to transmit, and high-resolution (HR) frames can be computed on client side. In multiview gaming, the server can render part of views to transmit, and the remaining views can be computed by client devices. Nvidia's Deep Learning Super Sampling (DLSS) has commercialized this idea, and one key factor of its success is the large-scale ground-truth LR-HR or multiview gaming data used for training.
In comparison, the research community uses pseudo training data for many restoration tasks. For example, for super-resolution, the LR data is generated from the HR data by downsampling and adding degradation like noises and blurs. Such pseudo data do not match real gaming data. True LR gaming frames are high-quality, sharp and clear without noises or blurs. There are unnatural visual effects and object movements, but with limited motion blur, different from captured natural videos. We need ground-truth gaming data for effective training.
In this competition, a large computer-synthesized ground-truth dataset is provided, targeting two different applications:
- Track 1: Super Resolution in Deferred Rendering
- Mini training set: For each paired video sequence, 5 paired frames are randomly selected to form the mini dataset.
- Full training set
- Code scripts to read data and compute evaluation metrics
- Track 2: Multiview Video Frame Synthesis
- Mini training set: For each paired video sequence, 5 paired frames are randomly selected to form the mini dataset.
- Full training set
- Code scripts to read data and compute evaluation metrics
This track aims to restore HR images from LR images along with additional GBuffers during the deferred rendering stage (i.e., segmentation map, depth map), supporting the gaming solution of transferring LR images with assistive information using reduced bits and restoring HR images on client side.
The dataset has 320 LR-HR paired sequences at 1440p and 720p. Each sequence has 60 frames, totalling 19200 LR-HR paired frames. The sequences are rendered by the open source CARLA simulator with the Unreal Engine. The paired sequences capture 3D scenes from 8 different towns (20 static scenes and 20 dynamic scenes for each town). The corresponding paired segmentation maps and depth maps synchronize with the RGB images are also provided. Data collected from 7 towns form the training set, and data from the remaining 1 town form the test set.
Given the LR RGB images (720p) and additional GBuffers as input, the task is to develop algorithms to recover the HR RGB images (1440p).
Algorithms will be evaluated based on 4 objective quality metrics: PSNR and SSIM to measure pixel-level distortion, LPIPS and FID to measure perceptual quality. In detail, assume there are N methods, they will be ranked according to each metric, and a ranking score between [1,2N-1] will be given to each method (1st place 2N-1, 2nd place 2N-3, ..., last place 1). The average ranking score of all 4 metrics will be used as the overall score to rank the N methods. The amount of additional GBuffers used will also be considered (the more GBuffers used, the more bits consumed). For example, if two methods have similar overall score, the less bits consumed the better.
huggingface Download includes the following:
Challenge result submission: participants must submit the recovered HR RGB images for the test set to be evaluated. The download link of the results should be provided by the deadline.
This track aims to synthesize intermediate frames from a sparse set of input frames in multiview videos, along with camera intrinsic and extrinsic parameters and additional segmentation maps and depth maps, supporting the multiview gaming solution of transferring part of multiview frames with assistive information using reduced bits and generating the remaining frames on client side.
The dataset contains 160 sets of sequences rendered by CARLA simulator. Each set of sequences consists of static 3D scenes captured by six cameras mounted on the top of a car moving within one of the 8 towns. Each town has 20 sets of sequences and each sequence has 60 frames, totaling 57600 frames. The corresponding segmentation maps and depth maps synchronize with the RGB images are also provided.
For each set of sequences, a subset of multiview frames will be randomly selected as inputs, and the task is to synthesis the remaining frames.
Algorithms will be evaluated based on 4 objective metrics: PSNR and SSIM to measure pixel-level distortion, LPIPS and FID to measure perceptual quality. In detail, assume there are N methods, they will be ranked according to each metric, and a ranking score between [1,2N-1] will be given to each method (1st place 2N-1, 2nd place 2N-3, ..., last place 1). The average ranking score of all 4 metrics will be used as the overall score to rank the N methods. The amount of additional GBuffers used will also be considered (the more GBuffers used, the more bits consumed). For example, if two methods have similar overall score, the less bits consumed the better.
huggingface Download includes the following:
Challenge result submission: participants must submit the synthesized RGB frames for the test set to be evaluated. The download link of the results should be provided by the deadline.
The winners will be announced at the RichMediaGAI workshop, and the top 3 non-corporate winners of each track will be rewarded with 1st $2000, 2nd $1000, 3rd $500. The winners are invited to submit a paper to the RichMediaGAI workshop through the paper submission system. For the paper to be accepted, each paper must be a self-contained description of the method, and be detailed enough to reproduce the results. The paper submission must follow the ACCV 2024 Author Guidelines .
3. Invited Talks
Nanyang Technological University
Chen Change Loy is a President's Chair Professor with the College of Computing and Data Science, Nanyang Technological University, Singapore. He is the Lab Director of MMLab@NTU and Co-associate Director of S-Lab. Prior to joining NTU, he served as a Research Assistant Professor at the MMLab of The Chinese University of Hong Kong, from 2013 to 2018. His research interests include computer vision and deep learning with a focus on image/video restoration and enhancement, generative tasks, and representation learning. He serves as an Associate Editor of the International Journal of Computer Vision (IJCV), IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), and Computer Vision and Image Understanding (CVIU). He also serves/served as an Area Chair of top conferences such as ICCV, CVPR, ECCV, ICLR and NeurIPS. He will serve as the Program Co-Chair of CVPR 2026. He is a senior member of IEEE.
Dong Tian is a Senior Director with InterDigital, Inc. He has been actively contributing to MPEG industry standards and academic communities for 20+ years. Prior to InterDigital, Inc. He holds 30+ U.S.-granted patents and 50+ recent publications in top-tier journals/transactions and conferences. His research interests include image processing, 3D video, point cloud processing, and deep learning. He serves as the Chair of MPEG-AI, MPEG 3DGH on AI-Based Graphic Coding since 2021, and MSA TC from 2023 to 2025, and a General Co-Chair of MMSP'20 and MMSP'21.
Northeastern Univeristy
Yanzhi Wang is currently an associate professor and faculty fellow at Dept. of ECE at Northeastern University, Boston, MA. His research interests focus on model compression and platform-specific acceleration of deep learning applications. His work has been published broadly in top conference and journal venues (e.g., DAC, ICCAD, ASPLOS, ISCA, MICRO, HPCA, PLDI, ICS, PACT, ISSCC, AAAI, ICML, NeurIPS, CVPR, ICLR, IJCAI, ECCV, ICDM, ACM MM, FPGA, LCTES, CCS, VLDB, PACT, ICDCS, RTAS, Infocom, C-ACM, JSSC, TComputer, TCAS-I, TCAD, TCAS-I, JSAC, TNNLS, etc.). He has received six Best Paper and Top Paper Awards, and one Communications of the ACM cover featured article. He has another 13 Best Paper Nominations and four Popular Paper Awards. He has received the U.S. Army Young Investigator Program Award (YIP), IEEE TC-SDM Early Career Award, APSIPA Distinguished Leader Award, Massachusetts Acorn Innovation Award, Martin Essigmann Excellence in Teaching Award, Massachusetts Acorn Innovation Award, Ming Hsieh Scholar Award, and other research awards from Google, MathWorks, etc.
City University of Hong Kong
Tianfan Xue is a Vice Chancellor assistant professor at the Department of Information Engineering, the Chinese University of Hong Kong. His research interests include computer vision, machine learning, and computer graphics, with a focus on generative AI and neural rendering.
National University of Sigapore
Mike Z. Shou is an Assistant Professor at NUS. His research focuses on Computer Vision and Deep Learning, with an emphasis on developing intelligent system for video understanding and creation. Mike was awarded Wei Family Private Foundation Fellowship from 2014 to 2017. Mike received the best student paper nomination at CVPR2017. His team won the first place in the International Challenge on Activity Recognition (ActivityNet) 2017. Having won the Singapore NRF Fellowship award for his proposal tilted “Towards Next-generation Video Intelligence: Training Machines to Understand Actions and Complex Events“ which carries a research grant that provides early career researchers to carry out independent research locally. Mike is looking forward to developing new deep learning methods to allow machines to understand actions and complex events in videos -- this can power many applications such as perception system for self-driving car, caring-robot for the elderly, smart CCTV cameras, social media recommendation system, intelligent video creation tool for journalists and filmmakers, to name a few.
Google Research
Junfeng He is with Google Inc. He has published more than 25 papers in top tier conference/journals such as CVPR, ICML, TPAMI, IEEE proceedings, etc., cited by more than 1000 times. He has served as TPC member of ACM MM, CVPR, and several other conferences.
4. Program Schedule (TBD)
5. Organizers
Futurewei Technologies
Santa Clara University
Chinese Univeristy of Hong Kong
University of California Irvine
Guard Strike
Contacts
Dataset related questions: Lebin Zhou
Paper related and other general questions: Wei Jiang, Jinwei Gu