Check out the leaderboard posted on 15 July 2024 under MMVRAC Workshop
ICME 2024
Jun LIU
Singapore University of Technology and Design
Bingquan SHEN
DSO National Laboratories and National University of Singapore
Ping HU
Boston University
Kian Eng ONG
Singapore University of Technology and Design
Haoxuan QU
Singapore University of Technology and Design
Duo PENG
Singapore University of Technology and Design
Lanyun ZHU
Singapore University of Technology and Design
Lin Geng FOO
Singapore University of Technology and Design
Xun Long NG
Singapore University of Technology and Design
Machine learning techniques have received increasing attention in recent years for solving various imaging problems, creating impacts in important applications, such as remote sensing, medical imaging, smart manufacture, etc. This talk will first review some recent advances in machine learning from modeling, algorithmic, and mathematical perspectives, for imaging and reconstruction tasks. In particular, he will share some of their recent works on the physics-driven deep learning and how it can be applied to the imaging applications for different modalities. He will show how the underlying image model evolves from signal processing to building deep neural networks, while the "old" ways can also join the new. It is critical to incorporate both the flexibility of deep learning and the underlying imaging physics to achieve the state-of-the-art results.
Download Certificate of Participation
Participants who have successfully completed the Challenge (submitted code, models etc) should have received their Certificate of Participation through email.
The Winning Teams of each track will be presented with their Winner's Certificate at the ICME workshop.Code Submission Terms and Conditions
Participants are to adhere to the following terms and conditions. Failure to do so will result in disqualification from the Challenge.
Paper Submission
Given the enormous amount of multi-modal multi-media information that we encounter in our daily lives (including visuals, sounds, texts, and interactions with their surroundings), we humans process such information with great ease – we understand and analyze, think rationally and reason logically, and then predict and make sound judgments and informed decision based on various modalities of information available.
For machines to assist us in holistic understanding and analysis of events, or even to achieve such sophistication of human intelligence (e.g., Artificial General Intelligence (AGI) or even Artificial Super Intelligence (ASI)), they need to process visual information from real-world videos, alongside complementary audio and textual data, about the events, scenes, objects, their attributes, actions, and interactions.
Hence, we hope to further advance such developments in multi-modal video reasoning and analyzing for different scenarios and real-world applications through this Grand Challenge using various challenging multi-modal datasets with different types of computer vision tasks (i.e., video grounding, spatiotemporal event grounding, video question answering, sound source localization, person reidentification, attribute recognition, pose estimation, skeleton-based action recognition, spatiotemporal action localization, behavioral graph analysis, animal pose estimation and action recognition) and multiple types of annotations (i.e., audio, visual, text). This Grand Challenge will culminate in the 2nd Workshop on Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC).
The Challenge Tracks are based on the following video datasets:
Chaotic World (https://github.com/sutdcv/Chaotic-World)
This challenging multi-modal (video, audio, and text) video dataset that focuses on chaotic situations around the world comprises complex and dynamic scenes with severe occlusions and contains over 200,000 annotated instances for various tasks such as spatiotemporal action localization (i.e., spatially and temporally locate the action), behavioral graph analysis (i.e., analyze interactions between people), spatiotemporal event grounding (i.e., identifying relevant segments in long videos and localizing people, scene, and behavior-of-interest), and sound source localization (i.e., spatially locate the source of sound). This will promote deeper research into more robust models that capitalize various modalities and handle such complex human behaviors / interactions in dynamic and complex environments.
Track: Spatiotemporal Action Localization
Track: Sound Source Localization
Animal Kingdom (https://github.com/sutdcv/Animal-Kingdom)
This animal behavioral analysis dataset comprises 50 hours of long videos with corresponding text descriptions of the scene, animal, and behaviors to localize the relevant temporal segments in long videos for video grounding (i.e., identifying relevant segments in long videos), 50 hours of annotated video sequences for animal action recognition (i.e. identifying the action of the animal), and 33,000 frames for animal pose estimation (i.e., predicting the keypoints of the animal) for a diversity of over 850 species. This will facilitate animal behavior analysis, which is especially important for wildlife surveillance and conservation.
Track: Video Grounding
Track: Animal Action Recognition
Track: Animal Pose Estimation for Protocol 1 (All animals)
UAV-Human (https://github.com/sutdcv/UAV-Human)
This human behavior analysis dataset contains more than 60,000 video samples (bird-eye views of people and actions taken from a UAV) for pose estimation (i.e., predicting the keypoints of the individual), skeleton-based action recognition (i.e. identifying the action of the individual based on keypoints), attribute recognition (i.e. identifying the features / characteristics of the individual), as well as person reidentification (i.e. matching a person's identity across various videos) tasks. This will facilitate human behavior analysis from different vantage points, and will be useful for the development and applications of UAVs.
Track: Skeleton-based Action Recognition
Track: Attribute Recognition
Track: Person Reidentification
Participants are required to register their interest to participate in the Challenge.
The Grand Challenge is open to individuals from institutions of higher education, research institutes, enterprises, or other organizations.
Evaluation Metric and Judging Criteria
The evaluation metric of each task follows the metric indicated in the original paper.
01 May 2024
Deadline for camera-ready submission of accepted paper
01 May 2024
Author Full Conference Registration