[2024 ICME Grand Challenge] Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC)


Will you be our next Grand Challenge winner?

2024 MMVRAC Winners

Avatar Avatar

This could be you! :)

ICME 2024

10 teams for
Track #1: Spatiotemporal Action Localization (Chaotic World dataset)

5 teams for
Track #2: Behavioral Graph Analysis (Scene Graph Generation) (Chaotic World dataset)

7 teams for
Track #3: Spatiotemporal Event Grounding (Chaotic World dataset)

7 teams for
Track #4: Sound Source Localization (Chaotic World dataset)

10 teams for
Track #5: Video Grounding (Animal Kingdom dataset)

20 teams for
Track #6: Animal Action Recognition (Animal Kingdom dataset)

13 teams for
Track #7: Animal Pose Estimation (Animal Kingdom dataset)

18 teams for
Track #8: Video Question-Answering (SUTD Traffic-QA dataset)

21 teams for
Track #9: Pose Estimation (UAV-Human dataset)

26 teams for
Track #10: Skeleton-based Action Recognition (UAV-Human dataset)

20 teams for
Track #11: Attribute Recognition (UAV-Human dataset)

23 teams for
Track #12: Person Reidentification (UAV-Human dataset)
Updated as of 25 Mar 2024

Singapore University of Technology and Design

Bingquan SHEN
DSO National Laboratories and National University of Singapore

Ping HU
Boston University

Kian Eng ONG
Singapore University of Technology and Design

Haoxuan QU
Singapore University of Technology and Design

Singapore University of Technology and Design

Lanyun ZHU
Singapore University of Technology and Design

Lin Geng FOO
Singapore University of Technology and Design

Xun Long NG
Singapore University of Technology and Design

Deadline has been extended to 31 March 2024 UTC 23:59:00

Avatar 25 March 2024

Challenge Leaderboard

Avatar 25 March 2024

Submission Guidelines

Code Submission Terms and Conditions

Participants are to adhere to the following terms and conditions. Failure to do so will result in disqualification from the Challenge.

  • Participants can utilize one or more modalities provided in the dataset for the task.
  • Participants can also use other datasets or pre-trained models etc.
  • The model should not be trained on any data from the test set.
  • Participants are required to submit their results the link to their training data, code, and model repositories by the Challenge submission deadline.
  • The codes need to be open-source and the same results can be reproduced by others. Hence, the submission should include:
    • list of Python libraries / dependencies for the codes
    • pre-training / training data (if different from what we provided) and pre-processing or pre-training codes
    • training and inference codes
    • trained model
    • documentation / instruction / other relevant information to ensure the seamless execution of the codes (e.g., easy means to read in test data directly by simply replacing the test data directory DIR_DATA_TEST)
  • Participants can upload their project (codes and models) to GitHub, Google Drive or any other online storage providers.

Paper Submission

  • We will contact the top teams of each track to submit the paper.
  • The paper submission will follow the ICME 2024 guidelines (https://2024.ieeeicme.org/author-information-and-submission-instructions)
  • Length: Papers must be no longer than 6 pages, including all text, figures, and references. Submissions may be accompanied by up to 20 MB of supplemental material following the same guidelines as regular and special session papers.
  • Format: Workshop papers have the same format as regular papers.

Avatar 06 February 2024

Challenge Details

Click on the respective headers to view / collapse details

Given the enormous amount of multi-modal multi-media information that we encounter in our daily lives (including visuals, sounds, texts, and interactions with their surroundings), we humans process such information with great ease – we understand and analyze, think rationally and reason logically, and then predict and make sound judgments and informed decision based on various modalities of information available.

For machines to assist us in holistic understanding and analysis of events, or even to achieve such sophistication of human intelligence (e.g., Artificial General Intelligence (AGI) or even Artificial Super Intelligence (ASI)), they need to process visual information from real-world videos, alongside complementary audio and textual data, about the events, scenes, objects, their attributes, actions, and interactions.

Hence, we hope to further advance such developments in multi-modal video reasoning and analyzing for different scenarios and real-world applications through this Grand Challenge using various challenging multi-modal datasets with different types of computer vision tasks (i.e., video grounding, spatiotemporal event grounding, video question answering, sound source localization, person reidentification, attribute recognition, pose estimation, skeleton-based action recognition, spatiotemporal action localization, behavioral graph analysis, animal pose estimation and action recognition) and multiple types of annotations (i.e., audio, visual, text). This Grand Challenge will culminate in the 2nd Workshop on Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC).

The Challenge Tracks are based on the following video datasets:

Chaotic World (https://github.com/sutdcv/Chaotic-World)
This challenging multi-modal (video, audio, and text) video dataset that focuses on chaotic situations around the world comprises complex and dynamic scenes with severe occlusions and contains over 200,000 annotated instances for various tasks such as spatiotemporal action localization (i.e., spatially and temporally locate the action), behavioral graph analysis (i.e., analyze interactions between people), spatiotemporal event grounding (i.e., identifying relevant segments in long videos and localizing people, scene, and behavior-of-interest), and sound source localization (i.e., spatially locate the source of sound). This will promote deeper research into more robust models that capitalize various modalities and handle such complex human behaviors / interactions in dynamic and complex environments.
Track #1: Spatiotemporal Action Localization
Track #2: Behavioral Graph Analysis (Scene Graph Generation)
Track #3: Spatiotemporal Event Grounding
Track #4: Sound Source Localization

Animal Kingdom (https://github.com/sutdcv/Animal-Kingdom)
This animal behavioral analysis dataset comprises 50 hours of long videos with corresponding text descriptions of the scene, animal, and behaviors to localize the relevant temporal segments in long videos for video grounding (i.e., identifying relevant segments in long videos), 50 hours of annotated video sequences for animal action recognition (i.e. identifying the action of the animal), and 33,000 frames for animal pose estimation (i.e., predicting the keypoints of the animal) for a diversity of over 850 species. This will facilitate animal behavior analysis, which is especially important for wildlife surveillance and conservation.
Track #5: Video Grounding
Track #6: Animal Action Recognition
Track #7: Animal Pose Estimation for Protocol 1 (All animals)

SUTD-TrafficQA (https://github.com/sutdcv/SUTD-TrafficQA)
This traffic event-based multi-modal video reasoning dataset about the traffic scenes contains over 10,000 RGB video samples and 60,000 questions and answers about the traffic scene for video question-answering (i.e., answer questions according to the given videos) task. This will encourage deeper research into more robust models with low latency that can analyze human and vehicle behaviors in dynamic split-seconds scenarios, and will be useful for development of safer autonomous vehicles.
Track #8: Video Question-Answering for Setting 1/4

UAV-Human (https://github.com/sutdcv/UAV-Human)
This human behavior analysis dataset contains more than 60,000 video samples (bird-eye views of people and actions taken from a UAV) for pose estimation (i.e., predicting the keypoints of the individual), skeleton-based action recognition (i.e. identifying the action of the individual based on keypoints), attribute recognition (i.e. identifying the features / characteristics of the individual), as well as person reidentification (i.e. matching a person's identity across various videos) tasks. This will facilitate human behavior analysis from different vantage points, and will be useful for the development and applications of UAVs.
Track #9: Pose Estimation
Track #10: Skeleton-based Action Recognition
Track #11: Attribute Recognition
Track #12: Person Reidentification

Participants are required to register their interest to participate in the Challenge.

The Grand Challenge is open to individuals from institutions of higher education, research institutes, enterprises, or other organizations.

  • Each team can consist of a maximum of 6 members. The team can comprise individuals from different institutions or organizations.
  • Participants can also participate as individuals.
  • Participants will take on any one of the tracks or more than one track in the Challenge Tracks.
  • All participants who successfully complete the Challenge by the Challenge deadline and adhere to the Submission Guidelines will receive a Certificate of Participation.
  • The top three teams of each track are required to submit a paper and present at the MMVRAC workshop (oral or poster session).
    • They will be notified by the Organizers and are required to submit their paper for review by 06 April 2024.
    • They will need to submit their camera-ready paper, register for the conference by 01 May 2024, and present at the conference; otherwise they their paper will not be included in IEEE Xplore.
    • A Grand Challenge paper is covered by a full-conference registration only.
    • They will receive the Winners' Certificate and their work (link to their training data, code, and model repositories) will be featured on this website.

Evaluation Metric and Judging Criteria

The evaluation metric of each task follows the metric indicated in the original paper.

  • In cases of task whereby there are sub-tasks, the sub-task is specified in the Challenge track.
  • In cases whereby there are a few evaluation metrics for the task, the combination of all metrics will be taken into consideration to evaluate the robustness of the model.
  • In cases whereby there are multiple submissions for the same task, only the last submission will be considered.
  • In the event of same results for the same task, the team with the earlier upload date and time will be considered as the winner.

31 Mar 2024
UTC 23:59:00

Deadline for Challenge submission and submit your codes here

07 April 2024
UTC 23:59:00

Deadline for submitting invited papers

11 April 2024
UTC 23:59:00

Notification of paper acceptance

01 May 2024
Deadline for camera-ready submission of accepted paper

01 May 2024
Author Full Conference Registration