VideoAds: Benchmarking Multi-Modal Language Models for Complex Temporal Understanding

Anonymous Author

Anonymous Institution

Abstract

Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. Unlike conventional short-form videos, advertisement videos inherently demand complex temporal understanding, often causing state-of-the-art multi-modal language models (MLLMs) to fail in reasoning over complex multi-modal sequential events. In this work, we introduce VideoAds, the first-ever dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by manually annotated valuable questions across three core tasks: visual finding, video summarization, and visual reasoning. Furthermore, we propose a quantitative metric for video complexity, enabling a more structured evaluation of video-based temporal reasoning. Extensive benchmarking of state-of-the-art VLMs reveals significant limitations in processing the intricate temporal structures present in advertisement videos. Particularly, GPT-4o only achieved a maximum accuracy of 66.82% on this short video four-option multiple-choice dataset while humans can achieve one accuracy of 94.27%. Notably, models exhibit higher accuracy in visual finding tasks than video summarization and complex reasoning, suggesting that current VLMs prioritize static frame-based features over holistic temporal comprehension, even within relatively short videos. Further analysis of the Chain of Thought (CoT) reveals the limitation of some MLLMs in dealing with complex contexts in challenging video reasoning. Our findings underscore the necessity of advancing temporal modeling techniques in multi-modal learning and highlight VideoAds as a crucial benchmark for future research in video-based language understanding. The dataset and evaluation code are publicly available at XXX.