Learning Temporal Video-Language Grounding for Egocentric Videos

In this paper, working with two PHD students at the University of Texas, we address the challenge of learning temporal structure in long-horizon egocentric videos by pretraining a large video-language model on the Ego4D dataset. The goal is to transfer the pretrained model to downstream video understanding tasks such as video-text retrieval. To achieve this, we propose a novel contrastive learning approach that combines hard negative mining via nearest neighbor sampling with temporal clip reordering. The key insight is to use a sampling strategy that incorporates strong and weak samples in the contrastive learning objective to encourage learning temporally robust video-language representations. I also develop a temporal reordering objective to unshuffle and reorder video clips and their corresponding narrations.


Egocentric videos are first-person videos captured using wearable cameras, and offer a unique first-person view of the environment. In this work, we consider the EgoClip subset provided by EgoVLP, which consists of a pretraining set of 3.8M video-text pairs from Ego4D. The approach outperforms existing state-of-the-art methods on the challenging EgoMCQ benchmark, demonstrating a significant improvement in both intra-video and inter-video metrics.


The challenges involved in pre-training large-scale video-language models for egocentric understanding include the inherent complexity of egocentric videos, the requirement for reasoning over large contexts in long-horizon videos, and the ambiguity of natural language queries. Mapping natural language queries to specific segments in the video stream requires fine-grained spatio-temporal reasoning. Therefore, an effective solution requires understanding both context and temporal relationships. In this work, we propose a novel approach to tackle these challenges, which outperforms existing state-of-the-art methods.

Learning Temporal Video-Language Grounding for Egocentric Videos.pdf