Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without access to future frames.
To tackle these challenges, we propose a novel On-GEBD framework, ESTimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Experimental results demonstrate that ESTimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods.
Conventional offline-GEBD methods utilize both past and future frames to determine event boundaries, which differs significantly from how humans perceive events in an online manner. Humans process visual information sequentially, relying only on visuals available at the current moment. Our Online GEBD task aims to closely mimic this natural human perception process.
Our framework is inspired by Event Segmentation Theory (EST) from cognitive science. EST explains how humans
continuously make predictions consistent with an ongoing event and detect changes when these predictions diverge
from actual information. When we perceive visuals, we naturally expect continuous visuals to be recognized.
When a significant difference from the given visual input occurs, we perceive it as an event boundary.
Image credit to Neuroscience News.
ESTimator addresses the unique challenges of On-GEBD through two key components inspired by EST principles:
CEA generates predictions of future frames reflecting current event dynamics based solely on prior frames. It is trained using two novel objectives:
OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. By storing historical discrepancies in a fixed-size queue and conducting statistical testing, OBD provides a dynamic threshold that reflects the surrounding context, enabling robust detection of taxonomy-free events with varying granularity.
Our ESTimator outperforms all baseline methods adapted from recent online video understanding models on both Kinetics-GEBD and TAPOS datasets. Moreover, it achieves performance comparable to prior offline-GEBD methods, despite having access only to past frames.
Qualitative comparisons demonstrate that ESTimator detects event boundaries more accurately than baselines, with predictions closely aligned with ground truth. The error plots show distinct peaks at boundary locations, effectively identifying both abrupt scene changes and subtle semantic transitions.
@inproceedings{jung2025online,
author = {Jung, Hyungrok and Kim, Daneul and Lim, Seunggyun and Son, Jeany and Choi, Jonghyun},
title = {Online Generic Event Boundary Detection},
booktitle = {ICCV},
year = {2025},
}