Online Generic Event Boundary Detection

ICCV 2025

1GIST, 2Seoul National University, 3POSTECH
*Equal Contribution, †Corresponding Authors

Abstract

Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without access to future frames.

To tackle these challenges, we propose a novel On-GEBD framework, ESTimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Experimental results demonstrate that ESTimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods.

Motivation: Online vs. Offline GEBD

Conventional offline-GEBD methods utilize both past and future frames to determine event boundaries, which differs significantly from how humans perceive events in an online manner. Humans process visual information sequentially, relying only on visuals available at the current moment. Our Online GEBD task aims to closely mimic this natural human perception process.

Comparison between offline-GEBD and human perception
Comparison between offline-GEBD and human perception. Offline-GEBD uses all frames, while humans segment events sequentially based on current visuals.

Event Segmentation Theory (EST)

Our framework is inspired by Event Segmentation Theory (EST) from cognitive science. EST explains how humans continuously make predictions consistent with an ongoing event and detect changes when these predictions diverge from actual information. When we perceive visuals, we naturally expect continuous visuals to be recognized. When a significant difference from the given visual input occurs, we perceive it as an event boundary.
Image credit to Neuroscience News.

Event Segmentation Theory mechanism
Illustration of Event Segmentation Theory showing how humans perceive event boundaries through the discrepancy between expected and actual visual information.

Our Framework: ESTimator

ESTimator addresses the unique challenges of On-GEBD through two key components inspired by EST principles:

ESTimator framework overview
Overview of our ESTimator framework showing the Consistent Event Anticipator (CEA) and Online Boundary Discriminator (OBD) components.

Consistent Event Anticipator (CEA)

CEA generates predictions of future frames reflecting current event dynamics based solely on prior frames. It is trained using two novel objectives:

  • EST Loss: Frame-level prediction error that maximizes errors at event boundaries while minimizing them within consistent event segments.
  • REST Loss (Region EST Loss): Region-level training scheme that considers temporal context flow, providing soft supervision for smooth semantic transitions.
EST and REST loss explanation
Visualization of EST and REST loss mechanisms for training the Consistent Event Anticipator.

Online Boundary Discriminator (OBD)

OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. By storing historical discrepancies in a fixed-size queue and conducting statistical testing, OBD provides a dynamic threshold that reflects the surrounding context, enabling robust detection of taxonomy-free events with varying granularity.

Online Boundary Discriminator mechanism
Online Boundary Discriminator applies dynamic threshold to capture diverse event transitions by leveraging past error distribution.

Experimental Results

Quantitative Results

Our ESTimator outperforms all baseline methods adapted from recent online video understanding models on both Kinetics-GEBD and TAPOS datasets. Moreover, it achieves performance comparable to prior offline-GEBD methods, despite having access only to past frames.

Quantitative comparison results
Quantitative comparison with online baselines and offline methods. Our method achieves state-of-the-art performance among online methods.

Qualitative Results

Qualitative comparisons demonstrate that ESTimator detects event boundaries more accurately than baselines, with predictions closely aligned with ground truth. The error plots show distinct peaks at boundary locations, effectively identifying both abrupt scene changes and subtle semantic transitions.

Qualitative comparison results
Qualitative comparison showing our method accurately detects event boundaries with clearer error peaks compared to baseline methods.

BibTeX

@inproceedings{jung2025online,
  author    = {Jung, Hyungrok and Kim, Daneul and Lim, Seunggyun and Son, Jeany and Choi, Jonghyun},
  title     = {Online Generic Event Boundary Detection},
  booktitle = {ICCV},
  year      = {2025},
}