Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context

Cho, Kyungwon; Joo, Hanbyul

Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context

Kyungwon Cho, Hanbyul Joo

Seoul National University

Code arXiv

We present the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices

Abstract

Egocentric vision systems are becoming widely available, creating new opportunities for human-computer interaction. A core challenge is estimating the wearer's full-body motion from first-person videos, which is crucial for understanding human behavior. However, this task is difficult since most body parts are invisible from the egocentric view. Prior approaches mainly rely on head trajectories, leading to ambiguity, or assume continuously tracked hands, which is unrealistic for lightweight egocentric devices. In this work, we present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices. To overcome the lack of datasets pairing diverse camera views with human motion, we introduce a novel augmentation method that models such real-world conditions. We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction, and thus employ local attention to infer long sequences efficiently. Experiments on public benchmarks show that our method achieves state-of-the-art accuracy and temporal smoothness, demonstrating a practical step toward reliable in-the-wild egocentric 3D motion understanding.

Long Demo

Speed: 5.0x

Comparison with Baseline

Method Overview

Our framework leverages a conditional diffusion model to generate human motion, conditioned on head trajectory and intermittently visible hand poses from egocentric video. To efficiently process long sequences, we incorporate a local attention mechanism within the encoder-decoder transformer architecture. Furthermore, we encode sequence-level context such as body shape and field-of-view to ensure consistent reconstruction across diverse device configurations.

To overcome the lack of datasets, we introduce a novel augmentation method that simulates real-world egocentric conditions. We apply spatial augmentation to mimic diverse device field-of-views, and temporal augmentation via independent event duration sampling to realistically model mis-detection patterns.

BibTeX

@article{cho2025hamos,
  title={Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context},
  author={Cho, Kyungwon and Joo, Hanbyul},
  journal={arXiv preprint arXiv:2512.19283},
  year={2025},
}