Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

Materia, Daniele; Ragusa, Francesco; Farinella, Giovanni Maria

Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

In International Conference on Pattern Recognition (ICPR) 2026 · Accepted

April 2026 Daniele Materia¹, Francesco Ragusa^1,2, Giovanni Maria Farinella^1,2

¹University of Catania ²Next Vision s.r.l.

Abstract

The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames.

Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.

Citation

Materia, D., Ragusa, F., and Farinella, G. M. (2026). Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos. In Proceedings of the International Conference on Pattern Recognition (ICPR 2026).

@inproceedings{materia2026Leveraging,
  booktitle = { International Conference on Pattern Recognition (ICPR) },
  url = {  },
  pdf = {  },
  year = { 2026 },
  author = { Daniele Materia and Francesco Ragusa and Giovanni Maria Farinella },
  title = { Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos },
}