Predicting Egocentric Visual Attention in 3D Action Contexts

Overview

When people perform everyday actions, their eyes and attention don’t wander randomly; they shift in predictable ways depending on what the person is doing and where objects are in the environment. Understanding this link between actions, gaze, and 3D context is helpful for applications like VR/AR training, assistive systems, and human-robot collaboration.

This project investigates how we can model the relationship between human gaze (where someone looks), visual saliency (what stands out in the scene), and 3D action recognition (what they are doing in space).

Objectives

You will choose one of the following directions (depending on your interests and background):

Gaze and Saliency in Egocentric Video
- Use publicly available datasets (e.g. Ego4D, EPIC-Kitchens) to train/test models that link visual saliency with action recognition.
- Evaluate how attention shifts during different phases of simple actions (e.g. reaching, pouring, picking up objects).
3D Localization of Gaze
- Use pose estimation and 3D reconstruction tools to map gaze data into a reconstructed scene.
- Prototype a system that shows where someone is looking within a volumetric action context.
Small Multimodal Dataset Creation
- Collect a small dataset (1–2 hours) using available equipment (e.g. egocentric camera + eye tracking).
- Align first-person gaze data with external camera views and provide a documented preprocessing pipeline.

Expected Outcomes

A working prototype (e.g. model, dataset, or pipeline) demonstrating how gaze and saliency can be linked to 3D action understanding.
Evaluation results using either public datasets or your own small-scale data collection.
A dissertation that positions your work within current research and identifies future directions.

Skills Required

Programming ability (Python, PyTorch/TensorFlow).
Familiarity with computer vision or machine learning.
Interest in multimodal data (video, gaze, 3D reconstruction).

Skills You Will Gain

Hands-on experience with egocentric video datasets.
Practice with visual saliency, action recognition, and 3D pose estimation techniques.
A broader understanding of human attention modeling in VR/AR and human-robot interaction.

References (starting point)

Liu, M., Ma, L., Somasundaram, K., Li, Y., Grauman, K., Rehg, J. M., & Li, C. (2022, October). Egocentric activity recognition and localization on a 3d map. In European Conference on Computer Vision (pp. 621-638). Cham: Springer Nature Switzerland.
Wang, X., Zhu, L., Wu, Y., & Yang, Y. (2020). Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE transactions on pattern analysis and machine intelligence, 45(6), 6605-6617.
Huang, Y., Cai, M., Li, Z., Lu, F., & Sato, Y. (2020). Mutual context network for jointly estimating egocentric gaze and action. IEEE Transactions on Image Processing, 29, 7795-7806.