The field of 3D computer vision has advanced rapidly in recent years, with a central challenge being human-centric understanding, encompassing tasks such as 3D human pose estimation (HPE), 3D head pose estimation, and multi-view pedestrian detection (MPD). While state-of-the-art methods leveraging deep neural networks have shown promising results, many still struggle to accurately infer 3D structures or positional information from images. This research aims to model and learn the geometric structure and feature representations of the articulated human body, thereby enhancing the perception capabilities of modern human-centric vision systems.
We initially propose a graph-based approach to better represent and understand geometric structures, validating our methods for 3D HPE. We introduce mechanisms that capture high-order proximity, compositionality, and variability of the articulated body. Our experimental results demonstrate that these methods significantly mitigate depth ambiguity in 3D HPE, achieving state-of-the-art performance on public datasets, including Human3.6M and MPI-INF-3DHP.
Next, we explore the benefits of geometric modeling and learning in 3D head pose estimation through a novel unsupervised framework that learns both head pose and facial landmarks. This framework features a multi-task network for joint landmark and pose prediction, learnable 3D canonical landmarks, and an image generation network, all trained collaboratively on unlabeled face images using a combined loss of conditional image generation and geometric consistency. Extensive experiments show the effectiveness of our approach, utilizing public datasets such as CelebA, AFLW, and BIWI.
Finally, we highlight the importance of geometric modeling and learning in multi-view pedestrian detection (MPD). Existing systems struggle with inconsistent feature representations, especially in dense crowds, due to scale variations across camera views. We propose a novel adaptive detection transformer (AdaDETR), which consists of three key components: a Multi-space Pyramid Encoder, Content-dependent BEV Queries, and a Pedestrian Occupancy Decoder. These components handle scale variations, improve feature fusion across views, and predict pedestrian occupancy in BEV space. Extensive experiments on publicly available datasets, such as Wildtrack and MultiviewX, validate the effectiveness of our approach.
History
Advisor
Dr.Wei Tang
Department
Computer Science
Degree Grantor
University of Illinois Chicago
Degree Level
Doctoral
Degree name
PhD, Doctor of Philosophy
Committee Member
Dr. Xinhua Zhang
Dr. Elena Zheleva
Dr. Natalie Parde
Dr.Yiding Yang