University of Illinois Chicago
Browse

Towards Geometric Structure Modeling and Learning for 3D Human-Centric Understanding

Download (48.11 MB)
thesis
posted on 2024-12-01, 00:00 authored by Zhiming Zou
The field of 3D computer vision has advanced rapidly in recent years, with a central challenge being human-centric understanding, encompassing tasks such as 3D human pose estimation (HPE), 3D head pose estimation, and multi-view pedestrian detection (MPD). While state-of-the-art methods leveraging deep neural networks have shown promising results, many still struggle to accurately infer 3D structures or positional information from images. This research aims to model and learn the geometric structure and feature representations of the articulated human body, thereby enhancing the perception capabilities of modern human-centric vision systems. We initially propose a graph-based approach to better represent and understand geometric structures, validating our methods for 3D HPE. We introduce mechanisms that capture high-order proximity, compositionality, and variability of the articulated body. Our experimental results demonstrate that these methods significantly mitigate depth ambiguity in 3D HPE, achieving state-of-the-art performance on public datasets, including Human3.6M and MPI-INF-3DHP. Next, we explore the benefits of geometric modeling and learning in 3D head pose estimation through a novel unsupervised framework that learns both head pose and facial landmarks. This framework features a multi-task network for joint landmark and pose prediction, learnable 3D canonical landmarks, and an image generation network, all trained collaboratively on unlabeled face images using a combined loss of conditional image generation and geometric consistency. Extensive experiments show the effectiveness of our approach, utilizing public datasets such as CelebA, AFLW, and BIWI. Finally, we highlight the importance of geometric modeling and learning in multi-view pedestrian detection (MPD). Existing systems struggle with inconsistent feature representations, especially in dense crowds, due to scale variations across camera views. We propose a novel adaptive detection transformer (AdaDETR), which consists of three key components: a Multi-space Pyramid Encoder, Content-dependent BEV Queries, and a Pedestrian Occupancy Decoder. These components handle scale variations, improve feature fusion across views, and predict pedestrian occupancy in BEV space. Extensive experiments on publicly available datasets, such as Wildtrack and MultiviewX, validate the effectiveness of our approach.

History

Advisor

Dr.Wei Tang

Department

Computer Science

Degree Grantor

University of Illinois Chicago

Degree Level

  • Doctoral

Degree name

PhD, Doctor of Philosophy

Committee Member

Dr. Xinhua Zhang Dr. Elena Zheleva Dr. Natalie Parde Dr.Yiding Yang

Thesis type

application/pdf

Language

  • en

Usage metrics

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC