Accurate estimation of speaker orientation significantly enhances applications such as hearing aids, teleconferencing systems, and voice-controlled interfaces. Speaker orientation information can help a hearing aid to decide whether to enhance that talker. It also enhances camera tracking and audio quality in teleconferencing systems. And help to decide whether a voice interface respond.
This thesis introduces a deep neural network method using combined spatial and spectral audio features for speaker orientation estimation. Spatial features are derived using a weighted Generalized Cross-Correlation with Phase Transform (GCC-PHAT) technique applied to three microphones placed around the speaker. Spectral features capture the directivity patterns of human speech from Mel spectrogram. The proposed method achieves significantly reduced estimation errors compared to approaches using single feature type. Experimental results show better accuracy, validating the effectiveness of the combined feature to be suitable for real world implementations.