posted on 2016-09-10, 00:00authored byL. Chen, M. Javaid, B. Di Eugenio, M. Zefran
The RoboHelper project has the goal of developing assistive robots for the elderly. One crucial component of such a robot is a
multimodal dialogue architecture, since collaborative task-oriented human-human dialogue is inherently multimodal. In this paper,
we focus on a specific type of interaction, Haptic-Ostensive (H-O) actions, that are pervasive in collaborative dialogue. H-O actions
manipulate objects, but they also often perform a referring function.
We collected 20 collaborative task-oriented human-human dialogues between a helper and an elderly person in a realistic
setting. To collect the haptic signals, we developed an unobtrusive sensory glove with pressure sensors. Multiple annotations
were then conducted to build the Find corpus. Supervised machine learning was applied to these annotations in order to develop
reference resolution and dialogue act classification modules. Both corpus analysis, and these two modules show that H-O actions
play a crucial role in interaction: models that include H-O actions, and other extra-linguistic information such as pointing gestures,
perform better.
For true human-robot interaction, all communicative intentions must of course be recognized in real time, not on the basis of
annotated categories. To demonstrate that our corpus analysis is not an end in itself, but can inform actual human-robot interaction,
the last part of our paper presents additional experiments on recognizing H-O actions from the haptic signals measured through the
sensory glove. We show that even though pressure sensors are relatively imprecise and the data provided by the glove is noisy, the
classification algorithms can successfully identify actions of interest within subjects.
Funding
Milos Zefran, Ph.D, research has been funded by the National Science Foundation
(NSF), and he is a recipient of the NSF CAREER award (2001).
History
Publisher Statement
Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Computer Speech and Language. 2015. 34(1): 201-231. DOI: 10.1016/j.csl.2015.03.010.