Robot-Human Mapping Model Learning for Robotic Imitation using Deep Learning and Virtual Reality
thesisposted on 01.08.2020, 00:00 authored by Zainab Al-Qurashi
In the applications of human-robot mimicking systems, the robots attempt to copy and precisely follow the human movements as much as possible. In some tasks, such as surgical operations, focusing a camera on an object while moving, carrying a filled cup, or playing table tennis, precisely mapping the position and orientation of the human’s hand—and wrist particularly—to the robot’s end-effector is very important to enable successful robot imitation. Motion capture, depth cameras, and virtual reality tracking systems provide human’s hand position and orientation tracking with varying accuracies. However, even with precise tracking data of the human, kinematic differences between human and robot arms complicate this mapping significantly, preventing handcrafted mappings from being effective (6) (7). Therefore, learning an appropriate human-to-robot mapping model to predict the robot’s joint variables is not an easy task. It is a critical task since the robot should be smart enough to mimic the human movement and make the required modifications in its movements to perform the task correctly and similar to the human’s performance. This thesis investigates whether learning-based methods can improve robotic teleoperation for complex tasks and the effects of efficient data collection on training the machine learning systems. To achieve the goal of mapping human pose to robot pose, we develop and evaluate two mapping models to map human pose to robot pose using linear regression (LR) and artificial neural network (ANN). We investigate two teleoperation approaches for human pose transferring using Microsoft Kinect camera and virtual reality HTC Vive. We use Baxter robot from ReThink Robotics in all experiments of this dissertation. Our experiment results demonstrate that ANN mapping model shows significantly better results in mapping human pose to Baxter robot arm pose compared to LR mapping model in both approaches. Moreover, the virtual reality HTC Vive teleoperation approach presents its ability in improving the performance of robotic teleoperation compared to Microsoft Kinect depth camera. However, the HTC Vive controller does not provide detailed information about the human hand structure and the wrist in particular, which could cause the robot to not perform some critical tasks correctly and accurately. Also, human and robot arms do not have the same kinematics and anatomy, which make mapping human pose to robot pose not an easy task. Thus, we propose a hybrid algorithm to solve the inverse kinematics (IK) problem for a complex robotic manipulator with a high degree of freedom (DOF). Our proposed algorithm divides the manipulator’s joints into two parts (arm and wrist) and employs neural network (NN) and a coordinate transformation with the aid of vector analysis to solve the IK problem as two independent sub-problems. We use an efficiently trained NN to find the solution of the IK of the arm, while we solve the IK problem for the wrist analytically. In comparison with other analytical or numerical methods, our approach enables the robot to reach the desired position and orientation based on the current end-effector orientation: it can be applied to any complex robots with high DOF; it has lower computational complexity especially after training the neural networks; it does not simplify the complex robot structure by assuming some off-axes equal to zero or by reducing the DOF of the robotic manipulator by locking one or more joints; the orientation description of the manipulator is universal and independent of the wrist hardware structure. We apply our approach to solve the IK problem for the Baxter robot arm. As demonstrated in our simulation results, our approach enables Baxter’s end-effector to reach the desired position and orientation with a small amount of error. To perform many critical manipulation tasks successfully, human-robot mimicking systems should not only accurately copy the position of a human hand, but its orientation as well. Deep learning methods trained from pairs of corresponding human and robot poses offer one promising approach for constructing a human-robot mapping to accomplish this. However, ignoring the spatial and temporal structures of this mapping makes learning it less effective. We propose two different hierarchical architectures that leverage the structural and temporal human-robot mapping. To capture the spatial structure, we partially separate the robotic manipulator’s end-effector position and orientation while considering the mutual coupling effects between them. This divides the main problem—making the robot’s end-effector match the human’s hand position and mimic its orientation accurately along an unknown trajectory—into several sub-problems. We address these using different temporal model recurrent neural networks (RNNs) with long short-term memory (LSTM) that we combine, and train hierarchically based on the coupling over the aspects of the robot that each controls. We evaluate our proposed architectures using a virtual reality system HTC Vive to track human table tennis motions and compare their mimicking response with the response when single artificial neural network (ANN) and RNN models are used. We compare the benefits of using deep learning neural networks with and without our architectures and find smaller errors in position and orientation, along with increased flexibility in wrist movement are obtained by our proposed architectures. Learning from demonstration (LfD) is a powerful approach for teaching robots desired behavior by observing the motions executed by human demonstrators that robots then mimic. It is an efficient way to transfer knowledge (demonstrations of a task) from the human to the robot in the human-robot mimicking systems. However, there are many constraints from the human and robot sides which affect collecting efficient training dataset. We propose a new approach for learning a human-robot mapping for robot learning from demonstration (LfD) using a training dataset collected from humans mimicking robot motions (LfD-RH), which stands in contrast with datasets collected from the robot mimicking human’s motions (LfD-HR). Then, we propose a method based on Pearson correlation coefficient (PCC) to filter the collected dataset by excluding the inefficient data vectors from the training dataset. Also, we propose a data collection algorithm based on LfD-RH and PCC to practically implement our proposed data collection approach LfD-RH with PCC. We test and evaluate the two approaches for training dataset collection (with and without applying data filtering) by using them to train a human-robot mimicking system implemented using linear regression LR, deep neural network ANN and recurrent neural network RNN. The results show that by using the proposed approaches, the collected robot-human training dataset by LfD-RH is more efficient than human-robot dataset collected by LfD-HR. The LfD-RH approach produces more correlated data by avoiding two main problems that exist in the LfD-HR approach: hidden points and variation in the human demonstrator’s and the robot demonstrator’s heights and sizes. This improves the performance of the human-robot mimicking system. The training dataset size is reduced by half or more by excluding the inefficient observations, and the performance of the machine learning (ML) system trained by the new filtered dataset is higher with less error and faster training time compared to using the original collected dataset.