posted on 2024-12-01, 00:00authored bySharad Chandakacherla
Emotion Recognition in Conversation (ERC) has been extensively studied in recent years. A closely related emerging domain is Emotion Cause Analysis (ECA), commonly defined as the task of Emotion-Cause Pair Extraction (ECPE). Most studies in these fields have been grounded in the textual modality. As natural language exists in different modalities, SemEval-2024 introduced a shared task—Multimodal Emotion Cause Analysis in Conversations (Task #3)—to incorporate multimodal information into this area of research. The task comprises two sub-tasks: Subtask 1—Textual Emotion-Cause Pair Extraction in Conversations (TECPE), and Subtask 2—Multimodal Emotion-Cause Pair Extraction in Conversations (MECPE). All experiments in this work use the publicly available ECF 2.0 dataset (Emotion-Cause-in-Friends), which is derived from the widely used multimodal, multiparty dataset based on the TV show Friends. The central intent of this work is to use various attention-based correlation mechanisms to model both the TECPE and MECPE tasks. Using popular pre-trained language models (PLMs), thorough experimentations in the following areas are presented: (1) establishing mono-modal baselines for the ERC task on ECF 2.0; (2) exploring various multimodal fusion techniques, including late fusion, early fusion, self-attention, and cross-modal attention-based fusion; and (3) Multimodal causal span/utterance extraction. These experiments can further help in understanding data modeling, fusion techniques, and modality selection for such tasks. A two-step disjoint training process was employed for the first subtask, where both the models were fine-tuned pre-trained language models. The latter model employed the QA paradigm and ranked second based on the weighted average F1 score using strict span matching. Subtask 2 examined multimodal interaction and joint training in detail for the MECPE task. Despite using much smaller models, our results are comparable to those of other teams employing larger LLMs, highlighting the importance of adding conversational context and selecting specialized models to efficiently utilize resources.