posted on 2025-05-01, 00:00authored byShweta Parihar
Social biases in large language models (LLMs) raise critical fairness concerns; this work addresses them through two complementary projects. The first project addresses the challenge of mitigating gender bias without compromising the language modeling capabilities of LLMs. Counterfactual Data Augmentation (CDA), a widely used method for fine-tuning, generates synthetic data that may poorly align with real-world distributions or ignore the social context of altered sensitive attributes (e.g., gender) in the pretraining corpus. To overcome this, we propose Context-CDA, which leverages LLMs to produce contextually relevant and diverse counterfactual data for fine-tuning. By minimizing distributional discrepancies between the debiasing corpus and the pre-training data, this approach enhances alignment with real-world usage. To further improve data quality, we implement semantic entropy filtering to remove uncertain text samples. Evaluations on bias benchmarks demonstrate that Context-CDA significantly reduces bias while preserving model performance.
The second project focuses on bias propagation in Retrieval-Augmented Generation (RAG) Systems. We investigate how the addition of retrieved contexts influences the bias behavior of LLMs. Our findings reveal a reduction in bias after incorporating the RAG pipeline, indicating that the inclusion of external context often helps counteract stereotype driven predictions. We also delve deeper into understanding the model’s reasoning process by integrating Chain-of-Thought (CoT) prompting into the RAG system while assessing faithfulness of the model’s CoT. Our experiments reveal that the model’s bias inclination shifts between stereotype and anti-stereotype responses as more contextual information is incorporated. Also, contrary to the bias reduction observed with standard RAG, we find that applying CoT with RAG increases overall bias across datasets. This counterintuitive result can be attributed to the bias-accuracy trade-off. While CoT improves accuracy by encouraging more deliberate reasoning, this often comes at the expense of fairness, thereby advocating for the design of bias-aware reasoning frameworks to mitigate this trade-off.
Public datasets used: Statistical and neural machine translation news commentary, StereoSet, CrowS-Pairs, Multi-Genre Natural Language Inference, Natural Language Inference Bias, Semantic textual similarity benchmark, BiasBios, Question-answering Natural Language Inference, Recognizing Textual Entailment, Stanford Sentiment Treebank v2, Winogender, WinoBias, WikiText-103, Colossal Clean Crawled Corpus, Bias in Open-ended Language Generation, Holistic Bias