Multi-agent AI systems offer a promising pathway to enhance the robustness of Natural Language Inference (NLI), a key task in semantic understanding. However, this thesis demonstrates that the benefits of collaboration are not guaranteed. The effectiveness of a multi-agent system is critically dependent on the architecture of its deliberation protocol and the richness of the information exchanged between agents. To investigate this, we designed a systematic 3x4 experimental matrix to test three prompting strategies (from basic responses to step-by-step reasoning) against four deliberation protocols (individual baseline, self-deliberation, peer review, and judge-mediated consensus). Using the Llama-3.2 and Phi-3-Mini models on the challenging SciNLI, MSciNLI, MISMATCHED, and ConjNLI benchmarks, we systematically evaluated the interplay between these two key variables. The central finding of this work is the discovery of "protocol-induced instability," a failure mode where simple peer-to-peer protocols can actively degrade performance. We provide empirical evidence that in low-information settings, less capable agents often become unreliable when asked to re-evaluate agreed-upon answers, revealing that their initial consensus was frequently based on shallow, brittle reasoning. Our results establish two clear solutions to this instability: either enriching the information agents exchange with explanations or employing an architecturally robust judge-mediated protocol. The judge-based approach proved to be the most stable and effective method across all experimental conditions. We conclude that for complex reasoning tasks, the design of the deliberative architecture is as critical as the capabilities of the individual agents, offering a foundational principle for building more reliable collaborative AI.