Functional Theory of Mind Evaluation in Large Language Models: A Behavioral and Causal Stability Framework
Functional Theory of Mind Evaluation in Large Language Models: A Behavioral and Causal Stability Framework
Prashanta Kumar Mohanty, Anupam Prasad, Abhisek Soy, Gaurav Kumar
printfpk@gmail.com
Department of MCA (Batch: 2024–2026), Haridwar University, Roorkee, Haridwar
Internal Guide: Akanksha Shukla, Assistant-Professor-CA -akanksha.cse@huroorkee.ac.in
Abstract: Theory of Mind (ToM) — the cognitive capacity to attribute beliefs, intentions, desires, and emotions to oneself and others — is considered a cornerstone of human social intelligence. As Large Language Models (LLMs) such as GPT-4o, LLaMA-3.1-70B, and Qwen2.5-72B are increasingly deployed in social and interactive roles, the question of whether they genuinely possess ToM capabilities has become both scientifically significant and practically urgent. However, the existing landscape of ToM evaluation is fragmented, primarily relying on behavioral benchmarks that test only whether a model produces the correct output, without investigating the underlying computational mechanism or the stability of that reasoning. This paper proposes a Functional Theory of Mind Evaluation Framework that addresses thisgap through three integrated layers of analysis: (1) behavioral accuracy evaluation using structured benchmarks(BigToM and ToMValley), (2) causal internal representation analysis using perspective projection and counterfactualinterventions gr ounded in Simulation Theory, and (3) reasoning stability measurement using transformation-based divergence testing. Experimental analysis across five leading LLMs demonstrates significant variation in behavioral accuracy (35–67%), with transformation and belief- tracking questions proving hardest. Counterfactual intervention experiments reveal that later Transformer layers (65–80) encode perspective-taking representations with measurable causal effects on model outputs, providing partial support for Simulation Theory as an explanatory mechanism. Stabilitytesting reveals that all models exhibit significant brittleness under adversarial scenario modifications, with answer consistency dropping 18–34% under minimal transformations. We propose a unified Functional ToM Score that integrates these three dimensions into a single interpretable metric, and discuss implications for AI safety, evaluationmethodology, and future benchmark design. Keywords: Theory of Mind, Large Language Models, Simulation Theory, false-belief evaluation, causal representationanalysis, reasoning stability, Functional ToM Score, social reasoning, mechanistic interpretability, BigToM, ToMValley.