thumb|300px|The original Sally–Anne cartoon used in the test by Baron-Cohen, Leslie and Frith (1985)

The Sally–Anne test is a psychological test originally conceived by Daniel Dennett, used in developmental psychology to measure a person's social cognitive ability to attribute false beliefs to others. Based on the earlier study by Wimmer and Perner (1983), the Sally–Anne test was so named by Simon Baron-Cohen, Alan M. Leslie, and Uta Frith (1985) who developed the test further; in 1988, Leslie and Frith repeated the experiment with human actors (rather than dolls) and found similar results.

Test description

To develop an efficacious test, Baron-Cohen et al. modified the puppet play paradigm of Wimmer and Perner (1983), in which puppets represent tangible characters in a story, rather than hypothetical characters of pure storytelling.

In the test process, after introducing the dolls, the child is asked the control question of recalling their names (the Naming Question). A short skit is then enacted; Sally takes a marble and hides it in her basket. She then "leaves" the room and goes for a walk. While she is away, Anne takes the marble out of Sally's basket and puts it in her own box. Sally is then reintroduced and the child is asked the key question, the Belief Question: "Where will Sally look for her marble?"

In the Baron-Cohen et al. (1985) study, 23 of the 27 clinically unimpaired children (85%) and 12 of the 14 children with Down syndrome (86%) answered the Belief Question correctly. However, only four of the 20 autistic children (20%) answered correctly. Overall, children under the age of four, along with most autistic children (of older ages), answered the Belief Question with "Anne's box", seemingly unaware that Sally does not know her marble has been moved. These results may be an expression of the social deficits relevant to autism.

Tager-Flusberg (2007) states that in spite of the empirical findings with the Sally–Anne task, there is a growing uncertainty among scientists about the importance of the underlying theory-of-mind hypothesis of autism. In all studies that have been done, some children with autism pass false-belief tasks such as Sally–Anne.

In other hominids

Eye tracking of chimpanzees, bonobos, and orangutans suggests that all three anticipate the false beliefs of a subject in a King Kong suit, and pass the Sally–Anne test.

Artificial intelligence

Artificial intelligence and computational cognitive science researchers have long attempted to computationally model humans' ability to reason about the (false) beliefs of others in tasks like the Sally–Anne test. Many approaches have been taken to replicate this ability in computers, including neural network approaches, epistemic plan recognition, and Bayesian theory-of-mind. These approaches typically model agents as rationally selecting actions based on their beliefs and desires, which can be used to either predict their future actions (as in the Sally–Anne test), or to infer their current beliefs and desires. In constrained settings, these models are able to reproduce human-like behavior on tasks similar to the Sally–Anne test, provided that the tasks are represented in a machine-readable format.

With the rise of large language models (LLMs), researchers have found that frontier models can now routinely pass classic false-belief tasks like the Sally–Anne test. A 2023 paper from Microsoft Research first reported that GPT-4 could pass an instance of the test, interpreting this as evidence of "a very advanced level of theory of mind." Kosinski (2024) tested eleven LLMs on 40 bespoke false-belief tasks requiring correct answers across eight scenarios each; GPT-4 solved 75% of tasks, matching the performance of six-year-old children, while older models solved none. Strachan et al. (2024) compared GPT and LLaMA models against 1,907 human participants on a broad battery of theory-of-mind tests and found that GPT-4 performed at or above human levels on false beliefs, indirect requests, and misdirection, though it struggled with detecting faux pas. Street et al. (2025) tested LLMs on higher-order theory-of-mind tasks involving recursive mental state reasoning (e.g., "I think that you believe that she knows") and found that GPT-4 reached adult-level performance overall, exceeding adult performance on sixth-order inferences.

While classic false-belief tasks thus appear to be largely solved by frontier LLMs, debate has shifted to whether this reflects genuine social reasoning or exploitation of surface-level textual patterns. Early work by Ullman (2023) showed that GPT-3.5 failed on trivial alterations to false-belief tasks that humans handle flexibly, though later models have proven more robust to such perturbations. A 2025 commentary responding to Kosinski argued that passing isolated false-belief tasks is insufficient evidence of theory of mind, and that simpler explanations such as associative learning from training data cannot yet be ruled out. A comprehensive survey presented at ACL in 2025 noted that the field has moved beyond simple Sally–Anne-style tasks toward benchmarks covering intentions, desires, emotions, and non-literal communication, and that debate continues over whether LLMs' ToM abilities are genuine or "often superficial and unstable."

References