Papers
arxiv:2405.18870

LLMs achieve adult human performance on higher-order theory of mind tasks

Published on May 29, 2024
· Submitted by akhaliq on May 30, 2024
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,

Abstract

This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.

Community

I still struggle to see how this alone would be useful when assisting users. I have trouble recalling the last time I had to think about tasks that the paper claims the models excel at. Maybe I'm missing some larger picture...

·

Think about solving games. It will be great at playing poker or other head to head games.

I think the idea is that LLMs may drive robotics one day, in which it may benefit the robot to have an accurate representation of a developed higher-order theory of mind. This would benefit autonomous robots when interacting in a social situation.

Check out the recent results on our benchmark FANToM as well, which was presented at EMNLP 2023.
We stress-test the SOTA LLMs, such as GPT-4o, Gemini-1.5, Llama3, Mixtral, and Claude.
They are nowhere near human performance, but still they are improving!
https://github.com/skywalker023/fantom?tab=readme-ov-file#-latest-results
updated_results.png

·

This is interesting as there are more models considered, and the models are a bit more relevant. Thanks for sharing!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.18870 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.18870 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2405.18870 in a Space README.md to link it from this page.

Collections including this paper 3