Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
Abstract
Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs). This study investigates whether such constraints on generation space impact LLMs' abilities, including reasoning and domain knowledge comprehension. Specifically, we evaluate LLMs' performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. Surprisingly, we observe a significant decline in LLMs' reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.
Community
We recently added gpt-4o-mini-2024-07-18 results with OpenAI latest structure outputs API on our v3 edition
Hi! This Will from the .txt team. Given our focus on structured generation we took the claims in this paper quite seriously and did some investigation into what happened to produce these surprising results which contradicted our own past experiments. Here is our response Say What You Mean: A Response to 'Let Me Speak Freely'.
For a tl;dr here are the main points we bring up (particularly concerning are points 2 and 3):
- The paper itself finds that structured generation has superior performance on a number of classification tasks.
- The prompts used for unstructured (NL) generation are markedly different than the ones used for structured generation, so the comparisons are not apples-to-apples to begin with.
- The structured generation prompts do not provide the model with adequate information to solve the task, this leads to particularly poor performance for the ‘json-mode’ examples.
- The real meat of the paper is actually about parsing the results of one LLM with a second LLM. The authors refer to this as the “Perfect Text Parser”.
- The paper confuses structured generation with JSON-mode, although independent runs of these evals show that “JSON-mode” yields better results than unstructured generation.
I think this simple result says it all, yes prompt matters but 0-shot CoT already enough to proof it all.
Last Letter Llama 3 Instruct | 0-shot CoT | 1-shot CoT (used in blog) | .txt reported best : JSON (struct) 1-shot CoT |
---|---|---|---|
lastletter-t3-f3 | 78.00* | 57.33* | 77.00 (T4-F1) |
Average of 9 prompts | 70.07* | 44.64* | - |
For more details just read this updated note : https://github.com/appier-research/structure-gen/blob/main/updates.md
We included JSON structured generation with averaged results from different prompts and its still worse.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 8
Browse 8 datasets citing this paperSpaces citing this paper 0
No Space linking this paper