Papers
arxiv:2408.02442

Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

Published on Aug 5, 2024
Authors:
,
,
,
,
,

Abstract

Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs). This study investigates whether such constraints on generation space impact LLMs' abilities, including reasoning and domain knowledge comprehension. Specifically, we evaluate LLMs' performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. Surprisingly, we observe a significant decline in LLMs' reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.

Community

We recently added gpt-4o-mini-2024-07-18 results with OpenAI latest structure outputs API on our v3 edition

Hi! This Will from the .txt team. Given our focus on structured generation we took the claims in this paper quite seriously and did some investigation into what happened to produce these surprising results which contradicted our own past experiments. Here is our response Say What You Mean: A Response to 'Let Me Speak Freely'.

For a tl;dr here are the main points we bring up (particularly concerning are points 2 and 3):

  1. The paper itself finds that structured generation has superior performance on a number of classification tasks.
  2. The prompts used for unstructured (NL) generation are markedly different than the ones used for structured generation, so the comparisons are not apples-to-apples to begin with.
  3. The structured generation prompts do not provide the model with adequate information to solve the task, this leads to particularly poor performance for the ‘json-mode’ examples.
  4. The real meat of the paper is actually about parsing the results of one LLM with a second LLM. The authors refer to this as the “Perfect Text Parser”.
  5. The paper confuses structured generation with JSON-mode, although independent runs of these evals show that “JSON-mode” yields better results than unstructured generation.

I think this simple result says it all, yes prompt matters but 0-shot CoT already enough to proof it all.

Last Letter Llama 3 Instruct 0-shot CoT 1-shot CoT (used in blog) .txt reported best : JSON (struct) 1-shot CoT
lastletter-t3-f3 78.00* 57.33* 77.00 (T4-F1)
Average of 9 prompts 70.07* 44.64* -

For more details just read this updated note : https://github.com/appier-research/structure-gen/blob/main/updates.md

We included JSON structured generation with averaged results from different prompts and its still worse.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2408.02442 in a model README.md to link it from this page.

Datasets citing this paper 8

Browse 8 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2408.02442 in a Space README.md to link it from this page.

Collections including this paper 2