Post
As the amount of datasets for fine tuning chat models has grown, there's been a plethora of dataset formats emerge. The most popular of these include the formats used by Alpaca, ShareGPT and Open Assistant datasets. The datasets and their formats have also evolved from single-turn conversation to multi-turn. Many of these formats share similarities (and they all have the same goal), but handling the variations in formats across datasets is often a hassle, and source of potential bugs.
Luckily the community seems to be converging on a simple and elegant chat dataset format: a list with each record being an array with each conversation turn being an object with a role (system, assistant or user) and content. Hugging Face uses this input format in the [Templates for Chat Models](https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates) docs:
Popular datasets like HuggingFaceH4/no_robots follow this format.
To encourage usage of this format, I propose we give it a name: Hugging Face MessagesList format.
The format is defined as:
- Having at least one
- Each messages record is an array containing one or more message turn objects.
- A message turn must have
-
-
This may be a small thing, but having a common dataset format will reduce wasted time data wrangling and help everyone.
Luckily the community seems to be converging on a simple and elegant chat dataset format: a list with each record being an array with each conversation turn being an object with a role (system, assistant or user) and content. Hugging Face uses this input format in the [Templates for Chat Models](https://huggingface.co/docs/transformers/main/en/chat_templating#how-do-i-use-chat-templates) docs:
messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
Popular datasets like HuggingFaceH4/no_robots follow this format.
To encourage usage of this format, I propose we give it a name: Hugging Face MessagesList format.
The format is defined as:
- Having at least one
messages
column of type list.- Each messages record is an array containing one or more message turn objects.
- A message turn must have
role
and content
keys.-
role
should be one of system
, assistant
or user
.-
content
is the text content of the message.This may be a small thing, but having a common dataset format will reduce wasted time data wrangling and help everyone.