Papers
arxiv:2407.03502

AgentInstruct: Toward Generative Teaching with Agentic Flows

Published on Jul 3, 2024
ยท Submitted by ari9dam on Jul 10, 2024
#2 Paper of the day
Authors:
,
,
,
,
,

Abstract

Synthetic data is becoming increasingly important for accelerating the development of language models, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse and drawbacks of imitating other models. This discrepancy can be attributed to the fact that synthetic data varies in quality and diversity. Effective use of synthetic data usually requires significant human effort in curating the data. We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, we refer to this setting as Generative Teaching. We introduce AgentInstruct, an extensible agentic framework for automatically creating large amounts of diverse and high-quality synthetic data. AgentInstruct can create both the prompts and responses, using only raw data sources like text documents and code files as seeds. We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach language models different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base model. We post-train Mistral-7b with the data. When comparing the resulting model Orca-3 to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and GPT-3.5-turbo.

Community

Paper author Paper submitter

I'm thrilled to announce our latest work on Generative Teaching: generating vast amount of diverse high-quality synthetic data for language models to teach a specific skill (e.g. RC, text classification, tool use,math) without the extensive human effort typically required.

Generating Teaching adopts a unique approach to synthetic data generation; Instead of expanding the seed set - a method that could potentially lead to benchmark manipulation if initial seeds are too similar to those from the benchmark - we concentrate on teaching skills

ยท

If so, wouldn't it be better to create both pretrain and fine tune datasets based on the prompt and context we provide? It's like a book we don't understand but if someone teaches us to understand that book, it would be great, great work, hope to be appreciated more. I've been struggling with the fine tune data junk

Please release the dataset, you'd be legends ๐Ÿ™

Woah!!! Really cool. Agentic flows are the future.

Interesting paper.

The dataset generation flow is somewhat similar to Auto-Evol Instruct (WizardLM-2 model family). Any chance we would see how this Orca-3 approach compares to WizardLM-2?

Anyway, thanks for an amazing works you're all doing!

ยท
Paper author

First of all, we are focusing on Generative teaching, where you specify "what data you need". There is no seed set of instructions. This mimics scenario we see while shipping SLMs. Multi-Agent workflows delivers the data you need following your specification. Agent-Instruct provides a recipe about how to create such Agentic Flow for data creation. The recipe addresses the key questions that one would face for data creation. The setting of Generative Teaching, we believe, is a very important problem for people focusing on synthetic data creation.

Our techniques in Orca-Math (Suggester Editor) is about Agent Based expansion of seed set which is more similar to Auto-Evol Instruct.

It doesn't say in the paper which LLM you use to power your agents. Was it GPT-4?

ยท
Paper author

its mix of many gpt4 versions including gpt4-turbo; whichever endpoint we could access to

Paper author Paper submitter
This comment has been hidden

Any plans to release the code or framework?

In the 2nd section 'generative teaching', it discusses the process of generating instructions/prompts, but how the responses are generated is not disclosed. Could you elaborate on it a bit more?

Could you release all agents flow prompts you used for all the 17 skills for training orca-3๏ผŸ

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.03502 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.03502 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.03502 in a Space README.md to link it from this page.

Collections including this paper 28