arxiv:2309.09530

Adapting Large Language Models via Reading Comprehension

Published on Sep 18, 2023

· Submitted by

akhaliq on Sep 19, 2023

#2 Paper of the day

Upvote

Authors:

Daixuan Cheng ,

Furu Wei

Abstract

We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. Taken inspiration from human learning via reading comprehension--practice after reading improves the ability to answer questions based on the learned knowledge--we propose a simple method for transforming raw corpora into reading comprehension texts. Each raw text is enriched with a series of tasks related to its content. Our method, highly scalable and applicable to any pre-training corpora, consistently enhances performance across various tasks in three different domains: biomedicine, finance, and law. Notably, our 7B language model achieves competitive performance with domain-specific models of much larger scales, such as BloombergGPT-50B. Furthermore, we demonstrate that domain-specific reading comprehension texts can improve the model's performance even on general benchmarks, showing the potential to develop a general model across even more domains. Our model, code, and data will be available at https://github.com/microsoft/LMOps.

View arXiv page View PDF Add to collection

Community

Prashant19

Jan 5, 2024

This comment has been hidden

superpeng

Feb 22, 2024

@librarian-bot recommend

daixuancheng

Paper author Jul 12, 2024

•

edited Nov 30, 2024

[2024/11/29] 🤗 Introduce the multimodal version of AdaptLLM at AdaMLLM, for adapting MLLMs to domains 🤗

**************************** Updates ****************************

2024/11/29: Released AdaMLLM for adapting MLLMs to domains
2024/9/20: Our research paper for Instruction-Pretrain has been accepted by EMNLP 2024
2024/8/29: Updated guidelines on evaluating any 🤗Huggingface models on the domain-specific tasks
2024/6/22: Released the benchmarking code
2024/6/21: Released the general version of AdaptLLM at Instruction-Pretrain
2024/4/2: Released the raw data splits (train and test) of all the evaluation datasets
2024/1/16: Our research paper for AdaptLLM has been accepted by ICLR 2024
2023/12/19: Released our 13B base models developed from LLaMA-1-13B
2023/12/8: Released our chat models developed from LLaMA-2-Chat-7B
2023/9/18: Released our paper, code, data, and base models developed from LLaMA-1-7B