README.md · Haon-Chen/speed-synthesis-8b-senior at main

metadata

license: mit
language:
  - en
base_model:
  - meta-llama/Meta-Llama-3-8B
pipeline_tag: text-generation
tags:
  - transformers

SPEED-synthesis-7b-senior

Little Giants: Synthesizing High-Quality Embedding Data at Scale. Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou, arXiv 2024

This is the senior data synthesis model of SPEED.

Usage

Below is an example to synthesize classification data using this senior generator.

The prompts and misc scripts can be found in our github page

Transformers

import torch
import os
import random
import numpy as np
import json
import re

from torch import Tensor
from transformers import AutoTokenizer, AutoModelForCausalLM

from prompts_synthesis import get_create_classify_data_prompt
from utils import fix_common_json_errors_and_loads


LLAMA3_PROMPT = """
{prompt} [/INST]
""".strip("\n")

# Each query must come with a one-sentence instruction that describes the task
tasks = [
    'Identify the intended age group for educational technology products.',
    'Classify businesses based on their operational hours.'
]
language = 'English'

prompts = [LLAMA3_PROMPT.format(prompt=get_create_classify_data_prompt(task=task, language=language)[1]['content']) for task in tasks]

tokenizer = AutoTokenizer.from_pretrained('Haon-Chen/speed-synthesis-7b-senior')
model = AutoModelForCausalLM.from_pretrained('Haon-Chen/speed-synthesis-7b-senior')
model.to("cuda:0")
model.eval()
tokenizer.pad_token = tokenizer.pad_token or tokenizer.eos_token
tokenizer.padding_side = "left"
tokenizer.truncation_side = "left"

with torch.inference_mode():
    # Tokenize the input texts
    encodes = tokenizer(prompts, padding="longest", add_special_tokens=True, return_tensors="pt")
    input_ids = encodes.input_ids.to(model.device)
    attention_mask = encodes.attention_mask.to(model.device)

    # Set the generation parameters
    GEN_CONFIG = {"do_sample":True, "temperature": 1.0, "top_p": 1.0, "max_new_tokens": 800}
    output = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        pad_token_id = tokenizer.eos_token_id,
        **GEN_CONFIG
    )
output_texts = tokenizer.batch_decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=False)
batch_results = []
for i in range(len(output_texts)):
    batch_results.append(output_texts[i][len(prompts[i]):].strip(' '))

# Format outputs
bad_cnt=0
outputs = []
for i, result in enumerate(batch_results):
    try:
        output = fix_common_json_errors_and_loads(result)
        user_query = output.get("input_text", "")
        positive_document = output.get("label", "")
        hard_negative_document = output.get("misleading_label", "")
    except:
        bad_cnt+=1
        continue
    out_data = {
        "query": user_query,
        "positives": [positive_document],
        "negatives": [hard_negative_document],
        "language": "English",
        "task_definition": tasks[i],
    }
    outputs.append(out_data)
print(bad_cnt)
print(outputs)

Citation

If you find our paper or models helpful, please consider cite as follows:

@article{chen2024little,
  title={Little Giants: Synthesizing High-Quality Embedding Data at Scale},
  author={Chen, Haonan and Wang, Liang and Yang, Nan and Zhu, Yutao and Zhao, Ziliang and Wei, Furu and Dou, Zhicheng},
  journal={arXiv preprint arXiv:2410.18634},
  year={2024}
}

Haon-Chen
/

speed-synthesis-8b-senior

SPEED-synthesis-7b-senior

Usage

Transformers

Citation

Limitations