magnum-72B-FP8 / README.md
Rallio67's picture
Update README.md
dcc17ee verified
metadata
tags:
  - fp8
  - vllm

See original model card for information about how it was made. This is to enable fast inference use with Hopper level hardware in FP8. I quantized it to FP8 using neuralmagic code below on 4x L40s.

https://huggingface.co/alpindale/magnum-72b-v1

Magnum-72b-v1-FP8

Model Overview

  • Model Architecture:

    Based on and identical to the Qwen2-72B-Instruct architecture
  • Model Optimizations:

    Weights and activations quantized to FP8
  • Release Date:

    June 25, 2024

Magnum-72B-v1 quantized to FP8 weights and activations using per-tensor quantization through the AutoFP8 repository, ready for inference with vLLM >= 0.5.0. Calibrated with 512 UltraChat samples to achieve better performance recovery. Part of the FP8 LLMs for vLLM collection.

Usage and Creation

Produced using AutoFP8 with calibration samples from ultrachat.

from datasets import load_dataset
from transformers import AutoTokenizer

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "alpindale/magnum-72b-v1"
quantized_model_dir = "Magnum-72B-FP8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=4096)
tokenizer.pad_token = tokenizer.eos_token

ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")

model = AutoFP8ForCausalLM.from_pretrained(
    pretrained_model_dir, quantize_config=quantize_config
)
model.quantize(examples)
model.save_quantized(quantized_model_dir)