Satwik11's picture
Update README.md
184694f verified
|
raw
history blame
4.51 kB
metadata
license: mit
language:
  - en
base_model:
  - Qwen/QwQ-32B-Preview
new_version: Qwen/QwQ-32B-Preview

Evaluation Results

Evaluation Metrics

Groups Version Filter n-shot Metric Direction Value Stderr
mmlu 2 none - acc 0.8034 ±0.0032
    humanities 2 none - acc 0.7275 ±0.0062
    other 2 none - acc 0.8323 ±0.0064
    social sciences 2 none - acc 0.8856 ±0.0056
    stem 2 none - acc 0.8081 ±0.0068

Description

  • mmlu: Overall accuracy across multiple domains.
  • humanities: Accuracy in humanities-related tasks.
  • other: Accuracy in other unspecified domains.
  • social sciences: Accuracy in social sciences-related tasks.
  • stem: Accuracy in STEM (Science, Technology, Engineering, Mathematics) related tasks.

QwQ-32B-Preview-quantized-autoround-GPTQ-sym-4bit

License Stars Downloads

Model Description

QwQ-32B-Preview-quantized-autoround-GPTQ-sym-4bit is a quantized version of the QwQ-32B-Preview model, optimized for efficient inference without significant loss in performance. This model employs AutoRound for quantization, utilizing the GPTQ (Generative Pre-trained Transformer Quantization) method with symmetric 4-bit quantization. The quantization process reduces the model size and computational requirements, making it more suitable for deployment in resource-constrained environments.

Features

  • Quantization Method: AutoRound with GPTQ
  • Bit Precision: 4-bit symmetric quantization
  • Group Size: 128
  • Efficiency: Optimized for low GPU memory usage
  • Compatibility: Compatible with Hugging Face's Transformers library

Intended Uses

  • Natural Language Processing (NLP): Suitable for tasks such as text generation, translation, summarization, and question-answering.
  • Deployment in Resource-Constrained Environments: Ideal for applications requiring efficient inference on devices with limited computational resources.
  • Research and Development: Useful for researchers exploring model compression and quantization techniques.

Note: This model is intended for non-commercial research and experimentation purposes. Users should evaluate the model's performance in their specific use cases before deployment.

Limitations

  • Performance Trade-off: While quantization significantly reduces model size and increases inference speed, it may introduce slight degradations in performance compared to the full-precision version.
  • Compatibility: The quantized model may not be compatible with all libraries and frameworks. Ensure compatibility with your deployment environment.
  • Bias and Fairness: As with all language models, this model may inherit biases present in the training data. Users should be cautious and perform thorough evaluations before deploying in sensitive applications.

Usage Example:

Here's a simple example of how to load and use the quantized model with Hugging Face's Transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained("Satwik11/QwQ-32B-Preview-quantized-autoround-GPTQ-sym-4bit")

Load the quantized model

model = AutoModelForCausalLM.from_pretrained( "Satwik11/QwQ-32B-Preview-quantized-autoround-GPTQ-sym-4bit", load_in_4bit=True, device_map="auto" )

Prepare input

input_text = "Once upon a time" inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

Generate text

outputs = model.generate(**inputs, max_length=50) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)

Output:

Once upon a time, in a land far away, there lived a...