yujiepan
/

falcon-40b-awq-w4g128

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

yujiepan commited on Mar 18, 2024

Commit

927ea0e

•

1 Parent(s): 3acdc17

Create README.md

Files changed (1) hide show

README.md +53 -0

README.md ADDED Viewed

	@@ -0,0 +1,53 @@

+---
+pipeline_tag: text-generation
+inference: true
+widget:
+- text: 'Hello!'
+  example_title: Hello world
+  group: Python
+library_name: transformers
+---
+# yujiepan/falcon-40b-awq-w4g128
+This model applies autoawq on [tiiuae/falcon-40b](https://huggingface.co/tiiuae/falcon-40b): AutoAWQ, 4bit, group_size=128, zero_point=True
+## Accuracy
+| task                       | tiiuae/falcon-40b (fp16) | this repo |
+|----------------------------|-------------------|-----------|
+| wikitext ppl by lm_harness | 8.410 | 8.497  |
+## Usage
+```python
+from awq import AutoAWQForCausalLM
+from transformers import AutoTokenizer
+model_name_or_path = "yujiepan/falcon-40b-awq-w4g128"
+# Load model
+model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=False, trust_remote_code=False)
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+prompt = "Tell me about AI"
+tokens = tokenizer(
+    prompt,
+    return_tensors='pt'
+).input_ids.cuda()
+# Generate
+generation_output = model.generate(
+    tokens,
+    do_sample=True,
+    temperature=0.7,
+    top_p=0.95,
+    top_k=40,
+    max_new_tokens=10,
+)
+print("Output: ", tokenizer.decode(generation_output[0]))
+```