BSC-LT
/

salamandra-7b

@@ -134,7 +134,141 @@ The accelerated partition is composed of 1,120 nodes with the following specific
 ## How to use
-<span style="color:red">TODO</span>
 ---

 ## How to use
+### Inference
+This section covers different methods for running inference, including utilizing Huggingface's Text Generation Pipeline, multi-GPU setups, and vLLM for efficient and scalable generation. Each approach is accompanied by step-by-step instructions to ensure a smooth setup.
+#### Inference with Huggingface's Text Generation Pipeline
+The Huggingface Text Generation Pipeline provides a simple and straightforward way to run inference using the Salamandra-7b model.
+```bash
+pip install -U transformers
+```
+<details>
+<summary>Show code</summary>
+```python
+from transformers import pipeline, set_seed
+model_id = "projecte-aina/salamandra-7b"
+# Sample prompts
+prompts = [
+    "Las fiestas de San Isidro Labrador de Yecla son",
+    "El punt més alt del Parc Natural del Montseny és",
+    "Sentence in English: The typical chance of such a storm is around 10%. Sentence in Catalan:",
+    "Si le monde était clair",
+    "The future of AI is",
+]
+# Create the pipeline
+generator = pipeline("text-generation", model_id, device_map="auto")
+generation_args = {
+  "temperature": 0.1,
+  "top_p": 0.95,
+  "max_new_tokens": 25,
+  "repetition_penalty": 1.2,
+  "do_sample": True
+}
+# Fix the seed
+set_seed(1)
+# Generate texts
+outputs = generator(prompts, **generation_args)
+# Print outputs
+for output in outputs:
+  print(output[0]["generated_text"])
+```
+</details>
+#### Inference with single / multi GPU
+Inference code for Huggingface’s AutoModel.
+```bash
+pip install transformers torch accelerate sentencepiece protobuf
+```
+<details>
+<summary>Show code</summary>
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "projecte-aina/salamandra-7b"
+# Input text
+text = "El mercat del barri és"
+# Load the tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+# Load the model
+model = AutoModelForCausalLM.from_pretrained(
+  model_id,
+  device_map="auto",
+  torch_dtype=torch.bfloat16
+)
+generation_args = {
+  "temperature": 0.1,
+  "top_p": 0.95,
+  "max_new_tokens": 25,
+  "repetition_penalty": 1.2,
+  "do_sample": True
+}
+inputs = tokenizer(text, return_tensors="pt")
+# Generate texts
+output = model.generate(input_ids=inputs["input_ids"].to(model.device), attention_mask=inputs["attention_mask"], **generation_args)
+# Print outputs
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+</details>
+#### Inference with vLLM
+vLLM is an efficient library for inference that enables faster and more scalable text generation.
+```bash
+pip install vllm
+```
+<details>
+<summary>Show code</summary>
+```python
+from vllm import LLM, SamplingParams
+model_id = "projecte-aina/salamandra-7b"
+# Sample prompts
+prompts = [
+    "Las fiestas de San Isidro Labrador de Yecla son",
+    "El punt més alt del Parc Natural del Montseny és",
+    "Sentence in English: The typical chance of such a storm is around 10%. Sentence in Catalan:",
+    "Si le monde était clair",
+    "The future of AI is",
+]
+# Create a sampling params object
+sampling_params = SamplingParams(
+  temperature=0.1,
+  top_p=0.95,
+  seed=1,
+  max_tokens=25,
+  repetition_penalty=1.2)
+# Create an LLM
+llm = LLM(model=model_id)
+# Generate texts
+outputs = llm.generate(prompts, sampling_params)
+# Print outputs
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+</details>
 ---