nintwentydo
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -22,22 +22,64 @@ pipeline_tag: image-text-to-text
|
|
22 |
# Pixtral-Large-Instruct-2411 🧡
|
23 |
|
24 |
Transformers implementation of [Pixtral-Large-Instruct-2411](https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411).
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
```
|
33 |
-
<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT]
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
```
|
35 |
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
|
43 |
## Quantizations
|
|
|
22 |
# Pixtral-Large-Instruct-2411 🧡
|
23 |
|
24 |
Transformers implementation of [Pixtral-Large-Instruct-2411](https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411).
|
25 |
+
|
26 |
+
***21 Dec 2024:** This model has been a LOT of fun to experiment and learn with. Model card updated below with changes made to this repo
|
27 |
+
over the last week.*
|
28 |
+
|
29 |
+
## Architecture Differences to Pixtral 12B
|
30 |
+
Pixtral 12B has bias keys for the multi_modal_projector layers, whereas Pixtral Large does not. Instead of including with low/zero values
|
31 |
+
this conversion does not include those bias keys, aligning with the keys present in the original Pixtral Large upload from Mistral. The
|
32 |
+
model's config.json file includes `"multimodal_projector_bias": false` to flag this. *n.b. If anyone in the community confirms initializing
|
33 |
+
these keys with zero values is the better way to go I'm happy to reupload without them excluded.*
|
34 |
+
|
35 |
+
## Tokenizer
|
36 |
+
This model uses a conversion of the Mistral v7m1 tokenizer. Pixtral 12B and Large use different tokenizers with different vocab sizes,
|
37 |
+
so make sure you use the right tokenizer.
|
38 |
+
|
39 |
+
## Prompting / Chat Template
|
40 |
+
The included chat_template.json supports all of Mistral's defined features with some of my own additions.
|
41 |
+
|
42 |
+
I believe this implementation should give quite a lot of flexibility for using the model, and in my testing has worked quite well.
|
43 |
+
|
44 |
+
Example *(line breaks added for readability)*
|
45 |
```
|
46 |
+
<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT]
|
47 |
+
[INST] [IMG]<user message>
|
48 |
+
[AVAILABLE_TOOLS] [<tool definitions>][/AVAILABLE_TOOLS][/INST]
|
49 |
+
[IMG]<assistant response>
|
50 |
+
[TOOL_CALLS] [<tool calls>][/TOOL_CALLS]
|
51 |
+
[TOOL_RESULTS] <tool results including images>[/TOOL_RESULTS]
|
52 |
+
</s>[INST] <user message>[/INST]
|
53 |
```
|
54 |
|
55 |
+
**System Prompts**:
|
56 |
+
Messages with role "system" will be parsed as `[SYSTEM_PROMPT] <content>[/SYSTEM_PROMPT]` anywhere they appear in chat history.
|
57 |
+
|
58 |
+
This appears to work pretty well for passing extra instructions at various depths, and keeps instructions separate from conversation.
|
59 |
+
|
60 |
+
**Allowing Non-Alternating Roles**:
|
61 |
+
Multiple user messages in a row can be provided, and each will be separated with `[INST][/INST]`. This could work well in group conversation
|
62 |
+
settings, or environments where multiple user messages can be provided before the model is invoked. Having a `[/INST]` breaking each one up
|
63 |
+
appeared to help prevent the model thinking it needs to respond to every previous message and focus on the last message, while still retaining
|
64 |
+
knowledge of what messages sit before it.
|
65 |
+
|
66 |
+
**Image Inputs Everywhere**:
|
67 |
+
Images can now be sent in user, assistant, and tool result messages. And seems to actually work. I did tests like including an image on an
|
68 |
+
assistant reply 10-15 messages back in the conversation, asked the assistant to recall what image they previously sent, and it was able to
|
69 |
+
accurately describe it.
|
70 |
+
|
71 |
+
Having this flexibility could allow for interesting applications, for example if you were to define a tool definition for image generation:
|
72 |
+
- tool is invoked and calls image generation api/model
|
73 |
+
- image returned inside tool result message
|
74 |
+
- model responds with a message with context of the image generated
|
75 |
+
- you can have further conversation about the generated image, or make revisions with the model actually knowing what was created
|
76 |
+
|
77 |
+
## Usage
|
78 |
+
When loading in transformers you'll probably want to add some handling to ensure the lack of mmproj bias is respected for it to handle
|
79 |
+
vision input properly.
|
80 |
+
|
81 |
+
Most of my testing has been using TabbyAPI and ExLlamaV2 (dev branch) with working vision input.
|
82 |
+
<img src="https://huggingface.co/nintwentydo/Pixtral-Large-Instruct-2411/resolve/main/image-input-example.jpg">
|
83 |
|
84 |
|
85 |
## Quantizations
|