Safetensors size mismatch.

#4
by MartialTerran - opened

I copy-pasted the bfloat16 code from this page into local model.py file named bfloat16_SmolLM2_360_model.py
I installed pip install accelerate
In CMD console in Windows 10, I ran bfloat16_SmolLM2_360_model.py including model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16) as follows:

Using torch.bfloat16

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
checkpoint = "HuggingFaceTB/SmolLM2-360M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

for fp16 use torch_dtype=torch.float16 instead

model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
#model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")

The result was an error report indicating a size mismatch within the downloaded safetensors parameter file:

"ValueError: Trying to set a tensor of shape torch.Size([320, 960]) in "weight" (which has shape torch.Size([960, 960])), this looks incorrect."

Here is the entire run/response:

C:\Users\User\OneDrive\Desktop\SmolLM2>python bfloat16_SmolLM2_360_model.py
C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\huggingface_hub\file_download.py:797: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last): File "C:\Users\User\OneDrive\Desktop\SmolLM2\bfloat16_SmolLM2_360_model.py", line 11, in
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 484, in from_pretrained
return model_class.from_pretrained(
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 2604, in from_pretrained
state_dict = load_state_dict(resolved_archive_file)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 461, in load_state_dict
return safe_load_file(checkpoint_file)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\safetensors\torch.py", line 315, in load_file
result[k] = f.get_tensor(k)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\storage.py", line 234, in getitem
return super().getitem(*args, **kwargs)
C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\storage.py:234: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ..\torch\csrc\utils\tensor_numpy.cpp:84.)
return super().getitem(*args, **kwargs)
Traceback (most recent call last):
File "C:\Users\User\OneDrive\Desktop\SmolLM2\bfloat16_SmolLM2_360_model.py", line 11, in
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 484, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 2881, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 3228, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 720, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\accelerate\utils\modeling.py", line 286, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([320, 960]) in "weight" (which has shape torch.Size([960, 960])), this looks incorrect.

C:\Users\User\OneDrive\Desktop\SmolLM2>

To load the model.safetensors file directly from a local folder instead of the Hugging Face cache, you need to provide the path to that folder to the from_pretrained() method. Here's how you can modify your script:

import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM

Specify the path to the folder containing model.safetensors

model_folder = os.path.dirname(os.path.abspath(file)) # Current script's directory
checkpoint = model_folder #or "./SmolLM2-360M" for a subfolder

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Load the model, suppressing warnings about missing files (as we are loading directly)

model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, local_files_only=True)

inputs = tokenizer.encode("Gravity is", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)

print(tokenizer.decode(outputs[0]))
print(f"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB")
Use code with caution.
Python
Explanation of Changes:

model_folder:

We use os.path.dirname(os.path.abspath(file)) to get the absolute path of the current script's directory. This assumes your model.safetensors, config.json, and tokenizer files are in the same directory as your Python script.

If your model files are in a subdirectory (e.g., "SmolLM2-360M"), adjust this line to model_folder = os.path.join(os.path.dirname(os.path.abspath(file)), "SmolLM2-360M") use a relative path e.g. checkpoint = "./SmolLM2-360M". This is useful for keeping your model files organized.

checkpoint = model_folder: We now use the model_folder variable as the checkpoint for both the tokenizer and the model. This tells from_pretrained() to look for the model files in the specified local folder.

local_files_only=True: This is crucial. It tells from_pretrained() to only load files from the local directory and not attempt to download anything from the Hugging Face Hub. Since we're providing the local path, we don't want the function to try and connect to the internet.

Important:

File Structure: Ensure that the model.safetensors, config.json, and tokenizer files (usually a vocab.json and merges.txt or similar) are correctly placed within the model_folder directory or your specified subdirectory. The from_pretrained function relies on finding these files to load the model properly. If the files are not in the correct place you will get errors.

No Download: With local_files_only=True, no download will occur. The code will strictly use local files.

This revised script will load the model directly from your local files, making it independent of the Hugging Face cache and internet connectivity (once the files are downloaded). Remember to replace "your_model_folder" with the actual path if your model is in a different location.

The safetensors.model was manually downloaded from https://huggingface.co/HuggingFaceTB/SmolLM2-360M This local-folder secification did not remove the size mismatch problem:

C:\Users\User\OneDrive\Desktop\SmolLM2>python bfloat16_SmolLM2_360_local_folder_model.py

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last): File "C:\Users\User\OneDrive\Desktop\SmolLM2\bfloat16_SmolLM2_360_local_folder_model.py", line 18, in
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, local_files_only=True)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 484, in from_pretrained
return model_class.from_pretrained(
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 2604, in from_pretrained
state_dict = load_state_dict(resolved_archive_file)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 461, in load_state_dict
return safe_load_file(checkpoint_file)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\safetensors\torch.py", line 315, in load_file
result[k] = f.get_tensor(k)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\storage.py", line 234, in getitem
return super().getitem(*args, **kwargs)
C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\storage.py:234: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ..\torch\csrc\utils\tensor_numpy.cpp:84.)
return super().getitem(*args, **kwargs)
Traceback (most recent call last):
File "C:\Users\User\OneDrive\Desktop\SmolLM2\bfloat16_SmolLM2_360_local_folder_model.py", line 18, in
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, local_files_only=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 484, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 2881, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 3278, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([320, 960]) from checkpoint, the shape in current model is torch.Size([960, 960]).
size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([320, 960]) from checkpoint, the shape in current model is torch.Size([960, 960]).
size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([320, 960]) from checkpoint, the shape in current model is torch.Size([960, 960]).
[etcetera]
size mismatch for model.layers.31.self_attn.v_proj.weight: copying a param with shape torch.Size([320, 960]) from checkpoint, the shape in current model is torch.Size([960, 960]).
You may consider adding ignore_mismatched_sizes=True in the model from_pretrained method.

C:\Users\User\OneDrive\Desktop\SmolLM2>

adding ignore_mismatched_sizes=True as follows
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, local_files_only=True, ignore_mismatched_sizes=True)

increased the error messages but did not produce a working model:

C:\Users\User\OneDrive\Desktop\SmolLM2>python bfloat16_SmolLM2_360_local_folder_model.py

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last): File "C:\Users\User\OneDrive\Desktop\SmolLM2\bfloat16_SmolLM2_360_local_folder_model.py", line 18, in
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, local_files_only=True, ignore_mismatched_sizes=True)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\models\auto\auto_factory.py", line 484, in from_pretrained
return model_class.from_pretrained(
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 2604, in from_pretrained
state_dict = load_state_dict(resolved_archive_file)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\modeling_utils.py", line 461, in load_state_dict
return safe_load_file(checkpoint_file)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\safetensors\torch.py", line 315, in load_file
result[k] = f.get_tensor(k)
File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\storage.py", line 234, in getitem
return super().getitem(*args, **kwargs)
C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\storage.py:234: UserWarning: Failed to initialize NumPy: _ARRAY_API not found (Triggered internally at ..\torch\csrc\utils\tensor_numpy.cpp:84.)
return super().getitem(*args, **kwargs)
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at C:\Users\User\OneDrive\Desktop\SmolLM2 and are newly initialized: ['model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'lm_head.weight', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.24.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at C:\Users\User\OneDrive\Desktop\SmolLM2 and are newly initialized because the shapes did not match:

  • model.layers.0.self_attn.k_proj.weight: found shape torch.Size([320, 960]) in the checkpoint and torch.Size([960, 960]) in the model instantiated
  • model.layers.0.self_attn.v_proj.weight: found shape torch.Size([320, 960]) in the checkpoint and torch.Size([960, 960]) in the model instantiated
  • model.layers.1.self_attn.k_proj.weight: found shape torch.Size([320, 960]) in the checkpoint and torch.Size([960, 960]) in the model instantiated
  • model.layers.1.self_attn.v_proj.weight: found shape torch.Size([320, 960]) in the checkpoint and torch.Size([960, 960]) in the model instantiated
  • model.layers.10.self_attn.k_proj.weight: found shape torch.Size([320, 960]) in the checkpoint and torch.Size([960, 960]) in the model instantiated
    [etcetera]
  • model.layers.8.self_attn.k_proj.weight: found shape torch.Size([320, 960]) in the checkpoint and torch.Size([960, 960]) in the model instantiated
  • model.layers.8.self_attn.v_proj.weight: found shape torch.Size([320, 960]) in the checkpoint and torch.Size([960, 960]) in the model instantiated
  • model.layers.9.self_attn.k_proj.weight: found shape torch.Size([320, 960]) in the checkpoint and torch.Size([960, 960]) in the model instantiated
  • model.layers.9.self_attn.v_proj.weight: found shape torch.Size([320, 960]) in the checkpoint and torch.Size([960, 960]) in the model instantiated
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\generation\utils.py:1259: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
    warnings.warn(
    C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\generation\utils.py:1353: UserWarning: Using max_length's default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using max_new_tokens to control the maximum length of the generation.
    warnings.warn(
    C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\generation\utils.py:1452: UserWarning: You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('cpu') before running .generate().
    warnings.warn(
    Traceback (most recent call last):
    File "C:\Users\User\OneDrive\Desktop\SmolLM2\bfloat16_SmolLM2_360_local_folder_model.py", line 24, in
    outputs = model.generate(inputs)
    ^^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\generation\utils.py", line 1522, in generate
    return self.greedy_search(
    ^^^^^^^^^^^^^^^^^^^
    File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\generation\utils.py", line 2339, in greedy_search
    outputs = self(
    ^^^^^
    File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\models\llama\modeling_llama.py", line 688, in forward
    outputs = self.model(
    ^^^^^^^^^^^
    File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\transformers\models\llama\modeling_llama.py", line 531, in forward
    inputs_embeds = self.embed_tokens(input_ids)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\modules\sparse.py", line 162, in forward
    return F.embedding(
    ^^^^^^^^^^^^
    File "C:\Users\User\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\nn\functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

what does the shape [320, 960] within self_attn.k_proj.weight imply about the model architecture, and model.py code?

Model
18.9s
The shape [320, 960] within self_attn.k_proj.weight provides important clues about the model's architecture, specifically its self-attention mechanism:

320 (Number of Output Features/Heads): This dimension likely represents the number of output features or attention heads used in the self-attention layer. In the context of multi-head attention, this would be the dimension of each head's output before concatenation. If it's a single-head attention mechanism, then this is simply the number of output features. Note that in most standard transformer implementations, the number of heads evenly divides the hidden size. In this case, 960 / 320 = 3.

960 (Hidden Size or Input Features): This dimension usually corresponds to the hidden size of the model or the number of input features being projected. The weight matrix k_proj.weight is used to project the input sequence into the "key" space of the self-attention mechanism.

Implications for Model Architecture and Code:

Custom Attention Implementation: The fact that the shape is [320, 960] instead of a more standard square shape (like [960, 960] for a single attention head with 960 hidden size) suggests that the SmolLM2 model might be using a custom implementation of the self-attention mechanism. It deviates from a typical single-head or multi-head attention where all the key, query, and value matrices have matching dimensions equal to the hidden size.

Key/Query/Value Projection: The k_proj.weight matrix is specifically responsible for projecting the input into the key space. There will be corresponding weight matrices for query (q_proj.weight) and value (v_proj.weight) projections. You should examine the shapes of these matrices in the downloaded weights and the code as well to get a complete picture of the attention mechanism. It's possible that the query and/or value projections also have modified dimensions.

Model Configuration: The dimensions 320 and 960 should be reflected in the model's configuration file (config.json). There might be parameters such as hidden_size, num_attention_heads, intermediate_size, or custom parameters specific to the SmolLM2 architecture that directly define these values.

Example of a potential modified attention mechanism (Illustrative):

import torch.nn as nn

class SmolLM2Attention(nn.Module): # Hypothetical example
def init(self, hidden_size, num_heads):
super().init()
self.hidden_size = hidden_size
self.num_heads = num_heads
self.head_dim = hidden_size // num_heads # This would be 320

    self.k_proj = nn.Linear(hidden_size, self.head_dim * num_heads) # [960, 320*3=960]
    self.q_proj = nn.Linear(hidden_size, self.head_dim * num_heads) # [960, 320*3=960]
    self.v_proj = nn.Linear(hidden_size, self.num_heads)  # [960, 320] Non-standard projection for Value!

def forward(self, hidden_states):
  # Custom attention logic ...
  keys = self.k_proj(hidden_states)
  queries = self.q_proj(hidden_states)
  values = self.v_proj(hidden_states) # Note that this line is non-standard
  # ... rest of the attention calculation...

Use code with caution.
Python
This is a purely hypothetical example to show that the v_proj might have a different output dimension than the key and query projections, leading to the [320, 960] shape you observed in the checkpoint. You must refer to the actual modeling_smol_lm2.py or equivalent file in your downloaded model to understand the true implementation details.

Model
18.1s
The config.json you provided reveals crucial information that explains the shape discrepancy and guides how to load the model correctly:

Key Insights:

num_key_value_heads: 5 This is the most important detail! This parameter indicates that the model uses Multi-Query Attention, a variation of multi-head attention where multiple "heads" share the same key and value projections. This directly affects the shape of the k_proj.weight and v_proj.weight matrices.

num_attention_heads: 15 This is the total number of attention heads (query heads) in each layer.

hidden_size: 960 The overall hidden dimension of the model.

How Multi-Query Attention Explains the Shape:

In multi-query attention:

Query Projections: Each of the 15 attention heads has its own independent query projection matrix (so q_proj.weight would likely have a shape of [960, 960] because 15 * 960 / 15 = 960. But this is just speculation, as you did not provide the shape of q_proj).

Key and Value Projections: The key and value projections are shared among a group of heads. Since num_key_value_heads is 5, this means there are only 5 sets of key and value vectors. Each set is shared by 3 query heads (15 / 5 = 3).

Calculating the Shape:

The dimension of each key/value head: 960 / 5 = 192

k_proj.weight and v_proj.weight will project to the shared key/value space: [hidden_size, key/value dimension * num_key_value_heads]. Therefore k_proj.weight shapes up to be [960, 192 * 5] = [960, 960] (which does not match your original error message), and v_proj.weight becomes [960, 192] or [320, 960] (matching a part of your error message, meaning that this weight name might have been assigned differently in the model).

Loading the Model

You cannot use the standard LlamaForCausalLM class from Transformers to load this model because it doesn't implement multi-query attention. You will still need to find the correct custom model class provided with the SmolLM2-320M checkpoint. The folder should contain a Python file defining the model architecture with the multi-query attention mechanism.

Important Considerations:

File Names: Carefully examine the file names in your SmolLM2-320M folder. Look for files like modeling_smol_lm2.py or variations that suggest a custom model definition. This is where the multi-query attention logic and the correctly named layers will reside.

Class Name: Once you locate the model definition file, identify the specific class that inherits from PreTrainedModel or a similar base class. This will be the class you need to use in your model.py. It might be named SmolLM2ForCausalLM, SmolLM2Model, or something similar. You should use this class name in the from_pretrained method.

If you're still encountering errors, double-check the following:

Typos in class names or file paths.

That the modeling_*.py file is in the same directory as your model.py for relative imports to work.

Review the model definition file to understand how the state dictionary keys are structured. They might be slightly different from the standard Llama model, hence the need for a custom class.

Sign up or log in to comment