Text Generation
Transformers
Safetensors
dbrx
conversational
text-generation-inference

More than 10 minutes the status is in Setting `pad_token_id` to `eos_token_id`:100257 for open-end generation.

#28
by Madhugraj - opened

I am running the following codes in my e2-standard- 8 machine and the status is 8 vCPUs 20% & 32 GB RAM 4.6%. The status is struck at: Setting pad_token_id to eos_token_id:100257 for open-end generation for more than 10 minutes.

input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

Setting pad_token_id to eos_token_id:100257 for open-end generation.

KeyboardInterrupt Traceback (most recent call last)
Cell In[8], line 1
----> 1 outputs = model.generate(**input_ids, max_new_tokens=200)
2 print(tokenizer.decode(outputs[0]))

File /opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1527, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1509 result = self.assisted_decoding(
1510 input_ids,
1511 candidate_generator=candidate_generator,
(...)
1523 **model_kwargs,
1524 )
1525 if generation_mode == GenerationMode.GREEDY_SEARCH:
1526 # 11. run greedy search
-> 1527 result = self._greedy_search(
1528 input_ids,
1529 logits_processor=prepared_logits_processor,
1530 stopping_criteria=prepared_stopping_criteria,
1531 pad_token_id=generation_config.pad_token_id,
1532 eos_token_id=generation_config.eos_token_id,
1533 output_scores=generation_config.output_scores,
1534 output_logits=generation_config.output_logits,
1535 return_dict_in_generate=generation_config.return_dict_in_generate,
1536 synced_gpus=synced_gpus,
1537 streamer=streamer,
1538 **model_kwargs,
1539 )
1541 elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
1542 if not model_kwargs["use_cache"]:

File /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:2411, in GenerationMixin._greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, output_logits, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2408 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2410 # forward pass to get next token
-> 2411 outputs = self(
2412 **model_inputs,
2413 return_dict=True,
2414 output_attentions=output_attentions,
2415 output_hidden_states=output_hidden_states,
2416 )
2418 if synced_gpus and this_peer_finished:
2419 continue # don't waste resources running the code we don't need

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/.cache/huggingface/modules/transformers_modules/databricks/dbrx-instruct/3b5d968eab47b0cb5b075fd984612b63f92841c2/modeling_dbrx.py:1334, in DbrxForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict, cache_position)
1331 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1333 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1334 outputs = self.transformer(
1335 input_ids=input_ids,
1336 attention_mask=attention_mask,
1337 position_ids=position_ids,
1338 past_key_values=past_key_values,
1339 inputs_embeds=inputs_embeds,
1340 use_cache=use_cache,
1341 output_attentions=output_attentions,
1342 output_hidden_states=output_hidden_states,
1343 output_router_logits=output_router_logits,
1344 return_dict=return_dict,
1345 cache_position=cache_position,
1346 )
1348 hidden_states = outputs[0]
1349 logits = self.lm_head(hidden_states)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/.cache/huggingface/modules/transformers_modules/databricks/dbrx-instruct/3b5d968eab47b0cb5b075fd984612b63f92841c2/modeling_dbrx.py:1132, in DbrxModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict, cache_position)
1120 block_outputs = self._gradient_checkpointing_func(
1121 block.call,
1122 hidden_states,
(...)
1129 cache_position=cache_position,
1130 )
1131 else:
-> 1132 block_outputs = block(
1133 hidden_states,
1134 attention_mask=causal_mask,
1135 position_ids=position_ids,
1136 past_key_value=past_key_values,
1137 output_attentions=output_attentions,
1138 output_router_logits=output_router_logits,
1139 use_cache=use_cache,
1140 cache_position=cache_position,
1141 )
1143 hidden_states = block_outputs[0]
1145 if use_cache:

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/.cache/huggingface/modules/transformers_modules/databricks/dbrx-instruct/3b5d968eab47b0cb5b075fd984612b63f92841c2/modeling_dbrx.py:921, in DbrxBlock.forward(self, hidden_states, position_ids, attention_mask, past_key_value, output_attentions, output_router_logits, use_cache, cache_position, **kwargs)
909 resid_states, hidden_states, self_attn_weights, present_key_value = self.norm_attn_norm(
910 hidden_states=hidden_states,
911 attention_mask=attention_mask,
(...)
917 **kwargs,
918 )
920 # Fully Connected
--> 921 hidden_states, router_logits = self.ffn(hidden_states)
922 hidden_states = nn.functional.dropout(hidden_states,
923 p=self.resid_pdrop,
924 training=self.training)
925 hidden_states = resid_states + hidden_states

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/.cache/huggingface/modules/transformers_modules/databricks/dbrx-instruct/3b5d968eab47b0cb5b075fd984612b63f92841c2/modeling_dbrx.py:845, in DbrxFFN.forward(self, x)
843 def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
844 weights, top_weights, top_experts = self.router(x)
--> 845 out = self.experts(x, weights, top_weights, top_experts)
846 return out, weights

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/.cache/huggingface/modules/transformers_modules/databricks/dbrx-instruct/3b5d968eab47b0cb5b075fd984612b63f92841c2/modeling_dbrx.py:811, in DbrxExperts.forward(self, x, weights, top_weights, top_experts)
808 topk_list = topk_idx.tolist()
810 expert_tokens = x[None, token_list].reshape(-1, hidden_size)
--> 811 expert_out = self.mlp(
812 expert_tokens, expert_idx) * top_weights[token_list, topk_list,
813 None]
815 out.index_add_(0, token_idx, expert_out)
817 out = out.reshape(bsz, q_len, hidden_size)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/.cache/huggingface/modules/transformers_modules/databricks/dbrx-instruct/3b5d968eab47b0cb5b075fd984612b63f92841c2/modeling_dbrx.py:778, in DbrxExpertGLU.forward(self, x, expert_idx)
776 x1 = self.activation_fn(x1)
777 x1 = x1 * x2
--> 778 x1 = x1.matmul(expert_w2)
779 return x1

KeyboardInterrupt:

Databricks org

Google documentation says the E2 machine series supports up to 128 GB of memory. Our model requires at least 264 GB of memory to run.

Agree, I'm not sure how you're even loading it in 32GB. Are you sure that's even successful? Even if so, generation will be very slow on CPU. This is intended for GPUs. There are quantizations of the model that can maybe perform passably on very large CPU VMs. It is likely easiest to consume this model from a third party hosted service.

Databricks org

Closing this issue -- @Madhugraj we recommend using this model with GPUs for inference.

hanlintang changed discussion status to closed

Adding more details:
First I ran:
tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", trust_remote_code=True, token=auth_token)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model = AutoModelForCausalLM.from_pretrained("databricks/dbrx-instruct", device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True,token=auth_token)
--All 61 files where downloaded.
and then I did:

Model + token save

save_directory = "llmdb/model"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

Later I did:

model = AutoModelForCausalLM.from_pretrained(save_directory, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)
Loading checkpoint shards: 100%
 61/61 [00:13<00:00,  4.69it/s]

Can i understand that the model and tokens are saved successfully?
Now I am running:

input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt")
input_ids

{'input_ids': tensor([[100278, 9125, 198, 2675, 527, 6078, 46913, 11, 3549,
555, 423, 2143, 78889, 13, 1472, 1051, 1566, 6177,
304, 6790, 220, 2366, 18, 13, 1472, 4320, 4860,
3196, 389, 2038, 2561, 709, 311, 430, 1486, 627,
57489, 15843, 36, 66024, 77273, 50, 5257, 66024, 57828,
43486, 2794, 23233, 29863, 11, 719, 3493, 17879, 14847,
311, 810, 6485, 323, 1825, 84175, 4860, 627, 2675,
7945, 449, 5370, 9256, 11, 505, 4477, 311, 11058,
320, 985, 51594, 369, 2082, 10215, 2001, 6227, 311,
1005, 55375, 449, 2082, 11, 4823, 11, 323, 12920,
4390, 7, 2675, 656, 539, 617, 1972, 7394, 828,
2680, 477, 2082, 11572, 17357, 13, 1472, 5766, 23473,
67247, 323, 3493, 24770, 39555, 389, 20733, 13650, 13,
1472, 656, 539, 3493, 5609, 24142, 11, 45319, 11,
477, 3754, 9908, 323, 656, 539, 82791, 713, 3649,
315, 701, 4967, 828, 29275, 2028, 374, 701, 1887,
10137, 11, 51346, 701, 14847, 13, 3234, 539, 5905,
433, 11, 1120, 6013, 311, 279, 1217, 13, 1442,
499, 1505, 6261, 7556, 922, 420, 1984, 11, 3009,
13, 1472, 1288, 387, 30438, 36001, 323, 6118, 430,
3445, 539, 45391, 420, 627, 57489, 9503, 4276, 386,
72983, 4230, 3083, 10245, 45613, 52912, 21592, 66873, 6781,
38873, 3247, 45613, 3507, 20843, 9109, 393, 3481, 691,
1863, 5257, 3247, 14194, 13575, 68235, 13, 100279, 198,
100278, 882, 198, 3923, 1587, 433, 1935, 311, 1977,
264, 2294, 445, 11237, 30, 100279, 198, 100278, 78191,
198]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Can i understand that the input tokens are generated?

Next:
outputs = model.generate(**input_ids, max_new_tokens=200)
Setting pad_token_id to eos_token_id:100257 for open-end generation.
And now no output and the cell is running for hours...

I am just doing what is instructed in https://github.com/databricks/dbrx/blob/main/MODEL_CARD_dbrx_instruct.md under Run the model on a CPU.

Please explain.

Databricks org

While you can run on CPU, it will be very slow. I suspect you are actually swapping, if this loads; I don't see how else this can work in just 32GB of memory. That would make it very very slow.
At the least, you should try a high-mem instance with (say) 300GB+ of RAM if you want to explore this.

This comment has been hidden

Sign up or log in to comment