Tokenizer.batch_encode_plus uses all my RAM

Fruits · March 21, 2021, 2:12pm

I only have 25GB RAM and everytime I try to run the below code my google colab crashes. Any idea how to prevent his from happening. Batch wise would work? If so, how does that look like?

max_q_len = 128
max_a_len = 64

def batch_encode(text, max_seq_len):
  return tokenizer.batch_encode_plus(
      text.tolist(),
      max_length = max_seq_len,
      pad_to_max_length=True,
      truncation=True,
      return_token_type_ids=False
  )

# tokenize and encode sequences in the training set
tokensq_train = batch_encode(train_q, max_q_len)
tokens1_train = batch_encode(train_a1, max_a_len)
tokens2_train = batch_encode(train_a2, max_a_len)

My Tokenizer:

tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')

len(train_q) is 5023194 (which is the same for train_a1 and train_a2)

neuralpat · March 22, 2021, 7:43am

Are you positive it’s actually the encoding that does it and not some other part of your code? Maybe you can show us the traceback?

Fruits · March 22, 2021, 1:37pm

@neuralpat Yes. It works with a smaller dataset. Unfortunately, there is no traceback other than “Your session crashed after using all available RAM”. I am using google colab.

I also tried to tokenize and encode only train_q without train_a1 and train_a2 - still crashed.

I then tried this:

    trainq_list = train_q.tolist()    
    batch_size = 50000
    def batch_encode(text, max_seq_len):
      for i in range(0, len(trainq_list), batch_size):
        encoded_sent = tokenizer.batch_encode_plus(
            text,
            max_length = max_seq_len,
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=False
        )
      return encoded_sent

    # tokenize and encode sequences in the training set
    tokensq_train = batch_encode(train_q, max_q_len)

So kind of going through it in batches of size 50000 with the hope of not crashing but didn’t work… it crashed. Any idea how I could tackle this problem?

neuralpat · March 23, 2021, 1:48pm

Just because it works with a smaller dataset, doesn’t mean it’s the tokenization that’s causing the ram issues.
You could try streaming the data from disk, instead of loading it all into ram at once.

FatyLak · November 23, 2021, 12:04pm

Try this:

def batch_encode(text, max_seq_len):

    for i in range(0, len(df["Text"].tolist()), batch_size):

        encoded_sent = tokenizer.batch_encode_plus(
            df["Text"][i : i + batch_size].tolist(),
            max_length=max_seq_len,
            add_special_tokens=True,
            padding="longest",
            return_attention_mask=True,
            pad_to_max_length=True,
            truncation=True,
            return_tensors="pt",
        )

        input_ids_train = encoded_sent["input_ids"].to(device)
        attention_masks_train = encoded_sent["attention_mask"].to(device)
        output = model(input_ids_train, attention_masks_train)

Your problem is that you are passing all the text to batch_encode_plus.

FatyLak · November 23, 2021, 12:08pm

def batch_encode(text, max_seq_len):
    for i in range(0, len(df["Text"].tolist()), batch_size):
        encoded_sent = tokenizer.batch_encode_plus(
            df["Text"][i : i + batch_size].tolist(),
            max_length=max_seq_len,
            add_special_tokens=True,
            padding="longest",
            return_attention_mask=True,
            pad_to_max_length=True,
            truncation=True,
            return_tensors="pt",
        )

        input_ids_train = encoded_sent["input_ids"].to(device)
        attention_masks_train = encoded_sent["attention_mask"].to(device)
        output = model(input_ids_train, attention_masks_train)

Try this. Your ploblem is that you are passing all the text to batch_encode

Topic		Replies	Views
Huggingface distilbert-base-uncased-finetuned-sst-2-english runs out of ram with only a few kb? Beginners	0	364	May 12, 2022
Using Batch Encodings 🤗Transformers	0	498	July 12, 2022
Tokenizer Trainer Crashing 🤗Tokenizers	0	669	April 15, 2023
Bert NextSentence memory leak Beginners	4	1529	May 29, 2021
Colab error (memory crashes) Beginners	3	3031	April 22, 2021

Tokenizer.batch_encode_plus uses all my RAM

Related topics