I only have 25GB RAM and everytime I try to run the below code my google colab crashes. Any idea how to prevent his from happening. Batch wise would work? If so, how does that look like?
max_q_len = 128
max_a_len = 64
def batch_encode(text, max_seq_len):
return tokenizer.batch_encode_plus(
text.tolist(),
max_length = max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False
)
# tokenize and encode sequences in the training set
tokensq_train = batch_encode(train_q, max_q_len)
tokens1_train = batch_encode(train_a1, max_a_len)
tokens2_train = batch_encode(train_a2, max_a_len)
@neuralpat Yes. It works with a smaller dataset. Unfortunately, there is no traceback other than “Your session crashed after using all available RAM”. I am using google colab.
I also tried to tokenize and encode only train_q without train_a1 and train_a2 - still crashed.
I then tried this:
trainq_list = train_q.tolist()
batch_size = 50000
def batch_encode(text, max_seq_len):
for i in range(0, len(trainq_list), batch_size):
encoded_sent = tokenizer.batch_encode_plus(
text,
max_length = max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False
)
return encoded_sent
# tokenize and encode sequences in the training set
tokensq_train = batch_encode(train_q, max_q_len)
So kind of going through it in batches of size 50000 with the hope of not crashing but didn’t work… it crashed. Any idea how I could tackle this problem?
Just because it works with a smaller dataset, doesn’t mean it’s the tokenization that’s causing the ram issues.
You could try streaming the data from disk, instead of loading it all into ram at once.