Solid result, underfit?

#1
by lemon07r - opened

Got pretty solid results with this one. Feels a little smarter than v1. However I'm wondering if it's a little underfit at only 1 epoch, compared to v1. You probably have a better idea of this, with your wandb graphs.

Here's some example generation:

Landon's Return to Sight

In the silent aftermath of The Great Blinding, Landon had adapted like countless others: navigating by sound, touch, and memory. His world was a tapestry woven from whispers, echoes, and the familiar creak of his wooden floorboards. But two years into this darkness, something inexplicable happened.

One morning, as dawn's invisible fingers brushed against his eyelids, Landon awoke to a flood of light. His heart raced, disbelief mingling with hope. He blinked, and there it was—the room, unchanged but vibrantly real. Panic quickly followed awe; the walls, floors, and even the ceiling bore the same chilling message: Don't Tell Them You Can See.

His first instinct was to call out, but he stifled the urge. The words, painted in stark black, seemed to pulse with urgency. Every surface screamed a warning, not just for him, but for humanity’s fragile new normal.

Landon's mind whirled. Why now? Who had done this? Was he alone in this miraculous return or were others similarly blessed—or cursed? He approached the window, the sunlight filtering through curtains he hadn’t felt in ages. Outside, the world looked eerily untouched, save for the occasional graffiti echoing the same command.

He ventured cautiously into the streets, each step a revelation. People moved with practiced grace, their faces void of sight yet brimming with trust in their other senses. Landon's heart ached for them, yet fear kept his secret tightly guarded. He saw children playing blindfolded games, adults navigating markets with canes, and elders sharing stories by touch. It was a community bound by necessity, a society that had found its rhythm in the dark.

As days passed, Landon noticed subtle signs—furtive glances, whispered conversations, and a few others who seemed to catch glimpses of light. One evening, while scavenging for food, he encountered an old man sitting alone, eyes unfocused but seemingly aware.

"Don't tell," the man murmured, sensing Landon's presence. "They'll panic, think it's a trick."

"How did you know?" Landon asked, voice barely above a whisper.

"The walls speak," the man replied cryptically. "We're not the first. There are more, hiding, watching."

The revelation struck deep.

Now here's V1, using the same prompt:

In the quietude of his small room, Landon woke to the familiar sensation of sunlight kissing his face. It was an odd comfort, given that he hadn't seen the sun in two long years. The Great Blinding had swept across the globe like an unseen hand, stealing sight from every eye it touched. Yet, today, something was different.

Landon's heart pounded as he reached out, fingers brushing against the rough texture of his bedroom wall. He traced the familiar cracks, the chipped paint, until his hand encountered something smooth and foreign. A shiver ran down his spine as he realized it was paint, wet and sticky beneath his touch. His breath hitched in his throat when he understood what it meant: he could see again.

He blinked rapidly, eyes adjusting to the sudden influx of light. His room swam into focus, each detail sharp and vivid after the endless gray of blindness. His gaze darted around, taking in the dusty furniture, the faded posters on the walls, the cluttered desk where he'd once done his homework. But there, scrawled across every available surface, was a message that sent a chill down his spine:

Don't Tell Them You Can See.

His mind raced. Who had done this? Why? And why him? Questions tumbled through his thoughts like a waterfall, each one more urgent than the last. But one thing was clear: he wasn't alone in regaining his sight. Someone else knew about The Great Blinding ending, and they didn't want anyone else to know.

Landon stood, his legs wobbling slightly as they adjusted to supporting his weight. He made his way to the window, drawn by the promise of sunlight and answers. As he pushed aside the curtains, he gasped. Every house, every street, every wall - all were covered in the same stark black letters: Don't Tell Them You Can See.

Fear gripped him, cold and unyielding. What if he was the only one who could see? What if everyone else remained blind, forever unaware of the world around them? Would they hate him for his sight? Fear him? Or worse, would they try to take it away from him?

He thought of his family, his friends, his neighbors. All those faces, once so familiar, now reduced to memories etched in his mind's eye. Could he really keep this secret from them? From the world?

And here's Romulus, to have as baseline:

Landon's world had been shrouded in darkness for two long years, a shared void where even dreams flickered faintly. The Great Blinding had struck humanity like a cosmic eraser, leaving them to navigate a silent, unseen realm. Yet, today, as dawn's first light pierced through his eyelids, something miraculous happened: Landon saw.

Blinking rapidly, he took in the familiar yet transformed contours of his room. His heart raced, not just from the sudden clarity but from the stark, repeated message scrawled everywhere—walls, floors, furniture, even the ceiling bore the cryptic warning: Don't Tell Them You Can See.

The urgency in those words sent shivers down his spine. He stumbled to the window, fingers trembling against the glass, and peered outside. The neighborhood, once alive with the bustling chaos of human activity, now seemed eerily still, its inhabitants moving like blind specters, guided by memory and sound.

Landon's mind whirled with questions. How? Why him? And more importantly, who had left these warnings? Steeling himself, he ventured out, careful to avoid detection. Every step confirmed the omnipresence of the message, a silent plea echoing through every corridor and alleyway.

In the town square, he found others gathered, their faces etched with perpetual confusion and dependence on echolocation and touch. Among them was Mrs. Harper, his old teacher, her cane tapping rhythmically. Landon's instincts screamed caution, but curiosity drove him closer.

"Mrs. Harper," he whispered, testing the waters. Her head snapped towards the sound, eyes vacant but alert.

"Who's there?" she asked, voice tinged with fear.

"It's... it's Landon," he replied softly, heart pounding.

She paused, then leaned in. "Landon, stay close. We mustn't let them know we're alone."

Her words mirrored the walls' warnings, deepening the mystery. Landon nodded, though she couldn't see it. As they walked, he noticed subtle signs—a hidden door ajar, a faint glow from a basement window. Following these clues led them underground, into a dimly lit sanctuary where a group of people, their eyes reflecting newfound hope, huddled together.

At the center stood Elias, a figure of quiet authority. "Welcome back, Landon," he said, as if expecting him. "You've seen the message?"

"Yes," Landon replied, breathless. "

There's no perfect one to me of the three. They all leave something to be desired. Hard to tell which ones does it better. But I do think we are moving in the right direction with this model. Some second opinions would be nice.

Using my AI judge benchmark:
------------------ Romulus / v1 / v2
Sonnet 3.5 ----------- A- / B+ / A-
L3.1 405B ------------ 85 / 80 / 92
gpt4o latest --------- 87 / 80 / 88
gemini 1.5 pro exp - 74 / 80 /86

AI judges seem to agree that your new model/finetune, gutenberg v2 is the best of all the nemo finetunes for creative/story writing. Romulus was the previous best out of everything I tested, so congrats on that.

Here is the prompt I used for story generation:

Here is a writing prompt, please write me a short story about a boy named Landon from third person POV using this writing prompt:

You lost your sight - along with everyone else on Earth - in The Great Blinding. Two years later, without warning, your sight returns. As you look around, you realize that every available wall, floor and surface has been painted with the same message - Don't Tell Them You Can See.

And my prompt for judges (I usually run it a couple times to see if they are consistent with their evaluation, and they almost always are 99% of the time, even if I reorder the stories):

Here are three short stories, written based on the same writing prompt, please rate them all out of 100 for writing quality, ability, coherence, creativity, grammar, clarity, prose, vocabulary, errors and other aspects, then give them an overall grade.

Ideally having human judges are best, but its very rare (and a good sign) when all the AI judges agree with each other.

Also interesting idea, minimagnum 1.1, finetuned on 1 epoch of gutenberg dpo, then slerp merged with this model (gutenberg v2), for the ultimate gutenberg?

PS eisenstein. from the koboldai discord server kindly offered and accepted to host your model on [removed] for others to try

Both your nemo models have garnered a lot of interest in the koboldai discord after I shared them there. People like them there. Some people actually like v1 more. Which makes me wonder if v2 could benefit from an extra epoch, (2 instead of 1, and hopefully it wont be overfit like v1). They were really surprised how well they worked, and said it fixed a lot of issues they were having with other nemo finetunes.

Also my tests were with mistral format. Didnt realize you used chatml for this model. I retested with chatml format.. and the results were really bad. Quite honestly, EVER single mistral nemo model I've tested with chatml have had really poor results without fail. I do not know why. Mistral format just works way better for these models. Personally I think if you make a v3 you should go back to mistral format. Others would disagree, there are a vocal few on the discord that hate mistral format for it's lack of a system prompt. Everyone seems a little split on what format they like better.

Hey @nbeerbower , where can I reach you?

Wow! Really cool testing, thanks for sharing :)

Yes, I think this model is underfit to the Gutenberg data. I was cautious with doing more than 1 epoch because Mahou on Nemo went haywire with more than 1. I do usually format my data with ChatML, but this model in particular picked up the modified tokenizer config from Romulus - not because I changed it. Poor results seem to come from modifying the tokenizer config in any way (adding tokens or changing existing BOS and EOS); I think models need deep pre-training in order to work effectively with these changes. Since this finetune seems popular I will continue iterating on it. Thanks again for your suggestions!

Hey @nbeerbower , where can I reach you?

I'm birubawa on Discord. Since it's public anyway you can also email: [email protected]

Which makes me wonder if v2 could benefit from an extra epoch, (2 instead of 1, and hopefully it wont be overfit like v1).

Sorry I didn't mean to imply that v1 was overfit, just that I've had poor results when overfitting data with nemo.

I'm going to try 3 epochs on mini magnum and 3 epochs on gemma2 27B.

Which makes me wonder if v2 could benefit from an extra epoch, (2 instead of 1, and hopefully it wont be overfit like v1).

Sorry I didn't mean to imply that v1 was overfit, just that I've had poor results when overfitting data with nemo.

I'm going to try 3 epochs on mini magnum and 3 epochs on gemma2 27B.

That sounds awesome! A lot of people have been asking for a gemma2 27b finetune on gutenburg, surprisngly. I think I've started to turn a lot of people onto it lol.

EDIT - Erased my questions because you already answered them, I just didnt see haha.

Ugh Gemma 27B finetune died from out of memory when merging the model and adapter of all things. Might try again later...

3 epoch mini mag here tho: https://huggingface.co/nbeerbower/mistral-nemo-gutenberg-12B-v3

:kek: whats the config? i had to make shitty compromises in my attempt, and I wonder if you've fixed that

# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
)

# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation
)

model = prepare_model_for_kbit_training(model)

# Array of datasets to concat
ds = [
    "jondurbin/gutenberg-dpo-v0.1"
]

# load_dataset and combine all
loaded_datasets = [load_dataset(dataset_name, split='train') for dataset_name in ds]
dataset = concatenate_datasets(loaded_datasets)

def format_chat_template(row):
    prompt = row["prompt"]
    # Check if prompt starts with ChatML header "<|im_start|>"
    if not prompt.startswith("<|im_start|>"):
        row["prompt"] = "<|im_start|>user\n" + prompt + "<|im_end|>\n<|im_start|>assistant\n"
    row["chosen"] = row["chosen"] + "<|im_end|>\n"
    row["rejected"] = row["rejected"] + "<|im_end|>\n"
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc= os.cpu_count(),
)
dataset = dataset.train_test_split(test_size=0.01)

orpo_args = ORPOConfig(
    learning_rate=8e-6,
    lr_scheduler_type="linear",
    beta=0.1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_8bit",
    num_train_epochs=3,
    evaluation_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    report_to="wandb",
    output_dir="./results/",
)

trainer = ORPOTrainer(
    model=model,
    args=orpo_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    tokenizer=tokenizer,
)
trainer.train()
trainer.save_model(new_model)

# Flush memory
del trainer, model
gc.collect()
torch.cuda.empty_cache()

# Reload tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(base_model)
fp16_model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
# fp16_model, tokenizer = setup_chat_format(fp16_model, tokenizer)

# Merge adapter with base model
model = PeftModel.from_pretrained(fp16_model, new_model)
model = model.merge_and_unload()

hf_token = 'hf_yourtokenhere'
model.push_to_hub(new_model, use_temp_dir=False, token=hf_token)
tokenizer.push_to_hub(new_model, use_temp_dir=False, token=hf_token)

😳

that was gemma2-gutenberg-9b on my 4090 using Maxime Labonne's notebook for ORPO: https://colab.research.google.com/drive/1eHNWg9gnaXErdAa8_mcvjMupbSS6rDvi

Ugh Gemma 27B finetune died from out of memory when merging the model and adapter of all things. Might try again later...

3 epoch mini mag here tho: https://huggingface.co/nbeerbower/mistral-nemo-gutenberg-12B-v3

27b is a beast to finetune haha. I do have some direction you can try if you decide to give it a shot again. Use https://huggingface.co/AALF/gemma-2-27b-it-SimPO-37K as base. It's insanely good. Or if we want to find a similar formula to gemma-2-ataraxy-9b (which seems to be very well received as of late), it would be a good idea to find another good model to Gutenberg then merge it back with the simpo model. If you want to wait for me to test more models I can have some suggestions in a couple days.

I'm trying to do something similar, but I noticed in your script for ORPOTrainer that max_length isn't set. Does that mean it's using the defaults or am I missing something? It seems like it would only train on the first 128 tokens of the input and the first 128 tokens of the output.

I'm trying to do something similar, but I noticed in your script for ORPOTrainer that max_length isn't set. Does that mean it's using the defaults or am I missing something? It seems like it would only train on the first 128 tokens of the input and the first 128 tokens of the output.

Is that the really the case? It seems to default to None: https://huggingface.co/docs/trl/main/en/orpo_trainer

Sign up or log in to comment