GPT2-Prompt-Upscaler-v1

Date Trained: March 2024

A lightweight model for generating Danbooru tag-based prompts from just a few input tags. It’s simple, fast, and (hopefully) useful for enhancing your text-to-image generations. Also, nsfw mode included.

Model Description

The GPT2-Prompt-Upscaler-v1 is designed to extend and refine prompts, aligning them with the tag distribution you’d expect from Danbooru images. Think of it as a friendly helper that fills in the gaps when you’re stuck or want more details for your image generation.

Why Use This Model?

Compact: With just 335M parameters, it’s much lighter than the bigger models like the Phi-3 based and on-par with TIPO (but trained much earlier than that!). It won’t hog your VRAM.
Fast: It runs <1s on a modern GPU to not adding extra time from your t2i generations.
Efficient: Saves resources for your main image generation process.

What Can It Do?

Character Refinement: Turn a simple character tag (like hatsune miku) into a fully detailed prompt with hairstyles, outfits, and accessories.
Adding Details: Sprinkle in creative details to enhance a scene or concept.
Inspiration: Want random but interesting variations? Just toss in a short idea and let it play around.
Prompt Polishing: Clean up and refine elements for better generation outputs.

Training Details

The model is finetuned on GPT2-medium with 10M prompts extracted from a refined Pixiv dataset for 5 epochs, with about ~2B tokens seen per epoch.

Training is done on a 8xH100 node for about 30 hours.

The training format looks something like this:

Rating: [safe | nsfw]
Chara: [Danbooru character tags]
Date: [2020s | 2010s | 2000s]
Quality: [normal | good | excellent] (based on aesthetics)
Tags: [General Danbooru tags]
Output: Model-generated continuation.

How to Use It

It’s a causal GPT2 model, so you prepare your input in a structured format, and it keeps generating until it sees </output>. Here’s what an input might look like:

Example Prompts

<input rating="safe" chara="" date="2020s" quality="excellent" tags="1girl, long hair, white hair"><output>
<input rating="safe" chara="" date="2020s" quality="excellent" tags="1girl, purple hair, white hair"><output>
<input rating="safe" chara="" date="2020s" quality="excellent" tags="gothic lolita"><output>
<input rating="safe" chara="hatsune miku" date="2020s" quality="excellent" tags=""><output>

The model generates extensions, filling in gaps with tags that feel natural and fitting.

Limitations

Okay, so here’s the deal: this model isn’t perfect. Since it’s relatively small and (confession time) it’s technically my 5th attempt at training but the first one I thought was worth keeping, it has a few quirks:

Over-Focusing on Scenery: When you set the quality to "excellent," the model sometimes gets overly enthusiastic about beautiful backgrounds and makes characters too small.
Alphabetical Tagging: Occasionally, it gets into a habit of generating tags in alphabetical order, which can lead to repetitive color tags at the end.
Needs More Data: It might benefit from a retrain with updated Danbooru tags to iron out some of these issues.

So, yeah, this isn’t a “final form” model, but I think it’s still pretty handy. I might update it in the future, so stay tuned!

What’s Next?

I’ll be sharing:

The Dataset: The Pixiv 2023 prompts corpus used for training.
A Demo: A simple interface to try out the model.

This model is for anyone who wants quick and lightweight prompt refinement without the heavy lifting of larger models. Play around with it, and let me know what you think!

trojblue
/

gpt2-prompt-upscaler-v1