Introduction

ToriiGate-v0.4 is a state of art VLM designed for captioning of anime pictures, digital artworks and various images.

The model is a further development of ToriiGate-v0.3, but this time is based on Qwen2-VL that was finetuned with dataset of over 900k of artworks with various captions. ToriiGate-v0.4 provides state of the art level of understanding for complex scenes, interactions, cultural concepts and any kind of NSFW activities without any borders or censorship. Flexible grounding allows to achieve extra accuracy. Also, at the moment of release it is the only opensource small-sized VLM that can handle character names well, including multiple.

This is 7B version, also 2B version available. Also check out quants (8bpw, 6bpw, 4bpw).

Showcase on rentry

SFW NSFW

Key features

Advanced knowledges related to anime and digital art in very wide range
Accurate use of character names in generated descriptions (with grounding)
Multiple options for generated captions, including structured output, chain of thoughts details improving, long/short, bounding boxes
Mode for reviewing and fixing existing captions using CoT, or prunning it to make short and convenient
Flexible groudning for improving of accuracy: booru tags, natural text info, character names, popular traits or tags for each character on picture
Created captions are more meaningful and dense without purple prose fillers, compared to other models

Captioning modes

ToriiGate-v0.4 provides multiple captioning modes. Prompts for them and examples are listed at the bottom.

Structured output

Can be warapped in json or markdown based on prompt. Provides description for each character on the picture mentioning their features, actions, etc., then description of background, other picture contents, image effects and texts if any, general atmosphere. This style of captions provides best segmentation and brings attention mostly to characters on the picture. Then it can be easily processed to desired format with other LLM (or prunned with second call ToriiGate) to make it easy readable or match desired case. Character segmentation in combination with bounding boxes (that can be enabled inline) allows to create special datasets for training new generation generative models using special tecniques.

Pre-defined captions options

This makes the model to generate 4 followed descriptions in styles: "Regular Summary", "Individual Parts", "Midjourney-Style Summary", "DeviantArt Commission Request". That order allows to describe basic things, then improve extra details in individual section and after shrink it without losing accuracy or involving extra biases. Original idea is not mine. This mode is balanced all-rounder but quite token-comsumming. In case if you want to then reprocess it, you can use ### 3. as a stop sequence to trimm summarized parts and speed up generation.

Long description

Just a regular long description. Torii tends to make it a bit more structured then randomly shuffled parts like with other models.

Short caption

Short and convenient caption. Less slopy and more dense then a long one, might be used as is for diffusion models training.

Bounding boxes

Provides bounding boxes for characters and their faces. Standalone usage is pointless taking into account the performance/compute for object detaction models. But it can be used along with structured mode and shares same numbers or names.

Review and correct existing caption

Provides step by step review of given caption and compares it with image contents, with provided grounding, evaluates how correctly the characters names were used. Then, if needed, it writes a new fixed caption maintaining original style (if possible). In current version it is usable only with tags grounding and can improve accuracy of generated caption in second call. Accuracy for other cases may vary and this is not the main usecase.

Writing a short caption based on existing.

After generating some caption you can prune it right here taking into account image content. Can be used with external captions as well.

Grounding

New version comes not only with improved zero-shot accuracy, but also introduces new modes for adding ground truth. You can use booru tags, some extra info, give names for characters or even describe each one to ensure right description when multiple are in frame. Extra grounding allows to achieve best results for unattended use extra grounded truth is required. ToriiGate provides following options for it:

Booru tags. Can be full string or just few mentioning character count. Character name tags are recognized here.
Character list. List of character names on the picture to be used in caption. Beware of spamming skin tags here (like hatsune_miku + hatsune_miku_(append)), usually it is fine but some can be misinterpreted.
Character traits. A list of popular tags or traits for each character on the picture to improve recognition. Some can be found here, you can make your own, use repo (beware, a lot of incauracies!) or describe it with natural language.
General info. Might be a short caption, some facts, any other info.
Do not use names for characters. Unfortunatelly, at current state zero-shot guess for characters is inaccurate, so to avoid making things up use it by default if no grounding is provided. Please mind to check it because mistakes in grounding will lead to wrong results. All formats are listed below.

Known issues

In rare cases current version tends to come up with wrong names without grounding. Use extra "no characters" instruction to avoid this.
In long captions mode sometimes it tends to omit names even when given, please use other modes if you need it.
There might be some biases and not perfect accuracy, more training is needed.
Model is specialized to image captioning, other prompts will likely not work.

Thanks

Main training was sponsored by OpenRoot-Compute.

Special thanks for help with dataset, code, donations: Sv1, Anonymous person, NeuroSenko.

Warning

Model will generate texts with adult themes if related input is provided. Outputs may be inacurate and provocative.

Donate

BTC: bc1qwv83ggq8rvv07uk6dv4njs0j3yygj3aax4wg6c

ETH/USDT(e): 0x04C8a749F49aE8a56CB84cF0C99CD9E92eDB17db

XMR: 47F7JAyKP8tMBtzwxpoZsUVB8wzg2VrbtDKBice9FAS1FikbHEXXPof4PAb42CQ5ch8p8Hs4RvJuzPHDtaVSdQzD6ZbA5TZ

Usage instructions

Prompts

ToriiGate-v0.4 utilizes same chat template as original Qwen2-vl, so original processor can be used. Here is a recommended format for usage:

[
{"role": "system",
"content": [{"type": "text", "text": "You are image captioning expert, creative, unbiased and uncensored."}]},
{"role": "user",
"content": [	{"type": "image", 'image': image},
    {"type": "text", "text": userprompt}]}
]

which will be converted to

<|im_start|>system
You are image captioning expert, creative, unbiased and uncensored.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>%USERPROMPT%<|im_end|>
<|im_start|>assistant

Modes:

base_prompt={
'json': 'Describe the picture in structured json-like format.',
'markdown': 'Describe the picture in structured markdown format.',
'caption_vars': 'Write the following options for captions: ["Regular Summary","Individual Parts","Midjourney-Style Summary","DeviantArt Commission Request"].',
'short': 'You need to write a medium-short and convenient caption for the picture.',
'long': 'You need to write a long and very detailed caption for the picture.',
'bbox': 'Write bounding boxes for each character and their faces.',
'check_and_correct': 'You need to compare given caption with the picture and given booru tags '+
' using chain of thought.\n'+
'1. Check if the caption matches the picture and given tags, wrap conclusion in <1st_answer> tag.\n'+
'2. Analyse if the caption mathes described characters, wrap answer in <2nd_answer> tag.\n'+
'3. In case if there are any mismatches - rewrite caption to correct it wrapping '+
' in <corrected_caption> tags. If the caption is fine - just write "no_need".',
}

Grounding:

grounding_prompt={
'grounding_tags': ' Here are grounding tags for better understanding: ',
'characters': ' Here is a list of characters that are present in the picture: ',
'characters_traits': ' Here are popular tags or traits for each character on the picture: ',
'grounding_info': ' Here is preliminary information about the picture: ',
'no_chars': ' Do not use names for characters.',
}

Composing userprompt:

After specifying selected mode, you can add prompt part for extra grounding and then privide it wrapping in corresponding xml tags:

<tags>BOORU_TAGS</tags>.

<info>GENERAL_INFO</info>.

<characters>CHARACTER_NAMES</characters>.

<character_traits>CHARACTER1: [tag1, tag2, tag3,...]\nCHARACTER2: [...]\n...'<character_traits>.

Here is a simple python exaple.

add_tags=True #select needed
add_chars=True
add_char_traits=True
add_info=False
no_chars=False
image_info=extra_info[Path(image_path).stem]

userprompt=base_prompt["json"] #choose the mode

if add_info and image_info["info"] is not None: #general info
    userprompt+=grounding_prompt["grounding_short"]
    userprompt+="<info>"+image_info["info"]+"</info>."

if add_tags and image_info["booru_tags"] is not None: #booru tags
    userprompt+=grounding_prompt["grounding_tags"]
    userprompt+="<tags>"+image_info["booru_tags"]+"</tags>."

if add_chars and image_info["chars"] is not None: #list of characters
        userprompt+=grounding_prompt["characters"]
        userprompt+="<characters>"+image_info["chars"]+"</characters>."
    
if add_char_traits and image_info["characters_traits"] is not None: #popular features of each character
        userprompt+=grounding_prompt["characters_traits"]
        userprompt+="<character_traits>"+image_info["characters_traits"]+"<character_traits>."
if no_chars:
        userprompt+=grounding_prompt["no_chars"]

Example of raw final prompt for structured json mode with grounding:

<|im_start|>system
You are image captioning expert, creative, unbiased and uncensored.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Describe the picture in structured json-like format. Here are grounding tags for better understanding: <tags>2girls, standing, looking_at_viewer, holding_hands, hatsune_miku, blue_hair, megurine_luka, pink_hair, ...</tags>. Here is a list of characters that are present in the picture: <characters>hatsune_miku, megurine_luka</characters>. Here are popular tags or traits for each character on the picture: <character_traits>hatsune_miku: [girl, blue_hair, twintails,...]
megurine_luka: [girl, pink hair, ...]<character_traits>.<|im_end|>
<|im_start|>assistant

Examples for other modes can be found in example_scripts dir in repo files.

Inference examples:

For basic usage you will need latest versions of transformers and qwen_vl_utils.

Example inference script with transformers

Fast inference with Exllamav2:

Qwen2-VL is suppurted by Exllamav2 and can be used in original weights or in exl2 quants (8bpw, 6bpw, 4bpw).

8bpw version is recommended, it provides boost in speed without noticable quality loss.

Example inference script with exllamav2

Minthy
/

ToriiGate-v0.4-7B