Update: Go here for the next release (v0.5)

This is an alpha release for testing & feedback, there are known issues. (see known issues below). I am already training the next version, but due to the long training times, I'd appreciate any feedback in the interim period.

Details

  • Base model: llama2_70b_longlora_fp16_32k_ROPE8 (no base instruction tuning)
  • Fine-tuned with Llama-2 chat format
  • System prompt: An interaction between a user providing instructions, and an imaginative assistant providing responses.
  • 32K context length, use Linear Rope Scaling = 8 (IMPORTANT: use a factor of 8 even if you are not using the full 32K context length)
  • Not sure how well the model performs in notebook mode/completions. At least for the initial prompts of a conversation, the chat format & system message seem to matter.
  • This model is not censored, and is capable of producing offensive and NSFW content. Please use this model with caution, and do not use if you are offended by such content.

Available Quantizations

  • bfloat16
  • EXL2 2.4bit (experimental new quant) fits in 1x24GB using Exllamav2 & 8-bit cache @ 10K context
  • EXL2 4bit fits in 2x24GB (19/24) using Exllamav2 @ 16K context
  • EXL2 6bit fits in 48GB+24GB (36/24 split) or 3x24GB (16/17/20 split) using Exllamav2 @ 32k context
  • GGUFs here thanks to Noeda!

Main functions

  • Story Co-writing: Co-write a story over multiple guided prompts over a 32K context, staying consistent with prior story details, capable of writing both long and short scenes. Start by explaining that you are writing a story scene-by-scene, provide some background, themes/tags, and describe what you want in the first scene. After that, continue directing the story direction one piece at a time. You can give the model more creative control by asking it to add imaginative details, or have it precisely follow your scene outline.
  • Brainstorming/Speculation/Analysis: Pause writing in the midst of co-writing a story, and analyze the story so far, bounce ideas about future directions, etc.
  • Oneshot Story-writing: Write a complete story in one go, based on an outline, themes/tags, etc. Make sure you explain that this is not a scene-by-scene writing, and is meant to be written in a single go. You can specify a word-count to shoot for (though the model may not respect it). Example
  • Document Search/Analysis: Reading comprehension & finding information from a long document, or sets of documents (up to 32K tokens)

Secondary Functions (limited training so far, model may get confused between task types)

  • Roleplaying (RP): Explain what RP is, setup a scenario and characters, and start the RP. You can specify any rules like OOC or use of emojis, etc.
  • Interactive Fiction (IF) Emulation: Adventure game/interactive fiction emulation like Zork, Anchorhead, etc. Explain what it is, and how the AI should respond, specify the kind of game, tags, and so on. You can interact with usual commands like 'north', 'examine book', etc.
  • Choose Your Own Adventure (CYOA) Emulation: Explain what you're looking for and how you want the AI to respond (egs., with a numbered list of prompts at the end of each turn), and you can pick which option you want the story/game to go. Most such human-written games tend to have 1-2 prompts, so I had a hard time getting the AI to give more options. Finetuning is helping, but the model is now only half-baked.
  • Document Summary/Editing: Brief or comprehensive summaries of a long document, or sets of documents, in various formats (prose, bulleted list, table). Can also do some limited re-writing, conversions between formats and grammar checking.
  • General Chatting: Explain that it is a general chat, or provide some preamble to your interaction before starting. Otherwise the model might not know if you want to RP, story-write or something else.
  • General Logic/Reasoning: Same guidelines as above.

Prompting Guidelines

  • Treat the first prompt like you normally would the system prompt
    • System prompt itself does not change
    • Describe what you want the AI to do in detail in the first prompt, even if you feel it is obvious. This is how the AI can tell what sort of task it is supposed to perform (story-writing, RP, adventure game emulation, summarization, and so on).
    • After that, specify anything else you want in the first prompt (your instructions for the next response, for instance).
  • Bias the length of the output with your prompt. This is no guarantee, so you may need to regenerate if you don't get your preferred length. The model will easily produce 2000+ tokens (egs., for a story scene), so make sure your response limit can handle that.
    • Egs., Statements like Make this a long response would bias the response longer
    • Statements like Respond briefly would bias it shorter
  • Explain clearly if you want the content to be SFW or NSFW in the first prompt as well. However, there are no guarantees that the model won't generate NSFW content if you force it to, in a later prompt, even if you specify the content should be SFW at the start. It's just a statistical bias (that should get better with more training).
  • Give the model details to go on. The more you flesh out what you want, the better and more consistently it will write. Tiny prompt = Tiny story, and more ChatGPTisms.

Known Issues

  • Blank Outputs: When you have many short prompts, sometimes the model just produces the EOS token. Especially with RP and adventure game emulation. I believe that this is due to this issue. Fixing in next iteration, but meanwhile, workarounds:
    • If you're a few prompts into your conversation, change the prompt format away from Llama-chat to break the model out of it. The model seems to adapt just fine to the new format mid-conversation.
    • Use Start reply with in Oobabooga to force the first token be something other than </s>
    • Ban the EOS token (though you need to stop the generation manually in that case)
    • Strip the space after the final [/INST], though I don't know of an easy way to do that without writing code in Oobabooga
    • Ban the EOS token only for the first generated token, though not sure how you'd do that without some code (this feature seems like a good idea to always have enabled actually)
    • Wait for the next iteration where I think I have it fixed! Airoboros went through the same issue when they switched to Llama-chat.
  • Lack of Diversity for NSFW Content: Some common phrases & descriptions are over-used. I believe I know why this is, and believe it can be fixed with more training/diverse content (which is currently being done).
  • ChatGPTisms: Not refusals, but canned responses, happy endings, that sort of thing. Thankfully, this does not happen often, but it shouldn't happen at all, as this was not in the training data. But it shows up anyhow, possibly because base Llama-2 has it baked in.
    • The first prompt is the worst, because Llama is probably remembering the ChatGPT conversations it saw in pre-training. Gets less likely at longer contexts which look less like ChatGPT samples. Any reference to a 'chat' or a 'helpful assistant' seems to trigger these 'memories'.
    • I will eventually fight this with DPO (using prompt-biased GPT-4 generated responses as the rejected option).
    • For now, regenerate, or prompt engineer. The model usually can regenerate diverse responses based on temperature, top_p and top_K (many models don't have a diverse distribution outside the top greedy tokens, but you don't want that in a creative model like this one).
    • If your client allows it, edit the model output to make it how you want. This will serve as an example in the history to align the model away from its pre-training in future rounds. Editing the model response is more effective than negative instructions (egs., don't do X).
  • Repetition: The usual thing. Not sure why it happens. I avoid it by setting repetition penalty = 1.16 (or higher) and repetition range = 4096.

Training Data

For 70% of the training data, the outputs were written by humans, though some of the inputs may have been originally seeded by GPT-4 (and expanded using other LORAs).

For 30% of the training data, the outpus were written by GPT3.5/4, but this was mostly logical reasoning, summarization and other non-creative content (with no issue of refusals or alignment). Some GPT4-generated creative content was present in the RP proxy logs however.

In many cases, the dataset used is a re-stitched, augmented version of the original dataset suited for training long-context models. These modified versions of the datasets are linked below.

Training Data:

  • Main dataset: Human-written stories from forums, fanfic websites, The Pile and other sources
    • Raw text was chunked and converted into Q&A (analysis, reading comprehension) and long-form, multi-round, scene-by-scene writing interactions, using my story-reverse-prompt model.
    • Each interaction was truncated at 32K (usually the first third of a normal-sized novel).
    • In a small fraction of cases, the story-writing continued beyond 32K, but with the dropped initial portions either summarized, or re-constituted using RAG, to fit in 32K and still provide context.
    • I can't publish this dataset directly, as I am not sure what the current legal status of The Pile is. But you can use the above model to generate your own dataset from raw story text.
  • Summaries of Wikipedia articles in various formats
  • Phyiscal/Spatial Reasoning (line of sight, physical deduction) and Theory of Mind (who knows what about what) problems, double-checked by GPT4-Turbo (2-shot).
  • Document Editing Tasks
  • Sections of Airoboros 2.2.1/3.1 (RP, chain-of-thought, rules-based chats, theory of mind, dealigned writing, jokes/riddles).
  • Sections of Surge Instruct (for extraction, summarization, re-writing, classification).
  • Proxy RP Logs (GPT4 outputs only): jannie-log-augmented, Teatime & aicg-logs-augmented were all re-stitched together to create a single seamless conversion (sometimes from the original source) to undo the 2K or 4K divisions, and augmented with more context and rules about the conversation in the first prompt.
  • A fully re-generated version of Floyd Text Adventures with better context and AI interaction format. Here is the link to the original until I upload the modified version.
  • A fully re-generated version of the CYS CYOA dataset (re-generated from source by 'dungeon crawling' the space automatically, maximizing visiting unique 'rooms', then converting the output logs into a chat format).
  • NART synthetic therapy logs was heavily filtered and used cautiously (lots of GPTisms, but actually relevant in this context where the AI is playing a supportive role).
  • Augmental-Stenisgate-Augmented, an augmented, cleaned up version of Augmental Stenisgate RP where the AI only plays a single character.
  • bluemoon_Karen_cleaned, an error-corrected version of Bluemoon RP, re-generated using Karen The Editor.
  • PIPPA-augmented-dedup, a de-duplicated, cleaned and augmented version of PygmalionAI's PIPPA.
  • LimaRP-augmented, an augmented, re-stitched version of LimaRP for long-context training.
  • Erotic Analysis was used in reverse, for one-shot NSFW story generation.
  • Reading Comprehension
  • Unnatural Instructions for word-constrained generation.
  • Long Instructions for relevant document finding/retrieval up to 32K.
  • OpenORCA GPT4 outputs only.
  • Ultrachat Uncensored with capitalization errors fixed & further scrubbed for GPTisms (not just refusals, sentiment as well).
  • ShareGPT Hyper Filtered further scrubbed for GPTisms (not just refusals, sentiment as well).
  • Claude Multiround also further scrubbed, but being a different model than GPT4 I may not have caught all the gushing positivity.
  • Wizard Vicuna Unfiltered further scrubbed like the others.
  • TinyStories GPT4
  • SODA Synthetic Dialogue used with caution (mostly for title suggestions).

License

Unsure. It uses some datasets which were generated using GPT-4 outputs, so openAI's terms may apply. I personally have no objection about this model being used for any commercial or non-commercial purpose, but please respect the license agreements of Meta, OpenAI or other parties involved.

Downloads last month
10
Safetensors
Model size
69B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.