The 2.0 and m2.0 versions still feel like a QLora
Hey,
As weird as it seems, those finetuned versions still feel like a QLora, I noticed this drop in quality quickly because when I try to make a story with only females, only the QLora versions of yours disregard that fact and add some male characters.
Yet on those finetuned 2.0 and m2.0 models, this problem persists, it adds male characters and the outputs are shorter than the airoboros-l1-13b-1.4.
I don't know if that's because the 1.4 dataset is superior to the 2.0 and m2.0, but it's a shame I still prefer the llama1 airoboros...
My guess would be that the assumptions about GPT4 being dumber over time are true, you said you were using the june version of gpt4 to create the 2.0 dataset, it seems like the march version of GPT4 gave better outputs and that's why the 1.4 dataset seems to be better in my book.
Edit: the 2.0 seems to be more consistent than m2.0 and sometimes it can happen the generation ends after a few sentences only
I assure you they were not fine-tuned with QLoRA - I used fastchat:
https://gist.github.com/jondurbin/7183e6edcc5cb57d5f544614d0ce0503
I might make a llama-1 version of this to see if it's something in the llama-2 base model itself causing problems.
When I was working on the new dataset, I had to be much more explicit with the prompts to get June gpt4 to respond the way I wanted it to (March version worked fine with much less detail), and even then it often ignored certain details or hallucinated extra criteria, so I suspect that's where the problem is.
When you say the 2.0 is more consistent, do you mean consistently better or worse?
"I might make a llama-1 version of this to see if it's something in the llama-2 base model itself causing problems."
I know I'm asking much, but I'd really love to get llama2-13b finetuned with the 1.4 Airoboros dataset instead, it worked so well on llama1 and I'm sure it will also do its magic for this one also.
"When I was working on the new dataset, I had to be much more explicit with the prompts to get June gpt4 to respond the way I wanted it to (March version worked fine with much less detail), and even then it often ignored certain details or hallucinated extra criteria, so I suspect that's where the problem is."
Yeah, same. I work as a datascientist and I'm often using gpt4 to make code, at the begining (mars -> may) it was really easy to talk to it. Now it feels I have to explain like it's a 5yo, it understands less and less what I'm asking for. I wouldn't say it got dumber to the point it went back to legacy 3.5 level but it's definitely not as smart as the march gpt4 counterpart, and that's such a shame...
That probably explains the drop in quality we got on the finetunes, june gpt4 doesn't provide good enough outputs anymore, I wish we could go back to march gpt4, at least you still have the 1.4 dataset so there's that I guess :p
Edit: Looks like we're not the only one noticing this ^^' https://www.reddit.com/r/ChatGPT/comments/15ekje9/goodbye_chat_gpt_plus_subscription/
"When you say the 2.0 is more consistent, do you mean consistently better or worse?"
For the better, I got less often serious halucinations on the 2.0 compared to the m2.0.
For the better, I got less often serious halucinations on the 2.0 compared to the m2.0.
I'm not sure a llama-2 fine-tune of 1.4 would be what you want if this is the case, because m2.0 includes 1.4 so if it's worse than 2.0 the problem is likely somewhere in the 1.4 dataset.
It may also just be overfit, perhaps an earlier checkpoint model would do better. Let me try a couple things .
I'm not sure a llama-2 fine-tune of 1.4 would be what you want if this is the case, because m2.0 includes 1.4 so if it's worse than 2.0 the problem is likely somewhere in the 1.4 dataset.
I think that it's either because merging the 1.4 dataset with the 2.0 dataset was a bad idea because the outputs are differents (june vs march gpt4) so the model has trouble to train with 2 different paradigm, or it's because like you said according to the LIMA paper, you shouldn't have a dataset too big and the merge is too big.
I don't think there's a problem with the 1.4 dataset, it made llama1-13b really interesting by itself, that's why I wanted to see if it would do the same for llama2
Congratulations, you've volunteered yourself to test this!
Here are 7 models, tuned with varying datasets, some qlora, some full fine tunes, some llama-2, some not.
- https://huggingface.co/jondurbin/blind-test-13b-zane
- https://huggingface.co/jondurbin/blind-test-13b-vlad
- https://huggingface.co/jondurbin/blind-test-13b-martha
- https://huggingface.co/jondurbin/blind-test-13b-jimmy
- https://huggingface.co/jondurbin/blind-test-13b-jasmine
- https://huggingface.co/jondurbin/blind-test-13b-janus
- https://huggingface.co/jondurbin/blind-test-13b-francis
One of these models is the one you seek - a full fine tune of llama-2-13b on the 1.4.1 dataset. The rest are not.
While is possible for you to identify some of them, as far as qlora vs FT, and whether they are llama-2 or llama-1, I'd ask that you just test them and not try to figure it out before hand.
Run at least a few dozen prompts through each, then rank them on a scale of 0-10, and let me know. I have a secret gist, with a git commit timestamp, that includes the mapping of what each of these models actually are.
I shared the test with my community so they can also see which one they like best.
The models can be loaded on the free KoboldAI GPU colab using United so that is probably what they will use as a setup.