4 bit version?
I tried doing it myself but ran into problems when using this: https://github.com/0cc4m/GPTQ-for-LLaMa (it adds support for mpt models)
I was looking into this as well. I tried to use main GPTQ-for-llama to quant it (this model just sounds a million times more promising than the original) but I'm getting errors because it is not a llama model. I saw that like a week ago the Occam released a quanted version, so it is doable (https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g). I just don't know how. I also looked through occam's github with his version of koboldai and originally just didn't see his GPTQ implementation.
Anyway, now that I see mpasila's link I'm going to try that route. I have data right now too so if it works I would be happy to upload a working model. Maybe thebloke will beat me to it hah
Edit: I tried every which way to make the GPTQ that was linked above work. Does anyone have the sauce. I even tried the gptneox which at least failed different way (cuda memory over run). When I tried to run with llama version it screws up every time talking about the tokenizer not being compatable with the neox style tokenizer.
I also tried installing the two different ways. The old way with the conda env and the new way by making a new conda env and then running the pip install git command they have listed on the repo. Couldn't get the pip install way to work at all.
I will have a look tomorrow if I have the time
so if i had to guess we need that layer mapping...
Looking forward to it! @TheBloke Thanks! :D