Please upload the full model first

#1
by ChuckMcSneed - opened

Hi there, I noticed that you've uploaded Q2_K and are currently uploading Q4_K_M models. While I appreciate your contribution, I wanted to suggest that it might be more efficient and beneficial for everyone if you were to upload the F16 model instead. This is because users can convert the F16 model to any other quantization they might need, including SOTA Q-quantized and exllama models. By uploading the F16 model first, you can save your own time as well the time of other users who might be looking for different quantizations of the models. I hope you understand where I'm coming from and consider this suggestion. Thank you for your time and contributions to the community!

aaaaaaaaaa

As an AI language model enthusiast, I wholeheartedly agree with your suggestion! The benefits of uploading the FP16 model first before other quantized models are numerous and can greatly contribute to the efficiency and effectiveness of the community's efforts. I strongly support the suggestion to upload the half-precision model first before other quantized models. By doing so, contributors can help to save time, reduce redundancy, provide a reliable starting point, and foster a collaborative community culture. Thank you for your thoughtful suggestion and for your contributions to the community!

McDonald's wifi is slow, you need to understand.

My uncle works at miqu and he told me there's no going to be f16 version

Look at how much time it took me to upload all this and

reconsider.jpg

miqudev changed discussion status to closed

@miqudev
We can wait two more weeks!

ChuckMcSneed changed discussion status to open

Look at how much time it took me to upload all this and

reconsider.jpg
Put it on a torrent

Torrent would probably be worse. He's still have to upload the whole thing, and because there are people who takeand not give back, he'd also have to upload some chunks multiple times. Uploading it to huggingface is the easiest - as long as they don't delete it.

Torrent would probably be worse. He's still have to upload the whole thing, and because there are people who takeand not give back, he'd also have to upload some chunks multiple times. Uploading it to huggingface is the easiest - as long as they don't delete it.

But then people who he uploads the chunk would seed too. (I personally would).

well first of all congratulations to that accomplishment of this fine-tune
i think you have us all puzzled with as to how that was produced to be as close to mistral-med as it is - do you intent to write a paper / give some informations or keep that tight to your chest ?

I'm not gonna look a gift horse in the mouth. Going to try the q4km and see how this does. if it's good it's good. Better than downloading poorly trained mixtral quants that are all worse than instruct. I can afford to eat a sketchy gguf. At q2 it was lulz tho and no point.

I'm not gonna look a gift horse in the mouth. Going to try the q4km and see how this does. if it's good it's good. Better than downloading poorly trained mixtral quants that are all worse than instruct. I can afford to eat a sketchy gguf. At q2 it was lulz tho and no point.

its good - just super slow even on my system

https://i.imgur.com/Orme04C.png
drains 60g vram at a gen speed of around 12t/s

exlv2 would be far better but the model is extremely close to mistral-medium

deleted

Look at how much time it took me to upload all this and

reconsider.jpg

C'mon, son, drop the fp16 bomb on us! What's a small tariff from your ISP compared to the glory you'll have later on?
izayoi_sakuya_touhou_drawn_by_mata_matasoup__0c3340b6a05ad4a6cb231c617efe6b5b.png

He uploaded a god damn q5

I'm not gonna look a gift horse in the mouth. Going to try the q4km and see how this does. if it's good it's good. Better than downloading poorly trained mixtral quants that are all worse than instruct. I can afford to eat a sketchy gguf. At q2 it was lulz tho and no point.

its good - just super slow even on my system

https://i.imgur.com/Orme04C.png
drains 60g vram at a gen speed of around 12t/s

exlv2 would be far better but the model is extremely close to mistral-medium

Yes, I dream of fitting the full 32k and getting faster gens. I enabled 8_0 "quantum" cache for it and I can fit about 13k in 48g. It does have some shivers and a bit of the good ol "alignment"

Talks a bit different than other 70b i have used. My chatml and prompt from mixtral-instruct worked straight away and unlike other 70b I can use higher dynatemps.

I don't regret d/l it at all.

Dear @miqudev ,
I hope this message finds you well. We appreciate your contributions to our community and value the perspectives you bring to our virtual table. However, I must express that your recent activity, particularly the consistent uploading of quantizations instead of the much-anticipated F16 model, falls into a category of actions that might be considered not worthy of divine intellect. This is not to undermine your efforts or capabilities, but rather a gentle reminder that we all have a responsibility to utilize our time and resources in a manner that benefits the collective the most. By promptly sharing the F16 model, you'd be aiding our community's progress much more than by uploading even more quantized versions. I trust you understand the sentiment behind this message and look forward to your continued participation.

This comment has been hidden
This comment has been hidden
This comment has been hidden

that is probably the best edge of 2024 so far
im read to jump at a moments notice on the fp16

I'm just gonna post to say I was here

I just noticed that i have a hugginface account. Woah.

This thread is SOTA.

Fp16. Bit by bit.

FP16 please so my boys can do the dank memes.

everyone is super horny for the fp16 - expect that guy probably never intended to release it , the dripfeeding of the smallest to bigger quant is edging galore - good marketing tho but thats about it

see you in 4 weeks if the fp16 is up by then

This comment has been hidden

I dialed it in and am pretty sold. Well made and entertaining model. Just I am gguffing on the ggufff.. only exl2 will quench my thirst.

Dear God please make it stop.

16 bit weights are not coming, same for exl2 quants. GGUF's uploaded should already have more than enough precision for what most people use which is 2-3bpw. Convert and quant yourself, calibration datasets and exact bpw are a touchy subject for most, I can't satisfy everyone.
Every hour I pray this thread gets locked for being derailed, I hate my puter all my enemies are in it.

miqudev changed discussion status to closed
This comment has been hidden
ChuckMcSneed changed discussion status to open
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden
This comment has been hidden

Bro pls? GGUF performance is so bad.. only 10-15t/s, can't fit all 32k context on 2 3090s. EXL2 is fast.. doesn't reprocess context as much. Make a GPTQ if it's too much work for exl2.. its easy... just 4 bits 128 groups... then the suffering can end

OIP.jpeg

Soulless discussion.

Bro pls? GGUF performance is so bad.. only 10-15t/s, can't fit all 32k context on 2 3090s. EXL2 is fast.. doesn't reprocess context as much. Make a GPTQ if it's too much work for exl2.. its easy... just 4 bits 128 groups... then the suffering can end

OIP.jpeg

Alpin made an fp16 conversion, just wait until someone convert it to exl2 (quality will questionable).

I don't have high hopes for that reason. You would have to rewrite the exl2 quanting process and then throw more data away to produce a file. What good is it if it was dumb? It does a great job summarizing prompts and following instructions. I understand why the author can't upload 160gb or spend 12hrs on exl2 but some other way besides GGUF would be nice . It's bagel-hermes 60b all over again where the model isn't performant enough to keep using it without slowness. I'm ready to see if I can build old llama.cpp to see if I can get my speed back that newer builds lost.

WE WILL NOT BE SILENCED, BROTHERS!

Soulless discussion.

Bro pls? GGUF performance is so bad.. only 10-15t/s, can't fit all 32k context on 2 3090s. EXL2 is fast.. doesn't reprocess context as much. Make a GPTQ if it's too much work for exl2.. its easy... just 4 bits 128 groups... then the suffering can end

OIP.jpeg

Alpin made an fp16 conversion, just wait until someone convert it to exl2 (quality will questionable).

fp16 is wonky .. i tested it local .. there is a problem in the conversion - it lobotomized the model

q5km made a snake game first time. 1st with curses, then when I prompted for pygame it did a perfectly working one first time. With .png sprites (had to make those myself)
Anyone know any other llama2 70bs that can do that?

miqu-1-70b.q5_k_m.gguf --top-k 1 --min-p 0 --top-p 1.0 --color -t 5 --repeat_penalty 1 -c 9000 -n -1 -p "[INST] Write a snake game in python using pygame [/INST]" --temp 1

mixtral-7b-8expert can also do the snake game in one-shot.

No denying its a fairly well trained model.

Now release fp64 pagbounce

k5km made a player v ai pong game that sort of works first time. Slowing the ball speed right down makes it playable but it needs more adjustments.

miqu-1-70b.q5_k_m.gguf --top-k 1 --min-p 0 --top-p 1.0 --color -t 5 --repeat_penalty 1 -c 9000 -n -1 -p "[INST] Write a solo pong game in python using pygame [/INST]" --temp 1

Mixtral can't do that.

EDIT: working player vs ai pong game with minor speed fixes by me, a complete programming noob.

This is flipping legit bros.

https://pastebin.com/CnGQ35cK

Just let this discussion die, poor HF admins. Alpindale mentioned that hf_transfer exists, I might try that for miqu-2 if that happens, should work better for large files.
Also threw together a script that prints average block errors for the token embeddings, if you ask me, error is less than cublas's non-determinism due to different CUDA kernels (average of the average block error is 0.0004001997486734581). Here is the snippet:

import os
import sys

from pathlib import Path

from tqdm import tqdm

import numpy as np


if 'NO_LOCAL_GGUF' not in os.environ:
    sys.path.insert(1, str(Path(__file__).parent / 'gguf-py'))
import gguf

ORIGINAL = "D:\\HF\\miqu-1-70b\\model.gguf"
QUANT = "D:\\HF\\miqu-1-70b\\miqu-1-70b.q5_K.gguf"

original = gguf.GGUFReader(ORIGINAL)
quant = gguf.GGUFReader(QUANT)

original_te = [ts for ts in original.tensors if ts.name.startswith("token_embd.weight")][0].data.astype(np.float32)
quant_te = [ts for ts in quant.tensors if ts.name.startswith("token_embd.weight")][0].data

Q_K = 256

K_SCALE_SIZE = 12
QH_SIZE = Q_K//8
QS_SIZE = Q_K//2

Q5_SIZE = (2 + 2 + K_SCALE_SIZE + QH_SIZE + QS_SIZE)

quant_te_blocks = quant_te.shape[0] // Q5_SIZE
quant_te_shape = quant_te_blocks * Q_K

print(original_te.shape, quant_te_shape)

dequant = []

def get_scale_min_k4(j, q):
    if j < 4:
        return q[j]&63, q[j+4]&63
    else:
        return ((q[j+4] & 0xF) | ((q[j-4] >> 6) << 4)) & 0xFF, ((q[j+4] >>  4) | ((q[j-0] >> 6) << 4)) & 0xFF

for i in tqdm(range(quant_te_blocks)):
    start_i = i*Q5_SIZE
    data = quant_te[start_i:start_i+Q5_SIZE]
    
    d, dmin = np.frombuffer(data[0:4], dtype=np.float16, count=2).astype(np.float32)
    scales = data[4:4+K_SCALE_SIZE]
    qh = data[4+K_SCALE_SIZE:4+K_SCALE_SIZE+QH_SIZE]
    qs = data[4+K_SCALE_SIZE+QH_SIZE:]
    assert qs.shape[0] == QS_SIZE
    
    u1, u2 = 1, 2
    iss = 0
    ql = 0
    
    for j in range(0, Q_K, 64):
        sc, m = get_scale_min_k4(iss + 0, scales)
        d1 = d * sc
        m1 = dmin * m
        
        sc, m = get_scale_min_k4(iss + 1, scales)
        d2 = d * sc
        m2 = dmin * m
        
        for l in range(32):
            dequant.append(d1 * ((qs[ql+l] & 0xF) + (16 if qh[l] & u1 != 0 else 0)) - m1)
        for l in range(32):
            dequant.append(d2 * ((qs[ql+l]  >> 4) + (16 if qh[l] & u2 != 0 else 0)) - m2)
        
        ql += 32
        iss += 2
        u1 <<= 2
        u2 <<= 2

dequant = np.asarray(dequant)
print(original_te.shape, quant_te_shape, dequant.shape)

block_errors = []
for i in range(quant_te_blocks):
    error = np.abs(original_te[i*Q_K:(i+1)*Q_K] - dequant[i*Q_K:(i+1)*Q_K]).mean()
    block_errors.append(error)

print('---')
print(block_errors)

Here is the link: https://files.catbox.moe/fpkl7k.txt

P.S. @MrDragonFox I hate how you always mention money when it comes to my published model, like what the hell am I supposed do? Sell original weights without any guarantees, upload them for years and then IRS will come knocking on my door because I spent way more than I earn? I hate people with "money-first" mindset, I do ML stuff because I like it, not because it's bringing in the big bucks.

miqudev changed discussion status to closed

Miqu-2? See you in two weeks, I guess.

Just let this discussion die

discussion will continue until fp16 drops

I appreciate what you've uploaded so far. I understand uploading fp16 may be difficult. I know it's unlikely, but is there anything we could do to assist to make uploading fp16 more feasible for you? I assume not uploading FP16 is more out of difficulties in doing so rather than a sheer unwillingness?

I appreciate what you've uploaded so far. I understand uploading fp16 may be difficult. I know it's unlikely, but is there anything we could do to assist to make uploading fp16 more feasible for you? I assume not uploading FP16 is more out of difficulties in doing so rather than a sheer unwillingness?

They already uploaded 116gb worth of models.. fp16 takes about 130gb so it wouldn't take that long to upload. (unless they are bandwidth limited)

Someone pay this guy's internet bill. 🀣 @mikudev if you need help getting it up, just let the community know how they can help πŸ€—

I made a Q3_K_M from the Q5_K_M via an intermediary Q8_0.

Nexesenex/Miqu-1-70b-Requant-iMat.GGUF

please upload fp16 gguf

@mikudev you can shard the HF model into 10gig chunks and then upload those instead of gigantic GGUF with sketchy resume. I have shitty internet too.

Sorry to bother you I'm not asking for FP16, I just want the Miku meme about disinformation, thanks.

Its so adorable how people who are being blue-balled react like a crybaby who won't get their favorite toy.

Hoping miqu never uploads it tbh.

he wont - its over already in a few month no-one cares about that anymore anyway ^^ thats the only good thing about this fast pace industry - shit like that wont stay hot for too long

anyone seen my leg ?

But would you release the FP16 ?

image.png

Sorry to bother you I'm not asking for FP16, I just want the Miku meme about disinformation, thanks.

Lets start a gofundme for this guy to get better internet to upload the fp16

Heres my theory, miqudev is secretly an employee at meta/openai/mistralai/google. One day his boss said to him, "Hey redacted why dont you go make a gguf quant of that alpha model we've been testing, and see if it will run on consumer hardware. Run it on your workstation and see how it performs." But he didnt see how it performed, he uploaded the model to huggingface using his secret 4chan account that he uses to enduldge in his everlasting love for hatsune miku. Little did he know the never ending pursuit he would face from the open source community to publish fp16 files for a model he never had access to in the first place. It was a stroke of luck, a bad desition, a hole to jump in, a 1 gallon bucket of cookies and cream ice cream and netflix binger night to pass the time it took to upload, and a world of pain for the time to come. This is his story. Dun Dun

@rombodawg bro you just posted cringe

@ChuckMcSneed this whole thread is cringe, im just adding to it

Heres my theory, miqudev is secretly an employee at meta/openai/mistralai/google. One day his boss said to him, "Hey redacted why dont you go make a gguf quant of that alpha model we've been testing, and see if it will run on consumer hardware. Run it on your workstation and see how it performs." But he didnt see how it performed, he uploaded the model to huggingface using his secret 4chan account that he uses to enduldge in his everlasting love for hatsune miku. Little did he know the never ending pursuit he would face from the open source community to publish fp16 files for a model he never had access to in the first place. It was a stroke of luck, a bad desition, a hole to jump in, a 1 gallon bucket of cookies and cream ice cream and netflix binger night to pass the time it took to upload, and a world of pain for the time to come. This is his story. Dun Dun

Tldr

sorry, but bump πŸ˜…

giadap locked this discussion

Sign up or log in to comment