THE THREAD OF DOOM
Just realised I deleted the old "thread of doom" as it was attached to the earliest alpha version of the control vectors :(
Okay, I was wondering if we crossed some sort of line.
Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...
@ChuckMcSneed @BigHuggyD @gghfez
Ping.
Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...
Yeah, it's a pity it got deleted (I should have checked more carefully what was linked), but it was getting a bit out of hand with all that scrolling so perhaps not such a bad thing.
I'm just gonna keep up the models that people have downloaded the most and get rid of all the "experimental, but likely broken" stuff with 15 downloads as they really weren't serving much of a purpose.
Also, all the old versions of the control vectors were vastly inferior to the final version due to me figuring out how to get them working as I went along, so it's probably better to just keep up the final v3.0
ones to avoid a lot of the confusion.
It looks a lot more like I'm just uploading quality models that people like/use now at least... The creative-writer-v0.1-35b
and creative-writer-v0.2-35b
models will be going as soon as I get the v1.0
version uploaded, and possible Dusk-Miqu-70B
if they do set a hard-limit (I still think Dark-Miqu-70B
is worth keeping whatever though).
Also if anybody really misses any I have uploaded, then I can in theory recreate them and upload a LoRA created from the delta using extract_lora.py, but I strongly suspect most of the models nobody will even notice they have gone... Of all that I have created I've only ever used Dark-Miqu-70B
myself!
:( Damn there was some good info in that thread.
If you've still got Firefox tabs open somewhere, you'll be able to save some of the thread.
Unfortunately, I cleaned my browser tabs up about an hour ago.
And yeah, if people were using it as free cloud storage then it makes sense. I just think they could have gone about it better, rather than having us wake up and see the limit.
I'm curious, did your quota drop after deleting that? I wonder if all the PNG files attached there were "billed" to you.
@jukofyork I think you're good man. If they start enforcing it, you'll get an exemption for sure.
I come across your contributions randomly all over the place, even on github repos like some fine tuning tool lol
I should probably deduplicate my quants. Often, I was making one because I could not find what I was looking for, then it would turn out a few of us just happened to be making them at the same time, Then I started getting requests. So I just decided I would make a bunch. Need a Huggingverse quant global dedupe...
P.S. Would a dataset in the format your soon-to-be-published scripts produce, be useful for pre-training?
Yeah, in theory it should be able to take the whole of the books3
dataset and sort it all out too (it's a complete mess - also full of scrambled TOCs, author's notes and so on).
I've written the API stuff now (plus untested code to deal with the OpenAI batch API), and some of the templating stuff too.
Eventually I hope to have the Bash equivalent of OmniChain and can then easily setup workflows where failures are handed off to smarter LLMs, iterated over, etc.
It should be really useful for tasks like this, reasonably fast and low overhead (plus with careful though you can often use GNU parallel to speed things up).
I opened it up:
https://github.com/jukofyork/bash-llm
but it is still very much a work-in-progress and some of the scrappy stuff like find_line_number.sh
that copied from my original monolithic script will likely get changed/removed and lots of other stuff will likely get moved around and renamed too...
The 3 main scripts will be these:
https://github.com/jukofyork/bash-llm/blob/main/api_call.sh
https://github.com/jukofyork/bash-llm/blob/main/template_substitute.sh
https://github.com/jukofyork/bash-llm/blob/main/json_extract.sh
(not tested it but api_call.sh
should work with OpenAI compatible APIs, but all the other api_XXX.sh
stuff is to do with their 50%-cheaper batch-API).
I also added the templates I've been using with my original monolithic script:
https://github.com/jukofyork/bash-llm/tree/main/templates
You can get even gpt-4o-mini
to work really well if you follow that style of asking for analysis before the decision.
I eventually would like to set up LLMs looking at every line/paragraph of a book, and sort out any bad formatting or hyphenation, etc.
Yes I'm a big fan of bash scripts myself :)
Thanks for opening it up, bookmarked to check out in a week when I start tackling the pretraining project.
How do I unsubscribe to this discussion thread?
https://huggingface.co/notifications
Click checkbox for the thread and then the "Done" button at the top.
Yes I'm a big fan of bash scripts myself :)
Thanks for opening it up, bookmarked to check out in a week when I start tackling the pretraining project.
I'll hopefully have it tidied up a bit more by then.
If you use C++ then I can tell you exactly what not to use for this:
Boost.JSON
.libcurl
multi-socket API.
The Boost.JSON
code is just a wrapper around Boost.PropertyTree
and scales horribly (likely due to the ptree stuff doing crazy amounts of dynamic allocation for everything).
The libcurl
library is really nice to deal with (and easy to make a C++ wrapper for), but the multi-socket API is a nightmare and seems to have near-impossible to find thread synchronization bugs that only show up with scale...
We've got some code using both these and it's lead to nothing but pain for the last few years and really needs completely rewriting :/
Not really related to LLM API stuff, but Boost.Serialization
has led to no end of problems too, and at least for binary serialisation; keep well clear!
@jukofyork do you still want the distributionally correct scramble code? The Redditor reached out and said he decided it was a placebo effect, but I wanted to check if you might still have a use for it?
@jukofyork do you still want the distributionally correct scramble code? The Redditor reached out and said he decided it was a placebo effect, but I wanted to check if you might still have a use for it?
Yeah, I'm not wanting to use it for the intended purpose - just more details on what he used to see if I can do something similar to corrupt real stories.