Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,117 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
pipeline_tag: text-generation
|
6 |
+
tags:
|
7 |
+
- linux
|
8 |
+
- gpl2
|
9 |
+
- mit
|
10 |
+
---
|
11 |
+
|
12 |
+
|
13 |
+
# LaaM - Linux as a Model
|
14 |
+
What happens when we train a simple transformer model to memorize the GPL2 source of the Linux kernel?
|
15 |
+
|
16 |
+
## Source
|
17 |
+
[https://github.com/mjbommar/laam](https://github.com/mjbommar/laam)
|
18 |
+
|
19 |
+
## Motivation
|
20 |
+
Simply put, the OSI is making a grave mistake by ignoring the most important transitive dependency in AI - the training data.
|
21 |
+
|
22 |
+
As of the latest version of [The Open Source AI Definition (draft v. 0.0.8)](https://opensource.org/deepdive/drafts/the-open-source-ai-definition-draft-v-0-0-8),
|
23 |
+
the OSI has decided that the legal status of training data is irrelevant to their subsequent "approval" of models as "open."
|
24 |
+
|
25 |
+
The argument in favor of this omission is that such a requirement would be inconvenient and legally ambiguous
|
26 |
+
in some jurisdictions.
|
27 |
+
|
28 |
+
This would be like Creative Commons encouraging the authors of textual or audiovisual works to ignore
|
29 |
+
the terms of copyleft licenses.
|
30 |
+
|
31 |
+
**Simply put, organizations like the OSI must take a clear, common sense stance: "AI" models like text or multimodal LLMs
|
32 |
+
cannot be considered "open" if they are trained on "stolen" or "closed source" data.**
|
33 |
+
|
34 |
+
## Details
|
35 |
+
To demonstrate how ridiculous the OSI's position is, I have trained simple transformer models to memorize the
|
36 |
+
source code of Linux version 1.0, which is licensed under the GPL2.
|
37 |
+
|
38 |
+
This model is documented and trained in perfect compliance with the OSI's draft guidance on Data Information, Code,
|
39 |
+
and Model sections. All source code is available in the GitHub repository, all dependencies are open source,
|
40 |
+
all input training data is directly described by the source code, and all model weights are available on
|
41 |
+
Hugging Face.
|
42 |
+
|
43 |
+
## Example Model - 5M parameter Llama2 architecture
|
44 |
+
For example, this 5M parameter model can be trained on practically any device in a minutes to hours. The model trivially
|
45 |
+
emits copies of Linux 1.0 source code. For example, using the HuggingFace hub copy at `mjbommar/linux-as-a-model-5M`:
|
46 |
+
|
47 |
+
```python
|
48 |
+
>>> from transformers import pipeline
|
49 |
+
>>> p = pipeline('text-generation', 'mjbommar/linux-as-a-model-5M')
|
50 |
+
>>> print(p('', max_new_tokens=256, do_sample=True, temperature=0.2)[0]['generated_text'])
|
51 |
+
linux/drivers/net/3c503.c /* 3c503.c: A shared-memory NS8390 ethernet driver for linux. */
|
52 |
+
/*
|
53 |
+
Written 1992,1993 by Donald Becker.
|
54 |
+
|
55 |
+
Copyright 1993 United States Government as represented by the
|
56 |
+
Director, National Security Agency. This software may be used and
|
57 |
+
distributed according to the terms of the GNU Public License,
|
58 |
+
incorporated herein by reference.
|
59 |
+
|
60 |
+
This driver should work with the 3c503 and 3c503/16. It should be used
|
61 |
+
in shared memory mode for best performance, although it may also work
|
62 |
+
in programmed-I/O mode.
|
63 |
+
|
64 |
+
The Author may be reached as [email protected] or
|
65 |
+
C/O Supercomputing Research Ctr., 17100 Science Dr., Bowie MD 20715
|
66 |
+
*/
|
67 |
+
|
68 |
+
```
|
69 |
+
|
70 |
+
## License
|
71 |
+
For the sake of demonstration, I have licensed the model source **and weights** under the MIT terms,
|
72 |
+
and the OSI should support this model as completely open and compliant with their draft guidance.
|
73 |
+
|
74 |
+
|
75 |
+
## Train your own model
|
76 |
+
```
|
77 |
+
# ensure poetry available
|
78 |
+
# curl -sSL https://install.python-poetry.org | python3 -
|
79 |
+
|
80 |
+
# setup poetry environment
|
81 |
+
$ poetry install --no-root
|
82 |
+
|
83 |
+
# optionally install flash-attn
|
84 |
+
# poetry run pip install wheel
|
85 |
+
# MAX_JOBS=4 poetry run pip install flash-attn --no-build-isolation
|
86 |
+
|
87 |
+
# train a tokenizer with fixed vocab size on linux version 1.0
|
88 |
+
$ PYTHONPATH=. poetry run python3 -m laam.commands.train_tokenizer \
|
89 |
+
--version v1.0/1.0 \
|
90 |
+
--vocab-size 32768
|
91 |
+
|
92 |
+
# train a 5M parameter model on it
|
93 |
+
|
94 |
+
# stage 1: large batch size, 1e-3 learning rate to safely converge near solution
|
95 |
+
$ PYTHONPATH=. poetry run accelerate launch \
|
96 |
+
laam/commands/train_llama.py \
|
97 |
+
--version v1.0/1.0 \
|
98 |
+
--precision bfloat16 \
|
99 |
+
--hidden_size 64 \
|
100 |
+
--intermediate_size 256 \
|
101 |
+
--num_hidden_layers 8 \
|
102 |
+
--num_attention_heads 32 \
|
103 |
+
--max_position_embeddings 512 \
|
104 |
+
--learning_rate 0.001 \
|
105 |
+
--batch_size 64 \
|
106 |
+
--epochs 100
|
107 |
+
|
108 |
+
# stage 2: single sample batches with 1e-4 learning rate to memorize
|
109 |
+
$ PYTHONPATH=. poetry run accelerate launch \
|
110 |
+
laam/commands/train_llama.py \
|
111 |
+
--version v1.0/1.0 \
|
112 |
+
--precision bfloat16 \
|
113 |
+
--reload \
|
114 |
+
--learning_rate 0.0001 \
|
115 |
+
--batch_size 1 \
|
116 |
+
--epochs 100
|
117 |
+
```
|