dleemiller commited on
Commit
ebe0c07
·
verified ·
1 Parent(s): 7651aec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -1,3 +1,5 @@
 
 
1
  ## Installation
2
 
3
  Use the github repo or install via pip: https://github.com/dleemiller/WordLlama
@@ -5,6 +7,36 @@ Use the github repo or install via pip: https://github.com/dleemiller/WordLlama
5
  pip install wordllama
6
  ```
7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  ## MTEB Results (l2_supercat)
10
 
@@ -20,4 +52,9 @@ pip install wordllama
20
 
21
  ---
22
  license: mit
 
 
 
 
 
23
  ---
 
1
+ # wordllama
2
+
3
  ## Installation
4
 
5
  Use the github repo or install via pip: https://github.com/dleemiller/WordLlama
 
7
  pip install wordllama
8
  ```
9
 
10
+ ## Intended Use
11
+
12
+ This model is intended for use in natural language processing applications that require text embeddings, such as text classification, sentiment analysis, and document clustering.
13
+ It's a token embedding model that is comparable to word embedding models, but substantionally smaller in size (16mb default 256-dim model).
14
+
15
+ ```python
16
+ from wordllama import load
17
+
18
+ wl = load()
19
+ similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
20
+ print(similarity_score) # Output: 0.06641249096796882
21
+ ```
22
+
23
+ ## Model Architecture
24
+
25
+ Wordllama is based on token embedding codebooks extracted from large language models.
26
+ It is trained like a general embedding, with MultipleNegativesRankingLoss using the sentence transformers library,
27
+ using Matryoshka Representation Learning so that embeddings can be truncated to 64, 128, 256, 512 or 1024 dimensions.
28
+
29
+ To create WordLlama L2 "supercat", we extract and concatenate the token embedding codebooks from several large language models that
30
+ use the llama2 tokenizer vocabulary (32k vocab size). This includes models like Llama2 70B and Phi-3 Medium.
31
+ Then we add a trainable token weight parameter and initialize stopwords to a smaller value (0.1). Finally, we
32
+ train a projection from the large, concatenated codebook down to a smaller dimension and average pool.
33
+
34
+ We use popular embeddings datasets from sentence transformers, and matryoshka representation learning (MRL) so that
35
+ dimensions can be truncated. For "binary" models, we train using a straight through estimator, so that the embeddings
36
+ can be binarized eg, (x>0).sign() and packed into integers for hamming distance computation.
37
+
38
+ After training, we save a new, small token embedding codebook, which is analogous to vectors of a word embedding.
39
+
40
 
41
  ## MTEB Results (l2_supercat)
42
 
 
52
 
53
  ---
54
  license: mit
55
+ datasets:
56
+ - sentence-transformers/all-nli
57
+ - sentence-transformers/gooaq
58
+ language:
59
+ - en
60
  ---