File size: 3,996 Bytes
e165779
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
language: en
library_name: bm25s
tags:
- bm25
- bm25s
- retrieval
- search
- lexical
---

# BM25S Index

This is a BM25S index created with the [`bm25s` library](https://github.com/xhluca/bm25s) (version `0.2.3`), an ultra-fast implementation of BM25. It can be used for lexical retrieval tasks.

BM25S Related Links:

* 🏠[Homepage](https://bm25s.github.io)
* 💻[GitHub Repository](https://github.com/xhluca/bm25s)
* 🤗[Blog Post](https://huggingface.co/blog/xhluca/bm25s)
* 📝[Technical Report](https://arxiv.org/abs/2407.03618)


## Installation

You can install the `bm25s` library with `pip`:

```bash
pip install "bm25s==0.2.3"

# Include extra dependencies like stemmer
pip install "bm25s[full]==0.2.3"

# For huggingface hub usage
pip install huggingface_hub
```

## Loading a `bm25s` index

You can use this index for information retrieval tasks. Here is an example:

```python
import bm25s
from bm25s.hf import BM25HF

# Load the index
retriever = BM25HF.load_from_hub("ylkhayat/bm25s-caselaw-us-and-veterans")

# You can retrieve now
query = "a cat is a feline"
results = retriever.retrieve(bm25s.tokenize(query), k=3)
```

## Saving a `bm25s` index

You can save a `bm25s` index to the Hugging Face Hub. Here is an example:

```python
import bm25s
from bm25s.hf import BM25HF

corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

retriever = BM25HF(corpus=corpus)
retriever.index(bm25s.tokenize(corpus))

token = None  # You can get a token from the Hugging Face website
retriever.save_to_hub("ylkhayat/bm25s-caselaw-us-and-veterans", token=token)
```

## Advanced usage

You can leverage more advanced features of the BM25S library during `load_from_hub`:

```python
# Load corpus and index in memory-map (mmap=True) to reduce memory
retriever = BM25HF.load_from_hub("ylkhayat/bm25s-caselaw-us-and-veterans", load_corpus=True, mmap=True)

# Load a different branch/revision
retriever = BM25HF.load_from_hub("ylkhayat/bm25s-caselaw-us-and-veterans", revision="main")

# Change directory where the local files should be downloaded
retriever = BM25HF.load_from_hub("ylkhayat/bm25s-caselaw-us-and-veterans", local_dir="/path/to/dir")

# Load private repositories with a token:
retriever = BM25HF.load_from_hub("ylkhayat/bm25s-caselaw-us-and-veterans", token=token)
```

## Tokenizer

If you have saved a `Tokenizer` object with the index using the following approach:

```python
from bm25s.hf import TokenizerHF

token = "your_hugging_face_token"
tokenizer = TokenizerHF(corpus=corpus, stopwords="english")
tokenizer.save_to_hub("ylkhayat/bm25s-caselaw-us-and-veterans", token=token)

# and stopwords too
tokenizer.save_stopwords_to_hub("ylkhayat/bm25s-caselaw-us-and-veterans", token=token)
```

Then, you can load the tokenizer using the following code:

```python
from bm25s.hf import TokenizerHF

tokenizer = TokenizerHF(corpus=corpus, stopwords=[])
tokenizer.load_vocab_from_hub("ylkhayat/bm25s-caselaw-us-and-veterans", token=token)
tokenizer.load_stopwords_from_hub("ylkhayat/bm25s-caselaw-us-and-veterans", token=token)
```


## Stats

This dataset was created using the following data:

| Statistic | Value |
| --- | --- |
| Number of documents | 366752 |
| Number of tokens | 57736416 |
| Average tokens per document | 157.43 |

## Parameters

The index was created with the following parameters:

| Parameter | Value |
| --- | --- |
| k1 | `1.5` |
| b | `0.75` |
| delta | `0.5` |
| method | `lucene` |
| idf method | `lucene` |

## Citation

To cite `bm25s`, please use the following bibtex:

```
@misc{lu_2024_bm25s,
      title={BM25S: Orders of magnitude faster lexical search via eager sparse scoring}, 
      author={Xing Han Lù},
      year={2024},
      eprint={2407.03618},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.03618}, 
}
```