sanchit-gandhi
commited on
Commit
·
7121cfc
1
Parent(s):
4b1ba05
Update README with langauge/task/timestamp info
Browse files
README.md
CHANGED
@@ -172,10 +172,11 @@ pip install --upgrade pip
|
|
172 |
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
|
173 |
```
|
174 |
|
175 |
-
### Short-Form Transcription
|
176 |
-
|
177 |
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
178 |
-
class to transcribe
|
|
|
|
|
|
|
179 |
|
180 |
```python
|
181 |
import torch
|
@@ -201,11 +202,14 @@ pipe = pipeline(
|
|
201 |
tokenizer=processor.tokenizer,
|
202 |
feature_extractor=processor.feature_extractor,
|
203 |
max_new_tokens=128,
|
|
|
|
|
|
|
204 |
torch_dtype=torch_dtype,
|
205 |
device=device,
|
206 |
)
|
207 |
|
208 |
-
dataset = load_dataset("
|
209 |
sample = dataset[0]["audio"]
|
210 |
|
211 |
result = pipe(sample)
|
@@ -218,59 +222,43 @@ To transcribe a local audio file, simply pass the path to your audio file when y
|
|
218 |
+ result = pipe("audio.mp3")
|
219 |
```
|
220 |
|
221 |
-
|
222 |
-
|
223 |
-
Through Transformers Whisper uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
|
224 |
-
is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
|
225 |
-
|
226 |
-
To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. To activate batching, pass the argument `batch_size`:
|
227 |
|
228 |
```python
|
229 |
-
|
230 |
-
|
231 |
-
from datasets import load_dataset
|
232 |
-
|
233 |
-
|
234 |
-
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
235 |
-
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
236 |
|
237 |
-
|
|
|
238 |
|
239 |
-
|
240 |
-
|
241 |
-
|
242 |
-
model.to(device)
|
243 |
|
244 |
-
|
245 |
|
246 |
-
|
247 |
-
|
248 |
-
|
249 |
-
|
250 |
-
feature_extractor=processor.feature_extractor,
|
251 |
-
max_new_tokens=128,
|
252 |
-
chunk_length_s=15,
|
253 |
-
batch_size=16,
|
254 |
-
torch_dtype=torch_dtype,
|
255 |
-
device=device,
|
256 |
-
)
|
257 |
|
258 |
-
|
259 |
-
sample = dataset[0]["audio"]
|
260 |
|
261 |
-
|
262 |
-
|
|
|
263 |
```
|
264 |
|
265 |
-
|
266 |
-
|
267 |
|
268 |
```python
|
269 |
-
result = pipe("
|
|
|
270 |
```
|
271 |
-
--->
|
272 |
|
273 |
-
|
274 |
|
275 |
Whisper `tiny` can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
|
276 |
ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
|
|
|
172 |
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
|
173 |
```
|
174 |
|
|
|
|
|
175 |
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
176 |
+
class to transcribe audio files of arbitrary length. Transformers uses a chunked algorithm to transcribe
|
177 |
+
long-form audio files, which in-practice is 9x faster than the sequential algorithm proposed by OpenAI
|
178 |
+
(see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)). The batch size should
|
179 |
+
be set based on the specifications of your device:
|
180 |
|
181 |
```python
|
182 |
import torch
|
|
|
202 |
tokenizer=processor.tokenizer,
|
203 |
feature_extractor=processor.feature_extractor,
|
204 |
max_new_tokens=128,
|
205 |
+
chunk_length_s=30,
|
206 |
+
batch_size=16,
|
207 |
+
return_timestamps=True,
|
208 |
torch_dtype=torch_dtype,
|
209 |
device=device,
|
210 |
)
|
211 |
|
212 |
+
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
|
213 |
sample = dataset[0]["audio"]
|
214 |
|
215 |
result = pipe(sample)
|
|
|
222 |
+ result = pipe("audio.mp3")
|
223 |
```
|
224 |
|
225 |
+
Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it
|
226 |
+
can be passed as an argument to the pipeline:
|
|
|
|
|
|
|
|
|
227 |
|
228 |
```python
|
229 |
+
result = pipe(sample, generate_kwargs={"language": "english"})
|
230 |
+
```
|
|
|
|
|
|
|
|
|
|
|
231 |
|
232 |
+
By default, Whisper performs the task of *speech transcription*, where the source audio language is the same as the target
|
233 |
+
text language. To perform *speech translation*, where the target text is in English, set the task to `"translate"`:
|
234 |
|
235 |
+
```python
|
236 |
+
result = pipe(sample, generate_kwargs={"task": "translate"})
|
237 |
+
```
|
|
|
238 |
|
239 |
+
Finally, the model can be made to predict timestamps. For sentence-level timestamps, pass the `return_timestamps` argument:
|
240 |
|
241 |
+
```python
|
242 |
+
result = pipe(sample, return_timestamps=True)
|
243 |
+
print(result["chunks"])
|
244 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
245 |
|
246 |
+
And for word-level timestamps:
|
|
|
247 |
|
248 |
+
```python
|
249 |
+
result = pipe(sample, return_timestamps="word")
|
250 |
+
print(result["chunks"])
|
251 |
```
|
252 |
|
253 |
+
The above arguments can be used in isolation or in combination. For example, to perform the task of speech transcription
|
254 |
+
where the source audio is in French, and we want to return sentence-level timestamps, the following can be used:
|
255 |
|
256 |
```python
|
257 |
+
result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
|
258 |
+
print(result["chunks"])
|
259 |
```
|
|
|
260 |
|
261 |
+
## Speculative Decoding
|
262 |
|
263 |
Whisper `tiny` can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
|
264 |
ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
|