kaushal98b
commited on
Commit
·
a485f84
1
Parent(s):
fb33248
Update README.md
Browse files
README.md
CHANGED
@@ -7,7 +7,21 @@ library_name: nemo
|
|
7 |
---
|
8 |
## IndicConformer
|
9 |
|
10 |
-
IndicConformer is a Hybrid RNNT conformer model built for Nepali.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
## AI4Bharat NeMo:
|
13 |
|
@@ -17,47 +31,30 @@ library_name: nemo
|
|
17 |
```
|
18 |
|
19 |
## Usage
|
20 |
-
|
21 |
-
```bash
|
22 |
-
$ python inference.py --help
|
23 |
-
usage: inference.py [-h] -c CHECKPOINT -f AUDIO_FILEPATH -d (cpu,cuda) -l LANGUAGE_CODE
|
24 |
-
options:
|
25 |
-
-h, --help show this help message and exit
|
26 |
-
-c CHECKPOINT, --checkpoint CHECKPOINT
|
27 |
-
Path to .nemo file
|
28 |
-
-f AUDIO_FILEPATH, --audio_filepath AUDIO_FILEPATH
|
29 |
-
Audio filepath
|
30 |
-
-d (cpu,cuda), --device (cpu,cuda)
|
31 |
-
Device (cpu/gpu)
|
32 |
-
-l LANGUAGE_CODE, --language_code LANGUAGE_CODE
|
33 |
-
Language Code (eg. hi)
|
34 |
```
|
|
|
35 |
|
36 |
-
|
|
|
|
|
37 |
```
|
38 |
-
|
|
|
|
|
39 |
```
|
40 |
-
Expected output -
|
41 |
|
|
|
|
|
42 |
```
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
----------
|
47 |
-
Transcript:
|
48 |
-
Took ** seconds.
|
49 |
-
----------
|
50 |
```
|
51 |
|
52 |
-
###
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
This model provides transcribed speech as a string for a given audio sample.
|
59 |
-
|
60 |
-
## Model Architecture
|
61 |
-
|
62 |
-
This model is a conformer-Large model, consisting of 120M parameters, as the encoder, with a hybrid CTC-RNNT decoder. The model has 17 conformer blocks with
|
63 |
-
512 as the model dimension.
|
|
|
7 |
---
|
8 |
## IndicConformer
|
9 |
|
10 |
+
IndicConformer is a Hybrid CTC-RNNT conformer ASR(Automatic Speech Recognition) model built for Nepali.
|
11 |
+
|
12 |
+
### Input
|
13 |
+
|
14 |
+
This model accepts 16000 KHz Mono-channel Audio (wav files) as input.
|
15 |
+
|
16 |
+
### Output
|
17 |
+
|
18 |
+
This model provides transcribed speech as a string for a given audio sample.
|
19 |
+
|
20 |
+
## Model Architecture
|
21 |
+
|
22 |
+
This model is a conformer-Large model, consisting of 120M parameters, as the encoder, with a hybrid CTC-RNNT decoder. The model has 17 conformer blocks with
|
23 |
+
512 as the model dimension.
|
24 |
+
|
25 |
|
26 |
## AI4Bharat NeMo:
|
27 |
|
|
|
31 |
```
|
32 |
|
33 |
## Usage
|
34 |
+
Download and load the model from Huggingface.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
```
|
36 |
+
model = nemo_asr.models.ASRModel.from_pretrained("ai4bharat/indicconformer_stt_ne_hybrid_rnnt_large")
|
37 |
|
38 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
39 |
+
model.freeze() # inference mode
|
40 |
+
model = model.to(device) # transfer model to device
|
41 |
```
|
42 |
+
Get an audio file ready by running the command shown below in your terminal. This will convert the audio to 16000 Hz and monochannel.
|
43 |
+
```
|
44 |
+
ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav
|
45 |
```
|
|
|
46 |
|
47 |
+
|
48 |
+
### Inference using CTC decoder
|
49 |
```
|
50 |
+
model.cur_decoder = "ctc"
|
51 |
+
ctc_text = model.transcribe(['sample_audio_infer_ready.wav'], batch_size=1,logprobs=False, language_id='hi')[0]
|
52 |
+
print(ctc_text)
|
|
|
|
|
|
|
|
|
53 |
```
|
54 |
|
55 |
+
### Inference using RNNT decoder
|
56 |
+
```
|
57 |
+
model.cur_decoder = "rnnt"
|
58 |
+
rnnt_text = model.transcribe(['sample_audio_infer_ready.wav'], batch_size=1, language_id='hi')[0]
|
59 |
+
print(rnnt_text)
|
60 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|