Update README.md
Browse files
README.md
CHANGED
@@ -59,20 +59,49 @@ For inference, please refer to the official [Ichigo Whisper repository](https://
|
|
59 |
python demo/inference.py --input path/to/your/audio.wav
|
60 |
```
|
61 |
|
62 |
-
|
63 |
## Training Specs
|
64 |
|
65 |
-
|
66 |
-
|
67 |
-
| **
|
68 |
-
|
69 |
-
| **
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
| **
|
74 |
-
|
75 |
-
| **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
|
77 |
## Evaluation
|
78 |
|
@@ -80,15 +109,15 @@ python demo/inference.py --input path/to/your/audio.wav
|
|
80 |
|
81 |
| Model Name | Codebook Size | Dataset test | Test samples | WER |
|
82 |
|------------|---------------|--------------|--------------|-----|
|
83 |
-
| **IchigoWhisper** | 2561 | viVoice |
|
84 |
-
| Whisper Medium | - | viVoice |
|
85 |
|
86 |
2. English
|
87 |
|
88 |
| Model Name | Codebook Size | Dataset test | Test samples | WER |
|
89 |
|------------|---------------|--------------|--------------|-----|
|
90 |
-
| **IchigoWhisper** | 2561 | LibriTTS-R |
|
91 |
-
| Whisper Medium | - | LibriTTS-R |
|
92 |
|
93 |
## Citation Information
|
94 |
|
|
|
59 |
python demo/inference.py --input path/to/your/audio.wav
|
60 |
```
|
61 |
|
|
|
62 |
## Training Specs
|
63 |
|
64 |
+
### Hardware Specifications
|
65 |
+
|
66 |
+
| **Component** | **Details** |
|
67 |
+
|---------------------------|-------------------------|
|
68 |
+
| **GPUs** | 8 × NVIDIA A6000 |
|
69 |
+
|
70 |
+
### Training Time
|
71 |
+
|
72 |
+
| **Phase** | **Duration** |
|
73 |
+
|---------------------------|-------------------------|
|
74 |
+
| **Phase 1** | 75 hours (50 epochs) |
|
75 |
+
| **Phase 2** | 29 hours (20 epochs) |
|
76 |
+
| **Total Training** | 104 hours |
|
77 |
+
|
78 |
+
### Phase 1: With KL Loss
|
79 |
+
|
80 |
+
| **Parameter** | **Value** |
|
81 |
+
|---------------------------|----------------------------------------------------------------|
|
82 |
+
| **Initialization Method** | WhisperVQ-Large-v3 (7 languages) embeddings with duplication |
|
83 |
+
| **Epochs** | 50 |
|
84 |
+
| **Global Batch Size** | 336 |
|
85 |
+
| **Learning Rate** | 1e-3 |
|
86 |
+
| **Learning Scheduler** | Linear warm-up with Cosine decay |
|
87 |
+
| **Optimizer** | AdamW |
|
88 |
+
| **Warmup Ratio** | 500 |
|
89 |
+
| **Weight Decay** | 0.001 |
|
90 |
+
| **Max Audio Length** | 30 seconds (padded audio) |
|
91 |
+
|
92 |
+
### Phase 2: Without KL Loss
|
93 |
+
|
94 |
+
| **Parameter** | **Value** |
|
95 |
+
|---------------------------|----------------------------------------------------------------|
|
96 |
+
| **Initialization Method** | Phase 1 checkpoint |
|
97 |
+
| **Epochs** | 20 |
|
98 |
+
| **Global Batch Size** | 336 |
|
99 |
+
| **Learning Rate** | 1e-3 |
|
100 |
+
| **Learning Scheduler** | Linear warm-up with Cosine decay |
|
101 |
+
| **Optimizer** | AdamW |
|
102 |
+
| **Warmup Ratio** | 500 |
|
103 |
+
| **Weight Decay** | 0.001 |
|
104 |
+
| **Max Audio Length** | 30 seconds (padded audio) |
|
105 |
|
106 |
## Evaluation
|
107 |
|
|
|
109 |
|
110 |
| Model Name | Codebook Size | Dataset test | Test samples | WER |
|
111 |
|------------|---------------|--------------|--------------|-----|
|
112 |
+
| **IchigoWhisper** | 2561 | viVoice | 10000 | **11.68** |
|
113 |
+
| Whisper Medium | - | viVoice | 10000 | 18.30 |
|
114 |
|
115 |
2. English
|
116 |
|
117 |
| Model Name | Codebook Size | Dataset test | Test samples | WER |
|
118 |
|------------|---------------|--------------|--------------|-----|
|
119 |
+
| **IchigoWhisper** | 2561 | LibriTTS-R | 4689 | **11.89** |
|
120 |
+
| Whisper Medium | - | LibriTTS-R | 4689 | 13.06 |
|
121 |
|
122 |
## Citation Information
|
123 |
|