jan-hq commited on
Commit
50896fe
·
verified ·
1 Parent(s): db9b183

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -16
README.md CHANGED
@@ -59,20 +59,49 @@ For inference, please refer to the official [Ichigo Whisper repository](https://
59
  python demo/inference.py --input path/to/your/audio.wav
60
  ```
61
 
62
-
63
  ## Training Specs
64
 
65
- | **Parameter** | **Value** |
66
- |----------------------------|-------------------------|
67
- | **Initialization Method** | |
68
- | **Epochs** | |
69
- | **Global Batch Size** | |
70
- | **Learning Rate** | |
71
- | **Learning Scheduler** | Cosine |
72
- | **Optimizer** | AdamW |
73
- | **Warmup Ratio** | |
74
- | **Weight Decay** | |
75
- | **Max Sequence Length** | |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ## Evaluation
78
 
@@ -80,15 +109,15 @@ python demo/inference.py --input path/to/your/audio.wav
80
 
81
  | Model Name | Codebook Size | Dataset test | Test samples | WER |
82
  |------------|---------------|--------------|--------------|-----|
83
- | **IchigoWhisper** | 2561 | viVoice | 1000 | **11.36** |
84
- | Whisper Medium | - | viVoice | 1000 | 18.64 |
85
 
86
  2. English
87
 
88
  | Model Name | Codebook Size | Dataset test | Test samples | WER |
89
  |------------|---------------|--------------|--------------|-----|
90
- | **IchigoWhisper** | 2561 | LibriTTS-R | 1000 | **12.96** |
91
- | Whisper Medium | - | LibriTTS-R | 1000 | 12.99 |
92
 
93
  ## Citation Information
94
 
 
59
  python demo/inference.py --input path/to/your/audio.wav
60
  ```
61
 
 
62
  ## Training Specs
63
 
64
+ ### Hardware Specifications
65
+
66
+ | **Component** | **Details** |
67
+ |---------------------------|-------------------------|
68
+ | **GPUs** | 8 × NVIDIA A6000 |
69
+
70
+ ### Training Time
71
+
72
+ | **Phase** | **Duration** |
73
+ |---------------------------|-------------------------|
74
+ | **Phase 1** | 75 hours (50 epochs) |
75
+ | **Phase 2** | 29 hours (20 epochs) |
76
+ | **Total Training** | 104 hours |
77
+
78
+ ### Phase 1: With KL Loss
79
+
80
+ | **Parameter** | **Value** |
81
+ |---------------------------|----------------------------------------------------------------|
82
+ | **Initialization Method** | WhisperVQ-Large-v3 (7 languages) embeddings with duplication |
83
+ | **Epochs** | 50 |
84
+ | **Global Batch Size** | 336 |
85
+ | **Learning Rate** | 1e-3 |
86
+ | **Learning Scheduler** | Linear warm-up with Cosine decay |
87
+ | **Optimizer** | AdamW |
88
+ | **Warmup Ratio** | 500 |
89
+ | **Weight Decay** | 0.001 |
90
+ | **Max Audio Length** | 30 seconds (padded audio) |
91
+
92
+ ### Phase 2: Without KL Loss
93
+
94
+ | **Parameter** | **Value** |
95
+ |---------------------------|----------------------------------------------------------------|
96
+ | **Initialization Method** | Phase 1 checkpoint |
97
+ | **Epochs** | 20 |
98
+ | **Global Batch Size** | 336 |
99
+ | **Learning Rate** | 1e-3 |
100
+ | **Learning Scheduler** | Linear warm-up with Cosine decay |
101
+ | **Optimizer** | AdamW |
102
+ | **Warmup Ratio** | 500 |
103
+ | **Weight Decay** | 0.001 |
104
+ | **Max Audio Length** | 30 seconds (padded audio) |
105
 
106
  ## Evaluation
107
 
 
109
 
110
  | Model Name | Codebook Size | Dataset test | Test samples | WER |
111
  |------------|---------------|--------------|--------------|-----|
112
+ | **IchigoWhisper** | 2561 | viVoice | 10000 | **11.68** |
113
+ | Whisper Medium | - | viVoice | 10000 | 18.30 |
114
 
115
  2. English
116
 
117
  | Model Name | Codebook Size | Dataset test | Test samples | WER |
118
  |------------|---------------|--------------|--------------|-----|
119
+ | **IchigoWhisper** | 2561 | LibriTTS-R | 4689 | **11.89** |
120
+ | Whisper Medium | - | LibriTTS-R | 4689 | 13.06 |
121
 
122
  ## Citation Information
123