homebrewltd
/

Ichigo-whisper-v0.1

@@ -59,20 +59,49 @@ For inference, please refer to the official [Ichigo Whisper repository](https://
 python demo/inference.py --input path/to/your/audio.wav
 ```
 ## Training Specs
-| **Parameter**              | **Value**               |
-|----------------------------|-------------------------|
-| **Initialization Method**  |                         |
-| **Epochs**                 |                         |
-| **Global Batch Size**      |                         |
-| **Learning Rate**          |                         |
-| **Learning Scheduler**     | Cosine                  |
-| **Optimizer**              | AdamW                   |
-| **Warmup Ratio**           |                         |
-| **Weight Decay**           |                         |
-| **Max Sequence Length**    |                         |
 ## Evaluation
@@ -80,15 +109,15 @@ python demo/inference.py --input path/to/your/audio.wav
 | Model Name | Codebook Size | Dataset test | Test samples | WER |
 |------------|---------------|--------------|--------------|-----|
-| **IchigoWhisper** | 2561 | viVoice | 1000 | **11.36** |
-| Whisper Medium | - | viVoice | 1000 | 18.64 |
 2. English
 | Model Name | Codebook Size | Dataset test | Test samples | WER |
 |------------|---------------|--------------|--------------|-----|
-| **IchigoWhisper** | 2561 | LibriTTS-R | 1000 | **12.96** |
-| Whisper Medium | - | LibriTTS-R | 1000 | 12.99 |
 ## Citation Information

 python demo/inference.py --input path/to/your/audio.wav
 ```
 ## Training Specs
+### Hardware Specifications
+| **Component**              | **Details**             |
+|---------------------------|-------------------------|
+| **GPUs**                 | 8 × NVIDIA A6000       |
+### Training Time
+| **Phase**                  | **Duration**            |
+|---------------------------|-------------------------|
+| **Phase 1**              | 75 hours (50 epochs)    |
+| **Phase 2**              | 29 hours (20 epochs)    |
+| **Total Training**       | 104 hours              |
+### Phase 1: With KL Loss
+| **Parameter**              | **Value**                                                      |
+|---------------------------|----------------------------------------------------------------|
+| **Initialization Method** | WhisperVQ-Large-v3 (7 languages) embeddings with duplication |
+| **Epochs**               | 50                                                              |
+| **Global Batch Size**    | 336                                             |
+| **Learning Rate**        | 1e-3                                                           |
+| **Learning Scheduler**    | Linear warm-up with Cosine decay                       |
+| **Optimizer**            | AdamW                                                          |
+| **Warmup Ratio**         | 500                                                      |
+| **Weight Decay**         | 0.001                                                          |
+| **Max Audio Length**  | 30 seconds (padded audio)                                      |
+### Phase 2: Without KL Loss
+| **Parameter**              | **Value**                                                      |
+|---------------------------|----------------------------------------------------------------|
+| **Initialization Method** | Phase 1 checkpoint                                             |
+| **Epochs**               | 20                                                              |
+| **Global Batch Size**    | 336                                              |
+| **Learning Rate**        | 1e-3                                                           |
+| **Learning Scheduler**    | Linear warm-up with Cosine decay                       |
+| **Optimizer**            | AdamW                                                          |
+| **Warmup Ratio**         | 500                                                      |
+| **Weight Decay**         | 0.001                                                          |
+| **Max Audio Length**  | 30 seconds (padded audio)  |
 ## Evaluation
 | Model Name | Codebook Size | Dataset test | Test samples | WER |
 |------------|---------------|--------------|--------------|-----|
+| **IchigoWhisper** | 2561 | viVoice | 10000 | **11.68** |
+| Whisper Medium | - | viVoice | 10000 | 18.30 |
 2. English
 | Model Name | Codebook Size | Dataset test | Test samples | WER |
 |------------|---------------|--------------|--------------|-----|
+| **IchigoWhisper** | 2561 | LibriTTS-R | 4689 | **11.89** |
+| Whisper Medium | - | LibriTTS-R | 4689 | 13.06 |
 ## Citation Information