File size: 3,468 Bytes
3098e4c 9aeb683 3098e4c 9aeb683 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
license: apache-2.0
language:
- ko
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- speech
- audio
---
# hubert-base-korean
## Model Details
HuBERT(Hidden-Unit BERT)๋ Facebook์์ ์ ์ํ Speech Representation Learning ๋ชจ๋ธ์
๋๋ค.
HuBERT๋ ๊ธฐ์กด์ ์์ฑ ์ธ์ ๋ชจ๋ธ๊ณผ ๋ฌ๋ฆฌ, ์์ฑ ์ ํธ๋ฅผ raw waveform์์ ๋ฐ๋ก ํ์ตํ๋ self-supervised learning ๋ฐฉ์์ ์ฌ์ฉํฉ๋๋ค.
์ด ์ฐ๊ตฌ๋ ๊ตฌ๊ธ์ TPU Research Cloud(TRC)๋ฅผ ํตํด ์ง์๋ฐ์ Cloud TPU๋ก ํ์ต๋์์ต๋๋ค.
### Model Description
<table>
<tr>
<td colspan="2"></td>
<td>Base</td>
<td>Large</td>
</tr>
<tr>
<td rowspan="3">CNN Encoder</td>
<td>strides</td>
<td colspan="2">5, 2, 2, 2, 2, 2, 2</td>
</tr>
<tr>
<td>kernel width</td>
<td colspan="2">10, 3, 3, 3, 3, 2, 2</td>
</tr>
<tr>
<td>channel</td>
<td colspan="2">512</td>
</tr>
<tr>
<td rowspan="4">Transformer Encoder</td>
<td>Layer</td>
<td>12</td>
<td>24</td>
</tr>
<tr>
<td>embedding dim</td>
<td>768</td>
<td>1024</td>
</tr>
<tr>
<td>inner FFN dim</td>
<td>3072</td>
<td>4096</td>
</tr>
<tr>
<td>attention heads</td>
<td>8</td>
<td>16</td>
</tr>
<tr>
<td>Projection</td>
<td>dim</td>
<td>256</td>
<td>768</td>
</tr>
<tr>
<td colspan="2">Params</td>
<td>95M</td>
<td>317M </td>
</tr>
</table>
## How to Get Started with the Model
### Pytorch
```py
import torch
from transformers import HubertModel
model = HubertModel.from_pretrained("team-lucid/hubert-xlarge-korean")
wav = torch.ones(1, 16000)
outputs = model(wav)
print(f"Input: {wav.shape}") # [1, 16000]
print(f"Output: {outputs.last_hidden_state.shape}") # [1, 49, 768]
```
### JAX/Flax
```py
import jax.numpy as jnp
from transformers import FlaxAutoModel
model = FlaxAutoModel.from_pretrained("team-lucid/hubert-xlarge-korean", trust_remote_code=True)
wav = jnp.ones((1, 16000))
outputs = model(wav)
print(f"Input: {wav.shape}") # [1, 16000]
print(f"Output: {outputs.last_hidden_state.shape}") # [1, 49, 768]
```
## Training Details
### Training Data
ํด๋น ๋ชจ๋ธ์ ๊ณผํ๊ธฐ์ ์ ๋ณดํต์ ๋ถ์ ์ฌ์์ผ๋ก ํ๊ตญ์ง๋ฅ์ ๋ณด์ฌํ์งํฅ์์ ์ง์์ ๋ฐ์
๊ตฌ์ถ๋ [์์ ๋ํ ์์ฑ(์ผ๋ฐ๋จ์ฌ)](https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=109), [๋คํ์ ์์ฑํฉ์ฑ ๋ฐ์ดํฐ](https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=542), [๋ฐฉ์ก ์ฝํ
์ธ ๋ํ์ฒด ์์ฑ์ธ์ ๋ฐ์ดํฐ](https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn=463)
์์ ์ฝ 4,000์๊ฐ์ ์ถ์ถํด ํ์ต๋์์ต๋๋ค.
### Training Procedure
[์ ๋
ผ๋ฌธ](https://arxiv.org/pdf/2106.07447.pdf)๊ณผ ๋์ผํ๊ฒ MFCC ๊ธฐ๋ฐ์ผ๋ก Base ๋ชจ๋ธ์ ํ์ตํ ๋ค์, 500 cluster๋ก k-means๋ฅผ ์ํํด ๋ค์ Base์
Large ๋ชจ๋ธ์ ํ์ตํ์ต๋๋ค.
#### Training Hyperparameters
| Hyperparameter | Base | Large |
|:--------------------|---------|--------:|
| Warmup Steps | 32,000 | 32,000 |
| Learning Rates | 5e-4 | 1.5e-3 |
| Batch Size | 128 | 128 |
| Weight Decay | 0.01 | 0.01 |
| Max Steps | 400,000 | 400,000 |
| Learning Rate Decay | 0.1 | 0.1 |
| \\(Adam\beta_1\\) | 0.9 | 0.9 |
| \\(Adam\beta_2\\) | 0.99 | 0.99 | |