speech-language model
Collection
6 items
•
Updated
•
2
rinna/japanese-data2vec-audio-base
This is a Japanese data2vec Audio Base model trained by rinna Co., Ltd.
Model summary
The model architecture is the same as the original data2vec Audio Base model, which contains 12 transformer layers with 12 attention heads. The model was trained using code from the official repository, and the detailed training configuration can be found in the same repository and the original paper.
Training
The model was trained on approximately 19,000 hours of following Japanese speech corpus ReazonSpeech v1.
Contributors
import soundfile as sf
from transformers import AutoFeatureExtractor, AutoModel
model_name = "rinna/japanese-data2vec-audio-base"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()
raw_speech_16kHz, sr = sf.read(audio_file)
inputs = feature_extractor(
raw_speech_16kHz,
return_tensors="pt",
sampling_rate=sr,
)
outputs = model(**inputs)
print(f"Input: {inputs.input_values.size()}") # [1, #samples]
print(f"Output: {outputs.last_hidden_state.size()}") # [1, #frames, 768]
A fairseq checkpoint file can also be available here.
@misc{rinna-japanese-data2vec-audio-base,
title = {rinna/japanese-data2vec-audio-base},
author = {Hono, Yukiya and Mitsui, Kentaro and Sawada, Kei},
url = {https://huggingface.co/rinna/japanese-data2vec-audio-base}
}
@inproceedings{sawada2024release,
title = {Release of Pre-Trained Models for the {J}apanese Language},
author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
month = {5},
year = {2024},
pages = {13898--13905},
url = {https://aclanthology.org/2024.lrec-main.1213},
note = {\url{https://arxiv.org/abs/2404.01657}}
}
@inproceedings{baevski2022data2vec,
title={Data2vec: A general framework for self-supervised learning in speech, vision and language},
author={Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael},
booktitle={International Conference on Machine Learning},
year={2022},
pages={1298--1312},
doi={10.48550/arXiv.2202.03555}
}