japanese-soseki-gpt2-1b

jweb-icon

This repository provides a 1.3B-parameter finetuned Japanese GPT2 model. The model was finetuned by jweb based on trained by rinna Co., Ltd. Both pytorch(pytorch_model.bin) and Rust(rust_model.ot) models are provided

How to use the model

NOTE: Use T5Tokenizer to initiate the tokenizer.

python

import torch
from transformers import T5Tokenizer, AutoModelForCausalLM

tokenizer = T5Tokenizer.from_pretrained("jweb/japanese-soseki-gpt2-1b")
model = AutoModelForCausalLM.from_pretrained("jweb/japanese-soseki-gpt2-1b")

if torch.cuda.is_available():
    model = model.to("cuda")

text = "夏目漱石は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_length=128,
        min_length=40,
        do_sample=True,
        repetition_penalty= 1.6,
        early_stopping= True,
        num_beams= 5,
        temperature= 1.0,
        top_k=500,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)  
# sample output: 夏目漱石は、明治時代を代表する文豪です。夏目漱石の代表作は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、それに「虞美人草(ぐびじんそう)」などたくさんあります。

rust

use rust_bert::gpt2::GPT2Generator;
use rust_bert::pipelines::common::{ModelType, TokenizerOption};
use rust_bert::pipelines::generation_utils::{GenerateConfig, LanguageGenerator};
use rust_bert::resources::{ RemoteResource,  ResourceProvider};
use tch::Device;

fn main() -> anyhow::Result<()> {
    let model_resource = Box::new(RemoteResource {     
        url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/rust_model.ot".into(),
        cache_subdir: "japanese-soseki-gpt2-1b/model".into(),        
    });
    let config_resource = Box::new(RemoteResource {     
        url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/config.json".into(),
        cache_subdir: "japanese-soseki-gpt2-1b/config".into(),        
    });
    let vocab_resource = Box::new(RemoteResource {     
        url: "https://huggingface.co/jweb/japanese-soseki-gpt2-1b/resolve/main/spiece.model".into(),
        cache_subdir: "japanese-soseki-gpt2-1b/vocab".into(),        
    });
    let vocab_resource_token = vocab_resource.clone();
    let merges_resource = vocab_resource.clone();    
    let generate_config = GenerateConfig {        
        model_resource,
        config_resource,
        vocab_resource,
        merges_resource, // not used        
        device: Device::Cpu,
        repetition_penalty: 1.6,
        min_length: 40,
        max_length: 128,
        do_sample: true,
        early_stopping: true,
        num_beams: 5,
        temperature: 1.0,
        top_k: 500,
        top_p: 0.95,
        ..Default::default()
    };
    let tokenizer = TokenizerOption::from_file(
        ModelType::T5,
        vocab_resource_token.get_local_path().unwrap().to_str().unwrap(),
        None,
        true,
        None,
        None,
    )?;
    let mut gpt2_model = GPT2Generator::new_with_tokenizer(generate_config, tokenizer.into())?;
    gpt2_model.set_device(Device::cuda_if_available());
    let input_text = "夏目漱石は、";
    let t1 = std::time::Instant::now();
    let output = gpt2_model.generate(Some(&[input_text]), None);
    println!("{}", output[0].text);
    println!("Elapsed Time(ms):{}",t1.elapsed().as_millis()); 
    Ok(())
}
// sample output: 夏目漱石は、明治から大正にかけて活躍した日本の小説家です。彼は「吾輩は猫である」や「坊っちゃん」、「草枕」「三四郎」、あるいは「虞美人草」などの小説で知られていますが、「明暗」のような小説も書いていました。

Model architecture

A 24-layer, 2048-hidden-size transformer-based language model.

Training

The model was trained on Japanese C4, Japanese CC-100 and Japanese Wikipedia to optimize a traditional language modelling objective. It reaches around 14 perplexity on a chosen validation set from the same data.

Finetuning

The model was finetuned on Aozorabunko, especially Natume Soseki books.

Tokenization

The model uses a sentencepiece-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script, and then augmented with emojis and symbols.

Licenese

The MIT license

Downloads last month
10
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train jweb/japanese-soseki-gpt2-1b