Person And Book Title Splitter

Open In Colab

Our model Person And Book Title Splitter is a Named Entity Recognition Classical Chinese language model that is intended to split author names and book titles, such as 徐元文漢魏風致集. This model is first inherited from raynardj/classical-chinese-punctuation-guwen-biaodian Classical Chinese punctuation model, and finetuned using over a 25,000 high-quality punctuation pairs collected CBDB group (China Biographical Database).

Sample input txt file

The sample input txt file can be downloaded here: https://huggingface.co/cbdb/OfficeTitleAddressSplitter/blob/main/input.txt

How to use

Here is how to use this model to get the features of a given text in PyTorch:

1. Import model and packages

from transformers import AutoTokenizer, AutoModelForTokenClassification

PRETRAINED = "cbdb/PersonAndBookTitleSplitter"
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
model = AutoModelForTokenClassification.from_pretrained(PRETRAINED)

2. Load Data

# Load your data here
test_list = ['徐元文漢魏風致集', '熊方補後漢書年表', '羅振玉本朝學術源流槪略 一卷', '陶諧陶莊敏集']

3. Make a prediction

def predict_class(test):
    tokens_test = tokenizer.encode_plus(
        test,
        add_special_tokens=True,
        return_attention_mask=True,
        padding=True,
        max_length=128,
        return_tensors='pt',
        truncation=True
    )

    test_seq = torch.tensor(tokens_test['input_ids'])
    test_mask = torch.tensor(tokens_test['attention_mask'])

    inputs = {
        "input_ids": test_seq,
        "attention_mask": test_mask
    }
    with torch.no_grad():
        # print(inputs.shape)
        outputs = model(**inputs)
        outputs = outputs.logits.detach().cpu().numpy()
        
    softmax_score = softmax(outputs)
    softmax_score = np.argmax(softmax_score, axis=2)[0]
    return test_seq, softmax_score

for test_sen0 in test_list:
    test_seq, pred_class_proba = predict_class(test_sen0)
    test_sen = tokenizer.decode(test_seq[0]).split()
    label = [idx2label[i] for i in pred_class_proba]

    element_to_find = '。'

    if element_to_find in label:
        index = label.index(element_to_find)
        test_sen_pred = [i for i in test_sen0]
        test_sen_pred.insert(index, element_to_find)
        test_sen_pred = ''.join(test_sen_pred)

    else:
        test_sen_pred = [i for i in test_sen0]
        test_sen_pred = ''.join(test_sen_pred)

    print(test_sen_pred)

徐元文。漢魏風致集
熊方。補後漢書年表
羅振玉。本朝學術源流槪略 一卷
陶諧。陶莊敏集

Authors

Queenie Luo (queenieluo[at]g.harvard.edu)
Hongsu Wang
Peter Bol
CBDB Group

License

Copyright (c) 2023 CBDB

Except where otherwise noted, content on this repository is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Downloads last month
20
Safetensors
Model size
102M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.