OTAS (Office Title Address Splitter)

Open In Colab

Our model OTAS (Office Title Address Splitter) is a Named Entity Recognition Classical Chinese language model that is intended to split the address portion in Classical Chinese office titles.. This model is first inherited from raynardj/classical-chinese-punctuation-guwen-biaodian Classical Chinese punctuation model, and finetuned using over a 25,000 high-quality punctuation pairs collected CBDB group (China Biographical Database).

Sample input txt file

The sample input txt file can be downloaded here: https://huggingface.co/cbdb/OfficeTitleAddressSplitter/blob/main/input.txt

How to use

Here is how to use this model to get the features of a given text in PyTorch:

1. Import model and packages

from transformers import AutoTokenizer, AutoModelForTokenClassification

PRETRAINED = "cbdb/OfficeTitleAddressSplitter"
tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
model = AutoModelForTokenClassification.from_pretrained(PRETRAINED)

2. Load Data

# Load your data here
test_list = ['漢軍鑲黃旗副都統', '兵部右侍郎', '盛京戶部侍郎']

3. Make a prediction

def predict_class(test):
    tokens_test = tokenizer.encode_plus(
        test,
        add_special_tokens=True,
        return_attention_mask=True,
        padding=True,
        max_length=128,
        return_tensors='pt',
        truncation=True
    )

    test_seq = torch.tensor(tokens_test['input_ids'])
    test_mask = torch.tensor(tokens_test['attention_mask'])

    inputs = {
        "input_ids": test_seq,
        "attention_mask": test_mask
    }
    with torch.no_grad():
        # print(inputs.shape)
        outputs = model(**inputs)
        outputs = outputs.logits.detach().cpu().numpy()
        
    softmax_score = softmax(outputs)
    softmax_score = np.argmax(softmax_score, axis=2)[0]
    return test_seq, softmax_score

for test_sen0 in test_list:
    test_seq, pred_class_proba = predict_class(test_sen0)
    test_sen = tokenizer.decode(test_seq[0]).split()
    label = [idx2label[i] for i in pred_class_proba]

    element_to_find = '。'

    if element_to_find in label:
        index = label.index(element_to_find)
        test_sen_pred = [i for i in test_sen0]
        test_sen_pred.insert(index, element_to_find)
        test_sen_pred = ''.join(test_sen_pred)

    else:
        test_sen_pred = [i for i in test_sen0]
        test_sen_pred = ''.join(test_sen_pred)

    print(test_sen_pred)

漢軍鑲黃旗。副都統
兵部右侍郎
盛京。戶部侍郎

Authors

Queenie Luo (queenieluo[at]g.harvard.edu)
Hongsu Wang
Peter Bol
CBDB Group

License

Copyright (c) 2023 CBDB

Except where otherwise noted, content on this repository is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Downloads last month
11
Safetensors
Model size
102M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.