UAR Play

Literary Character Representations using UAR Play., trained on fictional character utterances.

You can find the training and evaluation repository here.

This model is based on LUAR implementation. It uses all-distillroberta-v1 as the base sentence encoder and was trained on the Play split of DramaCV, a dataset consisting of drama plays collected from Project Gutenberg.

You can find the model trained on the Scene split at this url.

Usage

from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gasmichel/UAR_Play")
model = AutoModel.from_pretrained("gasmichel/UAR_Play")
#`episodes` are embedded as colletions of documents presumed to come from an author
# NOTE: make sure that `episode_length` consistent across `episode`
batch_size = 3
episode_length = 16
text = [
    ["Foo"] * episode_length,
    ["Bar"] * episode_length,
    ["Zoo"] * episode_length,
]
text = [j for i in text for j in i]
tokenized_text = tokenizer(
    text, 
    max_length=32,
    padding="max_length", 
    truncation=True,
    return_tensors="pt"
)
# inputs size: (batch_size, episode_length, max_token_length)
tokenized_text["input_ids"] = tokenized_text["input_ids"].reshape(batch_size, episode_length, -1)
tokenized_text["attention_mask"] = tokenized_text["attention_mask"].reshape(batch_size, episode_length, -1)
print(tokenized_text["input_ids"].size())       # torch.Size([3, 16, 32])
print(tokenized_text["attention_mask"].size())  # torch.Size([3, 16, 32])
out = model(**tokenized_text)
print(out.size())   # torch.Size([3, 512])
# to get the Transformer attentions:
out, attentions = model(**tokenized_text, output_attentions=True)
print(attentions[0].size())     # torch.Size([48, 12, 32, 32])

Citing & Authors

If you find this model helpful, feel free to cite our publication.

@inproceedings{michel-etal-2024-improving,
    title = "Improving Quotation Attribution with Fictional Character Embeddings",
    author = "Michel, Gaspard  and
      Epure, Elena V.  and
      Hennequin, Romain  and
      Cerisara, Christophe",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.744",
    doi = "10.18653/v1/2024.findings-emnlp.744",
    pages = "12723--12735",,
}

License

UAR Scene is distributed under the terms of the Apache License (Version 2.0).

All new contributions must be made under the Apache-2.0 licenses.

Downloads last month
11
Safetensors
Model size
82.5M params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .