UAR Play
Literary Character Representations using UAR Play., trained on fictional character utterances.
You can find the training and evaluation repository here.
This model is based on LUAR implementation. It uses all-distillroberta-v1
as the base sentence encoder and was trained on the Play split of DramaCV, a dataset consisting of drama plays collected from Project Gutenberg.
You can find the model trained on the Scene split at this url.
Usage
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gasmichel/UAR_Play")
model = AutoModel.from_pretrained("gasmichel/UAR_Play")
#`episodes` are embedded as colletions of documents presumed to come from an author
# NOTE: make sure that `episode_length` consistent across `episode`
batch_size = 3
episode_length = 16
text = [
["Foo"] * episode_length,
["Bar"] * episode_length,
["Zoo"] * episode_length,
]
text = [j for i in text for j in i]
tokenized_text = tokenizer(
text,
max_length=32,
padding="max_length",
truncation=True,
return_tensors="pt"
)
# inputs size: (batch_size, episode_length, max_token_length)
tokenized_text["input_ids"] = tokenized_text["input_ids"].reshape(batch_size, episode_length, -1)
tokenized_text["attention_mask"] = tokenized_text["attention_mask"].reshape(batch_size, episode_length, -1)
print(tokenized_text["input_ids"].size()) # torch.Size([3, 16, 32])
print(tokenized_text["attention_mask"].size()) # torch.Size([3, 16, 32])
out = model(**tokenized_text)
print(out.size()) # torch.Size([3, 512])
# to get the Transformer attentions:
out, attentions = model(**tokenized_text, output_attentions=True)
print(attentions[0].size()) # torch.Size([48, 12, 32, 32])
Citing & Authors
If you find this model helpful, feel free to cite our publication.
@inproceedings{michel-etal-2024-improving,
title = "Improving Quotation Attribution with Fictional Character Embeddings",
author = "Michel, Gaspard and
Epure, Elena V. and
Hennequin, Romain and
Cerisara, Christophe",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.744",
doi = "10.18653/v1/2024.findings-emnlp.744",
pages = "12723--12735",,
}
License
UAR Scene is distributed under the terms of the Apache License (Version 2.0).
All new contributions must be made under the Apache-2.0 licenses.
- Downloads last month
- 11