flan-t5-base-paragrapher

This model is designed to preprocess, clean, and reformat text chunks containing line breaks, word breaks, and references into coherent plain text paragraphs. The resulting paragraphs can be used with other models like agentlans/flan-t5-small-title and agentlans/text-summarization.

Model description

The flan-t5-base-paragrapher is a fine-tuned version of google/flan-t5-base, trained on a dataset of open-source introductory social science textbooks. While it was trained on academic texts, it should work well with other types of educational and academic content.

The model achieves the following results on the evaluation set:

  • Loss: 1.5175
  • Number of Input Tokens Seen: 49 815 380

Intended uses & limitations

This model is intended for preprocessing and reformatting text chunks into coherent paragraphs. It can be particularly useful for:

  1. Cleaning up text extracted from PDFs or OCR systems
  2. Reformatting text with irregular line breaks or word breaks
  3. Preparing text for further processing or analysis

Limitations:

  • The model may not perform optimally on highly specialized or technical texts outside its training domain.
  • Very long input sequences may be truncated due to the model's maximum sequence length (512 tokens).

Training and evaluation data

The model was trained on a dataset compiled from open-source textbooks. Due to licensing constraints, the specific training data is not published.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • Learning rate: 5e-05
  • Train batch size: 8
  • Eval batch size: 8
  • Seed: 42
  • Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • LR scheduler type: linear
  • Number of epochs: 10.0

Training results

Click to expand training results
Training Loss Epoch Step Validation Loss Input Tokens Seen
2.0748 0.1126 500 1.7587 562752
1.9699 0.2251 1000 1.7031 1119424
1.9177 0.3377 1500 1.6701 1676620
1.9179 0.4502 2000 1.6647 2244928
1.8908 0.5628 2500 1.6502 2806840
1.8666 0.6754 3000 1.6427 3364792
1.8456 0.7879 3500 1.6245 3925172
1.8542 0.9005 4000 1.6218 4490100
1.8305 1.0131 4500 1.6211 5052066
1.7588 1.1256 5000 1.6040 5607258
1.7606 1.2382 5500 1.6020 6165278
1.7426 1.3507 6000 1.5993 6727290
1.7477 1.4633 6500 1.5869 7292338
1.7413 1.5759 7000 1.5791 7849466
1.7342 1.6884 7500 1.5792 8415302
1.7247 1.8010 8000 1.5759 8970490
1.7423 1.9136 8500 1.5744 9529290
1.7138 2.0261 9000 1.5655 10091652
1.6719 2.1387 9500 1.5630 10650544
1.6637 2.2512 10000 1.5584 11208648
1.6415 2.3638 10500 1.5609 11776396
1.6565 2.4764 11000 1.5558 12338500
1.6597 2.5889 11500 1.5530 12897552
1.6709 2.7015 12000 1.5477 13460052
1.648 2.8140 12500 1.5424 14021984
1.642 2.9266 13000 1.5433 14586256
1.6258 3.0392 13500 1.5419 15140609
1.6067 3.1517 14000 1.5415 15700397
1.5946 3.2643 14500 1.5450 16265849
1.5835 3.3769 15000 1.5415 16827557
1.5996 3.4894 15500 1.5411 17384857
1.5834 3.6020 16000 1.5382 17945909
1.5956 3.7145 16500 1.5351 18507721
1.5825 3.8271 17000 1.5356 19069425
1.6001 3.9397 17500 1.5294 19631905
1.5677 4.0522 18000 1.5369 20185192
1.5415 4.1648 18500 1.5318 20739888
1.5362 4.2774 19000 1.5311 21304584
1.5251 4.3899 19500 1.5323 21862856
1.5388 4.5025 20000 1.5307 22427236
1.5508 4.6150 20500 1.5282 22985184
1.5692 4.7276 21000 1.5265 23548396
1.5391 4.8402 21500 1.5276 24111452
1.5431 4.9527 22000 1.5270 24673344
1.5147 5.0653 22500 1.5292 25236559
1.4908 5.1778 23000 1.5288 25799675
1.5153 5.2904 23500 1.5288 26352767
1.5099 5.4030 24000 1.5250 26916707
1.5064 5.5155 24500 1.5259 27483639
1.5146 5.6281 25000 1.5249 28040307
1.4938 5.7407 25500 1.5233 28600639
1.5034 5.8532 26000 1.5237 29164539
1.5091 5.9658 26500 1.5219 29730199
1.4853 6.0783 27000 1.5241 30286010
1.4797 6.1909 27500 1.5201 30840802
1.466 6.3035 28000 1.5238 31403710
1.4666 6.4160 28500 1.5226 31962730
1.4732 6.5286 29000 1.5199 32518854
1.4756 6.6412 29500 1.5219 33083634
1.4778 6.7537 30000 1.5195 33644482
1.4674 6.8663 30500 1.5182 34207738
1.4813 6.9788 31000 1.5202 34772050
1.4543 7.0914 31500 1.5211 35331657
1.4389 7.2040 32000 1.5221 35888749
1.4534 7.3165 32500 1.5215 36455101
1.4401 7.4291 33000 1.5208 37016889
1.4435 7.5416 33500 1.5212 37570517
1.4443 7.6542 34000 1.5205 38134577
1.4533 7.7668 34500 1.5209 38700917
1.4589 7.8793 35000 1.5218 39259257
1.4548 7.9919 35500 1.5185 39819093
1.4322 8.1045 36000 1.5207 40382907
1.4271 8.2170 36500 1.5220 40938983
1.4165 8.3296 37000 1.5203 41498811
1.4273 8.4421 37500 1.5197 42053427
1.4281 8.5547 38000 1.5195 42615135
1.4372 8.6673 38500 1.5197 43173055
1.4374 8.7798 39000 1.5175 43737723
1.4278 8.8924 39500 1.5211 44300547
1.442 9.0050 40000 1.5189 44864787
1.4235 9.1175 40500 1.5226 45418155
1.413 9.2301 41000 1.5220 45985195
1.4193 9.3426 41500 1.5201 46538675
1.414 9.4552 42000 1.5202 47101815
1.4084 9.5678 42500 1.5191 47655583
1.408 9.6803 43000 1.5207 48217371
1.4207 9.7929 43500 1.5200 48781351
1.4293 9.9054 44000 1.5198 49345155

Framework versions

  • Transformers 4.44.2
  • PyTorch 2.5.1+cu124
  • Datasets 3.1.0
  • Tokenizers 0.19.1

Usage

Here's an example of how to use the model:

from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("agentlans/flan-t5-base-paragrapher")
model = T5ForConditionalGeneration.from_pretrained(
    "agentlans/flan-t5-base-paragrapher", device_map="auto"
)

# Define input texts
# Note: These aren't real citations. Only for demonstration purpose.
input_texts = [
    """ge with a narrative—whether through books, films, or oral traditions—we are invited into another person's experience (Brown & Thompson, 2023). This immersion allows us to see the world through different perspectives, breaking down barriers of misunderstanding and prejudice. For example, novels like Harper Lee's "To Kill a Mockingbird" challenge readers to confront issues of racism and injustice through the eyes of a child (Williams, 2018). Similarly, contemporary works such as Chimamanda Ngozi Adichie's "Americanah" explore themes of identity and belonging in a globalized world (Nguyen & Roberts, 2020). By sharing these experiences through storytelling, authors can cultivate empathy in their audiences, encouraging them to reflect on their own beliefs and biases.
    Shaping Identity Through Narratives
    Stories also play a crucial role in shaping personal and collective identities. From childhood tales told by parents to the myths and legends that define cultural heritage, narratives help individuals understand their place in the world (Anderson & White, 2021). They provide frameworks thro""",
    """cia, M., & Patel, R. (2022). Cultural insights through literature: A comparative analysis. International Journal of Cultural Studies, 15(3), 201-215. Johnson, L., & Lee, H. (2019). Oral traditions: Preserving culture through storytelling. Anthropology Today Journal, 34(4), 56-60. Kumar, P. (2021). Epic tales: Literature as a reflection of society. Literary Critique Review, 29(1), 34-50. Lee, J., & Martinez, F. (2021). Voices unheard: Marginalized narratives in digital spaces. Journal of Digital Culture Studies, 7(2), 45-67. Martinez, C., & Chen, Y. (2022). Cultural navigation: Identity in a globalized world. Global Studies Review Jou""",
]

# Tokenize input texts
input_ids = tokenizer(
    input_texts, return_tensors="pt", padding=True, truncation=True
).input_ids.to("cuda")

# Generate outputs
outputs = model.generate(input_ids, max_length=512)

# Print generated outputs
for output in outputs:
    print(tokenizer.decode(output, skip_special_tokens=True) + "\n")

Example output:

Through storytelling, we are invited into another person's experience, breaking down barriers of misunderstanding and prejudice. This immersion allows us to see the world through different perspectives, fostering empathy and re-evaluating our own beliefs and biases. For instance, Harper Lee's "To Kill a Mockingbird" challenges readers to confront issues of racism and injustice through the eyes of a child, while contemporary works like Chimamanda Ngozi Adichie's "Americanah" explore themes of identity and belonging in a globalized world. By sharing these experiences through storytelling, authors

The study of cultural insights through literature has yielded valuable insights into the world. Ci and Patel (2022) conducted a comparative analysis of cultural insights through literature, highlighting the importance of cultural storytelling in preserving culture. Kumar (2021) argued that oral traditions can preserve culture through storytelling, highlighting the importance of storytelling in preserving culture. Lee and Martinez (2021) explored marginalized narratives in digital spaces, highlighting the need for cultural navigation in a globalized world. These studies collectively demonstrate the importance of cultural navigation in fostering identity and identity in a globalized world.

Downloads last month
113
Safetensors
Model size
248M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for agentlans/flan-t5-base-paragrapher

Finetuned
(662)
this model