polish-roberta-base-v2-cposes-tagging

This model is a fine-tuned version of sdadas/polish-roberta-base-v2 on the nkjp1m dataset. It achieves the following results on the evaluation set:

  • Loss: 0.0458
  • Precision: 0.9913
  • Recall: 0.9912
  • F1: 0.9913
  • Accuracy: 0.9889

You can find the training notebook here: https://github.com/WikKam/roberta-pos-finetuning

Usage

from transformers import pipeline

nlp = pipeline("token-classification", "wkaminski/polish-roberta-base-v2-cposes-tagging")

nlp("Ale dzisiaj leje")

Model description

This model is a coarse-part-of-speech tagger for the Polish language based on sdadas/polish-roberta-base-v2. It support 13 classes representing coarse part of speech):

{
 0: 'A',
 1: 'Adv',
 2: 'Comp',
 3: 'Conj',
 4: 'Dig',
 5: 'Interj',
 6: 'N',
 7: 'Num',
 8: 'Part',
 9: 'Prep',
 10: 'Punct',
 11: 'V',
 12: 'X'
}

Tags meaning is the same as in nkjp1m dataset:

Tag Description in English Description in Polish Example in Polish
A Adjective przymiotnik szybki
Adv Adverb przysłówek szybko
Comp Comparative / Complementizer stopień porównawczy / spójnik podrzędny lepszy / że
Conj Conjunction spójnik i
Dig Digit cyfra 5, 3
Interj Interjection wykrzyknik och!
N Noun rzeczownik dom
Num Numeral liczebnik jeden
Part Particle partykuła by
Prep Preposition przyimek w
Punct Punctuation interpunkcja ., !, ?
V Verb czasownik biegać
X Unknown / Other niesklasyfikowane xxx

Intended uses & limitations

Even though we have some nice tools for pos-tagging in polish (http://morfeusz.sgjp.pl/), I needed a pos tagger for polish that could be easily loaded inside the browser. Huggingface supports such functionality and that's why I created this model.

Training and evaluation data

Model was trained on a half of test data of the nkjp1m dataset (~0.5 milion tokens).

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss Precision Recall F1 Accuracy
0.0471 1.0 2155 0.0491 0.9896 0.9900 0.9898 0.9873
0.0291 2.0 4310 0.0467 0.9901 0.9905 0.9903 0.9884
0.0191 3.0 6465 0.0458 0.9913 0.9912 0.9913 0.9889

Framework versions

  • Transformers 4.35.2
  • Pytorch 2.1.0+cu118
  • Datasets 2.15.0
  • Tokenizers 0.15.0
Downloads last month
14
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for wkaminski/polish-roberta-base-v2-cposes-tagging

Finetuned
(2)
this model

Evaluation results