|
--- |
|
license: apache-2.0 |
|
base_model: sentence-transformers/all-MiniLM-L6-v2 |
|
tags: |
|
- generated_from_trainer |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: new_classifier_model |
|
results: [] |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# Classifier for Academic Text Contents |
|
|
|
This model is a fine-tuned version of [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) on a collection of Linguistics publications. |
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.4181 |
|
- Accuracy: 0.9193 |
|
|
|
## Model description |
|
|
|
The model is fine-tuned with academic publications in Linguistics, to classify texts in publications into 4 classes as a filter to other tasks. |
|
|
|
The 4 classes: |
|
- 0: out of scope - materials that are of low significance, eg. page number and page header, noise from OCR/pdf-to-text convertion |
|
- 1: main text - texts that are the main texts of the publication, to be used for down-stream tasks |
|
- 2: examples - texts that are captions of the figures, or quotes or excerpts |
|
- 3: references - references of the publication, excluding in-text citations |
|
|
|
## Intended uses & limitations |
|
|
|
Intended uses: |
|
- to extract main text in academic texts for down-stream tasks |
|
|
|
Limitations: |
|
- training and evaluation data is limited to English, and academic texts in Linguistics |
|
|
|
## How to run |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
classifier = pipeline("text-classification", model="howanching-clara/classifier_for_academic_texts", tokenizer="howanching-clara/classifier_for_academic_texts") |
|
|
|
# Perform inference on your input text |
|
your_text = "your text here." |
|
result = classifier(your_text) |
|
|
|
print(result) |
|
``` |
|
|
|
|
|
|
|
## Try it yourself with the following examples (not in training/ evaluation data) |
|
|
|
--- |
|
language: en |
|
widget: |
|
- text: "my example goes here in the requested language" |
|
--- |
|
|
|
Excerpts from Chomsky, N. (2014). Aspects of the Theory of Syntax (No. 11). MIT press. |
|
retrieved from https://apps.dtic.mil/sti/pdfs/AD0616323.pdf |
|
|
|
- In the case of (ioii) and (1 lii), the passive transformation will |
|
apply to the embedded sentence, and in all four cases other |
|
operations will give the final surface forms of (8) and (g). |
|
|
|
|
|
- (10) (i) Noun Phrase — Verb — Noun Phrase — Sentence |
|
(/ — persuaded — a specialist — a specialist will examine |
|
John) |
|
(ii) Noun Phrase — Verb — Noun Phrase — Sentence |
|
(/ — persuaded — John — a specialist will examine John) |
|
|
|
|
|
- (13) S |
|
Det |
|
Predicate-Phrase |
|
[+Definite] nom VP |
|
their |
|
F1...Fm Det N |
|
destroy [+Definite] G, ... G, |
|
the property |
|
|
|
- 184 SOME RESIDUAL PROBLEMS |
|
|
|
- Peshkovskii, A. M. (1956). Russkii Sintaksis v Nauchnom Osveshchenii. |
|
Moscow. |
|
|
|
|
|
## Problematic cases |
|
|
|
Definitions or findings written in point form are challenging for the model. For example: |
|
|
|
- (2) (i) the string (1) is a Sentence (S); frighten the boy is a Verb |
|
Phrase (VP) consisting of the Verb (V) frighten and the |
|
Noun Phrase (NP) the boy; sincerity is also an NP; the |
|
NP the boy consists of the Determiner (Det) the, followed |
|
by a Noun (N); the NP sincerity consists of just an N; |
|
the is, furthermore, an Article (Art); may is a Verbal |
|
Auxiliary (Aux) and, furthermore, a Modal (M). |
|
|
|
- (v) specification of a function m such that m(i) is an integer |
|
associated with the grammar G4 as its value (with, let us |
|
say, lower value indicated by higher number) |
|
|
|
|
|
|
|
## Training and evaluation data |
|
|
|
More information needed |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 2e-05 |
|
- train_batch_size: 16 |
|
- eval_batch_size: 16 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 10 |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Accuracy | |
|
|:-------------:|:-----:|:----:|:---------------:|:--------:| |
|
| 0.5772 | 1.0 | 762 | 0.3256 | 0.9062 | |
|
| 0.2692 | 2.0 | 1524 | 0.3038 | 0.9163 | |
|
| 0.217 | 3.0 | 2286 | 0.3109 | 0.9180 | |
|
| 0.1773 | 4.0 | 3048 | 0.3160 | 0.9209 | |
|
| 0.1619 | 5.0 | 3810 | 0.3440 | 0.9206 | |
|
| 0.1329 | 6.0 | 4572 | 0.3675 | 0.9160 | |
|
| 0.1165 | 7.0 | 5334 | 0.3770 | 0.9209 | |
|
| 0.0943 | 8.0 | 6096 | 0.4012 | 0.9203 | |
|
| 0.085 | 9.0 | 6858 | 0.4166 | 0.9196 | |
|
| 0.0811 | 10.0 | 7620 | 0.4181 | 0.9193 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.34.1 |
|
- Pytorch 2.1.0+cpu |
|
- Datasets 2.14.7 |
|
- Tokenizers 0.14.1 |
|
|