serdarcaglar commited on
Commit
505b6e7
·
1 Parent(s): 5149d19

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -6
README.md CHANGED
@@ -4,29 +4,68 @@ tags:
4
  - roberta
5
  - language-model
6
  - scientific
7
- - turkish
8
  license: mit
9
  model_author: Serdar ÇAĞLAR
10
  ---
11
-
12
  # Roberta-Based Language Model Trained on Turkish Scientific Article Abstracts
13
 
14
- This model is a powerful natural language processing model trained on Turkish scientific article summaries. It focuses on scientific content in the Turkish language and excels in tasks related to text comprehension. The model can be used for understanding scientific texts, summarization, and various other natural language processing tasks.
15
 
16
  ## Model Details
17
 
18
- - **Data Source**: This model is trained on a custom dataset consisting of Turkish scientific article summaries. The data was collected using web scraping methods from various sources in Turkey, including databases like "trdizin," "yöktez," and "türkiyeklinikler."
19
 
20
  - **Dataset Preprocessing**: The data underwent preprocessing to facilitate better learning. Texts were segmented into sentences, and improperly divided sentences were cleaned. The texts were processed meticulously.
21
 
22
  - **Tokenizer**: The model utilizes a BPE (Byte Pair Encoding) tokenizer to process the data effectively, breaking down the text into subword tokens.
23
 
24
- - **Training Details**: The model was trained on a large dataset of Turkish sentences. Training spanned 10 epochs, totaling 240 hours, and the model was built from scratch. No fine-tuning was applied.
25
 
26
  ## Usage
27
 
28
- This model is compatible with Hugging Face's Transformers library and can be employed in various natural language processing projects and applications.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## Disclaimer
31
 
32
  The use of this model is subject to compliance with specific copyright and legal regulations, which are the responsibility of the users. The model owner or provider cannot be held liable for any issues arising from the use of the model.
 
4
  - roberta
5
  - language-model
6
  - scientific
7
+ - turkish
8
  license: mit
9
  model_author: Serdar ÇAĞLAR
10
  ---
11
+ 🇹🇷
12
  # Roberta-Based Language Model Trained on Turkish Scientific Article Abstracts
13
 
14
+ This model is a powerful natural language processing model trained on Turkish scientific article abstracts. It focuses on scientific content in the Turkish language and excels in tasks related to text comprehension. The model can be used for understanding scientific texts, summarization, and various other natural language processing tasks. Model is cased
15
 
16
  ## Model Details
17
 
18
+ - **Data Source**: This model is trained on a custom dataset consisting of Turkish scientific article summaries. The data was collected using web scraping methods from various sources in Turkey, including databases like "trdizin," "yöktez," and "türkiyeklinikleri."
19
 
20
  - **Dataset Preprocessing**: The data underwent preprocessing to facilitate better learning. Texts were segmented into sentences, and improperly divided sentences were cleaned. The texts were processed meticulously.
21
 
22
  - **Tokenizer**: The model utilizes a BPE (Byte Pair Encoding) tokenizer to process the data effectively, breaking down the text into subword tokens.
23
 
24
+ - **Training Details**: The model was trained on a large dataset of Turkish sentences. The training spanned 10 epochs 5M Steps, totaling 240 hours, and the model was built from scratch. No fine-tuning was applied.
25
 
26
  ## Usage
27
 
28
+ Load transformers library with:
29
+ ```python
30
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
31
+
32
+ tokenizer = AutoTokenizer.from_pretrained("serdarcaglar/roberta-base-turkish-scientific-abstract")
33
+ model = AutoModelForMaskedLM.from_pretrained("serdarcaglar/roberta-base-turkish-scientific-abstract")
34
+ ```
35
+ # Fill Mask Usage
36
+
37
+ ```python
38
+ from transformers import pipeline
39
+
40
+ fill_mask = pipeline(
41
+ "fill-mask",
42
+ model="burakaytan/roberta-base-turkish-uncased",
43
+ tokenizer="burakaytan/roberta-base-turkish-uncased"
44
+ )
45
+
46
+ fill_mask("İnterarteriyel seyirli anormal <mask> arter hastaları ne zaman ameliyat edilmeli ve hangi cerrahi teknik kullanılmalıdır?")
47
 
48
+ [{'score': 0.47466886043548584,
49
+ 'token': 6252,
50
+ 'token_str': ' koroner',
51
+ 'sequence': 'İnterarteriyel seyirli anormal koroner arter hastaları ne zaman ameliyat edilmeli ve hangi cerrahi teknik kullanılmalıdır?'},
52
+ {'score': 0.10102332383394241,
53
+ 'token': 16407,
54
+ 'token_str': ' uterin',
55
+ 'sequence': 'İnterarteriyel seyirli anormal uterin arter hastaları ne zaman ameliyat edilmeli ve hangi cerrahi teknik kullanılmalıdır?'},
56
+ {'score': 0.07669707387685776,
57
+ 'token': 9978,
58
+ 'token_str': ' pulmoner',
59
+ 'sequence': 'İnterarteriyel seyirli anormal pulmoner arter hastaları ne zaman ameliyat edilmeli ve hangi cerrahi teknik kullanılmalıdır?'},
60
+ {'score': 0.03238440677523613,
61
+ 'token': 16284,
62
+ 'token_str': ' serebral',
63
+ 'sequence': 'İnterarteriyel seyirli anormal serebral arter hastaları ne zaman ameliyat edilmeli ve hangi cerrahi teknik kullanılmalıdır?'},
64
+ {'score': 0.018826927989721298,
65
+ 'token': 12969,
66
+ 'token_str': ' renal',
67
+ 'sequence': 'İnterarteriyel seyirli anormal renal arter hastaları ne zaman ameliyat edilmeli ve hangi cerrahi teknik kullanılmalıdır?'}]
68
+ ```
69
  ## Disclaimer
70
 
71
  The use of this model is subject to compliance with specific copyright and legal regulations, which are the responsibility of the users. The model owner or provider cannot be held liable for any issues arising from the use of the model.