Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -54,12 +54,12 @@ print(response)
54
 
55
  ## Hardware and Software
56
 
57
- **Training Factors:** We used [llama-factory]() training library, Cloud GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on cloud infrastructure.
58
 
59
 
60
  ## Training Data
61
 
62
- **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size roughly around 268 GB. We separated __22GB__ data from that using a ratio of the data actual data size. Total trained tokens are __3B__ tokens.
63
 
64
  Data sources summary:
65
  - Web documents: Extract, clean, filter common crawl data
@@ -69,7 +69,7 @@ Data sources summary:
69
  - Code-mixed data: We trained a Bangla-English code-mixed LLM model and used it to generate code-mixed data
70
  - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
71
  - Synthetic data: We generated synthetic data using a Bangla LLM model
72
- - Others: We scrap some selected websites data, used open-sources data, and used some other data sources
73
 
74
 
75
  ## Benchmarks \- Bangla Text
@@ -77,7 +77,7 @@ Data sources summary:
77
  In this section, we report the results for __titulm-gemma-2-2b-v1.0__ models on standard automatic benchmarks. For all these evaluations, we used [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) evaluations library.
78
 
79
  ### Evaluation Datasets
80
- We evaluated our pretrained models on both Bangla and English benchmark datasets. Although the model is trained on Bangla data, it's English capability is also evaluated on English benchmark datasets. The evaluation datasets are as follows:
81
 
82
  #### Bangla Benchmark datasets
83
  We evaluated the models on the following datasets:
 
54
 
55
  ## Hardware and Software
56
 
57
+ **Training Factors:** We used [llama-factory]() training library, cloud GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on cloud infrastructure.
58
 
59
 
60
  ## Training Data
61
 
62
+ **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribe text, code-mixed text, conversations, and open sources raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB. We separated __22GB__ data from that using a ratio of the data actual data size. Total trained tokens are __3B__ tokens.
63
 
64
  Data sources summary:
65
  - Web documents: Extract, clean, filter common crawl data
 
69
  - Code-mixed data: We trained a Bangla-English code-mixed LLM model and used it to generate code-mixed data
70
  - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
71
  - Synthetic data: We generated synthetic data using a Bangla LLM model
72
+ - Others: We scraped data from some selected websites, used open-sources data, and used some other data sources
73
 
74
 
75
  ## Benchmarks \- Bangla Text
 
77
  In this section, we report the results for __titulm-gemma-2-2b-v1.0__ models on standard automatic benchmarks. For all these evaluations, we used [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) evaluations library.
78
 
79
  ### Evaluation Datasets
80
+ We evaluated our pretrained models on both Bangla and English benchmark datasets. Although the model is trained on Bangla data, its English capability is also evaluated on English benchmark datasets. The evaluation datasets are as follows:
81
 
82
  #### Bangla Benchmark datasets
83
  We evaluated the models on the following datasets: