|
--- |
|
license: mit |
|
language: |
|
- en |
|
- de |
|
- fr |
|
- it |
|
--- |
|
|
|
This model is trained for document separation for printed reviews from zbMATH Open. |
|
We had old scanned volumes of documents dating back to the 1800s, which we wanted to convert to LaTeX machine-processable format. We first converted all scanned documents to LaTeX using |
|
mathPiX and then trained an LLM to match the metadata of a document with the converted LaTeX (a single page had many documents). |
|
|
|
|
|
1) Download LLamaFactory (I recommend on this point - https://github.com/hiyouga/LLaMA-Factory/tree/36039b0fe01c17ae30dba60e247d7ba8a1beb20a). |
|
2) Save in data folder your dataset, update dataset_info (check README.md and data/README.md). |
|
3) Upload this model. |
|
4) Run |
|
python3 -u LLaMA-Factory/src/train.py --stage sft --model_name_or_path DeepKarkhanis/Mistral-Passthrough-8L-10B --adapter_name_or_path {way_to_this_midel} --finetuning_type lora --template default --dataset_dir LLaMA-Factory/data --eval_dataset {your_dataset_name} --cutoff_len 16000 --max_samples 100000 --per_device_eval_batch_size 1 --predict_with_generate True --max_new_tokens 16000 --top_p 0.7 --temperature 0.95 --output_dir {your_output_dir} --do_predict True |