This model is trained for document separation for printed reviews from zbMATH Open. We had old scanned volumes of documents dating back to the 1800s, which we wanted to convert to LaTeX machine-processable format. We first converted all scanned documents to LaTeX using mathPiX and then trained an LLM to match the metadata of a document with the converted LaTeX (a single page had many documents).
- Download LLamaFactory (I recommend on this point - https://github.com/hiyouga/LLaMA-Factory/tree/36039b0fe01c17ae30dba60e247d7ba8a1beb20a).
- Save in data folder your dataset, update dataset_info (check README.md and data/README.md).
- Upload this model.
- Run python3 -u LLaMA-Factory/src/train.py --stage sft --model_name_or_path DeepKarkhanis/Mistral-Passthrough-8L-10B --adapter_name_or_path {way_to_this_midel} --finetuning_type lora --template default --dataset_dir LLaMA-Factory/data --eval_dataset {your_dataset_name} --cutoff_len 16000 --max_samples 100000 --per_device_eval_batch_size 1 --predict_with_generate True --max_new_tokens 16000 --top_p 0.7 --temperature 0.95 --output_dir {your_output_dir} --do_predict True