Pretraining or Finetuning

HuggySSO · October 6, 2024, 10:05am

I am using mixedbread-ai/deepset-mxbai-embed-de-large-v1 embeddings for semantic search in a niche domain. The embeddings work well, but I would like to make them more domain-specific.
Annotated data is very hard to generate, but I have a large corpus (250 MB of raw text) of domain-specific documents.

If I continue pretraining, for example with TSDAE, will it basically destroy the already fine-tuned model and leave me with a raw pretrained model? The mixedbread model is based on xlm-roberta-large, so my pretraining would leave me with a pretrained xlm-roberta-large model. Is my understanding correct?

What are my options to make already fine-tuned embeddings more domain-specific without annotated data?

I read about freezing layers during pretraining or adding layers to models, but I have no idea if that would work for my use case and how I would set this up with sentence-transformers

John6666 · October 6, 2024, 10:51am

I’m new to language modeling so I only know what I’ve seen and heard, but I thought it might be what I’ve heard so often called transfer learning.
I thought that layer freezing was essential to prevent forgetting, but according to the following post, apparently not so much?
If it’s a RoBERTa model, the original author is on HF, so you could send him a direct mentions and ask him how to train it. You can reach him from here too. (@+username)

Topic		Replies	Views
The point of using pretrained model if I don't freeze layers Beginners	1	7537	May 31, 2023
Continue Pre-Training Roberta Intermediate	3	2508	May 18, 2023
Training embeddings of tokens 🤗Transformers	2	4980	January 27, 2021
Identifying and getting right embeddings from the fine tuned BERT on domain specific data Intermediate	0	1308	September 8, 2021
What is transfer learning and why is it needed? Beginners	1	2019	March 16, 2021

Pretraining or Finetuning

Related topics