I am using mixedbread-ai/deepset-mxbai-embed-de-large-v1 embeddings for semantic search in a niche domain. The embeddings work well, but I would like to make them more domain-specific.
Annotated data is very hard to generate, but I have a large corpus (250 MB of raw text) of domain-specific documents.
If I continue pretraining, for example with TSDAE, will it basically destroy the already fine-tuned model and leave me with a raw pretrained model? The mixedbread model is based on xlm-roberta-large, so my pretraining would leave me with a pretrained xlm-roberta-large model. Is my understanding correct?
What are my options to make already fine-tuned embeddings more domain-specific without annotated data?
I read about freezing layers during pretraining or adding layers to models, but I have no idea if that would work for my use case and how I would set this up with sentence-transformers
I’m new to language modeling so I only know what I’ve seen and heard, but I thought it might be what I’ve heard so often called transfer learning.
I thought that layer freezing was essential to prevent forgetting, but according to the following post, apparently not so much?
If it’s a RoBERTa model, the original author is on HF, so you could send him a direct mentions and ask him how to train it. You can reach him from here too. (@+username)