Email-tuned BGE-M3

This is a fine-tuned version of BAAI/bge-m3 optimized for email content retrieval. The model was trained on a mixed-language (English/Korean) email dataset to improve retrieval performance for various email-related queries.

Model Description

  • Model Type: Embedding model (encoder-only)
  • Base Model: BAAI/bge-m3
  • Languages: English, Korean
  • Domain: Email content, business communication
  • Training Data: Mixed-language email dataset with various types of queries (metadata, long-form, short-form, yes/no questions)

Quickstart

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document

# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="doubleyyh/email-tuned-bge-m3",
    model_kwargs={'device': 'cuda'},
    encode_kwargs={'normalize_embeddings': True}
)

# Example emails
emails = [
    {
        "subject": "회의 일정 λ³€κ²½ μ•ˆλ‚΄",
        "from": [["κΉ€μ² μˆ˜", "[email protected]"]],
        "to": [["이영희", "[email protected]"]],
        "cc": [["박지원", "[email protected]"]],
        "date": "2024-03-26T10:00:00",
        "text_body": "μ•ˆλ…•ν•˜μ„Έμš”, 내일 μ˜ˆμ •λœ ν”„λ‘œμ νŠΈ λ―ΈνŒ…μ„ μ˜€ν›„ 2μ‹œλ‘œ λ³€κ²½ν•˜κ³ μž ν•©λ‹ˆλ‹€."
    },
    {
        "subject": "Project Timeline Update",
        "from": [["John Smith", "[email protected]"]],
        "to": [["Team", "[email protected]"]],
        "cc": [],
        "date": "2024-03-26T11:30:00",
        "text_body": "Hi team, I'm writing to update you on the Q2 project milestones."
    }
]

# Format emails into documents
docs = []
for email in emails:
    # Format email content
    content = "\n".join([f"{k}: {v}" for k, v in email.items()])
    docs.append(Document(page_content=content))

# Create FAISS index
db = FAISS.from_documents(docs, embeddings)

# Query examples (supports both Korean and English)
queries = [
    "회의 μ‹œκ°„μ΄ μ–Έμ œλ‘œ λ³€κ²½λ˜μ—ˆλ‚˜μš”?",
    "When is the meeting rescheduled?",
    "ν”„λ‘œμ νŠΈ 일정",
    "Q2 milestones"
]

# Perform similarity search
for query in queries:
    print(f"\nQuery: {query}")
    results = db.similarity_search(query, k=1)
    print(f"Most relevant email:\n{results[0].page_content[:200]}...")

Intended Use & Limitations

Intended Use

  • Email content retrieval
  • Similar document search in email corpora
  • Question answering over email content
  • Multi-language email search systems

Limitations

  • Performance may vary for domains outside of email content
  • Best suited for business communication context
  • While supporting both English and Korean, performance might vary between languages

Citation

@misc{email-tuned-bge-m3,
  author = {doubleyyh},
  title = {Email-tuned BGE-M3: Fine-tuned Embedding Model for Email Content},
  year = {2024},
  publisher = {HuggingFace}
}

License

This model follows the same license as the base model (bge-m3).

Contact

For questions or feedback, please use the GitHub repository issues section or contact through HuggingFace.

Downloads last month
11
Safetensors
Model size
568M params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.