Model Card for PhysBERT

PhysBERT is a specialized text embedding model for physics, designed to improve information retrieval, citation classification, and clustering of physics literature. Trained on 1.2 million physics papers, it outperforms general-purpose models in physics-specific tasks.

Model Description

PhysBERT is a BERT-based text embedding model for physics, fine-tuned using SimCSE for optimized physics-specific performance. This model enables efficient retrieval, categorization, and analysis of physics literature, achieving higher relevance and accuracy on domain-specific NLP tasks. The uncased version can be found here.

Training Data

Trained on a 40GB corpus from arXiv’s physics publications, consisting of 1.2 million documents, refined for scientific accuracy.

Training Procedure

The model was pre-trained using Masked Language Modeling (MLM) and fine-tuned with SimCSE for sentence embeddings.

Example of Usage

from transformers import AutoTokenizer, AutoModel
import torch

# Load PhysBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("thellert/physbert_cased")
model = AutoModel.from_pretrained("thellert/physbert_cased")

# Sample text to embed
sample_text = "Electrons exhibit both particle and wave-like behavior."

# Tokenize the input text and pass it through the model
inputs = tokenizer(sample_text, return_tensors="pt")
outputs = model(**inputs)

# Extract the token embeddings
token_embeddings = outputs.last_hidden_state
# Drop CLS and SEP tokens, then take the mean for the sentence embedding
token_embeddings = token_embeddings[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)

Citation

If you find this work useful please consider citing the following paper:

@article{10.1063/5.0238090,
    author = {Hellert, Thorsten and Montenegro, João and Pollastro, Andrea},
    title = "{PhysBERT: A text embedding model for physics scientific literature}",
    journal = {APL Machine Learning},
    volume = {2},
    number = {4},
    pages = {046105},
    year = {2024},
    month = {10},
    issn = {2770-9019},
    doi = {10.1063/5.0238090},
    url = {https://doi.org/10.1063/5.0238090},
    eprint = {https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0238090/20227307/046105\_1\_5.0238090.pdf},
}

Model Card Authors

Thorsten Hellert, João Montenegro, Andrea Pollastro

Model Card Contact

Thorsten Hellert, Lawrence Berkeley National Laboratory, [email protected]

Downloads last month
57
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including thellert/physbert_cased