--- language: en tags: - protein - protbert - masked-language-modeling - bioinformatics - sequence-prediction datasets: - custom license: mit library_name: transformers pipeline_tag: fill-mask --- # ProtBERT-Unmasking This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context. ## Model Description - **Base Model**: ProtBERT - **Task**: Protein Sequence Unmasking - **Training**: Fine-tuned on masked protein sequences - **Use Case**: Predicting missing or masked amino acids in protein sequences - **Optimal Use**: Best performance on E. coli sequences with known amino acids K, C, Y, H, S, M For detailed information about the training methodology and approach, please refer to our paper: [https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892) ## Usage ```python from transformers import AutoModelForMaskedLM, AutoTokenizer # Load model and tokenizer model = AutoModelForMaskedLM.from_pretrained("your-username/protbert-sequence-unmasking") tokenizer = AutoTokenizer.from_pretrained("your-username/protbert-sequence-unmasking") # Example usage for E. coli sequence with known amino acids (K,C,Y,H,S,M) sequence = "MALN[MASK]KFGP[MASK]LVRK" inputs = tokenizer(sequence, return_tensors="pt") outputs = model(**inputs) predictions = outputs.logits ``` ## Inference API The model is optimized for: - **Organism**: E. coli - **Known Amino Acids**: K, C, Y, H, S, M - **Task**: Predicting unknown amino acids in a sequence Example API usage: ```python from transformers import pipeline unmasker = pipeline('fill-mask', model='your-username/protbert-sequence-unmasking') sequence = "K[MASK]YHS[MASK]" # Example with known amino acids K,Y,H,S results = unmasker(sequence) for result in results: print(f"Predicted amino acid: {result['token_str']}, Score: {result['score']:.3f}") ``` ## Limitations and Biases - This model is specifically designed for protein sequence unmasking in E. coli - Optimal performance is achieved when working with sequences containing known amino acids K, C, Y, H, S, M - The model may not perform optimally for: - Sequences from other organisms - Sequences without the specified known amino acids - Other protein-related tasks ## Training Details The complete details of the training methodology, dataset preparation, and model evaluation can be found in our paper: [https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892)