faceless-void commited on
Commit
49354fb
·
verified ·
1 Parent(s): 7b34995

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +49 -5
README.md CHANGED
@@ -1,3 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # ProtBERT-Unmasking
2
 
3
  This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context.
@@ -8,6 +23,10 @@ This model is a fine-tuned version of ProtBERT specifically optimized for unmask
8
  - **Task**: Protein Sequence Unmasking
9
  - **Training**: Fine-tuned on masked protein sequences
10
  - **Use Case**: Predicting missing or masked amino acids in protein sequences
 
 
 
 
11
 
12
  ## Usage
13
 
@@ -15,20 +34,45 @@ This model is a fine-tuned version of ProtBERT specifically optimized for unmask
15
  from transformers import AutoModelForMaskedLM, AutoTokenizer
16
 
17
  # Load model and tokenizer
18
- model = AutoModelForMaskedLM.from_pretrained("faceless-void/protbert-sequence-unmasking")
19
- tokenizer = AutoTokenizer.from_pretrained("faceless-void/protbert-sequence-unmasking")
20
 
21
- # Example usage
22
  sequence = "MALN[MASK]KFGP[MASK]LVRK"
23
  inputs = tokenizer(sequence, return_tensors="pt")
24
  outputs = model(**inputs)
25
  predictions = outputs.logits
26
  ```
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  ## Limitations and Biases
29
 
30
- This model is specifically designed for protein sequence unmasking and may not perform optimally for other protein-related tasks.
 
 
 
 
 
31
 
32
  ## Training Details
33
 
34
- The model was fine-tuned from the original ProtBERT model with specific focus on masked sequence prediction.
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - protein
5
+ - protbert
6
+ - masked-language-modeling
7
+ - bioinformatics
8
+ - sequence-prediction
9
+ datasets:
10
+ - custom
11
+ license: mit
12
+ library_name: transformers
13
+ pipeline_tag: fill-mask
14
+ ---
15
+
16
  # ProtBERT-Unmasking
17
 
18
  This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context.
 
23
  - **Task**: Protein Sequence Unmasking
24
  - **Training**: Fine-tuned on masked protein sequences
25
  - **Use Case**: Predicting missing or masked amino acids in protein sequences
26
+ - **Optimal Use**: Best performance on E. coli sequences with known amino acids K, C, Y, H, S, M
27
+
28
+ For detailed information about the training methodology and approach, please refer to our paper:
29
+ [https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892)
30
 
31
  ## Usage
32
 
 
34
  from transformers import AutoModelForMaskedLM, AutoTokenizer
35
 
36
  # Load model and tokenizer
37
+ model = AutoModelForMaskedLM.from_pretrained("your-username/protbert-sequence-unmasking")
38
+ tokenizer = AutoTokenizer.from_pretrained("your-username/protbert-sequence-unmasking")
39
 
40
+ # Example usage for E. coli sequence with known amino acids (K,C,Y,H,S,M)
41
  sequence = "MALN[MASK]KFGP[MASK]LVRK"
42
  inputs = tokenizer(sequence, return_tensors="pt")
43
  outputs = model(**inputs)
44
  predictions = outputs.logits
45
  ```
46
 
47
+ ## Inference API
48
+
49
+ The model is optimized for:
50
+ - **Organism**: E. coli
51
+ - **Known Amino Acids**: K, C, Y, H, S, M
52
+ - **Task**: Predicting unknown amino acids in a sequence
53
+
54
+ Example API usage:
55
+ ```python
56
+ from transformers import pipeline
57
+
58
+ unmasker = pipeline('fill-mask', model='your-username/protbert-sequence-unmasking')
59
+ sequence = "K[MASK]YHS[MASK]" # Example with known amino acids K,Y,H,S
60
+ results = unmasker(sequence)
61
+
62
+ for result in results:
63
+ print(f"Predicted amino acid: {result['token_str']}, Score: {result['score']:.3f}")
64
+ ```
65
+
66
  ## Limitations and Biases
67
 
68
+ - This model is specifically designed for protein sequence unmasking in E. coli
69
+ - Optimal performance is achieved when working with sequences containing known amino acids K, C, Y, H, S, M
70
+ - The model may not perform optimally for:
71
+ - Sequences from other organisms
72
+ - Sequences without the specified known amino acids
73
+ - Other protein-related tasks
74
 
75
  ## Training Details
76
 
77
+ The complete details of the training methodology, dataset preparation, and model evaluation can be found in our paper:
78
+ [https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892)