File size: 5,062 Bytes
88f4188
4d92225
880f628
604ed18
 
 
880f628
 
 
604ed18
70ce9b6
604ed18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88f4188
70ce9b6
880f628
b194781
880f628
b194781
880f628
b194781
 
 
 
 
 
 
 
337b42c
 
 
 
 
 
d061b22
 
337b42c
 
 
 
 
 
 
 
 
 
 
 
 
 
5274368
337b42c
b194781
 
 
7538790
b194781
604ed18
7538790
604ed18
b194781
ee09102
ea86856
880f628
 
 
 
 
 
ee09102
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: cc-by-nc-2.0
library_name: transformers
datasets:
- BIOGRID
- Negatome
pipeline_tag: text-classification
tags:
- protein language model
- biology
widget:
- text: >-
    M S H S V K I Y D T C I G C T Q C V R A C P T D V L E M I P W G G C K A K Q
    I A S A P R T E D C V G C K R C E S A C P T D F L S V R V Y L W H E T T R S
    M G L A Y [SEP] M I N L P S L F V P L V G L L F P A V A M A S L F L H V E K
    R L L F S T K K I N
  example_title: Non-interacting proteins
- text: >-
    M S I N I C R D N H D P F Y R Y K M P P I Q A K V E G R G N G I K T A V L N
    V A D I S H A L N R P A P Y I V K Y F G F E L G A Q T S I S V D K D R Y L V
    N G V H E P A K L Q D V L D G F I N K F V L C G S C K N P E T E I I I T K D
    N D L V R D C K A C G K R T P M D L R H K L S S F I L K N P P D S V S G S K
    K K K K A A T A S A N V R G G G L S I S D I A Q G K S Q N A P S D G T G S S
    T P Q H H D E D E D E L S R Q I K A A A S T L E D I E V K D D E W A V D M S
    E E A I R A R A K E L E V N S E L T Q L D E Y G E W I L E Q A G E D K E N L
    P S D V E L Y K K A A E L D V L N D P K I G C V L A Q C L F D E D I V N E I
    A E H N A F F T K I L V T P E Y E K N F M G G I E R F L G L E H K D L I P L
    L P K I L V Q L Y N N D I I S E E E I M R F G T K S S K K F V P K E V S K K
    V R R A A K P F I T W L E T A E S D D D E E D D E [SEP] M S I E N L K S F D
    P F A D T G D D E T A T S N Y I H I R I Q Q R N G R K T L T T V Q G V P E E
    Y D L K R I L K V L K K D F A C N G N I V K D P E M G E I I Q L Q G D Q R A
    K V C E F M I S Q L G L Q K K N I K I H G F
  example_title: Interacting proteins
---
<img src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F62f2bd3bdb7cbd214b658c48%2FRo4uhQDurP-x7IHJj11xa.png%26quot%3B%3C%2Fspan%3E width="350">

## Model description

SYNTERACT (SYNThetic data-driven protein-protein intERACtion Transformer) is a fine-tuned version of [ProtBERT](https://huggingface.co/Rostlab/prot_bert_bfd) that attends two amino acid sequences separated by [SEP] to determine if they plausibly interact in biological context.

We utilized the multivalidated physical interaction dataset from BIORGID, Negatome, and synthetic negative samples to train our model. Check out our [preprint](https://www.biorxiv.org/content/10.1101/2023.06.07.544109v1.full) for more details.

SYNTERACT achieved unprecedented performance over vast phylogeny with 92-96% accuracy on real unseen examples, and is already being used to accelerate drug target screening and peptide therapeutic design.


## How to use

```python
# Imports
import re
import torch
import torch.nn.functional as F
from transformers import BertForSequenceClassification, BertTokenizer

model = BertForSequenceClassification.from_pretrained('GleghornLab/SYNTERACT') # load model
tokenizer = BertTokenizer.from_pretrained('GleghornLab/SYNTERACT') # load tokenizer
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # gather device
model.to(device) # move to device
model.eval() # put in eval mode

sequence_a = 'MEKSCSIGNGREQYGWGHGEQCGTQFLECVYRNASMYSVLGDLITYVVFLGATCYAILFGFRLLLSCVRIVLKVVIALFVIRLLLALGSVDITSVSYSG' # Uniprot A1Z8T3
sequence_b = 'MRLTLLALIGVLCLACAYALDDSENNDQVVGLLDVADQGANHANDGAREARQLGGWGGGWGGRGGWGGRGGWGGRGGWGGRGGWGGGWGGRGGWGGRGGGWYGR' # Uniprot A1Z8H0
sequence_a = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence_a))) # need spaces inbetween amino acids
sequence_b = ' '.join(list(re.sub(r'[UZOB]', 'X', sequence_b))) # replace rare amino acids with X
example = sequence_a + ' [SEP] ' + sequence_b # add SEP token

example = tokenizer(example, return_tensors='pt', padding=False).to(device) # tokenize example
with torch.no_grad():
    logits = model(**example).logits.cpu().detach() # get logits from model

probability = F.softmax(logits, dim=-1) # use softmax to get "confidence" in the prediction
prediction = probability.argmax(dim=-1) # 0 for no interaction, 1 for interaction
```

## Intended use and limitations
We define a protein-protein interaction as physical contact that mediates chemical or conformational change, especially with non-generic function. However, due to SYNTERACT's propensity to predict false positives, we believe that it identifies plausible conformational changes caused by interactions without relevance to function.

## Our lab
The [Gleghorn lab](https://www.gleghornlab.com/) is an interdisciplinary research group at the University of Delaware that focuses on solving translational problems with our expertise in engineering, biology, and chemistry. We develop inexpensive and reliable tools to study organ development, maternal-fetal health, and drug delivery. Recently we have begun exploration into protein language models and strive to make protein design and annotation accessible.

## Please cite
```
@article {Hallee_ppi_2023,
	author = {Logan Hallee and Jason P. Gleghorn},
	title = {Protein-Protein Interaction Prediction is Achievable with Large Language Models},
	year = {2023},
	doi = {10.1101/2023.06.07.544109},
	publisher = {Cold Spring Harbor Laboratory},
	journal = {bioRxiv}
}
```