AmelieSchreiber
/

esm2_t12_35M_lora_binding_sites_v2_cp3

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "04879e6b-3718-4d23-90fa-35e5c8956861",
+   "metadata": {},
+   "source": [
+    "# ESMB for Protein Binding Residue Prediction\n",
+    "\n",
+    "**ESMBind** (or ESMB) is for predicting residues in protein sequences that are binding sites or active sites. The `ESMB_35M` series of models are Low Rank Adaptation (LoRA) finetuned versions of the protein language model `esm2_t12_35M_UR50D`. The models were finetuned on ~549K protein sequences and appear to achieve competative performance compared to current SOTA geometric/structural models, surpassing many on certain classes and for certain metrics. These models have an especially high recall score, meaning they are highly likely to discover most binding residues. \n",
+    "\n",
+    "However, they have relatively low precision, meaning they may return false positives as well. Their MCC, AUC, and accuracy values are also quite high. We hope that scaling the model and dataset size in a 1-to-1 fashion will achieve SOTA performace, but the simplicity of the models and training procedure already make them an attractive set of models as the domain knowledge and data preparation required to use them are very modest and make the barrier to entry low for using these models in practive. \n",
+    "\n",
+    "These models also predict binding residues from sequence alone. Since most proteins have yet to have their 3D folds and backbone structures predicted, we hope this resource will be valuable to the community for this reason as well. Before we proceed, we need to run a few pip install statements. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b8df2453-f478-4ef5-b69f-633aff114438",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install transformers -q \n",
+    "!pip install accelerate -q \n",
+    "!pip install peft -q \n",
+    "!pip install datasets -q "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc67c0c9-6847-432d-89c0-2465981e7ebe",
+   "metadata": {},
+   "source": [
+    "## Running Inference \n",
+    "\n",
+    "To run inference, simply replace the protein sequence below with your own. Afterwards, you'll be guided through how to check the train/test metrics on the datasets yourself. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "84d3cdeb-1a77-425c-8641-123107589868",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForTokenClassification, AutoTokenizer\n",
+    "from peft import PeftModel\n",
+    "import torch\n",
+    "\n",
+    "# Path to the saved LoRA model\n",
+    "model_path = \"AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp3\"\n",
+    "# ESM2 base model\n",
+    "base_model_path = \"facebook/esm2_t12_35M_UR50D\"\n",
+    "\n",
+    "# Load the model\n",
+    "base_model = AutoModelForTokenClassification.from_pretrained(base_model_path)\n",
+    "loaded_model = PeftModel.from_pretrained(base_model, model_path)\n",
+    "\n",
+    "# Ensure the model is in evaluation mode\n",
+    "loaded_model.eval()\n",
+    "\n",
+    "# Load the tokenizer\n",
+    "loaded_tokenizer = AutoTokenizer.from_pretrained(base_model_path)\n",
+    "\n",
+    "# Protein sequence for inference (replace with your own)\n",
+    "protein_sequence = \"MAVPETRPNHTIYINNLNEKIKKDELKKSLHAIFSRFGQILDILVSRSLKMRGQAFVIFKEVSSATNALRSMQGFPFYDKPMRIQYAKTDSDIIAKMKGT\" # @param {type:\"string\"}\n",
+    "\n",
+    "# Tokenize the sequence\n",
+    "inputs = loaded_tokenizer(protein_sequence, return_tensors=\"pt\", truncation=True, max_length=1024, padding='max_length')\n",
+    "\n",
+    "# Run the model\n",
+    "with torch.no_grad():\n",
+    "    logits = loaded_model(**inputs).logits\n",
+    "\n",
+    "# Get predictions\n",
+    "tokens = loaded_tokenizer.convert_ids_to_tokens(inputs[\"input_ids\"][0])  # Convert input ids back to tokens\n",
+    "predictions = torch.argmax(logits, dim=2)\n",
+    "\n",
+    "# Define labels\n",
+    "id2label = {\n",
+    "    0: \"No binding site\",\n",
+    "    1: \"Binding site\"\n",
+    "}\n",
+    "\n",
+    "# Print the predicted labels for each token\n",
+    "for token, prediction in zip(tokens, predictions[0].numpy()):\n",
+    "    if token not in ['<pad>', '<cls>', '<eos>']:\n",
+    "        print((token, id2label[prediction]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "38cebb36-6758-4e7d-b82d-e43cabfbe798",
+   "metadata": {},
+   "source": [
+    "## Train/Test Metrics\n",
+    "\n",
+    "### Loading and Tokenizing the Datasets\n",
+    "\n",
+    "To use this notebook to run the model on the train/test split and get the various metrics (accuracy, precision, recall, F1 score, AUC, and MCC) you will need to download the pickle files [found on Hugging Face here](https://huggingface.co/datasets/AmelieSchreiber/binding_sites_random_split_by_family_550K). Navigate to the \"Files and versions\" and download the four pickle files (you can ignore the TSV files unless you want to preprocess the data in a different way yourself). Once you have downloaded the pickle files, change the four file pickle paths in the cell below to match the local paths of the pickle files on your machine, then run the cell. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "763eba61-fd1e-45d5-a427-0075e46c6293",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(Dataset({\n",
+       "     features: ['input_ids', 'attention_mask', 'labels'],\n",
+       "     num_rows: 450330\n",
+       " }),\n",
+       " Dataset({\n",
+       "     features: ['input_ids', 'attention_mask', 'labels'],\n",
+       "     num_rows: 113475\n",
+       " }))"
+      ]
+     },
+     "execution_count": 1,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from datasets import Dataset\n",
+    "from transformers import AutoTokenizer\n",
+    "import pickle\n",
+    "\n",
+    "# Load tokenizer\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"facebook/esm2_t12_35M_UR50D\")\n",
+    "\n",
+    "# Function to truncate labels\n",
+    "def truncate_labels(labels, max_length):\n",
+    "    \"\"\"Truncate labels to the specified max_length.\"\"\"\n",
+    "    return [label[:max_length] for label in labels]\n",
+    "\n",
+    "# Set the maximum sequence length\n",
+    "max_sequence_length = 1000\n",
+    "\n",
+    "# Load the data from pickle files (change to match your local paths)\n",
+    "with open(\"train_sequences_chunked_by_family.pkl\", \"rb\") as f:\n",
+    "    train_sequences = pickle.load(f)\n",
+    "with open(\"test_sequences_chunked_by_family.pkl\", \"rb\") as f:\n",
+    "    test_sequences = pickle.load(f)\n",
+    "with open(\"train_labels_chunked_by_family.pkl\", \"rb\") as f:\n",
+    "    train_labels = pickle.load(f)\n",
+    "with open(\"test_labels_chunked_by_family.pkl\", \"rb\") as f:\n",
+    "    test_labels = pickle.load(f)\n",
+    "\n",
+    "# Tokenize the sequences\n",
+    "train_tokenized = tokenizer(train_sequences, padding=True, truncation=True, max_length=max_sequence_length, return_tensors=\"pt\", is_split_into_words=False)\n",
+    "test_tokenized = tokenizer(test_sequences, padding=True, truncation=True, max_length=max_sequence_length, return_tensors=\"pt\", is_split_into_words=False)\n",
+    "\n",
+    "# Truncate the labels to match the tokenized sequence lengths\n",
+    "train_labels = truncate_labels(train_labels, max_sequence_length)\n",
+    "test_labels = truncate_labels(test_labels, max_sequence_length)\n",
+    "\n",
+    "# Create train and test datasets\n",
+    "train_dataset = Dataset.from_dict({k: v for k, v in train_tokenized.items()}).add_column(\"labels\", train_labels)\n",
+    "test_dataset = Dataset.from_dict({k: v for k, v in test_tokenized.items()}).add_column(\"labels\", test_labels)\n",
+    "\n",
+    "train_dataset, test_dataset\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c56556a3-93a5-45c6-935d-dc959b18c608",
+   "metadata": {},
+   "source": [
+    "### Getting the Train/Test Metrics\n",
+    "\n",
+    "Next, run the following cell. Depending on your hardware, this may take a while. There are ~549K protein sequences to process in total. The train dataset will obviously take much longer than the test dataset. Be patient and let both of them complete to see both the train and test metrics."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "65dd11e8-f502-44cd-b439-a593bf4d5019",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of EsmForTokenClassification were not initialized from the model checkpoint at facebook/esm2_t12_35M_UR50D and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "f110a2bca7314f278e1b97a37f4ab033",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading (…)/adapter_config.json:   0%|          | 0.00/457 [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "2bd08fb8fcb644d080746c42dc4d77d1",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Downloading adapter_model.bin:   0%|          | 0.00/307k [00:00<?, ?B/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "    <div>\n",
+       "      \n",
+       "      <progress value='200' max='56292' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
+       "      [  200/56292 01:32 < 7:13:37, 2.16 it/s]\n",
+       "    </div>\n",
+       "    "
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "from sklearn.metrics import(\n",
+    "    matthews_corrcoef, \n",
+    "    accuracy_score, \n",
+    "    precision_recall_fscore_support, \n",
+    "    roc_auc_score\n",
+    ")\n",
+    "from peft import PeftModel\n",
+    "from transformers import DataCollatorForTokenClassification, AutoModelForTokenClassification\n",
+    "from transformers import Trainer\n",
+    "from accelerate import Accelerator\n",
+    "\n",
+    "# Instantiate the accelerator\n",
+    "accelerator = Accelerator()\n",
+    "\n",
+    "# Define paths to the LoRA and base models\n",
+    "base_model_path = \"facebook/esm2_t12_35M_UR50D\"\n",
+    "lora_model_path = \"AmelieSchreiber/esm2_t12_35M_lora_binding_sites_v2_cp3\" # \"path/to/your/lora/model\"  # Replace with the correct path to your LoRA model\n",
+    "\n",
+    "# Load the base model\n",
+    "base_model = AutoModelForTokenClassification.from_pretrained(base_model_path)\n",
+    "\n",
+    "# Load the LoRA model\n",
+    "model = PeftModel.from_pretrained(base_model, lora_model_path)\n",
+    "model = accelerator.prepare(model)  # Prepare the model using the accelerator\n",
+    "\n",
+    "# Define label mappings\n",
+    "id2label = {0: \"No binding site\", 1: \"Binding site\"}\n",
+    "label2id = {v: k for k, v in id2label.items()}\n",
+    "\n",
+    "# Create a data collator\n",
+    "data_collator = DataCollatorForTokenClassification(tokenizer)\n",
+    "\n",
+    "# Define a function to compute the metrics\n",
+    "def compute_metrics(dataset):\n",
+    "    # Get the predictions using the trained model\n",
+    "    trainer = Trainer(model=model, data_collator=data_collator)\n",
+    "    predictions, labels, _ = trainer.predict(test_dataset=dataset)\n",
+    "    \n",
+    "    # Remove padding and special tokens\n",
+    "    mask = labels != -100\n",
+    "    true_labels = labels[mask].flatten()\n",
+    "    flat_predictions = np.argmax(predictions, axis=2)[mask].flatten().tolist()\n",
+    "\n",
+    "    # Compute the metrics\n",
+    "    accuracy = accuracy_score(true_labels, flat_predictions)\n",
+    "    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, flat_predictions, average='binary')\n",
+    "    auc = roc_auc_score(true_labels, flat_predictions)\n",
+    "    mcc = matthews_corrcoef(true_labels, flat_predictions)  # Compute the MCC\n",
+    "    \n",
+    "    return {\"accuracy\": accuracy, \"precision\": precision, \"recall\": recall, \"f1\": f1, \"auc\": auc, \"mcc\": mcc}  # Include the MCC in the returned dictionary\n",
+    "\n",
+    "# Get the metrics for the training and test datasets\n",
+    "train_metrics = compute_metrics(train_dataset)\n",
+    "test_metrics = compute_metrics(test_dataset)\n",
+    "\n",
+    "train_metrics, test_metrics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d8cc0058-1f81-466d-9fed-4a7ef55ba11f",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.17"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}