Spaces:
Sleeping
Sleeping
File size: 5,751 Bytes
53c5633 01554a7 53c5633 01554a7 53c5633 01554a7 53c5633 01554a7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
---
title: Saccharomyces Pythia
emoji: π
colorFrom: purple
colorTo: gray
sdk: gradio
sdk_version: 4.25.0
app_file: app.py
pinned: true
license: apache-2.0
---
# saccharomyces-pythia: an ML/AI-integrated *Saccharomyces cerevisiae* assistant
## Table of Contents
1. [Introduction](#introduction)
2. [SacCerML: the base ML model](#saccerml-the-base-ml-model)
- [Training](#training)
* [Data and preprocessing](#data-and-preprocessing)
* [Validation](#validation)
- [Testing](#testing)
3. [saccharomyces-pythia: gene calling and AI integration](#saccharomyces-pythia-gene-calling-and-ai-integration)
4. [Try it out!](#try-it-out)
5. [References](#references)
6. [License](#license)
## Introduction
**saccharomyces-pythia** is the new, rebranded v1.0.0 of SacCerML. Initially conceived as a Python script that leveraged machine learning and bioinformatics tools to predict genes in Saccharomyces cerevisiae (baker's yeast) genomic sequences, it is now a complete and AI-integrated tool that can help researchers both as a chatbot and as a ORF-predicter.
## SacCerML: the base ML model
### Training
#### Data and preprocessing
All the annotated coding DNA sequences for *S. cerevisiae* (strain S288C) were downloaded from Saccharomyces Genome Database.
These genetic sequences were split according to their ORF classification (verified, dubious, uncharacterized, pseudogene and transposable element) and for each of them the following parameters were calculated:
- Codon Adaptation Index
- Checksum
After that, DNA was translated into aminoacids and other descriptors were retrieved:
- Hydrophobicity
- Isoelectric point
- Aromaticity
- Instability
- Molecular weight
- Secondary structure percentage (helix, turn and sheet)
- Molar extinction (both oxidized and reduced)
All the computed data were stored in a csv file, which was used to train a supervised ML model, a Voting Classifier (implemented in scikit-learn package), made up by HistGradient Boosting Classifier, a Decision Tree Clasifier and an Extra Tree Classifier.
#### Validation
The so-obtained machine-larning model (called SacCerML) was then evaluated on the entire training set, yielding a 99.93% accuracy. A key component of the training was k-fold crossvalidation. SacCerML was trained on increasingly wider percentages of the training data and tested on the remainder: it yielded a high accuracy (>84%) in all the tests, and the same goes for recall, f1 and precision score. From the classification reports it could be already seen a slight bias towards predicting verified and dubious ORFs, with more difficulty in predicting uncharacterized ORFs.
### Testing
Data were collected from ORFs of 10 *Saccharomyces cerevisiae* strains, different from the one used for training:
- AWRI1631
- BC187
- BY4741
- CBS7960
- FL100
- g833-1B
- Kyokai7
- LalvinQA23
- Vin13
- YS9
A total of 54452 transcripts were collected and processed into csv file by extracting the previously mentioned features. The model performed well, it had overall accuracy, f1, precision and recall score always above 86%. Nevertheless, the slight bias towards verified and dubious ORFs was confirmed, though uncharacterized ORFs too were well detected in several tests.
## saccharomyces-pythia: gene calling and AI integration
SacCerML has now reached a new stage of its development (v1.0.0), where it has been rebranded as **saccharomyces-pythia**. You can now enjoy the following upgrades, that make it user-friendly and easy to install:
- `Gradio <https://www.gradio.app/>`_ chatbot interface running completely locally on your computer
- Gene calling with automated ORF detection thanks to `orfipy <https://pypi.org/project/orfipy/>`_: no need for preprocessing your reads, just upload one or more FASTA files with *S. cerevisiae* DNA sequences to the chatbot.
- AI assistant, built upon `EleutherAI/pythia-160-deduped-v0 <https://huggingface.co/EleutherAI/pythia-160m-deduped-v0>`_ finetuned on *Saccharomyces cerevisiae and its industrial applications* (Parapouli et al., 2020): this is a text-generation model that will reply to researcher questions (stil a beta feature, will become more stable in future releases).
- Docker image to download and run the application on your computer
## Try it out!
Use the following commands to run **saccharomyces-pythia** on your computer:
```bash
docker pull ghcr.io/astrabert/saccharomyces-pythia:latest
docker run -p 7860:7860 ghcr.io/astrabert/saccharomyces-pythia:latest
```
Just wait 30s-1min, the app should then be running on port 0.0.0.0:7860 (Linux-based) or localhost:7860 (Windows-based).
## References
* Saccharomyces Genome Database: <https://www.yeastgenome.org/>
* Biopython: <https://biopython.org/>
* Scikit-learn: <https://scikit-learn.org/stable/>
* Gradio: <https://www.gradio.app/>
* orfipy: <https://pypi.org/project/orfipy/>
* EleutherAI/pythia-160-deduped-v0: <https://huggingface.co/EleutherAI/pythia-160m-deduped-v0>
* Parapouli et al., 2020: <https://doi.org/10.3934/microbiol.2020001>
Additionally, the following libraries and packages were used in the development of the machine learning model:
* NumPy: <https://numpy.org/>
* Pandas: <https://pandas.pydata.org/>
These libraries and packages were used for data manipulation, analysis, and model training.
## License
The project is hereby provided under MIT license.
If you are using saccharomyces-pythia for your work, please consider citing its author, [Astra Bertelli](https://astrabert.vercel.app)
*How was this README generated? Leveraging the power of AI with reAIdme, an HuggingChat assistant based on meta-llama/Llama-2-70b-chat-hf. Go and give it a try at this link: <https://hf.co/chat/assistant/660d9a4f590a7924eed02a32!> π€*
|