|
--- |
|
language: en |
|
license: other |
|
tags: |
|
- random-forest |
|
- classification |
|
- bert |
|
- sector-classification |
|
- machine-learning |
|
inference: false |
|
datasets: |
|
- custom |
|
model-index: |
|
- name: RF 48 Sectors Classification Model |
|
results: [] |
|
--- |
|
|
|
# RF 48 Sectors Classification Model |
|
|
|
## Overview |
|
|
|
This machine learning model is a Random Forest classifier designed to categorize datasets into 48 predefined sectors based on column names. By leveraging BERT embeddings and a sophisticated Random Forest algorithm, the model provides intelligent sector classification for various types of datasets. |
|
|
|
## Model Details |
|
|
|
- **Model Type**: Random Forest Classifier |
|
- **Embedding Method**: BERT (bert-base-uncased) |
|
- **Number of Sectors**: 48 |
|
- **Classification Approach**: Column name embedding and prediction |
|
|
|
## 48 Supported Sectors |
|
|
|
The model can classify datasets into the following sectors: |
|
|
|
1. Agriculture Sector |
|
- Crop Production |
|
- Livestock Farming |
|
- Agricultural Equipment |
|
- Agri-tech |
|
|
|
2. Banking & Finance Sector |
|
- Retail Banking |
|
- Corporate Banking |
|
- Investment Banking |
|
- Digital Banking |
|
- Asset Management |
|
- Securities & Investments |
|
- Financial Planning & Advice |
|
|
|
3. Construction & Infrastructure |
|
- Residential Construction |
|
- Commercial Construction |
|
- Industrial Construction |
|
- Infrastructure |
|
|
|
4. Consulting Sector |
|
- Management Consulting |
|
- IT Consulting |
|
- Human Resources Consulting |
|
- Legal Consulting |
|
|
|
5. Education Sector |
|
- Early Childhood Education |
|
- Primary & Secondary Education |
|
- Higher Education |
|
- Adult Education & Vocational Training |
|
|
|
6. Engineering Sector |
|
- Civil Engineering |
|
- Mechanical Engineering |
|
- Electrical Engineering |
|
- Chemical Engineering |
|
|
|
7. Entertainment & Media |
|
- Film & Television |
|
- Music Industry |
|
- Video Games |
|
- Live Events |
|
|
|
8. Environmental Sector |
|
- Environmental Protection |
|
- Waste Management |
|
- Renewable Energy |
|
- Wildlife Conservation |
|
|
|
9. Insurance Sector |
|
- General Insurance Services |
|
- Life Insurance |
|
- Health Insurance |
|
- Property & Casualty Insurance |
|
- Reinsurance |
|
|
|
10. Food Industry |
|
- Food Processing |
|
- Food Retail |
|
- Food Services |
|
- Food Safety & Quality Control |
|
|
|
11. Healthcare Sector |
|
- Hospitals |
|
- Clinics & Outpatient Care |
|
- Pharmaceuticals |
|
- Medical Equipment & Supplies |
|
|
|
## Installation |
|
|
|
```bash |
|
pip install transformers torch joblib scikit-learn |
|
``` |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import BertTokenizer, BertModel |
|
import joblib |
|
import torch |
|
|
|
# Initialize model |
|
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') |
|
bert_model = BertModel.from_pretrained('bert-base-uncased', ignore_mismatched_sizes=True) |
|
|
|
# Download and load the Random Forest model |
|
model_path = hf_hub_download(repo_id="Mageswaran/rf_48_sectors", filename="model_48_sectors.pkl") |
|
label_encoder_path = hf_hub_download(repo_id="Mageswaran/rf_48_sectors", filename="label_encoder_48_sectors.pkl") |
|
|
|
rf = joblib.load(model_path) |
|
label_encoder = joblib.load(label_encoder_path) |
|
|
|
def predict_sector(column_names): |
|
# Convert column names to BERT embeddings |
|
embeddings = get_bert_embeddings([column_names]) |
|
|
|
# Predict sector |
|
prediction = rf.predict(embeddings) |
|
return label_encoder.inverse_transform(prediction)[0] |
|
|
|
# Example |
|
column_names = "clinical_trail_duration, computer_analysis_score, customer_feedback_score" |
|
sector = predict_sector(column_names) |
|
print(f"Predicted Sector: {sector}") |
|
``` |
|
|
|
## Model Performance |
|
|
|
- **Embedding Technique**: BERT embeddings from 'bert-base-uncased' |
|
- **Classification Algorithm**: Random Forest |
|
- **Unique Feature**: Sector classification based on column name semantics |
|
|
|
## Limitations |
|
|
|
- Model performance depends on the semantic similarity of column names to training data |
|
- Works best with column names that clearly represent the dataset's domain |
|
- Requires careful preprocessing of column names |
|
|
|
## Contributing |
|
|
|
Contributions, issues, and feature requests are welcome! Feel free to check the issues page. |
|
|
|
## License and Usage Restrictions |
|
|
|
### Proprietary Usage Policy |
|
|
|
**IMPORTANT: This model is NOT freely available for unrestricted use.** |
|
|
|
#### Usage Restrictions |
|
- Prior written permission is REQUIRED before using this model |
|
- Commercial use is strictly prohibited without explicit authorization |
|
- Academic or research use requires formal permission from the model's creator |
|
- Unauthorized use, distribution, or reproduction is prohibited |
|
|
|
#### Licensing Terms |
|
- This model is protected under proprietary intellectual property rights |
|
- Any use of the model requires a formal licensing agreement |
|
- Contact the model's creator for licensing inquiries and permissions |
|
|
|
### Permissions and Inquiries |
|
|
|
To request permission for model usage, please contact: |
|
- Email: [Your Contact Email] |
|
- Hugging Face Profile: [Your Hugging Face Profile URL] |
|
|
|
**Unauthorized use will result in legal action.** |
|
|
|
## Contact |
|
|
|
[email protected] |
|
|
|
## Citing this Model |
|
|
|
If you use this model in your research, please cite it using the following BibTeX entry: |
|
|
|
```bibtex |
|
@misc{mageswaran_rf_48_sectors, |
|
title = {Random Forest 48 Sectors Classification Model}, |
|
author = {Mageswaran}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
howpublished = {\url{https://huggingface.co/Mageswaran/rf_48_sectors}} |
|
} |
|
``` |
|
|
|
## Additional Resources |
|
|
|
- [Author's Hugging Face Profile](https://huggingface.co/Mageswaran) |
|
- [Model Repository](https://huggingface.co/Mageswaran/rf_48_sectors) |
|
|
|
## Acknowledgments |
|
|
|
- Hugging Face Transformers |