File size: 4,380 Bytes
560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 560486f d7a8295 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
library_name: transformers
tags:
- biobert
- medical-nlp
- icd-9
- classification
- healthcare
license: apache-2.0
language:
- en
base_model:
- dmis-lab/biobert-v1.1
pipeline_tag: text-classification
---
# Model Card for BioBERT Fine-tuned on MIMIC-3 for ICD-9 Code Classification
## Model Details
### Model Description
This is a BioBERT model fine-tuned on the MIMIC-3 (Medical Information Mart for Intensive Care) corpus specifically for ICD-9 code classification. The model is designed to predict medical diagnostic codes based on Electronic Health Record (EHR) and symptom text inputs.
- **Developed by:** [Researcher/Institution Name - to be added]
- **Model type:** Transformer-based medical language model (BioBERT)
- **Language(s):** English (Medical Domain)
- **License:** [License to be specified]
- **Finetuned from model:** BioBERT base model
### Model Sources
- **Repository:** [GitHub/Model Repository Link - to be added]
- **Paper:** [Research Paper Link - to be added]
## Uses
### Direct Use
The primary use of this model is to automatically classify medical conditions by predicting relevant ICD-9 diagnostic codes from clinical text, such as electronic health records, medical notes, or symptom descriptions.
### Downstream Use
This model can be integrated into:
- Clinical decision support systems
- Medical coding automation
- Electronic health record (EHR) analysis tools
- Healthcare informatics research
### Out-of-Scope Use
- The model should not be used for direct medical diagnosis without professional medical oversight
- It is not intended to replace clinical judgment
- Performance may vary with text outside the medical domain or significantly different from the training corpus
## Bias, Risks, and Limitations
- The model's performance is limited to the medical conditions and coding patterns in the MIMIC-3 dataset
- Potential biases from the original training data may be present
- Accuracy can be affected by variations in medical terminology, writing styles, and complex medical cases
### Recommendations
- Validate model predictions with medical professionals
- Use as a supportive tool, not a replacement for expert medical assessment
- Regularly evaluate performance on new datasets
- Be aware of potential demographic or contextual biases in the predictions
## How to Get Started with the Model
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load the model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained('model_path')
tokenizer = AutoTokenizer.from_pretrained('model_path')
# Example prediction function (similar to the provided get_predictions function)
def predict_icd9_codes(input_text, threshold=0.8):
# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512, padding='max_length')
# Get model predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.sigmoid(outputs.logits)
# Filter predictions above threshold
predicted_codes = [model.config.id2label[i] for i in (predictions > threshold).nonzero()[:, 1]]
return predicted_codes
```
## Training Details
### Training Data
- **Dataset:** MIMIC-3 Corpus
- **Domain:** Medical/Clinical text
- **Content:** Electronic Health Records (EHR)
### Training Procedure
#### Preprocessing
- Text tokenization
- Maximum sequence length: 512 tokens
- Padding to uniform length
- Potential text normalization techniques
#### Training Hyperparameters
- **Base Model:** BioBERT
- **Training Regime:** Fine-tuning
- **Precision:** [Specify training precision, e.g., mixed precision]
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- Held-out subset of MIMIC-3 corpus
- Diverse medical cases and documentation styles
#### Metrics
- Precision
- Recall
- F1-Score
- Multi-label classification metrics
## Environmental Impact
- Estimated carbon emissions to be calculated
- Compute details to be specified
## Technical Specifications
### Model Architecture
- **Base Model:** BioBERT
- **Task:** Multi-label ICD-9 Code Classification
## Citation
[Citation information to be added when research is published]
## More Information
For more details about the model's development, performance, and usage, please contact the model developers. |