Model Card for BioBERT Fine-tuned on MIMIC-3 for ICD-9 Code Classification

Model Details

Model Description

This is a BioBERT model fine-tuned on the MIMIC-3 (Medical Information Mart for Intensive Care) corpus specifically for ICD-9 code classification. The model is designed to predict medical diagnostic codes based on Electronic Health Record (EHR) and symptom text inputs.

Developed by: [Researcher/Institution Name - to be added]
Model type: Transformer-based medical language model (BioBERT)
Language(s): English (Medical Domain)
License: [License to be specified]
Finetuned from model: BioBERT base model

Model Sources

Repository: [GitHub/Model Repository Link - to be added]
Paper: [Research Paper Link - to be added]

Uses

Direct Use

The primary use of this model is to automatically classify medical conditions by predicting relevant ICD-9 diagnostic codes from clinical text, such as electronic health records, medical notes, or symptom descriptions.

Downstream Use

This model can be integrated into:

Clinical decision support systems
Medical coding automation
Electronic health record (EHR) analysis tools
Healthcare informatics research

Out-of-Scope Use

The model should not be used for direct medical diagnosis without professional medical oversight
It is not intended to replace clinical judgment
Performance may vary with text outside the medical domain or significantly different from the training corpus

Bias, Risks, and Limitations

The model's performance is limited to the medical conditions and coding patterns in the MIMIC-3 dataset
Potential biases from the original training data may be present
Accuracy can be affected by variations in medical terminology, writing styles, and complex medical cases

Recommendations

Validate model predictions with medical professionals
Use as a supportive tool, not a replacement for expert medical assessment
Regularly evaluate performance on new datasets
Be aware of potential demographic or contextual biases in the predictions

How to Get Started with the Model

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load the model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained('model_path')
tokenizer = AutoTokenizer.from_pretrained('model_path')

# Example prediction function (similar to the provided get_predictions function)
def predict_icd9_codes(input_text, threshold=0.8):
    # Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512, padding='max_length')
    
    # Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.sigmoid(outputs.logits)
        
    # Filter predictions above threshold
    predicted_codes = [model.config.id2label[i] for i in (predictions > threshold).nonzero()[:, 1]]
    
    return predicted_codes

Training Details

Training Data

Dataset: MIMIC-3 Corpus
Domain: Medical/Clinical text
Content: Electronic Health Records (EHR)

Training Procedure

Preprocessing

Text tokenization
Maximum sequence length: 512 tokens
Padding to uniform length
Potential text normalization techniques

Training Hyperparameters

Base Model: BioBERT
Training Regime: Fine-tuning
Precision: [Specify training precision, e.g., mixed precision]

Evaluation

Testing Data, Factors & Metrics

Testing Data

Held-out subset of MIMIC-3 corpus
Diverse medical cases and documentation styles

Metrics

Precision
Recall
F1-Score
Multi-label classification metrics

Environmental Impact

Estimated carbon emissions to be calculated
Compute details to be specified

Technical Specifications

Model Architecture

Base Model: BioBERT
Task: Multi-label ICD-9 Code Classification

Citation

[Citation information to be added when research is published]

More Information

For more details about the model's development, performance, and usage, please contact the model developers.

ashishkgpian
/

biobert_icd9_classifier_ehr