Model Card for BioBERT Fine-tuned on MIMIC-3 for ICD-9 Code Classification

Model Details

Model Description

This is a BioBERT model fine-tuned on the MIMIC-3 (Medical Information Mart for Intensive Care) corpus specifically for ICD-9 code classification. The model is designed to predict medical diagnostic codes based on Electronic Health Record (EHR) and symptom text inputs.

  • Developed by: [Researcher/Institution Name - to be added]
  • Model type: Transformer-based medical language model (BioBERT)
  • Language(s): English (Medical Domain)
  • License: [License to be specified]
  • Finetuned from model: BioBERT base model

Model Sources

  • Repository: [GitHub/Model Repository Link - to be added]
  • Paper: [Research Paper Link - to be added]

Uses

Direct Use

The primary use of this model is to automatically classify medical conditions by predicting relevant ICD-9 diagnostic codes from clinical text, such as electronic health records, medical notes, or symptom descriptions.

Downstream Use

This model can be integrated into:

  • Clinical decision support systems
  • Medical coding automation
  • Electronic health record (EHR) analysis tools
  • Healthcare informatics research

Out-of-Scope Use

  • The model should not be used for direct medical diagnosis without professional medical oversight
  • It is not intended to replace clinical judgment
  • Performance may vary with text outside the medical domain or significantly different from the training corpus

Bias, Risks, and Limitations

  • The model's performance is limited to the medical conditions and coding patterns in the MIMIC-3 dataset
  • Potential biases from the original training data may be present
  • Accuracy can be affected by variations in medical terminology, writing styles, and complex medical cases

Recommendations

  • Validate model predictions with medical professionals
  • Use as a supportive tool, not a replacement for expert medical assessment
  • Regularly evaluate performance on new datasets
  • Be aware of potential demographic or contextual biases in the predictions

How to Get Started with the Model

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load the model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained('model_path')
tokenizer = AutoTokenizer.from_pretrained('model_path')

# Example prediction function (similar to the provided get_predictions function)
def predict_icd9_codes(input_text, threshold=0.8):
    # Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512, padding='max_length')
    
    # Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.sigmoid(outputs.logits)
        
    # Filter predictions above threshold
    predicted_codes = [model.config.id2label[i] for i in (predictions > threshold).nonzero()[:, 1]]
    
    return predicted_codes

Training Details

Training Data

  • Dataset: MIMIC-3 Corpus
  • Domain: Medical/Clinical text
  • Content: Electronic Health Records (EHR)

Training Procedure

Preprocessing

  • Text tokenization
  • Maximum sequence length: 512 tokens
  • Padding to uniform length
  • Potential text normalization techniques

Training Hyperparameters

  • Base Model: BioBERT
  • Training Regime: Fine-tuning
  • Precision: [Specify training precision, e.g., mixed precision]

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Held-out subset of MIMIC-3 corpus
  • Diverse medical cases and documentation styles

Metrics

  • Precision
  • Recall
  • F1-Score
  • Multi-label classification metrics

Environmental Impact

  • Estimated carbon emissions to be calculated
  • Compute details to be specified

Technical Specifications

Model Architecture

  • Base Model: BioBERT
  • Task: Multi-label ICD-9 Code Classification

Citation

[Citation information to be added when research is published]

More Information

For more details about the model's development, performance, and usage, please contact the model developers.

Downloads last month
8
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for ashishkgpian/biobert_icd9_classifier_ehr

Finetuned
(74)
this model

Collection including ashishkgpian/biobert_icd9_classifier_ehr