File size: 4,380 Bytes
560486f
 
d7a8295
 
 
 
 
 
 
 
 
 
 
 
560486f
 
d7a8295
560486f
 
 
 
 
d7a8295
560486f
d7a8295
 
 
 
 
560486f
d7a8295
560486f
d7a8295
 
560486f
 
 
 
 
d7a8295
560486f
d7a8295
560486f
d7a8295
 
 
 
 
560486f
 
 
d7a8295
 
 
560486f
 
 
d7a8295
 
 
560486f
 
 
d7a8295
 
 
 
560486f
 
 
d7a8295
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
560486f
 
 
 
 
d7a8295
 
 
560486f
 
 
d7a8295
 
 
 
 
560486f
 
d7a8295
 
 
560486f
 
 
 
 
 
d7a8295
 
560486f
 
d7a8295
 
 
 
560486f
 
 
d7a8295
 
560486f
d7a8295
560486f
d7a8295
 
 
560486f
d7a8295
560486f
d7a8295
560486f
d7a8295
560486f
d7a8295
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
library_name: transformers
tags:
- biobert
- medical-nlp
- icd-9
- classification
- healthcare
license: apache-2.0
language:
- en
base_model:
- dmis-lab/biobert-v1.1
pipeline_tag: text-classification
---

# Model Card for BioBERT Fine-tuned on MIMIC-3 for ICD-9 Code Classification

## Model Details

### Model Description

This is a BioBERT model fine-tuned on the MIMIC-3 (Medical Information Mart for Intensive Care) corpus specifically for ICD-9 code classification. The model is designed to predict medical diagnostic codes based on Electronic Health Record (EHR) and symptom text inputs.

- **Developed by:** [Researcher/Institution Name - to be added]
- **Model type:** Transformer-based medical language model (BioBERT)
- **Language(s):** English (Medical Domain)
- **License:** [License to be specified]
- **Finetuned from model:** BioBERT base model

### Model Sources

- **Repository:** [GitHub/Model Repository Link - to be added]
- **Paper:** [Research Paper Link - to be added]

## Uses

### Direct Use

The primary use of this model is to automatically classify medical conditions by predicting relevant ICD-9 diagnostic codes from clinical text, such as electronic health records, medical notes, or symptom descriptions.

### Downstream Use

This model can be integrated into:
- Clinical decision support systems
- Medical coding automation
- Electronic health record (EHR) analysis tools
- Healthcare informatics research

### Out-of-Scope Use

- The model should not be used for direct medical diagnosis without professional medical oversight
- It is not intended to replace clinical judgment
- Performance may vary with text outside the medical domain or significantly different from the training corpus

## Bias, Risks, and Limitations

- The model's performance is limited to the medical conditions and coding patterns in the MIMIC-3 dataset
- Potential biases from the original training data may be present
- Accuracy can be affected by variations in medical terminology, writing styles, and complex medical cases

### Recommendations

- Validate model predictions with medical professionals
- Use as a supportive tool, not a replacement for expert medical assessment
- Regularly evaluate performance on new datasets
- Be aware of potential demographic or contextual biases in the predictions

## How to Get Started with the Model

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load the model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained('model_path')
tokenizer = AutoTokenizer.from_pretrained('model_path')

# Example prediction function (similar to the provided get_predictions function)
def predict_icd9_codes(input_text, threshold=0.8):
    # Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512, padding='max_length')
    
    # Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.sigmoid(outputs.logits)
        
    # Filter predictions above threshold
    predicted_codes = [model.config.id2label[i] for i in (predictions > threshold).nonzero()[:, 1]]
    
    return predicted_codes
```

## Training Details

### Training Data

- **Dataset:** MIMIC-3 Corpus
- **Domain:** Medical/Clinical text
- **Content:** Electronic Health Records (EHR)

### Training Procedure

#### Preprocessing
- Text tokenization
- Maximum sequence length: 512 tokens
- Padding to uniform length
- Potential text normalization techniques

#### Training Hyperparameters
- **Base Model:** BioBERT
- **Training Regime:** Fine-tuning
- **Precision:** [Specify training precision, e.g., mixed precision]

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data
- Held-out subset of MIMIC-3 corpus
- Diverse medical cases and documentation styles

#### Metrics
- Precision
- Recall
- F1-Score
- Multi-label classification metrics

## Environmental Impact

- Estimated carbon emissions to be calculated
- Compute details to be specified

## Technical Specifications

### Model Architecture
- **Base Model:** BioBERT
- **Task:** Multi-label ICD-9 Code Classification

## Citation

[Citation information to be added when research is published]

## More Information

For more details about the model's development, performance, and usage, please contact the model developers.