CT-BERT-JPN
CT-BERT-JPN is a Japanese BERT-based model for multilabel classification of abnormal findings in radiology reports, fine-tuned on the CT-RATE-JPN dataset.
Model Overview
This model is based on BERT base Japanese v3, and has been fine-tuned on the CT-RATE-JPN dataset, which provides Japanese translations of radiology reports from the CT-RATE dataset. The training data consists of deduplicated radiology reports with corresponding abnormality labels.
How to Use
!pip install fugashi unidic_lite
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load the model and tokenizer from Hugging Face Hub
model_name = "YYama0/CT-BERT-JPN"
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=18,
problem_type="multi_label_classification"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Define the inference function
def infer(input_texts):
inputs = tokenizer(input_texts, padding=True, truncation=True, return_tensors="pt")
model.eval()
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.sigmoid(logits)
return probs
# Run inference
input_texts = ["気管および両主気管支の内腔は開存しています。気管および両主気管支の内腔には閉塞病変は認められませんでした。縦隔内の主要血管構造、心臓の輪郭、サイズは正常です。胸部大動脈の直径は正常です。心嚢水、心膜肥厚は確認されませんでした。胸部食道径は正常であり、非造影検査の範囲内で有意な病的壁肥厚は認められませんでした。縦隔内の上下部気管傍、大動脈肺動脈窓の血管前領域、および気管分岐下において、最大短軸が7mmのリンパ節が認められました。肺野条件では、右側の胸膜葉の間に最大で8cmの厚さに達する広範な胸水が認められました。左側では、最も広い部分で26mmです。隣接する肺実質には、特に右側でびまん性の無気肺変化が認められました。加えて、両肺で小葉間隔壁の肥厚を伴うびまん性のすりガラス陰影の増加およびcrazy paving appearancesが認められました。これらの所見は感染症過程と一致している可能性があります。肺水腫も鑑別診断に考慮されるべきです。臨床および検査との対比を考慮すること、および治療後の管理が推奨されます。両肺にミリ単位の非特異的な実質性結節がいくつか認められました。検査範囲の上腹部では、有意な病変は認められませんでした。骨構造においても溶骨性破壊病変は認められませんでした。"]
probs = infer(input_texts)
Model Performance
The following table shows the results for each class on the validation data (n=150) of CT-RATE-JPN. For metrics requiring binarization, a threshold of 0.5 was used.
Finding | Positive_Samples | Accuracy | Precision | Recall | F1 | AUC-ROC | AP | |
---|---|---|---|---|---|---|---|---|
0 | Medical material | 14 | 0.973 | 0.778 | 1 | 0.875 | 0.999 | 0.99 |
1 | Arterial wall calcification | 49 | 0.987 | 0.961 | 1 | 0.98 | 1 | 1 |
2 | Cardiomegaly | 25 | 0.987 | 1 | 0.92 | 0.958 | 0.999 | 0.996 |
3 | Pericardial effusion | 12 | 1 | 1 | 1 | 1 | 1 | 1 |
4 | Coronary artery wall calcification | 45 | 0.987 | 0.978 | 0.978 | 0.978 | 1 | 1 |
5 | Hiatal hernia | 24 | 1 | 1 | 1 | 1 | 1 | 1 |
6 | Lymphadenopathy | 37 | 0.987 | 0.973 | 0.973 | 0.973 | 0.994 | 0.987 |
7 | Emphysema | 31 | 0.98 | 0.938 | 0.968 | 0.952 | 0.989 | 0.96 |
8 | Atelectasis | 49 | 0.993 | 0.98 | 1 | 0.99 | 1 | 1 |
9 | Lung nodule | 82 | 0.967 | 0.975 | 0.963 | 0.969 | 0.991 | 0.994 |
10 | Lung opacity | 55 | 0.953 | 0.929 | 0.945 | 0.937 | 0.991 | 0.985 |
11 | Pulmonary fibrotic sequela | 47 | 0.953 | 0.935 | 0.915 | 0.925 | 0.981 | 0.973 |
12 | Pleural effusion | 19 | 0.987 | 0.905 | 1 | 0.95 | 1 | 0.997 |
13 | Mosaic attenuation pattern | 25 | 1 | 1 | 1 | 1 | 1 | 1 |
14 | Peribronchial thickening | 21 | 0.96 | 1 | 0.714 | 0.833 | 0.985 | 0.948 |
15 | Consolidation | 24 | 0.933 | 0.706 | 1 | 0.828 | 0.996 | 0.985 |
16 | Bronchiectasis | 20 | 0.98 | 0.87 | 1 | 0.93 | 0.99 | 0.873 |
17 | Interlobular septal thickening | 7 | 0.993 | 0.875 | 1 | 0.933 | 1 | 1 |
Citation
Base BERT Model:
Please cite the original BERT Japanese model from cl-tohoku/bert-japanese.
CT-RATE Dataset:
Please visit the original CT-RATE repository for the appropriate citation information.
CT-RATE-JPN (CT-BERT-JPN):
Citation information for CT-RATE-JPN is provided below. Research paper detailing the dataset and translation methodology.
@misc{yamagishi2024ctrep,
title={Development of a Large-scale Dataset of Chest Computed Tomography Reports in Japanese and a High-performance Finding Classification Model},
author={Yosuke Yamagishi and Yuta Nakamura and Tomohiro Kikuchi and Yuki Sonoda and Hiroshi Hirakawa and Shintaro Kano and Satoshi Nakamura and Shouhei Hanaoka and Takeharu Yoshikawa and Osamu Abe},
year={2024},
eprint={2412.15907},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.15907},
}
License
This model was trained using the CT-RATE-JPN dataset, which is released under the Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) license. Regarding model usage:
- The model outputs and weights can be used for non-commercial research purposes only
- When using the dataset, users must comply with the terms of the original CC BY-NC-SA license
Acknowledgments
- The original BERT Japanese model developers (cl-tohoku)
- The CT-RATE dataset creators
- Downloads last month
- 31