DrBERT-CASM2

Model description

DrBERT-CASM2 is a French Named Entity Recognition model that was fine-tuned from DrBERT: A PreTrained model in French for biomedical and clinical domains. It has been trained to detect the following type of entities: problem, treatment and test using the medkit Trainer.

  • Fine-tuned using medkit GitHub Repo

  • Developed by @camila-ud, medkit, HeKA Research team

  • Dataset source

    Annotated version from @aneuraz called 'corpusCasM2: A corpus of annotated clinical texts'

    • The annotation was performed collaborativelly by the students of masters students from Université Paris Cité.

    • The corpus contains documents from CAS:

      Natalia Grabar, Vincent Claveau, and Clément Dalloux. 2018. CAS: French Corpus with Clinical Cases.
      In Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis,
      pages 122–128, Brussels, Belgium. Association for Computational Linguistics.
      

Intended uses & limitations

Limitations and bias

This model was trained for development and test phases. This model is limited by its training dataset, and it should be used with caution. The results are not guaranteed, and the model should be used only in data exploration stages. The model may be able to detect entities in the early stages of the analysis of medical documents in French.

The maximum token size was reduced to 128 tokens to minimize training time.

How to use

Install medkit

First of all, please install medkit with the following command:

pip install 'medkit-lib[optional]'

Please check the documentation for more info and examples.

Using the model

from medkit.core.text import TextDocument
from medkit.text.ner.hf_entity_matcher import HFEntityMatcher

matcher = HFEntityMatcher(model="medkit/DrBERT-CASM2")

test_doc = TextDocument("Elle souffre d'asthme mais n'a pas besoin d'Allegra")
detected_entities = matcher.run([test_doc.raw_segment])

# show information
msg = "|".join(f"'{entity.label}':{entity.text}" for entity in detected_entities)
print(f"Text: '{test_doc.text}'\n{msg}")
Text: "Elle souffre d'asthme mais n'a pas besoin d'Allegra"
'problem':asthme|'treatment':Allegra

Training data

This model was fine-tuned on CASM2, an internal corpus with clinical cases (in french) annotated by master students. The corpus contains more than 5000 medkit documents (~ phrases) with entities to detect.

Number of documents (~ phrases) by split

Split # medkit docs
Train 5824
Validation 1457
Test 1821

Number of examples per entity type

Split treatment test problem
Train 3258 3990 6808
Validation 842 1007 1745
Test 994 1289 2113

Training procedure

This model was fine-tuned using the medkit trainer on CPU, it takes about 3h.

Model perfomances

Model performances computes on CASM2 test dataset (using medkit seqeval evaluator)

Entity precision recall f1
treatment 0.7492 0.7666 0.7578
test 0.7449 0.8240 0.7824
problem 0.6884 0.7304 0.7088
Overall 0.7188 0.7660 0.7416

How to evaluate using medkit

from medkit.text.metrics.ner import SeqEvalEvaluator

# load the matcher and get predicted entities by document
matcher = HFEntityMatcher(model="medkit/DrBERT-CASM2")
predicted_entities = [matcher.run([doc.raw_segment]) for doc in test_documents]

evaluator  = SeqEvalEvaluator(tagging_scheme="iob2")
evaluator.compute(test_documents,predicted_entities=predicted_entities)

You can use the tokenizer from HF to evaluate by tokens instead of characters

from transformers import AutoTokenizer

tokenizer_drbert = AutoTokenizer.from_pretrained("medkit/DrBERT-CASM2", use_fast=True)

evaluator  = SeqEvalEvaluator(tokenizer=tokenizer_drbert,tagging_scheme="iob2")
evaluator.compute(test_documents,predicted_entities=predicted_entities)

Citation

@online{medkit-lib,
  author={HeKA Research Team},
  title={medkit, A Python library for a learning health system},
  url={https://pypi.org/project/medkit-lib/},
  urldate = {2023-07-24}, 
}
HeKA Research Team, “medkit, a Python library for a learning health system.” https://pypi.org/project/medkit-lib/ (accessed Jul. 24, 2023).
Downloads last month
16
Safetensors
Model size
109M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.