Commit
·
ac083fd
0
Parent(s):
Duplicate from emilyalsentzer/Bio_ClinicalBERT
Browse filesCo-authored-by: Emily Alsentzer <[email protected]>
- .gitattributes +9 -0
- LICENSE +21 -0
- README.md +46 -0
- config.json +16 -0
- flax_model.msgpack +3 -0
- graph.pbtxt +0 -0
- model.ckpt-150000.data-00000-of-00001 +3 -0
- model.ckpt-150000.index +0 -0
- model.ckpt-150000.meta +0 -0
- pytorch_model.bin +3 -0
- tf_model.h5 +3 -0
- vocab.txt +0 -0
.gitattributes
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
*.bin.* filter=lfs diff=lfs merge=lfs -text
|
2 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
5 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
6 |
+
*.tar.gz filter=lfs diff=lfs merge=lfs -text
|
7 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
8 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
9 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2019 Emily Alsentzer
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
README.md
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: "en"
|
3 |
+
tags:
|
4 |
+
- fill-mask
|
5 |
+
license: mit
|
6 |
+
|
7 |
+
---
|
8 |
+
|
9 |
+
# ClinicalBERT - Bio + Clinical BERT Model
|
10 |
+
|
11 |
+
The [Publicly Available Clinical BERT Embeddings](https://arxiv.org/abs/1904.03323) paper contains four unique clinicalBERT models: initialized with BERT-Base (`cased_L-12_H-768_A-12`) or BioBERT (`BioBERT-Base v1.0 + PubMed 200K + PMC 270K`) & trained on either all MIMIC notes or only discharge summaries.
|
12 |
+
|
13 |
+
This model card describes the Bio+Clinical BERT model, which was initialized from [BioBERT](https://arxiv.org/abs/1901.08746) & trained on all MIMIC notes.
|
14 |
+
|
15 |
+
## Pretraining Data
|
16 |
+
The `Bio_ClinicalBERT` model was trained on all notes from [MIMIC III](https://www.nature.com/articles/sdata201635), a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see [here](https://mimic.physionet.org/). All notes from the `NOTEEVENTS` table were included (~880M words).
|
17 |
+
|
18 |
+
## Model Pretraining
|
19 |
+
|
20 |
+
### Note Preprocessing
|
21 |
+
Each note in MIMIC was first split into sections using a rules-based section splitter (e.g. discharge summary notes were split into "History of Present Illness", "Family History", "Brief Hospital Course", etc. sections). Then each section was split into sentences using SciSpacy (`en core sci md` tokenizer).
|
22 |
+
|
23 |
+
### Pretraining Procedures
|
24 |
+
The model was trained using code from [Google's BERT repository](https://github.com/google-research/bert) on a GeForce GTX TITAN X 12 GB GPU. Model parameters were initialized with BioBERT (`BioBERT-Base v1.0 + PubMed 200K + PMC 270K`).
|
25 |
+
|
26 |
+
### Pretraining Hyperparameters
|
27 |
+
We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 5 · 10−5 for pre-training our models. The models trained on all MIMIC notes were trained for 150,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15
|
28 |
+
and max predictions per sequence = 20).
|
29 |
+
|
30 |
+
## How to use the model
|
31 |
+
|
32 |
+
Load the model via the transformers library:
|
33 |
+
```
|
34 |
+
from transformers import AutoTokenizer, AutoModel
|
35 |
+
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
|
36 |
+
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
|
37 |
+
```
|
38 |
+
|
39 |
+
## More Information
|
40 |
+
|
41 |
+
Refer to the original paper, [Publicly Available Clinical BERT Embeddings](https://arxiv.org/abs/1904.03323) (NAACL Clinical NLP Workshop 2019) for additional details and performance on NLI and NER tasks.
|
42 |
+
|
43 |
+
## Questions?
|
44 |
+
|
45 |
+
Post a Github issue on the [clinicalBERT repo](https://github.com/EmilyAlsentzer/clinicalBERT) or email [email protected] with any questions.
|
46 |
+
|
config.json
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"attention_probs_dropout_prob": 0.1,
|
3 |
+
"hidden_act": "gelu",
|
4 |
+
"hidden_dropout_prob": 0.1,
|
5 |
+
"hidden_size": 768,
|
6 |
+
"initializer_range": 0.02,
|
7 |
+
"intermediate_size": 3072,
|
8 |
+
"layer_norm_eps": 1e-12,
|
9 |
+
"max_position_embeddings": 512,
|
10 |
+
"model_type": "bert",
|
11 |
+
"num_attention_heads": 12,
|
12 |
+
"num_hidden_layers": 12,
|
13 |
+
"pad_token_id": 0,
|
14 |
+
"type_vocab_size": 2,
|
15 |
+
"vocab_size": 28996
|
16 |
+
}
|
flax_model.msgpack
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:23c147c8e9394cd9d9d1849e0b09bc1f75da9a7b4c1a69612e5361a3eef806b4
|
3 |
+
size 433248237
|
graph.pbtxt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
model.ckpt-150000.data-00000-of-00001
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4eaf6d2ad94f501933b1799f207fe747e9c27d8b0e5e1a67144bda6dee5c04fb
|
3 |
+
size 1307195216
|
model.ckpt-150000.index
ADDED
Binary file (23.4 kB). View file
|
|
model.ckpt-150000.meta
ADDED
Binary file (4.07 MB). View file
|
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a18c4c260fb5c0978b86658615106d5617050b5f14dac6ceb5e0d8beb2f9f719
|
3 |
+
size 435778770
|
tf_model.h5
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:28a06cf7d31bf659ed8eddf6f77394306a1e0a075c3ee827debaa1004b6579ec
|
3 |
+
size 526681584
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|