nlpaueb commited on
Commit
671430e
·
1 Parent(s): ab92028

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -0
README.md CHANGED
@@ -1,3 +1,48 @@
1
  ---
 
 
2
  license: cc-by-sa-4.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ pipeline_tag: fill-mask
4
  license: cc-by-sa-4.0
5
+ thumbnail: https://i.ibb.co/0yz81K9/sec-bert-logo.png
6
+ tags:
7
+ - finance
8
+ - financial
9
+ widget:
10
+ - text: "Total net sales [MASK] 2% or $5.4 billion during 2019 compared to 2018."
11
  ---
12
+
13
+ # SEC-BERT
14
+
15
+ <img align="center" src="https://i.ibb.co/0yz81K9/sec-bert-logo.png" alt="sec-bert-logo" width="300"/>
16
+
17
+ SEC-BERT is a family of BERT models for the financial domain, intended to assist financial NLP research and FinTech applications.<br>
18
+ SEC-BERT consists of the following models:
19
+ * SEC-BERT-BASE (this model)
20
+ * [SEC-BERT-NUM](https://huggingface.co/nlpaueb/sec-bert-num): We replace every number token with a [NUM] pseudo-token handling all numeric expressions in a uniform manner, disallowing their fragmentation)
21
+ * [SEC-BERT-SHAPE](https://huggingface.co/nlpaueb/sec-bert-shape): We replace numbers with pseudo-tokens that represent the number’s shape, so numeric expressions (of known shapes) are no longer fragmented.<br>
22
+ (e.g. , '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]').<br>
23
+
24
+ ## Pre-training corpus
25
+
26
+ The model was pre-trained on 260,773 10-K filings from 1993-2019, publicly available at <a href="https://www.sec.gov/">U.S. Securities and Exchange Commission (SEC)</a>
27
+
28
+ ## Pre-training details
29
+
30
+ * We created a new vocabulary of 30k subwords by training a [BertWordPieceTokenizer](https://github.com/huggingface/tokenizers) from scratch on the pre-training corpus.
31
+ * We trained BERT using the official code provided in [Google BERT's GitHub repository](https://github.com/google-research/bert)</a>.
32
+ * We then used [Hugging Face](https://huggingface.co)'s [Transformers](https://github.com/huggingface/transformers) conversion script to convert the TF checkpoint in the desired format in order to be able to load the model in two lines of code for both PyTorch and TF2 users.
33
+ * We release a model similar to the English BERT-BASE model (12-layer, 768-hidden, 12-heads, 110M parameters).
34
+ * We chose to follow the same training set-up: 1 million training steps with batches of 256 sequences of length 512 with an initial learning rate 1e-4.
35
+ * We were able to use a single Google Cloud TPU v3-8 provided for free from [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc), while also utilizing [GCP research credits](https://edu.google.com/programs/credits/research). Huge thanks to both Google programs for supporting us!
36
+
37
+ ## Load Pretrained Model
38
+
39
+ ```python
40
+ from transformers import AutoTokenizer, AutoModel
41
+
42
+ tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-base")
43
+ model = AutoModel.from_pretrained("nlpaueb/sec-bert-base")
44
+ ```
45
+
46
+ FiNER: Financial Numeric Entity Recognition for XBRL Tagging
47
+
48
+ [Manos Fergadiotis](https://manosfer.github.io) on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)