lrodrigues
commited on
Commit
·
cd7feb8
1
Parent(s):
18947ca
upload
Browse files- README.md +83 -0
- config.json +67 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +10 -0
- tokenizer_config.json +23 -0
- vocab.txt +0 -0
README.md
ADDED
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- sqa
|
4 |
+
- wikisql
|
5 |
+
- wtq
|
6 |
+
language: en
|
7 |
+
license: mit
|
8 |
+
pipeline_tag: table-question-answering
|
9 |
+
tags:
|
10 |
+
- tapas
|
11 |
+
- table-question-answering
|
12 |
+
---
|
13 |
+
|
14 |
+
# TAPAS large model fine-tuned on WikiTable Questions (WTQ)
|
15 |
+
|
16 |
+
TAPAS is a BERT-like transformers model pretrained on a large corpus of English data from Wikipedia in a self-supervised fashion. This means it was pretrained on the raw tables and associated texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:
|
17 |
+
|
18 |
+
- Masked language modeling (MLM): taking a (flattened) table and associated context, the model randomly masks 15% of the words in the input, then runs the entire (partially masked) sequence through the model. The model then has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of a table and associated text.
|
19 |
+
- Intermediate pre-training: to encourage numerical reasoning on tables, the authors additionally pre-trained the model by creating a balanced dataset of millions of syntactically created training examples. Here, the model must predict (classify) whether a sentence is supported or refuted by the contents of a table. The training examples are created based on synthetic as well as counterfactual statements.
|
20 |
+
|
21 |
+
This way, the model learns an inner representation of the English language used in tables and associated texts, which can then be used to extract features useful for downstream tasks such as answering questions about a table, or determining whether a sentence is entailed or refuted by the contents of a table.
|
22 |
+
|
23 |
+
[TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349)
|
24 |
+
|
25 |
+
## Training Data
|
26 |
+
This model was pre-trained on MLM and an additional step which the authors call intermediate pre-training, and then fine-tuned in a chain on [SQA](https://www.microsoft.com/en-us/download/details.aspx?id=54253), [WikiSQL](https://github.com/salesforce/WikiSQL) and finally [WTQ](https://github.com/ppasupat/WikiTableQuestions).
|
27 |
+
|
28 |
+
It can be used for answering questions related to a table.
|
29 |
+
|
30 |
+
## Usage and Performance
|
31 |
+
The trained model can be used like this:
|
32 |
+
```python
|
33 |
+
from transformers import AutoModelForTableQuestionAnswering, AutoTokenizer, pipeline
|
34 |
+
|
35 |
+
# Load model & tokenizer
|
36 |
+
tapas_model = AutoModelForTableQuestionAnswering.from_pretrained('navteca/tapas-large-finetuned-wtq')
|
37 |
+
tapas_tokenizer = AutoTokenizer.from_pretrained('navteca/tapas-large-finetuned-wtq')
|
38 |
+
|
39 |
+
# Get predictions
|
40 |
+
nlp = pipeline('table-question-answering', model=tapas_model, tokenizer=tapas_tokenizer)
|
41 |
+
|
42 |
+
result = nlp({
|
43 |
+
'table': {
|
44 |
+
'Repository': [
|
45 |
+
'Transformers',
|
46 |
+
'Datasets',
|
47 |
+
'Tokenizers'
|
48 |
+
],
|
49 |
+
'Stars': [
|
50 |
+
'36542',
|
51 |
+
'4512',
|
52 |
+
'3934'
|
53 |
+
],
|
54 |
+
'Contributors': [
|
55 |
+
'651',
|
56 |
+
'77',
|
57 |
+
'34'
|
58 |
+
],
|
59 |
+
'Programming language': [
|
60 |
+
'Python',
|
61 |
+
'Python',
|
62 |
+
'Rust, Python and NodeJS'
|
63 |
+
]
|
64 |
+
},
|
65 |
+
'query': 'How many stars does the transformers repository have?'
|
66 |
+
})
|
67 |
+
|
68 |
+
print(result)
|
69 |
+
|
70 |
+
#{
|
71 |
+
# "answer": "SUM > 36542",
|
72 |
+
# "coordinates": [
|
73 |
+
# [
|
74 |
+
# 0,
|
75 |
+
# 1
|
76 |
+
# ]
|
77 |
+
# ],
|
78 |
+
# "cells": [
|
79 |
+
# "36542"
|
80 |
+
# ],
|
81 |
+
# "aggregator": "SUM"
|
82 |
+
#}
|
83 |
+
```
|
config.json
ADDED
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"aggregation_labels": {
|
3 |
+
"0": "NONE",
|
4 |
+
"1": "SUM",
|
5 |
+
"2": "AVERAGE",
|
6 |
+
"3": "COUNT"
|
7 |
+
},
|
8 |
+
"aggregation_loss_weight": 1,
|
9 |
+
"aggregation_temperature": 1,
|
10 |
+
"allow_empty_column_selection": false,
|
11 |
+
"answer_loss_cutoff": 0.664694,
|
12 |
+
"answer_loss_importance": 1,
|
13 |
+
"architectures": [
|
14 |
+
"TapasForQuestionAnswering"
|
15 |
+
],
|
16 |
+
"attention_probs_dropout_prob": 0.1,
|
17 |
+
"average_approximation_function": "ratio",
|
18 |
+
"average_logits_per_cell": false,
|
19 |
+
"cell_selection_preference": 0.207951,
|
20 |
+
"disable_per_token_loss": false,
|
21 |
+
"gradient_checkpointing": false,
|
22 |
+
"hidden_act": "gelu",
|
23 |
+
"hidden_dropout_prob": 0.1,
|
24 |
+
"hidden_size": 1024,
|
25 |
+
"huber_loss_delta": 0.121194,
|
26 |
+
"init_cell_selection_weights_to_zero": true,
|
27 |
+
"initializer_range": 0.02,
|
28 |
+
"intermediate_size": 4096,
|
29 |
+
"layer_norm_eps": 1e-12,
|
30 |
+
"max_num_columns": 32,
|
31 |
+
"max_num_rows": 64,
|
32 |
+
"max_position_embeddings": 1024,
|
33 |
+
"model_type": "tapas",
|
34 |
+
"no_aggregation_label_index": 0,
|
35 |
+
"num_aggregation_labels": 4,
|
36 |
+
"num_attention_heads": 16,
|
37 |
+
"num_hidden_layers": 24,
|
38 |
+
"pad_token_id": 0,
|
39 |
+
"positive_label_weight": 10,
|
40 |
+
"reset_position_index_per_cell": true,
|
41 |
+
"select_one_column": true,
|
42 |
+
"softmax_temperature": 1,
|
43 |
+
"temperature": 0.0352513,
|
44 |
+
"type_vocab_size": [
|
45 |
+
3,
|
46 |
+
256,
|
47 |
+
256,
|
48 |
+
2,
|
49 |
+
256,
|
50 |
+
256,
|
51 |
+
10
|
52 |
+
],
|
53 |
+
"type_vocab_sizes": [
|
54 |
+
3,
|
55 |
+
256,
|
56 |
+
256,
|
57 |
+
2,
|
58 |
+
256,
|
59 |
+
256,
|
60 |
+
10
|
61 |
+
],
|
62 |
+
"use_answer_as_supervision": true,
|
63 |
+
"use_gumbel_for_aggregation": false,
|
64 |
+
"use_gumbel_for_cells": false,
|
65 |
+
"use_normalized_answer_loss": false,
|
66 |
+
"vocab_size": 30522
|
67 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:4f1b5620999629496833aad6c111737ca7acaad44a53a50fb3c1de66ad788f8f
|
3 |
+
size 1347101703
|
special_tokens_map.json
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"additional_special_tokens": [
|
3 |
+
"[EMPTY]"
|
4 |
+
],
|
5 |
+
"cls_token": "[CLS]",
|
6 |
+
"mask_token": "[MASK]",
|
7 |
+
"pad_token": "[PAD]",
|
8 |
+
"sep_token": "[SEP]",
|
9 |
+
"unk_token": "[UNK]"
|
10 |
+
}
|
tokenizer_config.json
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"additional_special_tokens": [
|
3 |
+
"[EMPTY]"
|
4 |
+
],
|
5 |
+
"cell_trim_length": -1,
|
6 |
+
"cls_token": "[CLS]",
|
7 |
+
"do_basic_tokenize": true,
|
8 |
+
"do_lower_case": true,
|
9 |
+
"drop_rows_to_fit": false,
|
10 |
+
"empty_token": "[EMPTY]",
|
11 |
+
"mask_token": "[MASK]",
|
12 |
+
"max_column_id": null,
|
13 |
+
"max_row_id": null,
|
14 |
+
"model_max_length": 512,
|
15 |
+
"never_split": null,
|
16 |
+
"pad_token": "[PAD]",
|
17 |
+
"sep_token": "[SEP]",
|
18 |
+
"strip_accents": null,
|
19 |
+
"strip_column_names": false,
|
20 |
+
"tokenize_chinese_chars": true,
|
21 |
+
"unk_token": "[UNK]",
|
22 |
+
"update_answer_coordinates": false
|
23 |
+
}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|