Muthukumaran commited on
Commit
419bbb3
·
1 Parent(s): f01d42f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -74
README.md CHANGED
@@ -1,95 +1,54 @@
1
  ---
2
  license: apache-2.0
3
  language:
4
- - en
5
  library_name: transformers
6
  pipeline_tag: fill-mask
7
  tags:
8
- - climate
9
- - biology
10
  ---
11
- # Model Card for Model ID
12
 
13
- <!-- Provide a quick summary of what the model is/does. -->
14
 
15
- This domain-adapted,(RoBERTa)[https://huggingface.co/roberta-base] based, Encoder-only transformer model is finetuned using select scientific journals and articles related to NASA Science Mission Directorate(SMD). It's intended purpose is to aid in NLP efforts within NASA. e.g.: Information retrieval, Intelligent search and discovery.
16
 
17
  ## Model Details
18
- - RoBERTa as base model
19
- - Custom tokenizer
20
- - 125M parameters
21
- - Masked Language Modeling (MLM) pretraining strategy
22
 
23
- ### Model Description
 
 
 
 
 
24
 
25
- <!-- - **Developed by:** NASA IMPACT and IBM Research
26
- - **Funded by [optional]:** [More Information Needed]
27
- - **Shared by [optional]:** [More Information Needed]
28
- - **Model type:** [More Information Needed]
29
- - **Language(s) (NLP):** [More Information Needed]
30
- - **License:** [More Information Needed]
31
- - **Finetuned from model [optional]:** [More Information Needed] -->
32
 
33
- ## Uses
34
-
35
- - Named Entity Recognition (NER), Information revreival, sentence-transformers.
36
-
37
- ## Training Details
38
-
39
- ### Training Data
40
-
41
- The model was trained on the following datasets:
42
- 1. Wikipedia English dump of February 1, 2020
43
- 2. NASA own data
44
- 3. NASA papers
45
- 4. NASA Earth Science papers
46
- 5. NASA Astrophysics Data System
47
- 6. PubMed abstract
48
- 7. PMC : subset with commercial license
49
-
50
- The sizes of the dataset is shown in the following chart.
51
-
52
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/CTNkn0WHS268hvidFmoqj.png)
53
-
54
- <!-- Provide the basic links for the model.
55
-
56
- - **Repository:** [More Information Needed]
57
- - **Paper [optional]:** [More Information Needed]
58
- - **Demo [optional]:** [More Information Needed]
59
- -->
60
-
61
- ### Training Procedure
62
- The model was trained on fairseq 0.12.1 with PyTorch 1.9.1 on transformer version 4.2.0. Masked Language Modeling (MLM) is the pretraining stragegy used.
63
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
64
 
65
  ## Evaluation
 
 
 
66
 
67
- ### BLURB Benchmark
68
-
69
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/K0IpQnTQmrfQJ1JXxn1B6.png)
70
-
71
 
72
- ### Pruned SQuAD2.0 (SQ2) Benchmark
73
-
74
-
75
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/R4oMJquUz4puah3lvd5Ve.png)
76
-
77
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
78
-
79
-
80
- ### NASA SMD Experts Benchmark
81
-
82
- WIP!
83
 
84
  ## Citation
 
85
 
86
- Please use the DOI provided by Huggingface to cite the model.
87
-
88
- ## Model Card Authors [optional]
89
-
90
- Bishwaranjan Bhattacharjee, IBM Research
91
- Muthukumaran Ramasubramanian, NASA-IMPACT ([email protected])
92
-
93
- ## Model Card Contact
94
-
95
- Muthukumaran Ramasubramanian ([email protected])
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - en
5
  library_name: transformers
6
  pipeline_tag: fill-mask
7
  tags:
8
+ - climate
9
+ - biology
10
  ---
 
11
 
12
+ # Model Card for nasa-smd-ibm-v0.1
13
 
14
+ nasa-smd-ibm-v0.1 is a RoBERTa-based, Encoder-only transformer model, domain-adapted for NASA Science Mission Directorate (SMD) applications. It's fine-tuned on scientific journals and articles relevant to NASA SMD, aiming to enhance natural language technologies like information retrieval and intelligent search.
15
 
16
  ## Model Details
17
+ - **Base Model**: RoBERTa
18
+ - **Tokenizer**: Custom
19
+ - **Parameters**: 125M
20
+ - **Pretraining Strategy**: Masked Language Modeling (MLM)
21
 
22
+ ## Training Data
23
+ - Wikipedia English (Feb 1, 2020)
24
+ - NASA datasets
25
+ - Scientific papers (NASA Earth Science, Astrophysics)
26
+ - PubMed abstracts
27
+ - PMC (commercial license subset)
28
 
29
+ ![Dataset Size Chart](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/CTNkn0WHS268hvidFmoqj.png)
 
 
 
 
 
 
30
 
31
+ ## Training Procedure
32
+ - **Framework**: fairseq 0.12.1 with PyTorch 1.9.1
33
+ - **Transformer Version**: 4.2.0
34
+ - **Strategy**: Masked Language Modeling (MLM)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ## Evaluation
37
+ - BLURB Benchmark
38
+ - Pruned SQuAD2.0 (SQ2) Benchmark
39
+ - NASA SMD Experts Benchmark (WIP)
40
 
41
+ ![BLURB Benchmark Results](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/K0IpQnTQmrfQJ1JXxn1B6.png)
42
+ ![SQ2 Benchmark Results](https://cdn-uploads.huggingface.co/production/uploads/61099e5d86580d4580767226/R4oMJquUz4puah3lvd5Ve.png)
 
 
43
 
44
+ ## Uses
45
+ - Named Entity Recognition (NER)
46
+ - Information Retrieval
47
+ - Sentence Transformers
 
 
 
 
 
 
 
48
 
49
  ## Citation
50
+ Refer to the DOI provided by Huggingface for citations.
51
 
52
+ ## Contacts
53
+ - Bishwaranjan Bhattacharjee, IBM Research
54
+ - Muthukumaran Ramasubramanian, NASA-IMPACT ([email protected])